University of Oxford

Hands-on tutorial for ChIP-seq data analysis

Silvia Salatino, PhD - 30.11.2018

 

Analysis of the GSE51274 dataset from GEO

Today we are going to use the following dataset, which has been downloaded from the Gene Expression Omnibus (GEO) repository of high-throughput arrays and sequencing data.

The raw data are available here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE51274

...and the corresponding article can be found here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4021073/

Most of the analysis steps we will see today are usually run on a High-Performance Computing cluster (HPC), which enables parallelisation and therefore speeds up the whole analysis.

However, since learning how to use HPC is beyond the scope of this one-day course, the time-consuming analysis steps have been precomputed and the results can be found in the "FULL_DATASET" folder. The "PLAYGROUND" folder, instead, can be used for trying software on smaller extracts of the original dataset, which would take much less to run.

 

Instructions for Windows users:

You'll be assigned a user name (e.g. workshop01) and a corresponding password.

OPTION A)

  1. Download SmarTTY from http://smartty.sysprogs.com/
  2. Double-click on the downloaded ".msi" file and follow the steps in the setup wizard to install it
  3. Launch the program and click on "New SSH connection..."
  4. In the pop-up window, type:
  • Host Name: rescomp.well.ox.ac.uk
  • User Name: (your user name)
  • Password: (your password)
  1. Click "Connect" and then "Start with a regular Terminal"

OPTION B)

  1. Launch Xming, check only "Private networks" and click "Allow access"
  2. Launch SSH Secure Shell
  3. In the Secure Shell Client click on "Edit/Settings"
  4. In the "Settings" window on the left-hand side click on the + next to expand "Profile Settings"
  5. Click on "Tunneling"
  6. On the right-hand side of the window, make sure there is a check next to the words "Tunnel X11 Connections"
  7. On the left-hand side of the window, click on "Authentication". Make sure there is a check next to the words "Enable for SSH2 Connections"
  8. Click "OK"
  9. Click "File" and then "Save Settings"
  10. Press "Enter"
  11. In the pop-up window, type:
  • Host Name: rescomp.well.ox.ac.uk
  • User Name: (your user name)
  • Port Number: 22
  • Authentication Method: <Profile Settings>
  1. Click "Connect" and, when prompted, type your password

 

Instructions for GNU/Linux or Mac users

You'll be assigned a user name (e.g. workshop01) and a corresponding password.

  1. Open an X11 terminal (XQuartz from Mac) and type ssh -X <your_user_name>@rescomp.well.ox.ac.uk
  2. When prompted, type your password

 

ChIP-seq analysis exercises

There could be several solutions to the following exercises. Feel free to use your favourite programming language... or just copy-paste the code below ;)

#008D36 STEP 1 - Familiarise with the FASTQ format.

  • Exercise A: extract 1000 reads from a FASTQ file and compress the output file
zcat FULL_DATASET/FASTQ_files/MCF7_GATA3_rep1.fastq.gz | head -4000 > PLAYGROUND/1000reads.fastq
gzip PLAYGROUND/1000reads.fastq
  • Exercise B: have a look at the structure of a FASTQ file
zless -S PLAYGROUND/1000reads.fastq.gz
  • Exercise C: count the number of reads in a FASTQ file from the full dataset
zcat FULL_DATASET/FASTQ_files/MCF7_GATA3_rep1.fastq.gz | awk '{s++} END {print s/4}'

#008D36 STEP 2 - Use FastQC to check read qualities before pre-processing.

  • Exercise A: run FastQC on the 1000-reads FASTQ file
/apps/htseq/FastQC/fastqc -t 4 --noextract -o PLAYGROUND PLAYGROUND/1000reads.fastq.gz
  • Exercise B: have a look at each metric in the ".html" file generated by FastQC
firefox --no-remote PLAYGROUND/1000reads_fastqc.html

#008D36 STEP 3 - Run Skewer to trim adapters from FASTQ read files.

  • Exercise A: launch Skewer on the 1000-reads FASTQ file
/apps/htseq/skewer/skewer -t 4 -m head -x AGATCGGAAGAGC -r 0.1 -d 0 -z -l 20 PLAYGROUND/1000reads.fastq.gz -o PLAYGROUND/1000reads
  • Exercise B: identify the trimmed reads from the output file
zcat PLAYGROUND/1000reads-trimmed.fastq.gz | awk 'BEGIN {OFS = "\n"} {header = $0 ; getline seq ; getline qheader ; getline qseq ; if (length(seq) < 36) {print header, seq, qheader, qseq}}'

#008D36 STEP 4 - Map the reads with BWA-mem and familiarise with the BAM format.

  • Exercise A: launch BWA-mem on the 1000-reads FASTQ file
/apps/htseq/bwa-0.7.15/bwa mem -t 4 -c 1 /well/htseq/Genomes/BWA-0.7/hs37d5 PLAYGROUND/1000reads-trimmed.fastq.gz | /apps/well/samtools/1.1/bin/samtools view -bh -S - > PLAYGROUND/1000reads.bam
samtools view -h PLAYGROUND/1000reads.bam | less -S
samtools view -c -F 4 PLAYGROUND/1000reads.bam

#008D36 STEP 5 - Filter out poor quality reads.

  • Exercise A: launch Bamtools on the 1000-reads BAM file
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/apps/well/bamtools/2.3.0/lib/bamtools"
/apps/well/bamtools/2.3.0/bin/bamtools filter -mapQuality '>=20' -isMapped true -in PLAYGROUND/1000reads.bam -out PLAYGROUND/1000reads_filtered.bam
  • Exercise B: count how many reads are left after filtering
samtools view -c PLAYGROUND/1000reads_filtered.bam

#008D36 STEP 6 - Re-run FastQC to check read qualities after pre-processing.

  • Exercise A: run FastQC on the filtered 1000-reads BAM file
/apps/htseq/FastQC/fastqc -t 4 --noextract -o PLAYGROUND PLAYGROUND/1000reads_filtered.bam
  • Exercise B: have a look at each metric in the ".html" file generated by FastQC
firefox --no-remote PLAYGROUND/1000reads_filtered_fastqc.html

#008D36 STEP 7 - Launch MACS2 to identify peaks from the filtered BAM files.

  • Exercise A: run MACS2 on a pair of treatment/control samples from the full dataset
module load python/2.7
/apps/well/python/2.7.11/bin/macs2 callpeak -t FULL_DATASET/BAM_FILTERED_files/MCF7_GATA3_rep1.filtered.bam -c FULL_DATASET/BAM_FILTERED_files/MCF7_input_rep1.filtered.bam -f BAM -g hs --outdir PLAYGROUND --bdg --name MCF7_GATA3_rep1 --keep-dup 1
  • Exercise B: count the number of reads obtained
wc -l PLAYGROUND/MCF7_GATA3_rep1_peaks.narrowPeak
  • Exercise C: familiarise with the BED format (NOTE: a description of each field in the "narrowPeak" file format can be found here https://github.com/taoliu/MACS)
less -S PLAYGROUND/MCF7_GATA3_rep1_peaks.narrowPeak

#008D36 STEP 8 - Remove peaks in low-complexity or repeated regions and on contigs.

  • Exercise A: use Bedtools to filter out uninteresting peaks
/apps/well/bedtools/2.24.0/intersectBed -a PLAYGROUND/MCF7_GATA3_rep1_peaks.narrowPeak -b encode_blacklisted_regions.bed -v > PLAYGROUND/MCF7_GATA3_rep1_peaks.temp
  • Exercise B: remove all peaks located on contigs (i.e. not on chromosomes)
awk '!/^GL|hs/' PLAYGROUND/MCF7_GATA3_rep1_peaks.temp > PLAYGROUND/MCF7_GATA3_rep1_peaks.filtered
  • Exercise C: check how many peaks were left after this filtering step
wc -l PLAYGROUND/MCF7_GATA3_rep1_peaks.narrowPeak
wc -l PLAYGROUND/MCF7_GATA3_rep1_peaks.filtered

#008D36 STEP 9 - Check the NSC and the RSC values for a given sample.

  • Exercise A: use SPP to calculate NSC and RSC for the IP data of the MCF7_GATA2_rep1 sample
/apps/well/R/3.2.2/bin/Rscript run_spp.R -c=FULL_DATASET/BAM_FILTERED_files/MCF7_GATA3_rep1.filtered.bam -savp -out=PLAYGROUND/MCF7_GATA3_rep1.filtered.report -odir=PLAYGROUND/
  • Exercise B: open the output PDF file and explore the report file (help can be found in the SPP_instructions.txt file)
evince PLAYGROUND/MCF7_GATA3_rep1.filtered.pdf
less -S PLAYGROUND/MCF7_GATA3_rep1.filtered.report

#008D36 STEP 10 - Check the IDR across replicate peaksets.

  • Exercise A: calculate the IDR for the MCF7_GATA2_rep1 and _rep2 samples
module load R/3.3.1
Rscript batch-consistency-analysis.r FULL_DATASET/HQ_PEAK_files/MCF7_GATA3_rep1_peaks.ExcludeBlackList.narrowPeak FULL_DATASET/HQ_PEAK_files/MCF7_GATA3_rep2_peaks.ExcludeBlackList.narrowPeak -1 PLAYGROUND/MCF7_GATA3_rep1_vs_MCF7_GATA3_rep2 0 F p.value 
  • Exercise B: generate a scatterplot to show the IDR values across the two replicates
gnuplot
set log x
set log y
set palette model RGB defined ( 0 'red', 1 'black' )
unset key
unset colorbox
plot 'PLAYGROUND/MCF7_GATA3_rep1_vs_MCF7_GATA3_rep2-overlapped-peaks.txt' u 5:9:($11 <= 0.01 ? 0 : 1 ) with points palette
exit
  • Exercise C: plot the cumulative distribution of the IDR
R
inp <- read.table('PLAYGROUND/MCF7_GATA3_rep1_vs_MCF7_GATA3_rep2-overlapped-peaks.txt', header=TRUE)
peaks_num <- cumsum(table(inp$IDR))
idr_values <- names(cumsum(table(inp$IDR)))
plot(peaks_num, idr_values, xlab='num of significant peaks', ylab='IDR')
quit()
module unload R/3.3.1

#008D36 STEP 11 - Use DiffBind to perform a differential binding analysis.

less -S diffbind_SampleInfo.csv
  • Exercise B: launch DiffBind analysis on the samples
module load R/3.2.2
export R_LIBS=/apps/htseq/lib/R/3.2.2/library/
R
library(DiffBind)
my_data = dba(sampleSheet="diffbind_SampleInfo.csv")
dba.plotVenn(my_data, my_data$masks$GATA3, main="Peaks overlap")
my_data = dba.count(my_data, minOverlap=2, bCorPlot=FALSE)
my_data = dba.contrast(my_data, minMembers=2, categories=DBA_TISSUE)
my_data  # to see the contrast in the DBA object
my_data = dba.analyze(my_data)
dba.report(my_data, th=1, ext="txt", file="sites", initString="PLAYGROUND/diffbind")
dba.plotMA(my_data)
dba.plotPCA(my_data, label=DBA_ID)
quit()
module unload R/3.2.2
export R_LIBS=''
  • Exercise C: have a look at the output file generated using DiffBind
less -S PLAYGROUND/diffbind_sites.txt

#008D36 STEP 12 - Use ChIPseeker to annotate peaks and do pathway enrichment analysis.

  • Exercise A: load MCF7_GATA3_rep1 peaks and plot their coverage along the genome
module load R/3.3.1
R
library(ChIPseeker)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(ReactomePA)
peaks <- readPeakFile('FULL_DATASET/HQ_PEAK_files/MCF7_GATA3_rep1_peaks.ExcludeBlackList.narrowPeak', header=F)
seqlevels(peaks) <- sapply(peaks@seqnames@values, function(x) paste('chr', x, sep=''))
covplot(peaks, weightCol="V5")
  • Exercise B: plot peak distances from their nearest TSSs
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
promoter <- getPromoters(TxDb=txdb, upstream=3000, downstream=3000)
tagMatrix <- getTagMatrix(peaks, windows=promoter)
plotAvgProf(tagMatrix, xlim=c(-3000, 3000), xlab="Genomic Region (5'->3')", ylab = "Read Count Frequency")
  • Exercise C: generate a peak-to-annotation piechart summarising the portion of peaks annotated to a given feature
peakAnno <- annotatePeak(peaks, tssRegion=c(-3000, 3000), TxDb=txdb, annoDb="org.Hs.eg.db")
plotAnnoPie(peakAnno)  
  • Exercise D: plot the top 10 enriched terms from Reactome
genes <- seq2gene(peaks, tssRegion = c(-1000, 1000), flankDistance = 3000, TxDb=txdb)
pathways <- enrichPathway(genes)
barplot(pathways, showCategory=10)
quit()
module unload R/3.3.1

#008D36 STEP 13 - Use MEME to do motif search (both known and de novo).

  • Exercise A: explore the PWM file in MEME format
less -S jaspar_PWMs.meme
  • Exercise B: extract the top 100 peaks from the MCF7_GATA3_rep1 sample, then retrieve their sequences in FASTA format
samtools faidx hs37d5.fasta $(sort -k8nr FULL_DATASET/HQ_PEAK_files/MCF7_GATA3_rep1_peaks.ExcludeBlackList.narrowPeak | head -100 | awk 'END{print ""};{printf $1":"$2"-"$3" "}') > PLAYGROUND/top100_peaks.fa
  • Exercise C: run FIMO on the top 100 peaks to find binding sites for known transcription factors; then, have a look at the results in the output file
/apps/well/meme/4.11.2_2/bin/fimo --skip-matched-sequence jaspar_PWMs.meme PLAYGROUND/top100_peaks.fa 2> PLAYGROUND/top100_peaks.fimo.log > PLAYGROUND/top100_peaks.fimo.TFBSs
less -S PLAYGROUND/top100_peaks.fimo.TFBSs
  • Exercise D: run MEME to perform a de novo motif search across the peakset (NOTE: no negative sequences are used for MEME, the positive sequences are shuffled to create the negative set)
. /apps/well/perl/5.16.3/apps-well-perl-5.16.3.sh
/apps/well/meme/4.11.2_2/bin/meme-chip -oc PLAYGROUND/MEME_RESULTS/ -seed 1234 -ccut 0 -meme-minw 6 -meme-maxw 20 -meme-nmotifs 1 -fimo-skip PLAYGROUND/top100_peaks.fa
  • Exercise E: explore the MEME output results
firefox --no-remote PLAYGROUND/MEME_RESULTS/index.html

#008D36 STEP 14 - Convert BEDGRAPH to BIGWIG format for data visualization.

  • Exercise A: use bedGraphToBigWig to generate a BIGWIG file
/apps/well/irap/git-gcc4.7.2/bin/bedGraphToBigWig PLAYGROUND/MCF7_GATA3_rep1_treat_pileup.bdg hg19_chrom_sizes.txt PLAYGROUND/MCF7_GATA3_rep1_treat_pileup.bw
  • Exercise B: explore the data generated so far (BIGWIG and BED files) in IGV
java -Xmx16G -jar /apps/well/igv/2.3.32/igv.jar

Then select:
- Genomes -> Load Genome from File... -> /well/workshop/workshop01/hs37d5.fasta
- File -> Load from File... -> /well/workshop/workshop01/hs37d5.refGene.txt.gz
- File -> Load from File... -> /well/workshop/workshop01/PLAYGROUND/MCF7_GATA3_rep1_treat_pileup.bw
- File -> Load from File... -> /well/workshop/workshop01/PLAYGROUND/MCF7_GATA3_rep1_peaks.narrowPeak
- File -> Load from File... -> /well/workshop/workshop01/FULL_DATASET/BAM_FILTERED_file/MCF7_GATA3_rep1.filtered.bam
- File -> Load from File... -> /well/workshop/workshop01/FULL_DATASET/BAM_FILTERED_file/MCF7_input_rep1.filtered.bam

Finally, type PGR in the search box to focus on region 11:100,898,355-101,002,544 and compare it with Fig. 8 of the paper published using these data (see link above).

 

                                                                                                                      Thank you for attending this course, I hope you enjoyed it!