Hands-on tutorial for ChIP-seq data analysis

Silvia Salatino, PhD

Analysis of the GSE51274 dataset from GEO

Today we are going to use the following dataset, which has been downloaded from the Gene Expression Omnibus (GEO) repository of high-throughput arrays and sequencing data.

The raw data are available here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE51274

...and the corresponding article can be found here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4021073/

Most of the analysis steps we will see today are usually run on a High-Performance Computing cluster (HPC), which enables parallelisation and therefore speeds up the whole analysis.

However, since learning how to use HPC is beyond the scope of this one-day course, the time-consuming analysis steps have been precomputed and the results can be found in the "FULL_DATASET" folder. The "PLAYGROUND" folder, instead, can be used for trying software on smaller extracts of the original dataset, which would take much less to run.

Instructions for Windows users:

You'll be assigned a user name (e.g. workshop01) and a corresponding password to access our rescomp servers.

Download SmarTTY from http://smartty.sysprogs.com/
Double-click on the downloaded ".msi" file and follow the steps in the setup wizard to install it
Launch the program and click on "New SSH connection..."
In the pop-up window, type:

Host Name: rescomp.well.ox.ac.uk
User Name: (your user name)
Password: (your password)

Click "Connect" and then "Start with a regular Terminal"

Instructions for GNU/Linux or Mac users

You'll be assigned a user name (e.g. workshop01) and a corresponding password to access our rescomp servers.

Open an X11 terminal (XQuartz from Mac) and type ssh -X <your_user_name>@rescomp.well.ox.ac.uk
When prompted, type your password

ChIP-seq analysis exercises

There could be several solutions to the following exercises. Feel free to use your favourite programming language... or just copy-paste the code below ;)

STEP 1 - Familiarise with the FASTQ format.

Exercise A: extract 1000 reads from a FASTQ file and compress the output file

zcat FULL_DATASET/FASTQ_files/MCF7_GATA3_rep1.fastq.gz | head -4000 > PLAYGROUND/1000reads.fastq
gzip PLAYGROUND/1000reads.fastq

Exercise B: have a look at the structure of a FASTQ file

zless -S PLAYGROUND/1000reads.fastq.gz

Exercise C: count the number of reads in a FASTQ file from the full dataset

zcat FULL_DATASET/FASTQ_files/MCF7_GATA3_rep1.fastq.gz | awk '{s++} END {print s/4}'

STEP 2 - Use FastQC to check read qualities before pre-processing.

Exercise A: run FastQC on the 1000-reads FASTQ file

/apps/htseq/FastQC/fastqc -t 4 --noextract -o PLAYGROUND PLAYGROUND/1000reads.fastq.gz

Exercise B: have a look at each metric in the ".html" file generated by FastQC

firefox --no-remote PLAYGROUND/1000reads_fastqc.html

STEP 3 - Run Skewer to trim adapters from FASTQ read files.

Exercise A: launch Skewer on the 1000-reads FASTQ file

/apps/htseq/skewer/skewer -t 4 -m head -x AGATCGGAAGAGC -r 0.1 -d 0 -z -l 20 PLAYGROUND/1000reads.fastq.gz -o PLAYGROUND/1000reads

Exercise B: identify the trimmed reads from the output file

zcat PLAYGROUND/1000reads-trimmed.fastq.gz | awk 'BEGIN {OFS = "\n"} {header = $0 ; getline seq ; getline qheader ; getline qseq ; if (length(seq) < 36) {print header, seq, qheader, qseq}}'

STEP 4 - Map the reads with BWA-mem and familiarise with the BAM format.

Exercise A: launch BWA-mem on the 1000-reads FASTQ file

/apps/htseq/bwa-0.7.15/bwa mem -t 4 -c 1 /well/htseq/Genomes/BWA-0.7/hs37d5 PLAYGROUND/1000reads-trimmed.fastq.gz | /apps/well/samtools/1.1/bin/samtools view -bh -S - > PLAYGROUND/1000reads.bam

Exercise B: have a look at the output file of BWA-mem (NOTE: a description of each field in the "SAM/BAM" file format can be found here http://samtools.github.io/hts-specs/SAMv1.pdf)

samtools view -h PLAYGROUND/1000reads.bam | less -S

Exercise C: use the appropriate flags to count how many reads were mapped to the reference genome (have a look at https://broadinstitute.github.io/picard/explain-flags.html). An alternative approach is to simply use samtools flagstat

samtools view -c -F 4 PLAYGROUND/1000reads.bam

STEP 5 - Filter out poor quality reads.

Exercise A: launch Bamtools on the 1000-reads BAM file

export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/apps/well/bamtools/2.3.0/lib/bamtools"
/apps/well/bamtools/2.3.0/bin/bamtools filter -mapQuality '>=20' -isMapped true -in PLAYGROUND/1000reads.bam -out PLAYGROUND/1000reads_filtered.bam

Exercise B: count how many reads are left after filtering

samtools view -c PLAYGROUND/1000reads_filtered.bam

STEP 6 - Re-run FastQC to check read qualities after pre-processing.

Exercise A: run FastQC on the filtered 1000-reads BAM file

/apps/htseq/FastQC/fastqc -t 4 --noextract -o PLAYGROUND PLAYGROUND/1000reads_filtered.bam

Exercise B: have a look at each metric in the ".html" file generated by FastQC

firefox --no-remote PLAYGROUND/1000reads_filtered_fastqc.html

STEP 7 - Launch MACS2 to identify peaks from the filtered BAM files.

Exercise A: run MACS2 on a pair of treatment/control samples from the full dataset

module load python/2.7
/apps/well/python/2.7.11/bin/macs2 callpeak -t FULL_DATASET/BAM_FILTERED_files/MCF7_GATA3_rep1.filtered.bam -c FULL_DATASET/BAM_FILTERED_files/MCF7_input_rep1.filtered.bam -f BAM -g hs --outdir PLAYGROUND --bdg --name MCF7_GATA3_rep1 --keep-dup 1

Exercise B: count the number of peaks obtained (there should be 15307)

wc -l PLAYGROUND/MCF7_GATA3_rep1_peaks.narrowPeak

Exercise C: familiarise with the BED format (NOTE: a description of each field in the "narrowPeak" file format can be found here https://github.com/taoliu/MACS)

less -S PLAYGROUND/MCF7_GATA3_rep1_peaks.narrowPeak

STEP 8 - Remove peaks in low-complexity or repeated regions and on contigs.

Exercise A: use Bedtools to filter out uninteresting peaks

/apps/well/bedtools/2.24.0/intersectBed -a PLAYGROUND/MCF7_GATA3_rep1_peaks.narrowPeak -b encode_blacklisted_regions.bed -v > PLAYGROUND/MCF7_GATA3_rep1_peaks.temp

Exercise B: remove all peaks located on contigs (i.e. not on chromosomes)

awk '!/^GL|hs/' PLAYGROUND/MCF7_GATA3_rep1_peaks.temp > PLAYGROUND/MCF7_GATA3_rep1_peaks.filtered

Exercise C: check how many peaks were left after this filtering step

wc -l PLAYGROUND/MCF7_GATA3_rep1_peaks.filtered

STEP 9 - Check the NSC and the RSC values for a given sample.

Exercise A: use SPP to calculate NSC and RSC for the IP data of the MCF7_GATA2_rep1 sample

PATH=/apps/well/samtools/0.1.19/bin:$PATH
/apps/well/R/3.2.2/bin/Rscript run_spp.R -c=FULL_DATASET/BAM_FILTERED_files/MCF7_GATA3_rep1.filtered.bam -savp -out=PLAYGROUND/MCF7_GATA3_rep1.filtered.report -odir=PLAYGROUND/

Exercise B: open the output PDF file and explore the report file (help can be found in the SPP_instructions.txt file)

evince PLAYGROUND/MCF7_GATA3_rep1.filtered.pdf
less -S PLAYGROUND/MCF7_GATA3_rep1.filtered.report

STEP 10 - Check the IDR across replicate peaksets.

Exercise A: calculate the IDR for the MCF7_GATA2_rep1 and _rep2 samples

module load R/3.3.1
Rscript batch-consistency-analysis.r FULL_DATASET/HQ_PEAK_files/MCF7_GATA3_rep1_peaks.ExcludeBlackList.narrowPeak FULL_DATASET/HQ_PEAK_files/MCF7_GATA3_rep2_peaks.ExcludeBlackList.narrowPeak -1 PLAYGROUND/MCF7_GATA3_rep1_vs_MCF7_GATA3_rep2 0 F p.value

Exercise B: generate a scatterplot to show the IDR values across the two replicates

gnuplot

set log x
set log y
set palette model RGB defined ( 0 'red', 1 'black' )
unset key
unset colorbox
plot 'PLAYGROUND/MCF7_GATA3_rep1_vs_MCF7_GATA3_rep2-overlapped-peaks.txt' u 5:9:($11 <= 0.01 ? 0 : 1 ) with points palette
exit

Exercise C: plot the cumulative distribution of the IDR

inp <- read.table('PLAYGROUND/MCF7_GATA3_rep1_vs_MCF7_GATA3_rep2-overlapped-peaks.txt', header=TRUE)
peaks_num <- cumsum(table(inp$IDR))
idr_values <- names(cumsum(table(inp$IDR)))
plot(peaks_num, idr_values, xlab='num of significant peaks', ylab='IDR')
quit()

module unload R/3.3.1

STEP 11 - Use DiffBind to perform a differential binding analysis.

Exercise A: explore the config file (pre-made) with the help of the package manual (https://bioconductor.org/packages/release/bioc/manuals/DiffBind/man/DiffBind.pdf)

less -S diffbind_SampleInfo.csv

Exercise B: launch DiffBind analysis on the samples

module load R/3.2.2
export R_LIBS=/apps/htseq/lib/R/3.2.2/library/
R

library(DiffBind)
my_data = dba(sampleSheet="diffbind_SampleInfo.csv")
dba.plotVenn(my_data, my_data$masks$GATA3, main="Peaks overlap")
my_data = dba.count(my_data, minOverlap=2, bCorPlot=FALSE)
my_data = dba.contrast(my_data, minMembers=2, categories=DBA_TISSUE)
my_data  # to see the contrast in the DBA object
my_data = dba.analyze(my_data)
dba.report(my_data, th=1, ext="txt", file="sites", initString="PLAYGROUND/diffbind")
dba.plotMA(my_data)
dba.plotPCA(my_data, label=DBA_ID)
quit()

module unload R/3.2.2
export R_LIBS=''

Exercise C: have a look at the output file generated using DiffBind

less -S PLAYGROUND/diffbind_sites.txt

STEP 12 - Use ChIPseeker to annotate peaks and do pathway enrichment analysis.

Exercise A: load MCF7_GATA3_rep1 peaks and plot their coverage along the genome

module load R/3.3.1
R

library(ChIPseeker)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(ReactomePA)
peaks <- readPeakFile('FULL_DATASET/HQ_PEAK_files/MCF7_GATA3_rep1_peaks.ExcludeBlackList.narrowPeak', header=F)
seqlevels(peaks) <- sapply(peaks@seqnames@values, function(x) paste('chr', x, sep=''))
covplot(peaks, weightCol="V5")

Exercise B: plot peak distances from their nearest TSSs

txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
promoter <- getPromoters(TxDb=txdb, upstream=3000, downstream=3000)
tagMatrix <- getTagMatrix(peaks, windows=promoter)
plotAvgProf(tagMatrix, xlim=c(-3000, 3000), xlab="Genomic Region (5'->3')", ylab = "Read Count Frequency")

Exercise C: generate a peak-to-annotation piechart summarising the portion of peaks annotated to a given feature

peakAnno <- annotatePeak(peaks, tssRegion=c(-3000, 3000), TxDb=txdb, annoDb="org.Hs.eg.db")
plotAnnoPie(peakAnno)

Exercise D: plot the top 10 enriched terms from Reactome

genes <- seq2gene(peaks, tssRegion = c(-1000, 1000), flankDistance = 3000, TxDb=txdb)
pathways <- enrichPathway(genes)
barplot(pathways, showCategory=10)
quit()

module unload R/3.3.1

STEP 13 - Use MEME to do motif search (both known and de novo).

Exercise A: explore the PWM file in MEME format

less -S jaspar_PWMs.meme

Exercise B: extract the top 100 peaks from the MCF7_GATA3_rep1 sample, then retrieve their sequences in FASTA format

samtools faidx hs37d5.fasta $(sort -k8nr FULL_DATASET/HQ_PEAK_files/MCF7_GATA3_rep1_peaks.ExcludeBlackList.narrowPeak | head -100 | awk 'END{print ""};{printf $1":"$2"-"$3" "}') > PLAYGROUND/top100_peaks.fa

Exercise C: run FIMO on the top 100 peaks to find binding sites for known transcription factors; then, have a look at the results in the output file

/apps/well/meme/4.11.2_2/bin/fimo --skip-matched-sequence jaspar_PWMs.meme PLAYGROUND/top100_peaks.fa 2> PLAYGROUND/top100_peaks.fimo.log > PLAYGROUND/top100_peaks.fimo.TFBSs
less -S PLAYGROUND/top100_peaks.fimo.TFBSs

Exercise D: run MEME to perform a de novo motif search across the peakset (NOTE: no negative sequences are used for MEME, the positive sequences are shuffled to create the negative set)

. /apps/well/perl/5.16.3/apps-well-perl-5.16.3.sh
/apps/well/meme/4.11.2_2/bin/meme-chip -oc PLAYGROUND/MEME_RESULTS/ -seed 1234 -ccut 0 -meme-minw 6 -meme-maxw 20 -meme-nmotifs 1 -fimo-skip PLAYGROUND/top100_peaks.fa

Exercise E: explore the MEME output results

firefox --no-remote PLAYGROUND/MEME_RESULTS/index.html

STEP 14 - Convert BEDGRAPH to BIGWIG format for data visualization.

Exercise A: use bedGraphToBigWig to generate a BIGWIG file

/apps/well/irap/git-gcc4.7.2/bin/bedGraphToBigWig PLAYGROUND/MCF7_GATA3_rep1_treat_pileup.bdg hg19_chrom_sizes.txt PLAYGROUND/MCF7_GATA3_rep1_treat_pileup.bw

Exercise B: explore the data generated so far (BIGWIG and BED files) in IGV

module load java/1.8.0_latest
java -Xmx16G -jar /apps/well/igv/2.3.32/igv.jar

Then select:
- Genomes -> Load Genome from File... -> /well/workshop/workshop01/hs37d5.fasta
- File -> Load from File... -> /well/workshop/workshop01/hs37d5.refGene.txt.gz
- File -> Load from File... -> /well/workshop/workshop01/PLAYGROUND/MCF7_GATA3_rep1_treat_pileup.bw
- File -> Load from File... -> /well/workshop/workshop01/PLAYGROUND/MCF7_GATA3_rep1_peaks.narrowPeak
- File -> Load from File... -> /well/workshop/workshop01/FULL_DATASET/BAM_FILTERED_file/MCF7_GATA3_rep1.filtered.bam
- File -> Load from File... -> /well/workshop/workshop01/FULL_DATASET/BAM_FILTERED_file/MCF7_input_rep1.filtered.bam

Finally, type PGR in the search box to focus on region 11:100,898,355-101,002,544 and compare it with Fig. 8 of the paper published using these data (see link above). To visualise reads mapping to the forward or reverse strand, right click on each BAM track and select "Color alignments by read strand".

Thank you for attending this course, I hope you enjoyed it!

University of Oxford