Hands-on tutorial for ChIP-seq data analysis

Silvia Salatino, PhD - 30.11.2018

Analysis of the GSE51274 dataset from GEO

Today we are going to use the following dataset, which has been downloaded from the Gene Expression Omnibus (GEO) repository of high-throughput arrays and sequencing data.

The raw data are available here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE51274

...and the corresponding article can be found here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4021073/

Most of the analysis steps we will see today are usually run on a High-Performance Computing cluster (HPC), which enables parallelisation and therefore speeds up the whole analysis.

However, since learning how to use HPC is beyond the scope of this one-day course, the time-consuming analysis steps have been precomputed and the results can be found in the "FULL_DATASET" folder. The "PLAYGROUND" folder, instead, can be used for trying software on smaller extracts of the original dataset, which would take much less to run.

Instructions for Windows users:

You'll be assigned a user name (e.g. workshop01) and a corresponding password.

OPTION A)

Download SmarTTY from http://smartty.sysprogs.com/
Double-click on the downloaded ".msi" file and follow the steps in the setup wizard to install it
Launch the program and click on "New SSH connection..."
In the pop-up window, type:

Host Name: rescomp.well.ox.ac.uk
User Name: (your user name)
Password: (your password)

Click "Connect" and then "Start with a regular Terminal"

OPTION B)

Launch Xming, check only "Private networks" and click "Allow access"
Launch SSH Secure Shell
In the Secure Shell Client click on "Edit/Settings"
In the "Settings" window on the left-hand side click on the + next to expand "Profile Settings"
Click on "Tunneling"
On the right-hand side of the window, make sure there is a check next to the words "Tunnel X11 Connections"
On the left-hand side of the window, click on "Authentication". Make sure there is a check next to the words "Enable for SSH2 Connections"
Click "OK"
Click "File" and then "Save Settings"
Press "Enter"
In the pop-up window, type:

Host Name: rescomp.well.ox.ac.uk
User Name: (your user name)
Port Number: 22
Authentication Method: <Profile Settings>

Click "Connect" and, when prompted, type your password

Instructions for GNU/Linux or Mac users

You'll be assigned a user name (e.g. workshop01) and a corresponding password.

Open an X11 terminal (XQuartz from Mac) and type ssh -X <your_user_name>@rescomp.well.ox.ac.uk
When prompted, type your password

ChIP-seq analysis exercises

There could be several solutions to the following exercises. Feel free to use your favourite programming language... or just copy-paste the code below ;)

STEP 1 - Familiarise with the FASTQ format.

Exercise A: extract 1000 reads from a FASTQ file and compress the output file

zcat FULL_DATASET/FASTQ_files/MCF7_GATA3_rep1.fastq.gz | head -4000 > PLAYGROUND/1000reads.fastq
gzip PLAYGROUND/1000reads.fastq

Exercise B: have a look at the structure of a FASTQ file

zless -S PLAYGROUND/1000reads.fastq.gz

Exercise C: count the number of reads in a FASTQ file from the full dataset

zcat FULL_DATASET/FASTQ_files/MCF7_GATA3_rep1.fastq.gz | awk '{s++} END {print s/4}'

STEP 2 - Use FastQC to check read qualities before pre-processing.

Exercise A: run FastQC on the 1000-reads FASTQ file

/apps/htseq/FastQC/fastqc -t 4 --noextract -o PLAYGROUND PLAYGROUND/1000reads.fastq.gz

Exercise B: have a look at each metric in the ".html" file generated by FastQC

firefox --no-remote PLAYGROUND/1000reads_fastqc.html

STEP 3 - Run Skewer to trim adapters from FASTQ read files.

Exercise A: launch Skewer on the 1000-reads FASTQ file

/apps/htseq/skewer/skewer -t 4 -m head -x AGATCGGAAGAGC -r 0.1 -d 0 -z -l 20 PLAYGROUND/1000reads.fastq.gz -o PLAYGROUND/1000reads

Exercise B: identify the trimmed reads from the output file

zcat PLAYGROUND/1000reads-trimmed.fastq.gz | awk 'BEGIN {OFS = "\n"} {header = $0 ; getline seq ; getline qheader ; getline qseq ; if (length(seq) < 36) {print header, seq, qheader, qseq}}'

STEP 4 - Map the reads with BWA-mem and familiarise with the BAM format.

Exercise A: launch BWA-mem on the 1000-reads FASTQ file

/apps/htseq/bwa-0.7.15/bwa mem -t 4 -c 1 /well/htseq/Genomes/BWA-0.7/hs37d5 PLAYGROUND/1000reads-trimmed.fastq.gz | /apps/well/samtools/1.1/bin/samtools view -bh -S - > PLAYGROUND/1000reads.bam

Exercise B: have a look at the output file of BWA-mem (NOTE: a description of each field in the "SAM/BAM" file format can be found here http://samtools.github.io/hts-specs/SAMv1.pdf)

samtools view -h PLAYGROUND/1000reads.bam | less -S

Exercise C: use the appropriate flags to count how many reads were mapped to the reference genome (have a look at https://broadinstitute.github.io/picard/explain-flags.html). An alternative approach is to simply use samtools flagstat

samtools view -c -F 4 PLAYGROUND/1000reads.bam

STEP 5 - Filter out poor quality reads.

Exercise A: launch Bamtools on the 1000-reads BAM file

export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/apps/well/bamtools/2.3.0/lib/bamtools"
/apps/well/bamtools/2.3.0/bin/bamtools filter -mapQuality '>=20' -isMapped true -in PLAYGROUND/1000reads.bam -out PLAYGROUND/1000reads_filtered.bam

Exercise B: count how many reads are left after filtering

samtools view -c PLAYGROUND/1000reads_filtered.bam

STEP 6 - Re-run FastQC to check read qualities after pre-processing.

Exercise A: run FastQC on the filtered 1000-reads BAM file

/apps/htseq/FastQC/fastqc -t 4 --noextract -o PLAYGROUND PLAYGROUND/1000reads_filtered.bam

Exercise B: have a look at each metric in the ".html" file generated by FastQC

firefox --no-remote PLAYGROUND/1000reads_filtered_fastqc.html

STEP 7 - Launch MACS2 to identify peaks from the filtered BAM files.

Exercise A: run MACS2 on a pair of treatment/control samples from the full dataset

module load python/2.7
/apps/well/python/2.7.11/bin/macs2 callpeak -t FULL_DATASET/BAM_FILTERED_files/MCF7_GATA3_rep1.filtered.bam -c FULL_DATASET/BAM_FILTERED_files/MCF7_input_rep1.filtered.bam -f BAM -g hs --outdir PLAYGROUND --bdg --name MCF7_GATA3_rep1 --keep-dup 1

Exercise B: count the number of reads obtained

wc -l PLAYGROUND/MCF7_GATA3_rep1_peaks.narrowPeak

Exercise C: familiarise with the BED format (NOTE: a description of each field in the "narrowPeak" file format can be found here https://github.com/taoliu/MACS)

less -S PLAYGROUND/MCF7_GATA3_rep1_peaks.narrowPeak

STEP 8 - Remove peaks in low-complexity or repeated regions and on contigs.

Exercise A: use Bedtools to filter out uninteresting peaks

/apps/well/bedtools/2.24.0/intersectBed -a PLAYGROUND/MCF7_GATA3_rep1_peaks.narrowPeak -b encode_blacklisted_regions.bed -v > PLAYGROUND/MCF7_GATA3_rep1_peaks.temp

Exercise B: remove all peaks located on contigs (i.e. not on chromosomes)

awk '!/^GL|hs/' PLAYGROUND/MCF7_GATA3_rep1_peaks.temp > PLAYGROUND/MCF7_GATA3_rep1_peaks.filtered

Exercise C: check how many peaks were left after this filtering step

wc -l PLAYGROUND/MCF7_GATA3_rep1_peaks.narrowPeak
wc -l PLAYGROUND/MCF7_GATA3_rep1_peaks.filtered

STEP 9 - Check the NSC and the RSC values for a given sample.

Exercise A: use SPP to calculate NSC and RSC for the IP data of the MCF7_GATA2_rep1 sample

/apps/well/R/3.2.2/bin/Rscript run_spp.R -c=FULL_DATASET/BAM_FILTERED_files/MCF7_GATA3_rep1.filtered.bam -savp -out=PLAYGROUND/MCF7_GATA3_rep1.filtered.report -odir=PLAYGROUND/

Exercise B: open the output PDF file and explore the report file (help can be found in the SPP_instructions.txt file)

evince PLAYGROUND/MCF7_GATA3_rep1.filtered.pdf
less -S PLAYGROUND/MCF7_GATA3_rep1.filtered.report

STEP 10 - Check the IDR across replicate peaksets.

Exercise A: calculate the IDR for the MCF7_GATA2_rep1 and _rep2 samples

module load R/3.3.1
Rscript batch-consistency-analysis.r FULL_DATASET/HQ_PEAK_files/MCF7_GATA3_rep1_peaks.ExcludeBlackList.narrowPeak FULL_DATASET/HQ_PEAK_files/MCF7_GATA3_rep2_peaks.ExcludeBlackList.narrowPeak -1 PLAYGROUND/MCF7_GATA3_rep1_vs_MCF7_GATA3_rep2 0 F p.value

Exercise B: generate a scatterplot to show the IDR values across the two replicates

gnuplot

set log x
set log y
set palette model RGB defined ( 0 'red', 1 'black' )
unset key
unset colorbox
plot 'PLAYGROUND/MCF7_GATA3_rep1_vs_MCF7_GATA3_rep2-overlapped-peaks.txt' u 5:9:($11 <= 0.01 ? 0 : 1 ) with points palette
exit

Exercise C: plot the cumulative distribution of the IDR

inp <- read.table('PLAYGROUND/MCF7_GATA3_rep1_vs_MCF7_GATA3_rep2-overlapped-peaks.txt', header=TRUE)
peaks_num <- cumsum(table(inp$IDR))
idr_values <- names(cumsum(table(inp$IDR)))
plot(peaks_num, idr_values, xlab='num of significant peaks', ylab='IDR')
quit()

module unload R/3.3.1

STEP 11 - Use DiffBind to perform a differential binding analysis.

Exercise A: explore the config file (pre-made) with the help of the package manual (https://bioconductor.org/packages/release/bioc/manuals/DiffBind/man/DiffBind.pdf)

less -S diffbind_SampleInfo.csv

Exercise B: launch DiffBind analysis on the samples

module load R/3.2.2
export R_LIBS=/apps/htseq/lib/R/3.2.2/library/
R

library(DiffBind)
my_data = dba(sampleSheet="diffbind_SampleInfo.csv")
dba.plotVenn(my_data, my_data$masks$GATA3, main="Peaks overlap")
my_data = dba.count(my_data, minOverlap=2, bCorPlot=FALSE)
my_data = dba.contrast(my_data, minMembers=2, categories=DBA_TISSUE)
my_data  # to see the contrast in the DBA object
my_data = dba.analyze(my_data)
dba.report(my_data, th=1, ext="txt", file="sites", initString="PLAYGROUND/diffbind")
dba.plotMA(my_data)
dba.plotPCA(my_data, label=DBA_ID)
quit()

module unload R/3.2.2
export R_LIBS=''

Exercise C: have a look at the output file generated using DiffBind

less -S PLAYGROUND/diffbind_sites.txt

STEP 12 - Use ChIPseeker to annotate peaks and do pathway enrichment analysis.

Exercise A: load MCF7_GATA3_rep1 peaks and plot their coverage along the genome

module load R/3.3.1
R

library(ChIPseeker)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(ReactomePA)
peaks <- readPeakFile('FULL_DATASET/HQ_PEAK_files/MCF7_GATA3_rep1_peaks.ExcludeBlackList.narrowPeak', header=F)
seqlevels(peaks) <- sapply(peaks@seqnames@values, function(x) paste('chr', x, sep=''))
covplot(peaks, weightCol="V5")

Exercise B: plot peak distances from their nearest TSSs

txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
promoter <- getPromoters(TxDb=txdb, upstream=3000, downstream=3000)
tagMatrix <- getTagMatrix(peaks, windows=promoter)
plotAvgProf(tagMatrix, xlim=c(-3000, 3000), xlab="Genomic Region (5'->3')", ylab = "Read Count Frequency")

Exercise C: generate a peak-to-annotation piechart summarising the portion of peaks annotated to a given feature

peakAnno <- annotatePeak(peaks, tssRegion=c(-3000, 3000), TxDb=txdb, annoDb="org.Hs.eg.db")
plotAnnoPie(peakAnno)

Exercise D: plot the top 10 enriched terms from Reactome

genes <- seq2gene(peaks, tssRegion = c(-1000, 1000), flankDistance = 3000, TxDb=txdb)
pathways <- enrichPathway(genes)
barplot(pathways, showCategory=10)
quit()

module unload R/3.3.1

STEP 13 - Use MEME to do motif search (both known and de novo).

Exercise A: explore the PWM file in MEME format

less -S jaspar_PWMs.meme

Exercise B: extract the top 100 peaks from the MCF7_GATA3_rep1 sample, then retrieve their sequences in FASTA format

samtools faidx hs37d5.fasta $(sort -k8nr FULL_DATASET/HQ_PEAK_files/MCF7_GATA3_rep1_peaks.ExcludeBlackList.narrowPeak | head -100 | awk 'END{print ""};{printf $1":"$2"-"$3" "}') > PLAYGROUND/top100_peaks.fa

Exercise C: run FIMO on the top 100 peaks to find binding sites for known transcription factors; then, have a look at the results in the output file

/apps/well/meme/4.11.2_2/bin/fimo --skip-matched-sequence jaspar_PWMs.meme PLAYGROUND/top100_peaks.fa 2> PLAYGROUND/top100_peaks.fimo.log > PLAYGROUND/top100_peaks.fimo.TFBSs
less -S PLAYGROUND/top100_peaks.fimo.TFBSs

Exercise D: run MEME to perform a de novo motif search across the peakset (NOTE: no negative sequences are used for MEME, the positive sequences are shuffled to create the negative set)

. /apps/well/perl/5.16.3/apps-well-perl-5.16.3.sh
/apps/well/meme/4.11.2_2/bin/meme-chip -oc PLAYGROUND/MEME_RESULTS/ -seed 1234 -ccut 0 -meme-minw 6 -meme-maxw 20 -meme-nmotifs 1 -fimo-skip PLAYGROUND/top100_peaks.fa

Exercise E: explore the MEME output results

firefox --no-remote PLAYGROUND/MEME_RESULTS/index.html

STEP 14 - Convert BEDGRAPH to BIGWIG format for data visualization.

Exercise A: use bedGraphToBigWig to generate a BIGWIG file

/apps/well/irap/git-gcc4.7.2/bin/bedGraphToBigWig PLAYGROUND/MCF7_GATA3_rep1_treat_pileup.bdg hg19_chrom_sizes.txt PLAYGROUND/MCF7_GATA3_rep1_treat_pileup.bw

Exercise B: explore the data generated so far (BIGWIG and BED files) in IGV

java -Xmx16G -jar /apps/well/igv/2.3.32/igv.jar

Then select:
- Genomes -> Load Genome from File... -> /well/workshop/workshop01/hs37d5.fasta
- File -> Load from File... -> /well/workshop/workshop01/hs37d5.refGene.txt.gz
- File -> Load from File... -> /well/workshop/workshop01/PLAYGROUND/MCF7_GATA3_rep1_treat_pileup.bw
- File -> Load from File... -> /well/workshop/workshop01/PLAYGROUND/MCF7_GATA3_rep1_peaks.narrowPeak
- File -> Load from File... -> /well/workshop/workshop01/FULL_DATASET/BAM_FILTERED_file/MCF7_GATA3_rep1.filtered.bam
- File -> Load from File... -> /well/workshop/workshop01/FULL_DATASET/BAM_FILTERED_file/MCF7_input_rep1.filtered.bam

Finally, type PGR in the search box to focus on region 11:100,898,355-101,002,544 and compare it with Fig. 8 of the paper published using these data (see link above).

Thank you for attending this course, I hope you enjoyed it!

University of Oxford