Hands-on tutorial for ChIP-seq data analysis
Silvia Salatino, PhD
Analysis of the GSE51274 dataset from GEO
Today we are going to use the following dataset, which has been downloaded from the Gene Expression Omnibus (GEO) repository of high-throughput arrays and sequencing data.
The raw data are available here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE51274
...and the corresponding article can be found here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4021073/
Most of the analysis steps we will see today are usually run on a High-Performance Computing cluster (HPC), which enables parallelisation and therefore speeds up the whole analysis.
However, since learning how to use HPC is beyond the scope of this one-day course, the time-consuming analysis steps have been precomputed and the results can be found in the "FULL_DATASET" folder. The "PLAYGROUND" folder, instead, can be used for trying software on smaller extracts of the original dataset, which would take much less to run.
Instructions for Windows users:
You'll be assigned a user name (e.g. workshop01) and a corresponding password to access our rescomp servers.
- Download SmarTTY from http://smartty.sysprogs.com/
- Double-click on the downloaded ".msi" file and follow the steps in the setup wizard to install it
- Launch the program and click on "New SSH connection..."
- In the pop-up window, type:
- Host Name: rescomp.well.ox.ac.uk
- User Name: (your user name)
- Password: (your password)
- Click "Connect" and then "Start with a regular Terminal"
Instructions for GNU/Linux or Mac users
You'll be assigned a user name (e.g. workshop01) and a corresponding password to access our rescomp servers.
- Open an X11 terminal (XQuartz from Mac) and type
ssh -X <your_user_name>@rescomp.well.ox.ac.uk - When prompted, type your password
ChIP-seq analysis exercises
There could be several solutions to the following exercises. Feel free to use your favourite programming language... or just copy-paste the code below ;)
STEP 1 - Familiarise with the FASTQ format.
- Exercise A: extract 1000 reads from a FASTQ file and compress the output file
zcat FULL_DATASET/FASTQ_files/MCF7_GATA3_rep1.fastq.gz | head -4000 > PLAYGROUND/1000reads.fastq
gzip PLAYGROUND/1000reads.fastq- Exercise B: have a look at the structure of a FASTQ file
zless -S PLAYGROUND/1000reads.fastq.gz- Exercise C: count the number of reads in a FASTQ file from the full dataset
zcat FULL_DATASET/FASTQ_files/MCF7_GATA3_rep1.fastq.gz | awk '{s++} END {print s/4}' STEP 2 - Use FastQC to check read qualities before pre-processing.
- Exercise A: run FastQC on the 1000-reads FASTQ file
/apps/htseq/FastQC/fastqc -t 4 --noextract -o PLAYGROUND PLAYGROUND/1000reads.fastq.gz- Exercise B: have a look at each metric in the ".html" file generated by FastQC
firefox --no-remote PLAYGROUND/1000reads_fastqc.html STEP 3 - Run Skewer to trim adapters from FASTQ read files.
- Exercise A: launch Skewer on the 1000-reads FASTQ file
/apps/htseq/skewer/skewer -t 4 -m head -x AGATCGGAAGAGC -r 0.1 -d 0 -z -l 20 PLAYGROUND/1000reads.fastq.gz -o PLAYGROUND/1000reads- Exercise B: identify the trimmed reads from the output file
zcat PLAYGROUND/1000reads-trimmed.fastq.gz | awk 'BEGIN {OFS = "\n"} {header = $0 ; getline seq ; getline qheader ; getline qseq ; if (length(seq) < 36) {print header, seq, qheader, qseq}}' STEP 4 - Map the reads with BWA-mem and familiarise with the BAM format.
- Exercise A: launch BWA-mem on the 1000-reads FASTQ file
/apps/htseq/bwa-0.7.15/bwa mem -t 4 -c 1 /well/htseq/Genomes/BWA-0.7/hs37d5 PLAYGROUND/1000reads-trimmed.fastq.gz | /apps/well/samtools/1.1/bin/samtools view -bh -S - > PLAYGROUND/1000reads.bam- Exercise B: have a look at the output file of BWA-mem (NOTE: a description of each field in the "SAM/BAM" file format can be found here http://samtools.github.io/hts-specs/SAMv1.pdf)
samtools view -h PLAYGROUND/1000reads.bam | less -S- Exercise C: use the appropriate flags to count how many reads were mapped to the reference genome (have a look at https://broadinstitute.github.io/picard/explain-flags.html). An alternative approach is to simply use
samtools flagstat
samtools view -c -F 4 PLAYGROUND/1000reads.bam STEP 5 - Filter out poor quality reads.
- Exercise A: launch Bamtools on the 1000-reads BAM file
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/apps/well/bamtools/2.3.0/lib/bamtools"
/apps/well/bamtools/2.3.0/bin/bamtools filter -mapQuality '>=20' -isMapped true -in PLAYGROUND/1000reads.bam -out PLAYGROUND/1000reads_filtered.bam- Exercise B: count how many reads are left after filtering
samtools view -c PLAYGROUND/1000reads_filtered.bam STEP 6 - Re-run FastQC to check read qualities after pre-processing.
- Exercise A: run FastQC on the filtered 1000-reads BAM file
/apps/htseq/FastQC/fastqc -t 4 --noextract -o PLAYGROUND PLAYGROUND/1000reads_filtered.bam- Exercise B: have a look at each metric in the ".html" file generated by FastQC
firefox --no-remote PLAYGROUND/1000reads_filtered_fastqc.html STEP 7 - Launch MACS2 to identify peaks from the filtered BAM files.
- Exercise A: run MACS2 on a pair of treatment/control samples from the full dataset
module load python/2.7
/apps/well/python/2.7.11/bin/macs2 callpeak -t FULL_DATASET/BAM_FILTERED_files/MCF7_GATA3_rep1.filtered.bam -c FULL_DATASET/BAM_FILTERED_files/MCF7_input_rep1.filtered.bam -f BAM -g hs --outdir PLAYGROUND --bdg --name MCF7_GATA3_rep1 --keep-dup 1- Exercise B: count the number of peaks obtained (there should be 15307)
wc -l PLAYGROUND/MCF7_GATA3_rep1_peaks.narrowPeak- Exercise C: familiarise with the BED format (NOTE: a description of each field in the "narrowPeak" file format can be found here https://github.com/taoliu/MACS)
less -S PLAYGROUND/MCF7_GATA3_rep1_peaks.narrowPeak STEP 8 - Remove peaks in low-complexity or repeated regions and on contigs.
- Exercise A: use Bedtools to filter out uninteresting peaks
/apps/well/bedtools/2.24.0/intersectBed -a PLAYGROUND/MCF7_GATA3_rep1_peaks.narrowPeak -b encode_blacklisted_regions.bed -v > PLAYGROUND/MCF7_GATA3_rep1_peaks.temp- Exercise B: remove all peaks located on contigs (i.e. not on chromosomes)
awk '!/^GL|hs/' PLAYGROUND/MCF7_GATA3_rep1_peaks.temp > PLAYGROUND/MCF7_GATA3_rep1_peaks.filtered- Exercise C: check how many peaks were left after this filtering step
wc -l PLAYGROUND/MCF7_GATA3_rep1_peaks.filtered STEP 9 - Check the NSC and the RSC values for a given sample.
- Exercise A: use SPP to calculate NSC and RSC for the IP data of the MCF7_GATA2_rep1 sample
PATH=/apps/well/samtools/0.1.19/bin:$PATH
/apps/well/R/3.2.2/bin/Rscript run_spp.R -c=FULL_DATASET/BAM_FILTERED_files/MCF7_GATA3_rep1.filtered.bam -savp -out=PLAYGROUND/MCF7_GATA3_rep1.filtered.report -odir=PLAYGROUND/- Exercise B: open the output PDF file and explore the report file (help can be found in the SPP_instructions.txt file)
evince PLAYGROUND/MCF7_GATA3_rep1.filtered.pdf
less -S PLAYGROUND/MCF7_GATA3_rep1.filtered.report STEP 10 - Check the IDR across replicate peaksets.
- Exercise A: calculate the IDR for the MCF7_GATA2_rep1 and _rep2 samples
module load R/3.3.1
Rscript batch-consistency-analysis.r FULL_DATASET/HQ_PEAK_files/MCF7_GATA3_rep1_peaks.ExcludeBlackList.narrowPeak FULL_DATASET/HQ_PEAK_files/MCF7_GATA3_rep2_peaks.ExcludeBlackList.narrowPeak -1 PLAYGROUND/MCF7_GATA3_rep1_vs_MCF7_GATA3_rep2 0 F p.value - Exercise B: generate a scatterplot to show the IDR values across the two replicates
gnuplotset log x
set log y
set palette model RGB defined ( 0 'red', 1 'black' )
unset key
unset colorbox
plot 'PLAYGROUND/MCF7_GATA3_rep1_vs_MCF7_GATA3_rep2-overlapped-peaks.txt' u 5:9:($11 <= 0.01 ? 0 : 1 ) with points palette
exit- Exercise C: plot the cumulative distribution of the IDR
Rinp <- read.table('PLAYGROUND/MCF7_GATA3_rep1_vs_MCF7_GATA3_rep2-overlapped-peaks.txt', header=TRUE)
peaks_num <- cumsum(table(inp$IDR))
idr_values <- names(cumsum(table(inp$IDR)))
plot(peaks_num, idr_values, xlab='num of significant peaks', ylab='IDR')
quit()module unload R/3.3.1 STEP 11 - Use DiffBind to perform a differential binding analysis.
- Exercise A: explore the config file (pre-made) with the help of the package manual (https://bioconductor.org/packages/release/bioc/manuals/DiffBind/man/DiffBind.pdf)
less -S diffbind_SampleInfo.csv- Exercise B: launch DiffBind analysis on the samples
module load R/3.2.2
export R_LIBS=/apps/htseq/lib/R/3.2.2/library/
Rlibrary(DiffBind)
my_data = dba(sampleSheet="diffbind_SampleInfo.csv")
dba.plotVenn(my_data, my_data$masks$GATA3, main="Peaks overlap")
my_data = dba.count(my_data, minOverlap=2, bCorPlot=FALSE)
my_data = dba.contrast(my_data, minMembers=2, categories=DBA_TISSUE)
my_data # to see the contrast in the DBA object
my_data = dba.analyze(my_data)
dba.report(my_data, th=1, ext="txt", file="sites", initString="PLAYGROUND/diffbind")
dba.plotMA(my_data)
dba.plotPCA(my_data, label=DBA_ID)
quit()module unload R/3.2.2
export R_LIBS=''- Exercise C: have a look at the output file generated using DiffBind
less -S PLAYGROUND/diffbind_sites.txt STEP 12 - Use ChIPseeker to annotate peaks and do pathway enrichment analysis.
- Exercise A: load MCF7_GATA3_rep1 peaks and plot their coverage along the genome
module load R/3.3.1
Rlibrary(ChIPseeker)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(ReactomePA)
peaks <- readPeakFile('FULL_DATASET/HQ_PEAK_files/MCF7_GATA3_rep1_peaks.ExcludeBlackList.narrowPeak', header=F)
seqlevels(peaks) <- sapply(peaks@seqnames@values, function(x) paste('chr', x, sep=''))
covplot(peaks, weightCol="V5")- Exercise B: plot peak distances from their nearest TSSs
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
promoter <- getPromoters(TxDb=txdb, upstream=3000, downstream=3000)
tagMatrix <- getTagMatrix(peaks, windows=promoter)
plotAvgProf(tagMatrix, xlim=c(-3000, 3000), xlab="Genomic Region (5'->3')", ylab = "Read Count Frequency")- Exercise C: generate a peak-to-annotation piechart summarising the portion of peaks annotated to a given feature
peakAnno <- annotatePeak(peaks, tssRegion=c(-3000, 3000), TxDb=txdb, annoDb="org.Hs.eg.db")
plotAnnoPie(peakAnno) - Exercise D: plot the top 10 enriched terms from Reactome
genes <- seq2gene(peaks, tssRegion = c(-1000, 1000), flankDistance = 3000, TxDb=txdb)
pathways <- enrichPathway(genes)
barplot(pathways, showCategory=10)
quit()module unload R/3.3.1 STEP 13 - Use MEME to do motif search (both known and de novo).
- Exercise A: explore the PWM file in MEME format
less -S jaspar_PWMs.meme- Exercise B: extract the top 100 peaks from the MCF7_GATA3_rep1 sample, then retrieve their sequences in FASTA format
samtools faidx hs37d5.fasta $(sort -k8nr FULL_DATASET/HQ_PEAK_files/MCF7_GATA3_rep1_peaks.ExcludeBlackList.narrowPeak | head -100 | awk 'END{print ""};{printf $1":"$2"-"$3" "}') > PLAYGROUND/top100_peaks.fa- Exercise C: run FIMO on the top 100 peaks to find binding sites for known transcription factors; then, have a look at the results in the output file
/apps/well/meme/4.11.2_2/bin/fimo --skip-matched-sequence jaspar_PWMs.meme PLAYGROUND/top100_peaks.fa 2> PLAYGROUND/top100_peaks.fimo.log > PLAYGROUND/top100_peaks.fimo.TFBSs
less -S PLAYGROUND/top100_peaks.fimo.TFBSs- Exercise D: run MEME to perform a de novo motif search across the peakset (NOTE: no negative sequences are used for MEME, the positive sequences are shuffled to create the negative set)
. /apps/well/perl/5.16.3/apps-well-perl-5.16.3.sh
/apps/well/meme/4.11.2_2/bin/meme-chip -oc PLAYGROUND/MEME_RESULTS/ -seed 1234 -ccut 0 -meme-minw 6 -meme-maxw 20 -meme-nmotifs 1 -fimo-skip PLAYGROUND/top100_peaks.fa- Exercise E: explore the MEME output results
firefox --no-remote PLAYGROUND/MEME_RESULTS/index.html STEP 14 - Convert BEDGRAPH to BIGWIG format for data visualization.
- Exercise A: use bedGraphToBigWig to generate a BIGWIG file
/apps/well/irap/git-gcc4.7.2/bin/bedGraphToBigWig PLAYGROUND/MCF7_GATA3_rep1_treat_pileup.bdg hg19_chrom_sizes.txt PLAYGROUND/MCF7_GATA3_rep1_treat_pileup.bw- Exercise B: explore the data generated so far (BIGWIG and BED files) in IGV
module load java/1.8.0_latest
java -Xmx16G -jar /apps/well/igv/2.3.32/igv.jarThen select:
- Genomes -> Load Genome from File... -> /well/workshop/workshop01/hs37d5.fasta
- File -> Load from File... -> /well/workshop/workshop01/hs37d5.refGene.txt.gz
- File -> Load from File... -> /well/workshop/workshop01/PLAYGROUND/MCF7_GATA3_rep1_treat_pileup.bw
- File -> Load from File... -> /well/workshop/workshop01/PLAYGROUND/MCF7_GATA3_rep1_peaks.narrowPeak
- File -> Load from File... -> /well/workshop/workshop01/FULL_DATASET/BAM_FILTERED_file/MCF7_GATA3_rep1.filtered.bam
- File -> Load from File... -> /well/workshop/workshop01/FULL_DATASET/BAM_FILTERED_file/MCF7_input_rep1.filtered.bam
Finally, type PGR in the search box to focus on region 11:100,898,355-101,002,544 and compare it with Fig. 8 of the paper published using these data (see link above). To visualise reads mapping to the forward or reverse strand, right click on each BAM track and select "Color alignments by read strand".
Thank you for attending this course, I hope you enjoyed it!
