Hands-on tutorial for ChIP-seq data analysis
Silvia Salatino, PhD - 30.11.2018
Analysis of the GSE51274 dataset from GEO
Today we are going to use the following dataset, which has been downloaded from the Gene Expression Omnibus (GEO) repository of high-throughput arrays and sequencing data.
The raw data are available here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE51274
...and the corresponding article can be found here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4021073/
Most of the analysis steps we will see today are usually run on a High-Performance Computing cluster (HPC), which enables parallelisation and therefore speeds up the whole analysis.
However, since learning how to use HPC is beyond the scope of this one-day course, the time-consuming analysis steps have been precomputed and the results can be found in the "FULL_DATASET" folder. The "PLAYGROUND" folder, instead, can be used for trying software on smaller extracts of the original dataset, which would take much less to run.
Instructions for Windows users:
You'll be assigned a user name (e.g. workshop01) and a corresponding password.
OPTION A)
- Download SmarTTY from http://smartty.sysprogs.com/
- Double-click on the downloaded ".msi" file and follow the steps in the setup wizard to install it
- Launch the program and click on "New SSH connection..."
- In the pop-up window, type:
- Host Name: rescomp.well.ox.ac.uk
- User Name: (your user name)
- Password: (your password)
- Click "Connect" and then "Start with a regular Terminal"
OPTION B)
- Launch Xming, check only "Private networks" and click "Allow access"
- Launch SSH Secure Shell
- In the Secure Shell Client click on "Edit/Settings"
- In the "Settings" window on the left-hand side click on the + next to expand "Profile Settings"
- Click on "Tunneling"
- On the right-hand side of the window, make sure there is a check next to the words "Tunnel X11 Connections"
- On the left-hand side of the window, click on "Authentication". Make sure there is a check next to the words "Enable for SSH2 Connections"
- Click "OK"
- Click "File" and then "Save Settings"
- Press "Enter"
- In the pop-up window, type:
- Host Name: rescomp.well.ox.ac.uk
- User Name: (your user name)
- Port Number: 22
- Authentication Method: <Profile Settings>
- Click "Connect" and, when prompted, type your password
Instructions for GNU/Linux or Mac users
You'll be assigned a user name (e.g. workshop01) and a corresponding password.
- Open an X11 terminal (XQuartz from Mac) and type
ssh -X <your_user_name>@rescomp.well.ox.ac.uk - When prompted, type your password
ChIP-seq analysis exercises
There could be several solutions to the following exercises. Feel free to use your favourite programming language... or just copy-paste the code below ;)
STEP 1 - Familiarise with the FASTQ format.
- Exercise A: extract 1000 reads from a FASTQ file and compress the output file
zcat FULL_DATASET/FASTQ_files/MCF7_GATA3_rep1.fastq.gz | head -4000 > PLAYGROUND/1000reads.fastq
gzip PLAYGROUND/1000reads.fastq- Exercise B: have a look at the structure of a FASTQ file
zless -S PLAYGROUND/1000reads.fastq.gz- Exercise C: count the number of reads in a FASTQ file from the full dataset
zcat FULL_DATASET/FASTQ_files/MCF7_GATA3_rep1.fastq.gz | awk '{s++} END {print s/4}' STEP 2 - Use FastQC to check read qualities before pre-processing.
- Exercise A: run FastQC on the 1000-reads FASTQ file
/apps/htseq/FastQC/fastqc -t 4 --noextract -o PLAYGROUND PLAYGROUND/1000reads.fastq.gz- Exercise B: have a look at each metric in the ".html" file generated by FastQC
firefox --no-remote PLAYGROUND/1000reads_fastqc.html STEP 3 - Run Skewer to trim adapters from FASTQ read files.
- Exercise A: launch Skewer on the 1000-reads FASTQ file
/apps/htseq/skewer/skewer -t 4 -m head -x AGATCGGAAGAGC -r 0.1 -d 0 -z -l 20 PLAYGROUND/1000reads.fastq.gz -o PLAYGROUND/1000reads- Exercise B: identify the trimmed reads from the output file
zcat PLAYGROUND/1000reads-trimmed.fastq.gz | awk 'BEGIN {OFS = "\n"} {header = $0 ; getline seq ; getline qheader ; getline qseq ; if (length(seq) < 36) {print header, seq, qheader, qseq}}' STEP 4 - Map the reads with BWA-mem and familiarise with the BAM format.
- Exercise A: launch BWA-mem on the 1000-reads FASTQ file
/apps/htseq/bwa-0.7.15/bwa mem -t 4 -c 1 /well/htseq/Genomes/BWA-0.7/hs37d5 PLAYGROUND/1000reads-trimmed.fastq.gz | /apps/well/samtools/1.1/bin/samtools view -bh -S - > PLAYGROUND/1000reads.bam- Exercise B: have a look at the output file of BWA-mem (NOTE: a description of each field in the "SAM/BAM" file format can be found here http://samtools.github.io/hts-specs/SAMv1.pdf)
samtools view -h PLAYGROUND/1000reads.bam | less -S- Exercise C: use the appropriate flags to count how many reads were mapped to the reference genome (have a look at https://broadinstitute.github.io/picard/explain-flags.html). An alternative approach is to simply use
samtools flagstat
samtools view -c -F 4 PLAYGROUND/1000reads.bam STEP 5 - Filter out poor quality reads.
- Exercise A: launch Bamtools on the 1000-reads BAM file
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/apps/well/bamtools/2.3.0/lib/bamtools"
/apps/well/bamtools/2.3.0/bin/bamtools filter -mapQuality '>=20' -isMapped true -in PLAYGROUND/1000reads.bam -out PLAYGROUND/1000reads_filtered.bam- Exercise B: count how many reads are left after filtering
samtools view -c PLAYGROUND/1000reads_filtered.bam STEP 6 - Re-run FastQC to check read qualities after pre-processing.
- Exercise A: run FastQC on the filtered 1000-reads BAM file
/apps/htseq/FastQC/fastqc -t 4 --noextract -o PLAYGROUND PLAYGROUND/1000reads_filtered.bam- Exercise B: have a look at each metric in the ".html" file generated by FastQC
firefox --no-remote PLAYGROUND/1000reads_filtered_fastqc.html STEP 7 - Launch MACS2 to identify peaks from the filtered BAM files.
- Exercise A: run MACS2 on a pair of treatment/control samples from the full dataset
module load python/2.7
/apps/well/python/2.7.11/bin/macs2 callpeak -t FULL_DATASET/BAM_FILTERED_files/MCF7_GATA3_rep1.filtered.bam -c FULL_DATASET/BAM_FILTERED_files/MCF7_input_rep1.filtered.bam -f BAM -g hs --outdir PLAYGROUND --bdg --name MCF7_GATA3_rep1 --keep-dup 1- Exercise B: count the number of reads obtained
wc -l PLAYGROUND/MCF7_GATA3_rep1_peaks.narrowPeak- Exercise C: familiarise with the BED format (NOTE: a description of each field in the "narrowPeak" file format can be found here https://github.com/taoliu/MACS)
less -S PLAYGROUND/MCF7_GATA3_rep1_peaks.narrowPeak STEP 8 - Remove peaks in low-complexity or repeated regions and on contigs.
- Exercise A: use Bedtools to filter out uninteresting peaks
/apps/well/bedtools/2.24.0/intersectBed -a PLAYGROUND/MCF7_GATA3_rep1_peaks.narrowPeak -b encode_blacklisted_regions.bed -v > PLAYGROUND/MCF7_GATA3_rep1_peaks.temp- Exercise B: remove all peaks located on contigs (i.e. not on chromosomes)
awk '!/^GL|hs/' PLAYGROUND/MCF7_GATA3_rep1_peaks.temp > PLAYGROUND/MCF7_GATA3_rep1_peaks.filtered- Exercise C: check how many peaks were left after this filtering step
wc -l PLAYGROUND/MCF7_GATA3_rep1_peaks.narrowPeak
wc -l PLAYGROUND/MCF7_GATA3_rep1_peaks.filtered STEP 9 - Check the NSC and the RSC values for a given sample.
- Exercise A: use SPP to calculate NSC and RSC for the IP data of the MCF7_GATA2_rep1 sample
/apps/well/R/3.2.2/bin/Rscript run_spp.R -c=FULL_DATASET/BAM_FILTERED_files/MCF7_GATA3_rep1.filtered.bam -savp -out=PLAYGROUND/MCF7_GATA3_rep1.filtered.report -odir=PLAYGROUND/- Exercise B: open the output PDF file and explore the report file (help can be found in the SPP_instructions.txt file)
evince PLAYGROUND/MCF7_GATA3_rep1.filtered.pdf
less -S PLAYGROUND/MCF7_GATA3_rep1.filtered.report STEP 10 - Check the IDR across replicate peaksets.
- Exercise A: calculate the IDR for the MCF7_GATA2_rep1 and _rep2 samples
module load R/3.3.1
Rscript batch-consistency-analysis.r FULL_DATASET/HQ_PEAK_files/MCF7_GATA3_rep1_peaks.ExcludeBlackList.narrowPeak FULL_DATASET/HQ_PEAK_files/MCF7_GATA3_rep2_peaks.ExcludeBlackList.narrowPeak -1 PLAYGROUND/MCF7_GATA3_rep1_vs_MCF7_GATA3_rep2 0 F p.value - Exercise B: generate a scatterplot to show the IDR values across the two replicates
gnuplotset log x
set log y
set palette model RGB defined ( 0 'red', 1 'black' )
unset key
unset colorbox
plot 'PLAYGROUND/MCF7_GATA3_rep1_vs_MCF7_GATA3_rep2-overlapped-peaks.txt' u 5:9:($11 <= 0.01 ? 0 : 1 ) with points palette
exit- Exercise C: plot the cumulative distribution of the IDR
Rinp <- read.table('PLAYGROUND/MCF7_GATA3_rep1_vs_MCF7_GATA3_rep2-overlapped-peaks.txt', header=TRUE)
peaks_num <- cumsum(table(inp$IDR))
idr_values <- names(cumsum(table(inp$IDR)))
plot(peaks_num, idr_values, xlab='num of significant peaks', ylab='IDR')
quit()module unload R/3.3.1 STEP 11 - Use DiffBind to perform a differential binding analysis.
- Exercise A: explore the config file (pre-made) with the help of the package manual (https://bioconductor.org/packages/release/bioc/manuals/DiffBind/man/DiffBind.pdf)
less -S diffbind_SampleInfo.csv- Exercise B: launch DiffBind analysis on the samples
module load R/3.2.2
export R_LIBS=/apps/htseq/lib/R/3.2.2/library/
Rlibrary(DiffBind)
my_data = dba(sampleSheet="diffbind_SampleInfo.csv")
dba.plotVenn(my_data, my_data$masks$GATA3, main="Peaks overlap")
my_data = dba.count(my_data, minOverlap=2, bCorPlot=FALSE)
my_data = dba.contrast(my_data, minMembers=2, categories=DBA_TISSUE)
my_data # to see the contrast in the DBA object
my_data = dba.analyze(my_data)
dba.report(my_data, th=1, ext="txt", file="sites", initString="PLAYGROUND/diffbind")
dba.plotMA(my_data)
dba.plotPCA(my_data, label=DBA_ID)
quit()module unload R/3.2.2
export R_LIBS=''- Exercise C: have a look at the output file generated using DiffBind
less -S PLAYGROUND/diffbind_sites.txt STEP 12 - Use ChIPseeker to annotate peaks and do pathway enrichment analysis.
- Exercise A: load MCF7_GATA3_rep1 peaks and plot their coverage along the genome
module load R/3.3.1
Rlibrary(ChIPseeker)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(ReactomePA)
peaks <- readPeakFile('FULL_DATASET/HQ_PEAK_files/MCF7_GATA3_rep1_peaks.ExcludeBlackList.narrowPeak', header=F)
seqlevels(peaks) <- sapply(peaks@seqnames@values, function(x) paste('chr', x, sep=''))
covplot(peaks, weightCol="V5")- Exercise B: plot peak distances from their nearest TSSs
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
promoter <- getPromoters(TxDb=txdb, upstream=3000, downstream=3000)
tagMatrix <- getTagMatrix(peaks, windows=promoter)
plotAvgProf(tagMatrix, xlim=c(-3000, 3000), xlab="Genomic Region (5'->3')", ylab = "Read Count Frequency")- Exercise C: generate a peak-to-annotation piechart summarising the portion of peaks annotated to a given feature
peakAnno <- annotatePeak(peaks, tssRegion=c(-3000, 3000), TxDb=txdb, annoDb="org.Hs.eg.db")
plotAnnoPie(peakAnno) - Exercise D: plot the top 10 enriched terms from Reactome
genes <- seq2gene(peaks, tssRegion = c(-1000, 1000), flankDistance = 3000, TxDb=txdb)
pathways <- enrichPathway(genes)
barplot(pathways, showCategory=10)
quit()module unload R/3.3.1 STEP 13 - Use MEME to do motif search (both known and de novo).
- Exercise A: explore the PWM file in MEME format
less -S jaspar_PWMs.meme- Exercise B: extract the top 100 peaks from the MCF7_GATA3_rep1 sample, then retrieve their sequences in FASTA format
samtools faidx hs37d5.fasta $(sort -k8nr FULL_DATASET/HQ_PEAK_files/MCF7_GATA3_rep1_peaks.ExcludeBlackList.narrowPeak | head -100 | awk 'END{print ""};{printf $1":"$2"-"$3" "}') > PLAYGROUND/top100_peaks.fa- Exercise C: run FIMO on the top 100 peaks to find binding sites for known transcription factors; then, have a look at the results in the output file
/apps/well/meme/4.11.2_2/bin/fimo --skip-matched-sequence jaspar_PWMs.meme PLAYGROUND/top100_peaks.fa 2> PLAYGROUND/top100_peaks.fimo.log > PLAYGROUND/top100_peaks.fimo.TFBSs
less -S PLAYGROUND/top100_peaks.fimo.TFBSs- Exercise D: run MEME to perform a de novo motif search across the peakset (NOTE: no negative sequences are used for MEME, the positive sequences are shuffled to create the negative set)
. /apps/well/perl/5.16.3/apps-well-perl-5.16.3.sh
/apps/well/meme/4.11.2_2/bin/meme-chip -oc PLAYGROUND/MEME_RESULTS/ -seed 1234 -ccut 0 -meme-minw 6 -meme-maxw 20 -meme-nmotifs 1 -fimo-skip PLAYGROUND/top100_peaks.fa- Exercise E: explore the MEME output results
firefox --no-remote PLAYGROUND/MEME_RESULTS/index.html STEP 14 - Convert BEDGRAPH to BIGWIG format for data visualization.
- Exercise A: use bedGraphToBigWig to generate a BIGWIG file
/apps/well/irap/git-gcc4.7.2/bin/bedGraphToBigWig PLAYGROUND/MCF7_GATA3_rep1_treat_pileup.bdg hg19_chrom_sizes.txt PLAYGROUND/MCF7_GATA3_rep1_treat_pileup.bw- Exercise B: explore the data generated so far (BIGWIG and BED files) in IGV
java -Xmx16G -jar /apps/well/igv/2.3.32/igv.jarThen select:
- Genomes -> Load Genome from File... -> /well/workshop/workshop01/hs37d5.fasta
- File -> Load from File... -> /well/workshop/workshop01/hs37d5.refGene.txt.gz
- File -> Load from File... -> /well/workshop/workshop01/PLAYGROUND/MCF7_GATA3_rep1_treat_pileup.bw
- File -> Load from File... -> /well/workshop/workshop01/PLAYGROUND/MCF7_GATA3_rep1_peaks.narrowPeak
- File -> Load from File... -> /well/workshop/workshop01/FULL_DATASET/BAM_FILTERED_file/MCF7_GATA3_rep1.filtered.bam
- File -> Load from File... -> /well/workshop/workshop01/FULL_DATASET/BAM_FILTERED_file/MCF7_input_rep1.filtered.bam
Finally, type PGR in the search box to focus on region 11:100,898,355-101,002,544 and compare it with Fig. 8 of the paper published using these data (see link above).
Thank you for attending this course, I hope you enjoyed it!
