This is an RNAseq tutorial introducing the basics of RNAseq data analysis. For this exercise, we will use a dataset generated at the Oxford Genomics Centre (OGC). This is currently an unpublished dataset and belongs to OGC, Wellcome Centre for Human Genetics (WCHG).
In this tutorial we will -
The dataset comprises RNAseq data from 12 samples of spike-ins data. These are the commonly used 92 ERCC spike-ins, generally used for gene expression quantification, and 7 SIRV spike-in genes, used for isoform level quantification of 68 isoforms.
You’ll be assigned a user name (e.g. workshop01) and a password. username: workshop## password: bsgrnaseq
You’ll be assigned a user name (e.g. workshop01) and a password. username: workshop## password: bsgrnaseq
ssh <your_user_name@rescomp.well.ox.ac.uk>Copy-paste or type the following commands to explore your home directory.
Note: The text after the hash(#) is explanatory text for the command. Don’t copy-paste these lines.
ls
# Lists all files and folders in the home directory
cd /well/workshop/workshop##/
# Change directory to go to your workshop folder on /well
ls -l FASTQ/
# Lists all files and folders in the FASTQ/ directory
ls -l GTF/
# Lists the gene annotation file in the GTF/ directory
ls -l genome/
# Lists the fasta file in the genome/ directoryThe data alignment and BAM sorting steps are generally run on a High-Performance Computing cluster (HPC). Use of a computing cluster gives access to multiple high-performance processors thereby enabling parallel processing for rapid analysis. As we will be unable to use the cluster today, the sorted alignment files have already been prepared and can be found in the processed_dataset/alignment_files/ folders. The alignment_files/ folder contains 12 folders with the samplenames.
Similarly the count files of each counting tool are available in processed_dataset/featureCount_output/, processed_dataset/salmon_output/, processed_dataset/stringtie_output/
We will make a playground/ folder with a small extract of the original dataset, which can be used for trying the different commands.
Check if the following commands work:
hisat2 --version
/apps/htseq/subread-1.5.2-Linux-x86_64/bin/featureCounts -v
samtools --version
# /apps/htseq/samtools-1.4/samtools --version
bedtools --version
#/apps/htseq/bedtools-2.26.0/bin/bedtools --version
picard-Xmx8g CollectRnaSeqMetrics --version
# /apps/well/picard-tools/2.0.1/picard-Xmx8g CollectRnaSeqMetrics --version
salmon --version
stringtie --versioncat FASTQ/MAS5283A10_1.fastq | head -8
cat FASTQ/MAS5283A10_2.fastq | head -8
# Each read record is stored in 4 linesgrep '^@' FASTQ/MAS5283A10_1.fastq | head
# All readnames begin with **@**
grep '^@' FASTQ/MAS5283A10_2.fastq | head
# Note /1 and /2 at the end of readnames. These represent read1 and read2 respectively.grep '^@' FASTQ/MAS5283A10_1.fastq | wc
# Number of lines that begin with '@' in the file
cat FASTQ/MAS5283A10_1.fastq | awk '{s++} END {print s/4}'
# Another method to count all lines and output number of readsmkdir playground
cat FASTQ/MAS5283A10_1.fastq | head -4000 > playground/MAS5283A10_1000_read1.fastq
cat FASTQ/MAS5283A10_2.fastq | head -4000 > playground/MAS5283A10_1000_read2.fastq As we will use hisat2 for alignment, here we build a reference hisat2 index of the spike-in genome
mkdir reference/
# Make a new directory to write the reference indexes in
hisat2-build genome/ERCC_SIRV_151124a.fa reference/ERCC_SIRV_151124a
# Build the reference index for reference fasta file
cd playground
# Change directory to **/well/workshop/workshop##/playground**
hisat2 -x ../reference/ERCC_SIRV_151124a -1 MAS5283A10_1000_read1.fastq -2 MAS5283A10_1000_read2.fastq 2> MAS5283A10_1000_align_summary.txt
# By default hisat2 writes a SAM file to the screen
hisat2 -x ../reference/ERCC_SIRV_151124a -1 MAS5283A10_1000_read1.fastq -2 MAS5283A10_1000_read2.fastq 2> MAS5283A10_1000_align_summary.txt | /apps/htseq/samtools-1.4/samtools view -bh -o MAS5283A10_1000_hisat_unsorted.bam -S -
# Align 1000 read FASTQ dataset to reference
head MAS5283A10_1000_align_summary.txt
# View the alignment summary report
samtools sort -o MAS5283A10_1000_hisat_sorted.bam MAS5283A10_1000_hisat_unsorted.bam
samtools index MAS5283A10_1000_hisat_sorted.bam
samtools flagstat MAS5283A10_1000_hisat_sorted.bam
# Check overall read alignment metrics
samtools idxstats MAS5283A10_1000_hisat_sorted.bam
# Check alignment metrics to each "chromosome" in the reference
mkdir alignment_metrics
picard-Xmx8g CollectRnaSeqMetrics REF_FLAT=../GTF/ERCC_SIRV_151124a.refFLAT INPUT=MAS5283A10_1000_hisat_sorted.bam STRAND_SPECIFICITY=NONE OUTPUT=alignment_metrics/CollectRnaSeqMetrics.txt 2> alignment_metrics/CollectRnaSeqMetrics.err
head alignment_metrics/CollectRnaSeqMetrics.txt/apps/htseq/subread-1.5.2-Linux-x86_64/bin/featureCounts -T 4 -p -F GTF -a ../GTF/ERCC_SIRV_151124a.gtf -o MAS5283A10_1000_featureCount.txt MAS5283A10_1000_hisat_unsorted.bam > MAS5283A10_1000_featureCount.log 2>&1
head MAS5283A10_1000_featureCount.log
head MAS5283A10_1000_featureCount.txt.summary
head MAS5283A10_1000_featureCount.txt
mkdir -p stringtie_output/MAS5283A10_1000
stringtie MAS5283A10_1000_hisat_sorted.bam -o stringtie_output/MAS5283A10_1000/stringtie.gtf -p 4 --rf -G ../GTF/ERCC_SIRV_151124a.gtf -e -C stringtie_output/MAS5283A10_1000/cov_refs.gtf -A stringtie_output/MAS5283A10_1000/gene_abund.tab -c 1 -b stringtie_output/MAS5283A10_1000
head stringtie_output/MAS5283A10_1000/stringtie.gtf
salmon index -t ../genome/ERCC_SIRV_C_151124a.transcripts.fa -i ../reference/ERCC_SIRV_C_151124a.transcripts --type quasi -k 31
# Build transcriptome index
salmon quant -i ../reference/ERCC_SIRV_C_151124a.transcripts -l A -1 MAS5283A10_1000_read1.fastq -2 MAS5283A10_1000_read2.fastq -o salmon_output
# Quantify transcript expression
head salmon_output/quant.sf
head salmon_output/logs/salmon_quant.logcd ..
pwd
# you should be in your main working directory /well/workshop/workshop##
head /well/workshop/examples/Alignment_count_script.sh
/well/workshop/examples/Alignment_count_script.sh single_sample.txt
/well/workshop/examples/Alignment_count_script.sh samplelist.txt| sample | Hisat | Salmon | Total_reads | Hisat_MultiMapping | Assigned_Exon_Features | Assigned_NoFeatures |
|---|---|---|---|---|---|---|
| MAS5283A1 | 96.01% | 97.13% | 786973 | 31046 | 761926 | 12 |
| MAS5283A10 | 95.54% | 97.24% | 560856 | 32296 | 832325 | 21 |
| MAS5283A11 | 95.46% | 97.05% | 873195 | 28597 | 702773 | 15 |
| MAS5283A12 | 95.40% | 97.25% | 821839 | 25737 | 675273 | 18 |
| MAS5283A2 | 95.74% | 97.70% | 814672 | 18173 | 544431 | 7 |
| MAS5283A3 | 95.51% | 97.21% | 834561 | 32837 | 840650 | 7 |
| MAS5283A4 | 95.37% | 97.49% | 615122 | 29661 | 790134 | 12 |
| MAS5283A5 | 95.68% | 97.59% | 714988 | 28973 | 788111 | 14 |
| MAS5283A6 | 95.66% | 97.18% | 880733 | 32144 | 807580 | 13 |
| MAS5283A7 | 95.70% | 96.68% | 862366 | 27235 | 594189 | 7 |
| MAS5283A8 | 95.71% | 97.17% | 726273 | 25989 | 691615 | 12 |
| MAS5283A9 | 95.39% | 96.92% | 697368 | 31431 | 847161 | 14 |
Thanks to Silvia Salatino, Jon Diprose and John Broxholme for help in preparing the pipeline and course materials.
“And that’s a wrap for this session”