This is an RNAseq tutorial introducing the basics of RNAseq data analysis. For this exercise, we will use a dataset generated at the Oxford Genomics Centre (OGC). This is currently an unpublished dataset and belongs to OGC, Wellcome Centre for Human Genetics (WHG).
In this tutorial we will -
The dataset comprises RNAseq data from 12 samples of spike-ins data. These are the commonly used 92 ERCC spike-ins, generally used for gene expression quantification, and 7 SIRV spike-in genes, used for isoform level quantification of 68 isoforms.
You’ll be assigned a user name (e.g. workshop01) and a password.
username: workshop##
password: RNASeqNov2018
note: You might have to type-in the password if Copy+Paste (Ctrl+C and Ctrl+V) do not work.
please change the password to something you remember, ideally like Workshop01&
You’ll be assigned a user name (e.g. workshop01) and a password.
username: workshop##
password: RNASeqNov2018
ssh <your_user_name@rescomp.well.ox.ac.uk>
ssh <your_user_name@rescomp.well.ox.ac.uk>
Copy-paste or type the following commands to explore your home directory.
Note: The text after the hash(#) is explanatory text for the command. Don’t copy-paste these lines.
ls
# Lists all files and folders in the home directory
cd /well/workshop/workshop##/
# Change directory to go to your workshop folder on /well
ls -l FASTQ/
# Lists all files and folders in the FASTQ/ directory
ls -l GTF/
# Lists the gene annotation file in the GTF/ directory
ls -l genome/
# Lists the fasta file in the genome/ directory
The data alignment and BAM sorting steps are generally run on a High-Performance Computing cluster (HPC). Use of a computing cluster gives access to multiple high-performance processors thereby enabling parallel processing for rapid analysis. As we will be unable to use the cluster today, the sorted alignment files have already been prepared and can be found in the processed_dataset/alignment_files/ folders. The alignment_files/
folder contains 12 folders with the samplenames.
Similarly the count files of each counting tool are available in processed_dataset/featureCount_output/, processed_dataset/salmon_output/, processed_dataset/stringtie_output/
We will make a playground/ folder with a small extract of the original dataset, which can be used for trying the different commands.
Check if the following commands work:
hisat2 --version
/apps/htseq/subread-1.5.2-Linux-x86_64/bin/featureCounts -v
samtools --version
# /apps/htseq/samtools-1.4/samtools --version
bedtools --version
#/apps/htseq/bedtools-2.26.0/bin/bedtools --version
picard-Xmx8g CollectRnaSeqMetrics --version
# /apps/well/picard-tools/2.0.1/picard-Xmx8g CollectRnaSeqMetrics --version
salmon --version
stringtie --version
cat FASTQ/MAS5283A10_1.fastq | head -8
cat FASTQ/MAS5283A10_2.fastq | head -8
# Each read record is stored in 4 lines
grep '^@' FASTQ/MAS5283A10_1.fastq | head
# All readnames begin with **@**
grep '^@' FASTQ/MAS5283A10_2.fastq | head
# Note /1 and /2 at the end of readnames. These represent read1 and read2 respectively.
grep '^@' FASTQ/MAS5283A10_1.fastq | wc
# Number of lines that begin with '@' in the file
cat FASTQ/MAS5283A10_1.fastq | awk '{s++} END {print s/4}'
# Another method to count all lines and output number of reads
mkdir playground
cat FASTQ/MAS5283A10_1.fastq | head -4000 > playground/MAS5283A10_1000_read1.fastq
cat FASTQ/MAS5283A10_2.fastq | head -4000 > playground/MAS5283A10_1000_read2.fastq
As we will use hisat2 for alignment, here we build a reference hisat2 index of the spike-in genome
mkdir reference/
# Make a new directory to write the reference indexes in
hisat2-build genome/ERCC_SIRV_151124a.fa reference/ERCC_SIRV_151124a
# Build the reference index for reference fasta file
cd playground
# Change directory to **/well/workshop/workshop##/playground**
hisat2 -x ../reference/ERCC_SIRV_151124a -1 MAS5283A10_1000_read1.fastq -2 MAS5283A10_1000_read2.fastq 2> MAS5283A10_1000_align_summary.txt
# By default hisat2 writes a SAM file to the screen
hisat2 -x ../reference/ERCC_SIRV_151124a -1 MAS5283A10_1000_read1.fastq -2 MAS5283A10_1000_read2.fastq 2> MAS5283A10_1000_align_summary.txt | /apps/htseq/samtools-1.4/samtools view -bh -o MAS5283A10_1000_hisat_unsorted.bam -S -
# Align 1000 read FASTQ dataset to reference
head -n20 MAS5283A10_1000_align_summary.txt
# View the alignment summary report
samtools sort -o MAS5283A10_1000_hisat_sorted.bam MAS5283A10_1000_hisat_unsorted.bam
samtools index MAS5283A10_1000_hisat_sorted.bam
samtools flagstat MAS5283A10_1000_hisat_sorted.bam
# Check overall read alignment metrics
samtools idxstats MAS5283A10_1000_hisat_sorted.bam
# Check alignment metrics to each "chromosome" in the reference
mkdir alignment_metrics
# Make a new directory to write alignment metrics
picard-Xmx8g CollectRnaSeqMetrics REF_FLAT=../GTF/ERCC_SIRV_151124a.refFLAT INPUT=MAS5283A10_1000_hisat_sorted.bam STRAND_SPECIFICITY=NONE OUTPUT=alignment_metrics/CollectRnaSeqMetrics.txt 2> alignment_metrics/CollectRnaSeqMetrics.err
# Run Picard CollectRnaSeqMetrics and write output to alignment_metrics/CollectRnaSeqMetrics.txt
head -n20 alignment_metrics/CollectRnaSeqMetrics.txt
# View the first 20 lines of alignment_metrics/CollectRnaSeqMetrics.txt
less -S alignment_metrics/CollectRnaSeqMetrics.txt
# Open the file alignment_metrics/CollectRnaSeqMetrics.txt with less pager
:q
#Exit the "less" pager with "q" or "ZZ"
samtools idxstats MAS5283A10_1000_hisat_sorted.bam >
less -S alignment_metrics/Samtools_idxstats_Metrics.txt
/apps/htseq/subread-1.5.2-Linux-x86_64/bin/featureCounts -T 4 -p -F GTF -a ../GTF/ERCC_SIRV_151124a.gtf -o MAS5283A10_1000_featureCount.txt MAS5283A10_1000_hisat_unsorted.bam > MAS5283A10_1000_featureCount.log 2>&1
#Count reads aligning to exon features with featureCounts
less -S MAS5283A10_1000_featureCount.log
#View standard output of featureCounts
:q
#Exit the "less" pager with "q" or "ZZ"
less -S MAS5283A10_1000_featureCount.txt.summary
#View summary statistics of featureCounts read assignment
:q
#Exit the "less" pager with "q" or "ZZ"
less -S MAS5283A10_1000_featureCount.txt
#View featureCounts results
:q
#Exit the "less" pager with "q" or "ZZ"
mkdir -p stringtie_output/MAS5283A10_1000_noAssembly
#Make a new directory to write the output of StringTie
stringtie MAS5283A10_1000_hisat_sorted.bam -o stringtie_output/MAS5283A10_1000_noAssembly/stringtie.gtf -p 4 --rf -G ../GTF/ERCC_SIRV_151124a.gtf -e -C stringtie_output/MAS5283A10_1000_noAssembly/cov_refs.gtf -A stringtie_output/MAS5283A10_1000_noAssembly/gene_abund.tab -c 1 -b stringtie_output/MAS5283A10_1000_noAssembly
#Run StringTie quantification (without assembly)
less -S stringtie_output/MAS5283A10_1000_noAssembly/stringtie.gtf
#View StringTie GTF file (without assembly)
:q
#Exit the "less" pager with "q" or "ZZ"
less -S stringtie_output/MAS5283A10_1000_noAssembly/gene_abund.tab
#View StringTie gene abundance file (without assembly)
:q
#Exit the "less" pager with "q" or "ZZ"
less -S stringtie_output/MAS5283A10_1000_noAssembly/t_data.ctab
#View StringTie transcript abundance file (without assembly)
:q
#Exit the "less" pager with "q" or "ZZ"
######## Run StringTie in Assembly mode
mkdir -p stringtie_output/MAS5283A10_1000_Assembly
#Make a new directory to write the output of StringTie
stringtie MAS5283A10_1000_hisat_sorted.bam -o stringtie_output/MAS5283A10_1000_Assembly/stringtie.gtf -p 4 --rf -G ../GTF/ERCC_SIRV_151124a.gtf -C stringtie_output/MAS5283A10_1000_Assembly/cov_refs.gtf -A stringtie_output/MAS5283A10_1000_Assembly/gene_abund.tab -c 1 -b stringtie_output/MAS5283A10_1000_Assembly
#Run StringTie quantification (with assembly)
less -S stringtie_output/MAS5283A10_1000_Assembly/stringtie.gtf
#View StringTie GTF file (with assembly)
less -S stringtie_output/MAS5283A10_1000_Assembly/gene_abund.tab
#View StringTie gene abundance file (with assembly)
:q
#Exit the "less" pager with "q" or "ZZ"
less -S stringtie_output/MAS5283A10_1000_Assembly/t_data.ctab
#View StringTie transcript abundance file (with assembly)
:q
#Exit the "less" pager with "q" or "ZZ"
salmon index -t ../genome/ERCC_SIRV_C_151124a.transcripts.fa -i ../reference/ERCC_SIRV_C_151124a.transcripts --type quasi -k 31
# Build transcriptome index
salmon quant -i ../reference/ERCC_SIRV_C_151124a.transcripts -l A -1 MAS5283A10_1000_read1.fastq -2 MAS5283A10_1000_read2.fastq -o salmon_output
# Quantify transcript expression
less -S salmon_output/quant.sf
#View Salmon transcript abundance file
:q
#Exit the "less" pager with "q" or "ZZ"
head salmon_output/logs/salmon_quant.log
#View Salmon transcript log file
:q
#Exit the "less" pager with "q" or "ZZ"
cd ..
pwd
# you should be in your main working directory /well/workshop/workshop##
head /well/workshop/examples/Alignment_count_script.sh
/well/workshop/examples/Alignment_count_script.sh single_sample.txt
/well/workshop/examples/Alignment_count_script.sh samplelist.txt
sample | Hisat | Salmon | Total_reads | Hisat_MultiMapping | Assigned_Exon_Features | Assigned_NoFeatures |
---|---|---|---|---|---|---|
MAS5283A1 | 96.01% | 97.13% | 786973 | 31046 | 761926 | 12 |
MAS5283A10 | 95.54% | 97.24% | 560856 | 32296 | 832325 | 21 |
MAS5283A11 | 95.46% | 97.05% | 873195 | 28597 | 702773 | 15 |
MAS5283A12 | 95.40% | 97.25% | 821839 | 25737 | 675273 | 18 |
MAS5283A2 | 95.74% | 97.70% | 814672 | 18173 | 544431 | 7 |
MAS5283A3 | 95.51% | 97.21% | 834561 | 32837 | 840650 | 7 |
MAS5283A4 | 95.37% | 97.49% | 615122 | 29661 | 790134 | 12 |
MAS5283A5 | 95.68% | 97.59% | 714988 | 28973 | 788111 | 14 |
MAS5283A6 | 95.66% | 97.18% | 880733 | 32144 | 807580 | 13 |
MAS5283A7 | 95.70% | 96.68% | 862366 | 27235 | 594189 | 7 |
MAS5283A8 | 95.71% | 97.17% | 726273 | 25989 | 691615 | 12 |
MAS5283A9 | 95.39% | 96.92% | 697368 | 31431 | 847161 | 14 |
Thanks to Jon Diprose, Silvia Salatino and John Broxholme for help in preparing the pipeline and course materials.
“And that’s a wrap for this session”