RNASeq Tutorial

Introduction

This is an RNAseq tutorial introducing the basics of RNAseq data analysis. For this exercise, we will use a dataset generated at the Oxford Genomics Centre (OGC). This is currently an unpublished dataset and belongs to OGC, Wellcome Centre for Human Genetics (WCHG).

In this tutorial we will -

Familiarize ourselves with FASTQ and BAM files
Align FASTQ files to a reference genome, using splice-aware RNASeq aligner, Hisat2
Sort BAM files by read names or genomic coordinates, Samtools
QC the BAM alignment data, Picard CollectRnaSeqMetrics
Generate Count tables from BAM alignment, featureCounts
EM based quantification of genes and transcripts, StringTie
Alignment-free quantification using transcriptome reference, Salmon

Dataset description

The dataset comprises RNAseq data from 12 samples of spike-ins data. These are the commonly used 92 ERCC spike-ins, generally used for gene expression quantification, and 7 SIRV spike-in genes, used for isoform level quantification of 68 isoforms.

Instructions for Windows users

You’ll be assigned a user name (e.g. workshop01) and a password. username: workshop## password: bsgrnaseq

Download and install SmarTTY
Launch SmarTTY terminal
Click on New SSH Connection
In the pop-up window, type:
- Host Name: rescomp.well.ox.ac.uk
- User Name: (your user name:workshop##)
- Authentication Method
- Password: bsgrnaseq
Click Connect
You will be asked to change the password.
- LDAP: bsgrnaseq
- New Password:
- Confirm New password: please change the password to something you remember, ideally like Workshop01&

Instructions for Mac/Linux users

You’ll be assigned a user name (e.g. workshop01) and a password. username: workshop## password: bsgrnaseq

Open a terminal and type ssh <your_user_name@rescomp.well.ox.ac.uk>
When asked for password, type:
- Password: bsgrnaseq
Enter
You will be asked to change the password.
- LDAP: bsgrnaseq
- New Password:
- Confirm New password: please change the password to something you remember, ideally like Workshop01&

Getting familiar with your home space

Copy-paste or type the following commands to explore your home directory.

Note: The text after the hash(#) is explanatory text for the command. Don’t copy-paste these lines.

ls
# Lists all files and folders in the home directory

cd /well/workshop/workshop##/
# Change directory to go to your workshop folder on /well

ls -l FASTQ/
# Lists all files and folders in the FASTQ/ directory

ls -l GTF/
# Lists the gene annotation file in the GTF/ directory

ls -l genome/
# Lists the fasta file in the genome/ directory

The data alignment and BAM sorting steps are generally run on a High-Performance Computing cluster (HPC). Use of a computing cluster gives access to multiple high-performance processors thereby enabling parallel processing for rapid analysis. As we will be unable to use the cluster today, the sorted alignment files have already been prepared and can be found in the processed_dataset/alignment_files/ folders. The alignment_files/ folder contains 12 folders with the samplenames.

Similarly the count files of each counting tool are available in processed_dataset/featureCount_output/, processed_dataset/salmon_output/, processed_dataset/stringtie_output/

We will make a playground/ folder with a small extract of the original dataset, which can be used for trying the different commands.

Prerequisites

Check if the following commands work:

hisat2 --version

/apps/htseq/subread-1.5.2-Linux-x86_64/bin/featureCounts -v

samtools --version
# /apps/htseq/samtools-1.4/samtools --version

bedtools --version
#/apps/htseq/bedtools-2.26.0/bin/bedtools --version

picard-Xmx8g CollectRnaSeqMetrics --version
# /apps/well/picard-tools/2.0.1/picard-Xmx8g CollectRnaSeqMetrics --version

salmon --version

stringtie --version

Step 1: Getting familiar with the FASTQ file format

Exercise A : Look at the first few lines of a FASTQ file

cat FASTQ/MAS5283A10_1.fastq | head -8
cat FASTQ/MAS5283A10_2.fastq | head -8
# Each read record is stored in 4 lines

Exercise B : Extract readnames from the file

grep '^@' FASTQ/MAS5283A10_1.fastq | head
# All readnames begin with **@**
grep '^@' FASTQ/MAS5283A10_2.fastq | head
# Note /1 and /2 at the end of readnames. These represent read1 and read2 respectively.

Exercise C : Count number of reads in a FASTQ file

grep '^@' FASTQ/MAS5283A10_1.fastq | wc
# Number of lines that begin with '@' in the file
cat FASTQ/MAS5283A10_1.fastq | awk '{s++} END {print s/4}'
# Another method to count all lines and output number of reads

Exercise D : Extract 1000 reads from a full FASTQ file and write in playground

mkdir playground
cat FASTQ/MAS5283A10_1.fastq | head -4000 > playground/MAS5283A10_1000_read1.fastq 
cat FASTQ/MAS5283A10_2.fastq | head -4000 > playground/MAS5283A10_1000_read2.fastq

Step 2: Build the Genome Index

As we will use hisat2 for alignment, here we build a reference hisat2 index of the spike-in genome

mkdir reference/
# Make a new directory to write the reference indexes in

hisat2-build genome/ERCC_SIRV_151124a.fa reference/ERCC_SIRV_151124a
# Build the reference index for reference fasta file

Step 3: Align FASTQ reads to reference Genome


cd playground
# Change directory to **/well/workshop/workshop##/playground**

hisat2 -x ../reference/ERCC_SIRV_151124a -1 MAS5283A10_1000_read1.fastq -2 MAS5283A10_1000_read2.fastq 2> MAS5283A10_1000_align_summary.txt
# By default hisat2 writes a SAM file to the screen

hisat2 -x ../reference/ERCC_SIRV_151124a -1 MAS5283A10_1000_read1.fastq -2 MAS5283A10_1000_read2.fastq 2> MAS5283A10_1000_align_summary.txt | /apps/htseq/samtools-1.4/samtools view -bh -o MAS5283A10_1000_hisat_unsorted.bam -S -
# Align 1000 read FASTQ dataset to reference

head MAS5283A10_1000_align_summary.txt
# View the alignment summary report

Step 4: Sort and index the 1000 read bam alignment file


samtools sort -o MAS5283A10_1000_hisat_sorted.bam MAS5283A10_1000_hisat_unsorted.bam

samtools index MAS5283A10_1000_hisat_sorted.bam

Step 5: QC the 1000 read bam alignment file


samtools flagstat MAS5283A10_1000_hisat_sorted.bam
# Check overall read alignment metrics

samtools idxstats MAS5283A10_1000_hisat_sorted.bam
# Check alignment metrics to each "chromosome" in the reference

mkdir alignment_metrics

picard-Xmx8g CollectRnaSeqMetrics REF_FLAT=../GTF/ERCC_SIRV_151124a.refFLAT INPUT=MAS5283A10_1000_hisat_sorted.bam STRAND_SPECIFICITY=NONE OUTPUT=alignment_metrics/CollectRnaSeqMetrics.txt 2> alignment_metrics/CollectRnaSeqMetrics.err

head alignment_metrics/CollectRnaSeqMetrics.txt

Step 6: Count reads aligning to gene features, featureCounts

/apps/htseq/subread-1.5.2-Linux-x86_64/bin/featureCounts -T 4 -p -F GTF -a ../GTF/ERCC_SIRV_151124a.gtf -o MAS5283A10_1000_featureCount.txt MAS5283A10_1000_hisat_unsorted.bam > MAS5283A10_1000_featureCount.log 2>&1

head MAS5283A10_1000_featureCount.log

head MAS5283A10_1000_featureCount.txt.summary

head MAS5283A10_1000_featureCount.txt

Step 7: Isoform quantification with stringtie


mkdir -p stringtie_output/MAS5283A10_1000

stringtie MAS5283A10_1000_hisat_sorted.bam -o stringtie_output/MAS5283A10_1000/stringtie.gtf -p 4 --rf -G ../GTF/ERCC_SIRV_151124a.gtf -e -C stringtie_output/MAS5283A10_1000/cov_refs.gtf -A stringtie_output/MAS5283A10_1000/gene_abund.tab -c 1 -b stringtie_output/MAS5283A10_1000 

head stringtie_output/MAS5283A10_1000/stringtie.gtf

Step 8: Alignment free quantification with Salmon


salmon index -t ../genome/ERCC_SIRV_C_151124a.transcripts.fa -i ../reference/ERCC_SIRV_C_151124a.transcripts --type quasi -k 31 
# Build transcriptome index

salmon quant -i ../reference/ERCC_SIRV_C_151124a.transcripts -l A -1 MAS5283A10_1000_read1.fastq -2 MAS5283A10_1000_read2.fastq -o salmon_output
# Quantify transcript expression

head salmon_output/quant.sf 
head salmon_output/logs/salmon_quant.log

Step 9: Run all steps sequentially for a full sample FASTQ file (~ 1-2 minutes)

cd ..
pwd
# you should be in your main working directory /well/workshop/workshop##

head /well/workshop/examples/Alignment_count_script.sh

/well/workshop/examples/Alignment_count_script.sh single_sample.txt

Step 10: (Optional - Do this later) You can also run all the steps for the full dataset (~ 20 minutes)


/well/workshop/examples/Alignment_count_script.sh samplelist.txt

Step 11: Summary statistics for Alignment and Read distribution

sample	Hisat	Salmon	Total_reads	Hisat_MultiMapping	Assigned_Exon_Features	Assigned_NoFeatures
MAS5283A1	96.01%	97.13%	786973	31046	761926	12
MAS5283A10	95.54%	97.24%	560856	32296	832325	21
MAS5283A11	95.46%	97.05%	873195	28597	702773	15
MAS5283A12	95.40%	97.25%	821839	25737	675273	18
MAS5283A2	95.74%	97.70%	814672	18173	544431	7
MAS5283A3	95.51%	97.21%	834561	32837	840650	7
MAS5283A4	95.37%	97.49%	615122	29661	790134	12
MAS5283A5	95.68%	97.59%	714988	28973	788111	14
MAS5283A6	95.66%	97.18%	880733	32144	807580	13
MAS5283A7	95.70%	96.68%	862366	27235	594189	7
MAS5283A8	95.71%	97.17%	726273	25989	691615	12
MAS5283A9	95.39%	96.92%	697368	31431	847161	14