RNASeq Tutorial

Eshita Sharma, PhD

2018-11-26

Introduction

This is an RNAseq tutorial introducing the basics of RNAseq data analysis. For this exercise, we will use a dataset generated at the Oxford Genomics Centre (OGC). This is currently an unpublished dataset and belongs to OGC, Wellcome Centre for Human Genetics (WHG).

In this tutorial we will -

Dataset description

The dataset comprises RNAseq data from 12 samples of spike-ins data. These are the commonly used 92 ERCC spike-ins, generally used for gene expression quantification, and 7 SIRV spike-in genes, used for isoform level quantification of 68 isoforms.

Instructions for Windows users

You’ll be assigned a user name (e.g. workshop01) and a password.

username: workshop##

password: RNASeqNov2018

note: You might have to type-in the password if Copy+Paste (Ctrl+C and Ctrl+V) do not work.

  1. Download and install SmarTTY
  2. Launch SmarTTY terminal
  3. Click on New SSH Connection
  4. In the pop-up window, type:
    • Host Name: rescomp.well.ox.ac.uk
    • User Name: (your user name:workshop##)
    • Authentication Method
    • Password: RNASeqNov2018
  5. Click Connect
  6. You will be asked to change the password.
    • LDAP: RNASeqNov2018
    • New Password:
    • Confirm New password:

please change the password to something you remember, ideally like Workshop01&

  1. You will be logged out.
  2. Connect again with same host name and user name
    • Host Name: rescomp.well.ox.ac.uk
    • User Name: (your user name:workshop##)
  3. Authentication Method, type your new password:
    • Password: new password

Instructions for Mac/Linux users

You’ll be assigned a user name (e.g. workshop01) and a password.

username: workshop##

password: RNASeqNov2018

  1. Open a terminal and type ssh <your_user_name@rescomp.well.ox.ac.uk>
  1. When asked for password, type:
    • Password: RNASeqNov2018
  2. Enter
  1. You will be asked to change the password.
    • LDAP: RNASeqNov2018
    • New Password:
    • Retype new password: please change the password to something you remember, ideally like Workshop01&
  1. You will be logged out.
  2. Login again with ssh <your_user_name@rescomp.well.ox.ac.uk>
  3. When asked for password, type your new password:
    • Password:

Getting familiar with your home space

Copy-paste or type the following commands to explore your home directory.

Note: The text after the hash(#) is explanatory text for the command. Don’t copy-paste these lines.

ls
# Lists all files and folders in the home directory

cd /well/workshop/workshop##/
# Change directory to go to your workshop folder on /well

ls -l FASTQ/
# Lists all files and folders in the FASTQ/ directory

ls -l GTF/
# Lists the gene annotation file in the GTF/ directory

ls -l genome/
# Lists the fasta file in the genome/ directory

The data alignment and BAM sorting steps are generally run on a High-Performance Computing cluster (HPC). Use of a computing cluster gives access to multiple high-performance processors thereby enabling parallel processing for rapid analysis. As we will be unable to use the cluster today, the sorted alignment files have already been prepared and can be found in the processed_dataset/alignment_files/ folders. The alignment_files/ folder contains 12 folders with the samplenames.

Similarly the count files of each counting tool are available in processed_dataset/featureCount_output/, processed_dataset/salmon_output/, processed_dataset/stringtie_output/

We will make a playground/ folder with a small extract of the original dataset, which can be used for trying the different commands.

Prerequisites

Check if the following commands work:

hisat2 --version

/apps/htseq/subread-1.5.2-Linux-x86_64/bin/featureCounts -v

samtools --version
# /apps/htseq/samtools-1.4/samtools --version

bedtools --version
#/apps/htseq/bedtools-2.26.0/bin/bedtools --version

picard-Xmx8g CollectRnaSeqMetrics --version
# /apps/well/picard-tools/2.0.1/picard-Xmx8g CollectRnaSeqMetrics --version

salmon --version

stringtie --version

Step 1: Getting familiar with the FASTQ file format

Exercise A : Look at the first few lines of a FASTQ file
cat FASTQ/MAS5283A10_1.fastq | head -8
cat FASTQ/MAS5283A10_2.fastq | head -8
# Each read record is stored in 4 lines
Exercise B : Extract readnames from the file
grep '^@' FASTQ/MAS5283A10_1.fastq | head
# All readnames begin with **@**
grep '^@' FASTQ/MAS5283A10_2.fastq | head
# Note /1 and /2 at the end of readnames. These represent read1 and read2 respectively.
Exercise C : Count number of reads in a FASTQ file
grep '^@' FASTQ/MAS5283A10_1.fastq | wc
# Number of lines that begin with '@' in the file
cat FASTQ/MAS5283A10_1.fastq | awk '{s++} END {print s/4}'
# Another method to count all lines and output number of reads
Exercise D : Extract 1000 reads from a full FASTQ file and write in playground
mkdir playground
cat FASTQ/MAS5283A10_1.fastq | head -4000 > playground/MAS5283A10_1000_read1.fastq 
cat FASTQ/MAS5283A10_2.fastq | head -4000 > playground/MAS5283A10_1000_read2.fastq 

Step 2: Build the Genome Index

As we will use hisat2 for alignment, here we build a reference hisat2 index of the spike-in genome

mkdir reference/
# Make a new directory to write the reference indexes in

hisat2-build genome/ERCC_SIRV_151124a.fa reference/ERCC_SIRV_151124a
# Build the reference index for reference fasta file

Step 3: Align FASTQ reads to reference Genome


cd playground
# Change directory to **/well/workshop/workshop##/playground**

hisat2 -x ../reference/ERCC_SIRV_151124a -1 MAS5283A10_1000_read1.fastq -2 MAS5283A10_1000_read2.fastq 2> MAS5283A10_1000_align_summary.txt
# By default hisat2 writes a SAM file to the screen

hisat2 -x ../reference/ERCC_SIRV_151124a -1 MAS5283A10_1000_read1.fastq -2 MAS5283A10_1000_read2.fastq 2> MAS5283A10_1000_align_summary.txt | /apps/htseq/samtools-1.4/samtools view -bh -o MAS5283A10_1000_hisat_unsorted.bam -S -
# Align 1000 read FASTQ dataset to reference

head -n20 MAS5283A10_1000_align_summary.txt
# View the alignment summary report

Step 4: Sort and index the 1000 read bam alignment file


samtools sort -o MAS5283A10_1000_hisat_sorted.bam MAS5283A10_1000_hisat_unsorted.bam

samtools index MAS5283A10_1000_hisat_sorted.bam

Step 5: QC the 1000 read bam alignment file


samtools flagstat MAS5283A10_1000_hisat_sorted.bam
# Check overall read alignment metrics

samtools idxstats MAS5283A10_1000_hisat_sorted.bam
# Check alignment metrics to each "chromosome" in the reference

mkdir alignment_metrics
# Make a new directory to write alignment metrics

picard-Xmx8g CollectRnaSeqMetrics REF_FLAT=../GTF/ERCC_SIRV_151124a.refFLAT INPUT=MAS5283A10_1000_hisat_sorted.bam STRAND_SPECIFICITY=NONE OUTPUT=alignment_metrics/CollectRnaSeqMetrics.txt 2> alignment_metrics/CollectRnaSeqMetrics.err
# Run Picard CollectRnaSeqMetrics and write output to alignment_metrics/CollectRnaSeqMetrics.txt

head -n20 alignment_metrics/CollectRnaSeqMetrics.txt
# View the first 20 lines of alignment_metrics/CollectRnaSeqMetrics.txt

less -S alignment_metrics/CollectRnaSeqMetrics.txt
# Open the file alignment_metrics/CollectRnaSeqMetrics.txt with less pager

:q
#Exit the "less" pager with "q" or "ZZ"

samtools idxstats MAS5283A10_1000_hisat_sorted.bam > 
less -S alignment_metrics/Samtools_idxstats_Metrics.txt

Step 6: Count reads aligning to gene features, featureCounts

/apps/htseq/subread-1.5.2-Linux-x86_64/bin/featureCounts -T 4 -p -F GTF -a ../GTF/ERCC_SIRV_151124a.gtf -o MAS5283A10_1000_featureCount.txt MAS5283A10_1000_hisat_unsorted.bam > MAS5283A10_1000_featureCount.log 2>&1
#Count reads aligning to exon features with featureCounts

less -S MAS5283A10_1000_featureCount.log
#View standard output of featureCounts

:q
#Exit the "less" pager with "q" or "ZZ"

less -S MAS5283A10_1000_featureCount.txt.summary
#View summary statistics of featureCounts read assignment

:q
#Exit the "less" pager with "q" or "ZZ"

less -S MAS5283A10_1000_featureCount.txt
#View featureCounts results 

:q
#Exit the "less" pager with "q" or "ZZ"

Step 7: Isoform quantification with StringTie


mkdir -p stringtie_output/MAS5283A10_1000_noAssembly
#Make a new directory to write the output of StringTie

stringtie MAS5283A10_1000_hisat_sorted.bam -o stringtie_output/MAS5283A10_1000_noAssembly/stringtie.gtf -p 4 --rf -G ../GTF/ERCC_SIRV_151124a.gtf -e -C stringtie_output/MAS5283A10_1000_noAssembly/cov_refs.gtf -A stringtie_output/MAS5283A10_1000_noAssembly/gene_abund.tab -c 1 -b stringtie_output/MAS5283A10_1000_noAssembly 
#Run StringTie quantification (without assembly)

less -S stringtie_output/MAS5283A10_1000_noAssembly/stringtie.gtf
#View StringTie GTF file (without assembly)

:q
#Exit the "less" pager with "q" or "ZZ"

less -S stringtie_output/MAS5283A10_1000_noAssembly/gene_abund.tab
#View StringTie gene abundance file (without assembly)

:q
#Exit the "less" pager with "q" or "ZZ"

less -S stringtie_output/MAS5283A10_1000_noAssembly/t_data.ctab
#View StringTie transcript abundance file (without assembly)

:q
#Exit the "less" pager with "q" or "ZZ"


######## Run StringTie in Assembly mode
mkdir -p stringtie_output/MAS5283A10_1000_Assembly
#Make a new directory to write the output of StringTie

stringtie MAS5283A10_1000_hisat_sorted.bam -o stringtie_output/MAS5283A10_1000_Assembly/stringtie.gtf -p 4 --rf -G ../GTF/ERCC_SIRV_151124a.gtf -C stringtie_output/MAS5283A10_1000_Assembly/cov_refs.gtf -A stringtie_output/MAS5283A10_1000_Assembly/gene_abund.tab -c 1 -b stringtie_output/MAS5283A10_1000_Assembly 
#Run StringTie quantification (with assembly)

less -S stringtie_output/MAS5283A10_1000_Assembly/stringtie.gtf
#View StringTie GTF file (with assembly)

less -S stringtie_output/MAS5283A10_1000_Assembly/gene_abund.tab
#View StringTie gene abundance file (with assembly)

:q
#Exit the "less" pager with "q" or "ZZ"

less -S stringtie_output/MAS5283A10_1000_Assembly/t_data.ctab
#View StringTie transcript abundance file (with assembly)

:q
#Exit the "less" pager with "q" or "ZZ"

Step 8: Alignment free quantification with Salmon


salmon index -t ../genome/ERCC_SIRV_C_151124a.transcripts.fa -i ../reference/ERCC_SIRV_C_151124a.transcripts --type quasi -k 31 
# Build transcriptome index

salmon quant -i ../reference/ERCC_SIRV_C_151124a.transcripts -l A -1 MAS5283A10_1000_read1.fastq -2 MAS5283A10_1000_read2.fastq -o salmon_output
# Quantify transcript expression

less -S salmon_output/quant.sf 
#View Salmon transcript abundance file

:q
#Exit the "less" pager with "q" or "ZZ"

head salmon_output/logs/salmon_quant.log
#View Salmon transcript log file

:q
#Exit the "less" pager with "q" or "ZZ"

Step 9: Run all steps sequentially for a full sample FASTQ file (~ 1-2 minutes)

cd ..
pwd
# you should be in your main working directory /well/workshop/workshop##

head /well/workshop/examples/Alignment_count_script.sh

/well/workshop/examples/Alignment_count_script.sh single_sample.txt

Step 10: (Optional - Do this later) You can also run all the steps for the full dataset (~ 20 minutes)


/well/workshop/examples/Alignment_count_script.sh samplelist.txt

Step 11: Summary statistics for Alignment and Read distribution

sample Hisat Salmon Total_reads Hisat_MultiMapping Assigned_Exon_Features Assigned_NoFeatures
MAS5283A1 96.01% 97.13% 786973 31046 761926 12
MAS5283A10 95.54% 97.24% 560856 32296 832325 21
MAS5283A11 95.46% 97.05% 873195 28597 702773 15
MAS5283A12 95.40% 97.25% 821839 25737 675273 18
MAS5283A2 95.74% 97.70% 814672 18173 544431 7
MAS5283A3 95.51% 97.21% 834561 32837 840650 7
MAS5283A4 95.37% 97.49% 615122 29661 790134 12
MAS5283A5 95.68% 97.59% 714988 28973 788111 14
MAS5283A6 95.66% 97.18% 880733 32144 807580 13
MAS5283A7 95.70% 96.68% 862366 27235 594189 7
MAS5283A8 95.71% 97.17% 726273 25989 691615 12
MAS5283A9 95.39% 96.92% 697368 31431 847161 14

Acknowledgements

Thanks to Jon Diprose, Silvia Salatino and John Broxholme for help in preparing the pipeline and course materials.

“And that’s a wrap for this session”