Introduction to Next-Generation Sequencing: from sequencing to variant calling

Performing an analysis on whole-genome or exome sequencing data requires a series of steps that almost systematically include quality control, sequence alignment and some kind of variant calling and annotation. For each of these steps, a number of well-established and regularly-updated tools exist, with new ones being released all the time. It is left up to the researcher to explore each of these and find the best fit for a given analysis. Today we will use  BWA  for alignment,  Samtools  for variant calling and  Picard  for duplicate removal. Filtering will be performed using the online version of  ANNOVAR  which will allow us to extract valuable data from various large databases without needing to download any of them. It is important to stress we could have also used  Bowtie2  for alignment,  GATK HaplotypeCaller  for variant calling and  GEMINI  for annotation. We strongly encourage participants to evaluate other existing options in their own time.

Downloading the Human Genome Reference
(and preparing it for fast read alignment this afternoon)

The Human Genome Reference is a crucial part of any human-centred sequencing analysis. It is required for alignment and variant calling and (to some degree) annotation. In this class, we will use the latest version: HG38.
(note: many existing studies still use HG19. If working from existing data, always check the version against which reads were aligned)
The human genome is a relatively big file, so let's download it now!

Instructions:

Go to: UCSC Genome Browser portal downloads > Humans > hg38 > Full data set > hg38.fa.gz
Save the genome reference in your working folder today for ease of access. You can use  wget [LINK]  to download it directly from the command line.
(note: the file extension should be '.fa.gz', you can unzip the fasta file using  gunzip   but if you merely wish to inspect it,  less   will do!)

Later, we will need the human genome reference for sequence alignment (more on that later...) which requires us to compress the reference into an FM-index. This takes a while (>40 mins on a personal computer), so it is better to start now...

...and for this, we need to run BWA

Question:

Only underline the blue box once you have tried to work out an answer yourself and want to check it. When you are ready, run  bwa 


FASTA/FASTQ format

For this step, we will download a pair of 'small' FASTQ files, each containing either read of various pairs (i.e. paired-end reads with IDs ending in :1 and :2).

Question:

  1. On opening the file and looking at the first few lines, can we ascertain whether the Phred quality score encoding is Phred+33 or Phred+64? If so, how?
  2. Can you convert the first five quality score of the first entry back into numbers (you would never need to do this by hand, but it helps to understand).

Here is the guide mentioned earlier:

Phred conversion table from wiki

If you're still stuck, consider this: do Phred+33 and Phred+64 strictly use the same ASCII characters?

First step: Quality Control

We've previously mentioned FastQC and PRINSEQ. Another tool that provides NGS quality control is NGS QC toolkit. How does it compare to FastQC? Let's take a look.

Instructions

Question:

  1. Which metrics are similar between tools?
  2. Which metrics are different between tools?

Do not worry if you are not able to complete this section before the end of the morning session. It is more important that you start the next section if the afternoon session has begun. You can always return to this section later.


Sequence alignment with BWA

Various tools exist for fast alignment, each with its own strengths and weaknesses. For this practical we will continue to use  bwa  (Burrows-Wheeler Aligner).
We already downloaded the HG38 reference this morning and created a compressed version of it (an FM index, notice the  .amp ,  .ann ,  .bwt ,  .pac ,  .sa  files in your work directory).

Now we are ready to run BWA once more!

Questions:

Only underline the blue box once you have tried to work out an answer yourself and want to check it. When you are ready, run  bwa 

  1. Write a command to generate an alignment in SAM format from paired-end reads
  2.  bwa mem hg38.fa.gz HG02621.mini.1.fq HG02621.mini.2.fq > HG02621.sam

Variant calling and preprocessing with Samtools and Picard

In the previous section, we generated a SAM file, which is an alignment (to HG38) of our reads. We now want to compress our file to save up hard disk space and prepare it for variant calling. For this, we use Samtools .

Questions:

For this afternoon session, answers will not be provided, although hints can be given. The best way to find out if your answer is correct is to run  samtools/picard  and see what happens. I will gladly provide answers by e-mail after the end of the course if needed: miossec@well.ox.ac.uk

  1. Which reference sequence should we use in this step? Would any version do? Why?
  2. Using Samtools most recent documentation (or by typing  samtools  in the command line), write a command to sort the reads in your SAM file. Additionally, you must:
  3. If you attempt to run the  samtools index  command on the previous file, you will get an error. Why? What do you need to change?
  4. Make the needed change above and run  samtools index 
  5. Now using  picard  (See documentation), mark all duplicates so that they can be ignored in the next steps   picard   
  6. Returning to  samtools and generate a VCF file

You now have a VCF file that you can run through wANNOVAR for further annotation and filtering. Using the default parameters, submit your VCF file for annotation.

The result should be a .csv type file which you can open in Excel or equivalent or in your terminal if you're comfortable with awk, grep and/or sed.
You may now explore how different options leads to shorter manageable lists from which we can derive hypotheses. Use the remaining time today (if any) to explore some of the databases used by wANNOVAR and touched upon briefly this afternoon.

This concludes the practical part of this course. Good luck with any of your future NGS analyses!