Introduction to Next-Generation Sequencing: from sequencing to variant calling

Performing an analysis on whole-genome or exome sequencing data requires a series of steps that almost systematically include quality control, sequence alignment and some kind of variant calling and annotation. For each of these steps, a number of well-established and regularly-updated tools exist, with new ones being released all the time. It is left up to the researcher to explore each of these and find the best fit for a given analysis. Today we will use BWA for alignment, Samtools for variant calling and Picard for duplicate removal. Filtering will be performed using the online version of ANNOVAR which will allow us to extract valuable data from various large databases without needing to download any of them. It is important to stress we could have also used Bowtie2 for alignment, GATK HaplotypeCaller for variant calling and GEMINI for annotation. We strongly encourage participants to evaluate other existing options in their own time.

Downloading the Human Genome Reference
(and preparing it for fast read alignment this afternoon)

The Human Genome Reference is a crucial part of any human-centred sequencing analysis. It is required for alignment and variant calling and (to some degree) annotation. In this class, we will use the latest version: HG38.
(note: many existing studies still use HG19. If working from existing data, always check the version against which reads were aligned)
The human genome is a relatively big file, so let's download it now!

Instructions:

Go to: UCSC Genome Browser portal downloads > Humans > hg38 > Full data set > hg38.fa.gz
Save the genome reference in your working folder today for ease of access. You can use wget [LINK] to download it directly from the command line.
(note: the file extension should be '.fa.gz', you can unzip the fasta file using gunzip but if you merely wish to inspect it, less will do!)

Later, we will need the human genome reference for sequence alignment (more on that later...) which requires us to compress the reference into an FM-index. This takes a while (>40 mins on a personal computer), so it is better to start now...

...and for this, we need to run BWA

Question:

Only underline the blue box once you have tried to work out an answer yourself and want to check it. When you are ready, run bwa

Which bwa command can you write to create an FM-index of the genome reference? (note: You would only ever need to run this once for a given reference, not for every analysis)

bwa index hg38.fa.gz
(hint: for available bwa commands, check the documentation or type bwa in the command line, followed by nothing)

FASTA/FASTQ format

For this step, we will download a pair of 'small' FASTQ files, each containing either read of various pairs (i.e. paired-end reads with IDs ending in :1 and :2).

Question:

On opening the file and looking at the first few lines, can we ascertain whether the Phred quality score encoding is Phred+33 or Phred+64? If so, how?
Can you convert the first five quality score of the first entry back into numbers (you would never need to do this by hand, but it helps to understand).

Here is the guide mentioned earlier:

Phred conversion table from wiki

If you're still stuck, consider this: do Phred+33 and Phred+64 strictly use the same ASCII characters?

First step: Quality Control

We've previously mentioned FastQC and PRINSEQ. Another tool that provides NGS quality control is NGS QC toolkit. How does it compare to FastQC? Let's take a look.

Instructions

NGS QC toolkit
FastQC

Question:

Which metrics are similar between tools?
Which metrics are different between tools?

Do not worry if you are not able to complete this section before the end of the morning session. It is more important that you start the next section if the afternoon session has begun. You can always return to this section later.

Sequence alignment with BWA

Various tools exist for fast alignment, each with its own strengths and weaknesses. For this practical we will continue to use bwa (Burrows-Wheeler Aligner).
We already downloaded the HG38 reference this morning and created a compressed version of it (an FM index, notice the .amp , .ann , .bwt , .pac , .sa files in your work directory).

Now we are ready to run BWA once more!

Questions:

Only underline the blue box once you have tried to work out an answer yourself and want to check it. When you are ready, run bwa

Write a command to generate an alignment in SAM format from paired-end reads

bwa mem hg38.fa.gz HG02621.mini.1.fq HG02621.mini.2.fq > HG02621.sam

Variant calling and preprocessing with Samtools and Picard

In the previous section, we generated a SAM file, which is an alignment (to HG38) of our reads. We now want to compress our file to save up hard disk space and prepare it for variant calling. For this, we use Samtools .

Questions:

For this afternoon session, answers will not be provided, although hints can be given. The best way to find out if your answer is correct is to run samtools/picard and see what happens. I will gladly provide answers by e-mail after the end of the course if needed: miossec@well.ox.ac.uk

Which reference sequence should we use in this step? Would any version do? Why?
Using Samtools most recent documentation (or by typing samtools in the command line), write a command to sort the reads in your SAM file. Additionally, you must:
- Compress the file to binary (i.e. BAM) to use less space
- specify output file name
- sort by read name
If you attempt to run the samtools index command on the previous file, you will get an error. Why? What do you need to change?
Make the needed change above and run samtools index
Now using picard (See documentation), mark all duplicates so that they can be ignored in the next steps picard
Returning to samtools and generate a VCF file

You now have a VCF file that you can run through wANNOVAR for further annotation and filtering. Using the default parameters, submit your VCF file for annotation.

The result should be a .csv type file which you can open in Excel or equivalent or in your terminal if you're comfortable with awk, grep and/or sed.
You may now explore how different options leads to shorter manageable lists from which we can derive hypotheses. Use the remaining time today (if any) to explore some of the databases used by wANNOVAR and touched upon briefly this afternoon.

Introduction to Next-Generation Sequencing: from sequencing to variant calling

Downloading the Human Genome Reference(and preparing it for fast read alignment this afternoon)

Instructions:

...and for this, we need to run BWA

Question:

FASTA/FASTQ format

Question:

First step: Quality Control

Instructions

Question:

Sequence alignment with BWA

Now we are ready to run BWA once more!

Questions:

Variant calling and preprocessing with Samtools and Picard

Questions:

This concludes the practical part of this course. Good luck with any of your future NGS analyses!

Downloading the Human Genome Reference
(and preparing it for fast read alignment this afternoon)