Skip to main content

Practical outline

Overview

In this tutorial we will demonstrate a basic pipeline for analysing paired-end short-read genomic sequencing data. We will start with raw data in a FASTQ file, inspect quality control metrics, align the data, and then use it to look for genetic variation.

The data

You should now have a folder called sequence_data_analysis filled with a number of data files. Now would be a good point to explore what's in there. The folder contains

  • sequence data reads from a human genomic DNA sample - in human/

  • The human reference genome file (in human/GRCh37.fa). This is in FASTA format.

  • There is also a similar set of sequence data reads and reference genome for a malaria sample - in the malaria/ folder and reference/Pf3D7_v3.fa. You don't need to look at this now, but can go back and try to analyse that data later if you like.

Note. In case you get stuck, I placed online a set of 'results' files for steps in the practical.

During the practical we'll bring one or more of these datasets to an analysis-ready state. The tutorial uses the human data in:

human/NA12878-read1.fastq.gz
human/NA12878-read2.fastq.gz

These files contain genome sequence reads from a sample called NA12878 (also known as HG001), generated on the Illumina Novaseq 6000 platform.

Note

"NA12878" is the codename for individual who was originally recruited to the 1000 Genomes Project. At that time an immortalised cell line was made so that the sample could be studied in multiple projects.

NA12878 is now often used as a reference sample, for example to test sequencing methods. For example it is one of the samples studied in the Genome in a Bottle project.