Skip to main content

Getting set up

In this tutorial, we will use a set of command-line tools to detect genetic variation in aligned next-generation sequencing data:

  • bcftools, a popular tool for variant calling, filtering and summarising; part of the SAMtools suite introduced in the previous session
  • GATK, a comprehensive toolkit for variant discovery and genotyping in next-generation sequencing data; considered the gold standard for germline short variant calling

For the purpose of today's tutorial we have pre-installed this software onto the JupyterHub instance.

We will also use a web-based tool for variant annotation:

  • wANNOVAR, a web interface for ANNOVAR that provides easy access to all its pre-built databases without the need to download or manage the large dataset (over 100 GB)

Logging in to JupyterHub

For this practical we'll use the MSc JupyterHub site again - you should already have a username and password. Go to https://mscjupyter.bmrc.ox.ac.uk now and login using the username and password you set in previous sessions.

Note

Your JupyterHub instance is private to you. You can log out and log back in at any time and your session should still be in place.

The two key pieces of software we'll use here are bcftools and gatk. It's possible to install these on your laptop as well, but we'll use JupyterHub for now for simplicity.

Getting the data

This tutorial is divided into two sections. You should already have the data you need for the first section from the session CM4.4. For the second section, we will use a new set of aligned next-generation sequencing files.

Getting the new data

To get these new files, open a Terminal on the JupyterHub instance and download the file called trio_bams.tgz from the following folder.

Let's use wget this time to download our files:

wget https://www.chg.ox.ac.uk/~miossec/courses/Ludwig/trio_bams.tgz

This will probably take a minute or two to download. Once the download is finished, extract its contents:

tar -xzf trio_bams.tgz

You should now have a folder called trio_bams. We can delete the tarball, as the contents are now available in the folder called trio_bams. We won't use this data straight away, but you can check the content of the folder with ls.

rm trio_bams.tgz
ls trio_bams

Returning to last session's data

Let’s begin by returning to the data from the previous session. Navigate to the relevant directory:

cd sequence_data_analysis/human
Note

Press the Tab key to autocomplete directory and file names.

You should see your aligned and sorted BAM file and its index from last session:

NA12878.bam
NA12878.bam.bai

You’ll also find the reference genome file GRCh38_chr19.fa and its associated index files:

GRCh38_chr19.fa
GRCh38_chr19.fa.amb
GRCh38_chr19.fa.ann
GRCh38_chr19.fa.bwt
GRCh38_chr19.fa.fai
GRCh38_chr19.fa.pac
GRCh38_chr19.fa.sa

Don't worry if you can only find the reference genome file: we only need the .fai index for calling variants and this can be quickly generated using SAMtools:

Only run this step if your .fai is missing
samtools faidx GRCh38_chr19.fa

If you no longer have any of the above files, don't worry, you can also retrieve a copy using wget:

wget -r -np -nH --cut-dirs=3 -R "index.html*" https://www.chg.ox.ac.uk/~miossec/courses/Ludwig/sequence_data_analysis/

Now you're ready to start.