Getting set up
In this tutorial, we will use a set of command-line tools to detect genetic variation in aligned next-generation sequencing data:
- bcftools, a popular tool for variant calling, filtering and summarising; part of the SAMtools suite introduced in the previous session
- GATK, a comprehensive toolkit for variant discovery and genotyping in next-generation sequencing data; considered the gold standard for germline short variant calling
For the purpose of today's tutorial we have pre-installed this software onto the JupyterHub instance.
We will also use a web-based tool for variant annotation:
- wANNOVAR, a web interface for ANNOVAR that provides easy access to all its pre-built databases without the need to download or manage the large dataset (over 100 GB)
Logging in to JupyterHub
If you are logging into JupyterHub for the first time, we will have allocated you a username (usually in the format 'surname_initial), which the instructor will be able to confirm for you if needed. Enter this username on the login page https://mscjupyter.bmrc.ox.ac.uk and specify a password of your choice (please choose a sensibly secure one that is not the same as your University password). Make sure to remember this as you will then use this same password to access the site in future.
If you have previously logged in, go to https://mscjupyter.bmrc.ox.ac.uk and login using the username and password you set in previous sessions.
Your JupyterHub instance is private to you. You can log out and log back in at any time and your session should still be in place.
Getting the data
This tutorial is divided into two sections. You should already have the data you need for the first section from the previous session. For the second section, we will use a new set of aligned next-generation sequencing files.
Getting the new data
To get these new files, open a Terminal on the JupyterHub instance and download the file called trio_bams.tgz from
the following folder.
Let's use wget this time to download our files:
wget https://www.chg.ox.ac.uk/~miossec/courses/Ludwig/trio_bams.tgz
This will probably take a minute or two to download. Once the download is finished, extract its contents:
tar -xzf trio_bams.tgz
You should now have a folder called trio_bams. We can delete the tarball, as the contents are now available in the folder called trio_bams. We won't use this data straight away, but you can check the content of the folder with ls.
rm trio_bams.tgz
ls trio_bams
Returning to last session's data
Let’s begin by returning to the data from the previous session. Navigate to the relevant directory:
cd sequence_data_analysis/human
Press the Tab key to autocomplete directory and file names.
You should see your aligned and sorted BAM file and its index from last session:
NA12878.bam
NA12878.bam.bai
You’ll also find the reference genome file GRCh38_chr19.fa and its associated index files:
GRCh38_chr19.fa
GRCh38_chr19.fa.amb
GRCh38_chr19.fa.ann
GRCh38_chr19.fa.bwt
GRCh38_chr19.fa.fai
GRCh38_chr19.fa.pac
GRCh38_chr19.fa.sa
Don't worry if you can only find the reference genome file: we only need the .fai index for calling variants and this can be quickly generated using SAMtools:
.fai is missingsamtools faidx GRCh38_chr19.fa
Now you're ready to start.