Getting set up
In this tutorial, we will use a set of command-line tools to detect genetic variation in aligned next-generation sequencing data:
- bcftools, a popular tool for variant calling, filtering and summarising; part of the SAMtools suite introduced in the previous session
- GATK, a comprehensive toolkit for variant discovery and genotyping in next-generation sequencing data; considered the gold standard for germline short variant calling
For the purpose of today's tutorial we have pre-installed this software onto the JupyterHub instance.
We will also use a web-based tool for variant annotation:
- wANNOVAR, a web interface for ANNOVAR that provides easy access to all its pre-built databases without the need to download or manage the large dataset (over 100 GB)
Logging in to JupyterHub
For this practical we'll use the MSc JupyterHub site again - you should already have a username and password. Go to https://mscjupyter.bmrc.ox.ac.uk now and login using the username and password you set in previous sessions.
Your JupyterHub instance is private to you. You can log out and log back in at any time and your session should still be in place.
The two key pieces of software we'll use here are bcftools and gatk. It's possible to install these on your laptop
as well, but we'll use JupyterHub for now for simplicity.
Getting the data
This tutorial is divided into two sections. You should already have the data you need for the first section from the session CM4.4. For the second section, we will use a new set of aligned next-generation sequencing files.
Getting the new data
To get these new files, open a Terminal on the JupyterHub instance and download the file called trio_bams.tgz from
the following folder.
Let's use wget this time to download our files:
wget https://www.chg.ox.ac.uk/~miossec/courses/Ludwig/trio_bams.tgz
This will probably take a minute or two to download. Once the download is finished, extract its contents:
tar -xzf trio_bams.tgz
You should now have a folder called trio_bams. We can delete the tarball, as the contents are now available in the folder called trio_bams. We won't use this data straight away, but you can check the content of the folder with ls.
rm trio_bams.tgz
ls trio_bams
Returning to last session's data
Let’s begin by returning to the data from the previous session. Navigate to the relevant directory:
cd sequence_data_analysis/human
Press the Tab key to autocomplete directory and file names.
You should see your aligned and sorted BAM file and its index from last session:
NA12878.bam
NA12878.bam.bai
You’ll also find the reference genome file GRCh38_chr19.fa and its associated index files:
GRCh38_chr19.fa
GRCh38_chr19.fa.amb
GRCh38_chr19.fa.ann
GRCh38_chr19.fa.bwt
GRCh38_chr19.fa.fai
GRCh38_chr19.fa.pac
GRCh38_chr19.fa.sa
Don't worry if you can only find the reference genome file: we only need the .fai index for calling variants and this can be quickly generated using SAMtools:
.fai is missingsamtools faidx GRCh38_chr19.fa
If you no longer have any of the above files, don't worry, you can also retrieve a copy using wget:
wget -r -np -nH --cut-dirs=3 -R "index.html*" https://www.chg.ox.ac.uk/~miossec/courses/Ludwig/sequence_data_analysis/
Now you're ready to start.