Step 2: GATK HaplotypeCaller
The data and scenario
We now move on to the second section of our tutorial, where we'll work with a family trio conveniently labeled Father, Mother and Daughter. For simplicity, only Chromosome 20 is included for each sample.
We will explore a scenario in which both the Mother and Daughter are affected by a complex congenital heart defect following an autosomal dominant pattern of inheritance. Our goal is to call variants in all three individuals and attempt to identify a potential disease-causing variant, if one exists.
Let’s return to your starting directory by running:
cd
Our files of interest are in the folder trio_bams. Unlike our previous example, these samples were aligned to the full hg19/b37 reference genome.
Could we use a different reference genome for alignment and variant calling steps? Why or why not?
It is common practice for widely used resources like genome references to be shared across research groups. For this tutorial, we will use a pre-existing reference available in the JupyterHub environment:
ls /home/common/NGS_hg19_ref
Among the files listed, you will notice hs37d5.dict. This is a sequence dictionary required specifically by GATK, similar in purpose to the .fai index used by SAMtools.
The following command is used to create a GATK-compatible sequence dictionary:
gatk CreateSequenceDictionary -R hs37d5.fa -O hs37d5.dict
As this is a one-time setup step and the dictionary already exists, there is no need to run this command.
GATK short germline variant detection step-by-step
Our first step with GATK involves running HaplotypeCaller on each of our samples individually. While it is technically possible to run them jointly (as is also the case with BCFtools), most large-scale projects favor this two-step approach, so we will follow suit.
1. GATK HaplotypeCaller run on each sample.
A typical HaplotypeCaller command looks like this:
gatk --java-options "-Xmx2G" HaplotypeCaller -R path/to/reference.fa -I path/to/input.bam -O output.g.vcf.gz
Here, -R, -I, and -O are used to specify the reference genome, input BAM file, and output VCF file, respectively. The -ERC GVCF flag tells GATK to produce a Genomic VCF (GVCF), which includes records for every genomic position. The -L 20 argument restricts processing to Chromosome 20.
To run HaplotypeCaller on each sample, we can use a simple FOR loop. The concept here is not too dissimilar to what we saw in our Introduction to Python sessions, with $file a variable taking on each of the BAM files as a value in turn:
for file in `ls trio_bams/*.bam` ; do gatk --java-options "-Xmx2G" HaplotypeCaller -R /home/common/NGS_hg19_ref/hs37d5.fa -I $file -O ${file%%.bam}.g.vcf.gz -L 20 -ERC GVCF ; done
To generate output filenames consistent with the input, we strip the .bam extension using parameter expansion: ${file%%.bam} and append .g.vcf.gz. We use %% to match the suffix. This is pattern-based suffix removal. We can do the same with prefixes using ##.
Each HaplotypeCaller run may take 6–7 minutes. This is a good time to take a short break.
Once complete, you can inspect one of the resulting GVCF files:
zless daughter_chr20.g.vcf.gz
AAs with earlier mpileup outputs, you’ll see entries for positions that match the reference. However, GVCFs summarize non-variant regions using blocks. Each block includes an END= tag that marks the endpoint of the reference block.
20 1 . N <NON_REF> . . END=60965 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0
20 60966 . A <NON_REF> . . END=60992 GT:DP:GQ:MIN_DP:PL 0/0:1:3:1:0,3,30
20 60993 . G <NON_REF> . . END=61016 GT:DP:GQ:MIN_DP:PL 0/0:2:6:2:0,6,49
2. Combine GVCFs for our samples
We can now merge the GVCF files from the three samples:
gatk --java-options "-Xmx2g" CombineGVCFs -R /home/common/NGS_hg19_ref/hs37d5.fa --variant father_chr20.g.vcf.gz --variant mother_chr20.g.vcf.gz --variant daughter_chr20.g.vcf.gz -O all_samples_merged.g.vcf.gz
3. Genotype again across all samples
Next, we perform joint genotyping to generate final genotype calls:
gatk --java-options "-Xmx2g" GenotypeGVCFs -R /home/common/NGS_hg19_ref/hs37d5.fa --variant all_samples_merged.g.vcf.gz -O final_genotyped.g.vcf.gz
These final steps are relatively quick since most computation was done during the initial HaplotypeCaller runs. While the individual and combined GVCF files are large, the final VCF output contains only variant sites and is much smaller.
To view your multi-sample VCF file:
zless final_genotyped.g.vcf
In the resulting VCF, you may notice some genotypes appear as ./.. This indicates that no reads were present at these loci in the given sample, and therefore no genotype could be confidently called.
the absence of evidence (for a given variant) is not the same as evidence of absence.
With genotypes available for each sample at every variant site, we are now ready to begin filtering for variants that fit our hypothesised dominant inheritance pattern. For this, it would be useful to annotate our data.
In the next section, we’ll use the ANNOVAR web interface to perform variant annotation.