Step 3: Variant Annotation and Filtering challenge.
Annotating variants
Once variants have been called, the next crucial step is to annotate and filter them to prioritize those most likely to be disease-causing. A wide array of tools and databases is available to help us interpret these variants.
Genic Context and Functional Effect
To understand the genic location of a variant and its effect on transcripts and protein structure, the following resources and tools are widely used:
Gene annotation databases:
- RefSeq Gene
- UCSC Known Genes
- Ensembl Gene
Variant effect predictors:
- Ensembl VEP (Variant Effect Predictor)
- SnpEff
These tools help determine whether a variant is:
- Synonymous (no change in amino acid; usually benign)
- Missense (amino acid change; potentially damaging)
- Nonsense or frameshift (premature stop codon or disrupted reading frame; often pathogenic)
For missense variants, functional prediction tools evaluate how likely the change is to affect protein structure or function:
…and perhaps soon, AlphaFold-inspired models?
Population Frequency Databases
To filter out common variants unlikely to cause rare diseases, we consult population-level variant databases:
- dbSNP: A broad collection of known single nucleotide polymorphisms.
- 1000 Genomes Project: Includes ~2,500 individuals from diverse populations.
- gnomAD (Genome Aggregation Database): Contains data from over 800,000 individuals and is more robust for assessing variant frequency.
Variants that are too common are typically filtered out. A good working threshold is an allele frequency < 0.001 (0.1%) for rare, potentially pathogenic variants.
The frequency threshold depends on the sample size of the database.
In 1000 Genomes: 0.001 ≈ 2–3 individuals → could still be a sequencing error.
In gnomAD: 0.001 ≈ ~800 individuals → more statistically reliable.
Gathering all the databases in one place
With so many tools and databases, annotation can feel overwhelming. Fortunately, tools like:
…consolidate these resources and allow for automated annotation of VCF files.
In this tutorial, we’ll use the web-based version of ANNOVAR.
Before uploading to wANNOVAR, decompress the final file:
gunzip final_genotyped.g.vcf.gz
Then download the resulting final_genotyped.g.vcf and upload it to wANNOVAR.
Fill in the fields as follows on the submit page as follows:
- Enter institutional email and any identifier.
- Result duration: 1 day
- Reference genome: hg19
- Input format: VCF
- Gene Definition: RefSeq Gene
- Individual analysis: All annotations (very important, otherwise only looks at first genotype and the variant is missed!)
- Disease Model: Rare dominant Mendelian disease.
While you're waiting for your resulting .csv, take a look at the documentation
The challenge: Variant interpretation
Recall that Mother and Daughter are affected by a rare autosomal dominant congenital heart defect. Your goal is to identify the disease-causing variant — and possibly, the associated gene and condition.
- The variant should be present in Mother and Daughter, but absent in Father. The columns present in the VCF are still present in you csv file as the very last columns (note: the first genotype column corresponds to father, followed by mother and daughter)
- For a variant to be disease-causative, it will likely need to change the structure of the corresponding protein, i.e. a new amino acid is produced as a result of the variant.
- The disease affecting the family is expected to be very rare. Apply filters so that only variants with a frequency <1% or otherwise not present in population databases (this is an important distinction in this context).
Once you have a shortlist, look up associated genes using resources like: