Skip to main content

Step 3: Variant Annotation and Filtering challenge.

Annotating variants

Once variants have been called, the next crucial step is to annotate and filter them to prioritize those most likely to be disease-causing. A wide array of tools and databases is available to help us interpret these variants.

Genic Context and Functional Effect

To understand the genic location of a variant and its effect on transcripts and protein structure, the following resources and tools are widely used:

Gene annotation databases:

  • RefSeq Gene
  • UCSC Known Genes
  • Ensembl Gene

Variant effect predictors:

  • Ensembl VEP (Variant Effect Predictor)
  • SnpEff

These tools help determine whether a variant is:

  • Synonymous (no change in amino acid; usually benign)
  • Missense (amino acid change; potentially damaging)
  • Nonsense or frameshift (premature stop codon or disrupted reading frame; often pathogenic)

For missense variants, functional prediction tools evaluate how likely the change is to affect protein structure or function:

…and perhaps soon, AlphaFold-inspired models?

Population Frequency Databases

To filter out common variants unlikely to cause rare diseases, we consult population-level variant databases:

Variants that are too common are typically filtered out. A good working threshold is an allele frequency < 0.001 (0.1%) for rare, potentially pathogenic variants.

Important caveat:

The frequency threshold depends on the sample size of the database.

In 1000 Genomes: 0.001 ≈ 2–3 individuals → could still be a sequencing error.

In gnomAD: 0.001 ≈ ~800 individuals → more statistically reliable.

Gathering all the databases in one place

With so many tools and databases, annotation can feel overwhelming. Fortunately, tools like:

…consolidate these resources and allow for automated annotation of VCF files.

In this tutorial, we’ll use the web-based version of ANNOVAR.

Before uploading to wANNOVAR, decompress the final file:

gunzip final_genotyped.g.vcf.gz

Then download the resulting final_genotyped.g.vcf and upload it to wANNOVAR.

Fill in the fields as follows on the submit page as follows:

  • Enter institutional email and any identifier.
  • Result duration: 1 day
  • Reference genome: hg19
  • Input format: VCF
  • Gene Definition: RefSeq Gene
  • Individual analysis: All annotations (very important, otherwise only looks at first genotype and the variant is missed!)
  • Disease Model: Rare dominant Mendelian disease.

While you're waiting for your resulting .csv, take a look at the documentation

The challenge: Variant interpretation

Recall that Mother and Daughter are affected by a rare autosomal dominant congenital heart defect. Your goal is to identify the disease-causing variant — and possibly, the associated gene and condition.

Some hints
  • The variant should be present in Mother and Daughter, but absent in Father. The columns present in the VCF are still present in you csv file as the very last columns (note: the first genotype column corresponds to father, followed by mother and daughter)
  • For a variant to be disease-causative, it will likely need to change the structure of the corresponding protein, i.e. a new amino acid is produced as a result of the variant.
  • The disease affecting the family is expected to be very rare. Apply filters so that only variants with a frequency <1% or otherwise not present in population databases (this is an important distinction in this context).

Once you have a shortlist, look up associated genes using resources like: