Step 3: Variant Annotation
Annotating variants
Once variants have been called, the next crucial step is to annotate them to increase interpretability. Knowing a variant is present across several samples at a given position on chromosome 5 is not as useful to us as knowing that it falls within the first exon of the FLT4 gene and that in the main gene transcript, this leads to an early stop codon. Suddenly, we know we have a potentially nonsense variant. Conversely, a lot of the variants we have identified will be synonymous variants, making no change to the resulting amino acid chain (and therefore resulting protein).
Having this genic context helps us prioritize those variants most likely to be disease-causing. A wide array of tools and databases is available to help us interpret these variants.
1. Genic Context and Functional Effect
To replace variants in their genic context and understand the immediate effect such a variant would have on the resulting gene transcripts, we can use the following resources and tools which are based on the same genome reference:
Gene annotation databases:
- RefSeq Gene
- UCSC Known Genes
- Ensembl Gene
Variant effect predictors:
- Ensembl VEP (Variant Effect Predictor)
- SnpEff
These latter two tools help us determine whether a variant is:
- Synonymous (no change in amino acid; usually benign)
- Missense (amino acid change; potentially damaging)
- Nonsense or frameshift (premature stop codon or disrupted reading frame; often pathogenic)
For missense variants, functional prediction tools evaluate how likely the change is to affect protein structure or function:
…and perhaps soon, AlphaFold-inspired models?
2. Population Frequency Databases
To filter out common variants unlikely to cause rare diseases, we can consult population-level variant databases:
- dbSNP: A broad collection of known single nucleotide polymorphisms.
- 1000 Genomes Project: Includes ~2,500 individuals from diverse populations.
- gnomAD (Genome Aggregation Database): Contains data from over 800,000 individuals and is more robust for assessing variant frequency.
Variants that are too common are typically filtered out. A good working threshold is an allele frequency < 0.001 (0.1%) for rare, potentially pathogenic variants.
The lowest reliable frequency threshold is limited by the sample size of the database.
In 1000 Genomes: 0.001 ≈ 2–3 individuals → could still be a sequencing error.
In gnomAD: 0.001 ≈ ~800 individuals → more statistically reliable.
Gathering all the databases in one place
With so many tools and databases, annotation can feel overwhelming. Fortunately, tools like:
…consolidate these resources and allow for automated annotation of VCF files.
In this tutorial, we’ll use the web-based version of ANNOVAR.
Before uploading to wANNOVAR, decompress the final file:
gunzip final_genotyped.g.vcf.gz
Then download the resulting final_genotyped.g.vcf and upload it to wANNOVAR.
Fill in the fields as follows on the submit page as follows:
- Enter institutional email and any identifier.
- Result duration: 1 day
- Reference genome: hg19
- Input format: VCF
- Gene Definition: RefSeq Gene
- Individual analysis: All annotations (very important, otherwise only looks at first genotype and the variant is missed!)
- Disease Model: Rare dominant Mendelian disease.
While you're waiting for your resulting .csv, take a look at the documentation
Once you have your .csv, it's time to filter down your list to the most likely candidates!