Skip to main content

Step 4: Filtering and Interpretation

The list of variants you've obtained by running GATK HaplotypeCaller is the same you have now as a csv file. The only difference is the additional annotation. Crucially, the annotation is going to help us filter the variants down to a manageable list of candidates!

Start by loading the data into R, you'll need to load dplyr.

library(dplyr)

annot.data <- read.csv("path/to/query.output.exome_summary.csv")

then interrogate the resulting tabular data

##What columns are we working with?
colnames(annot.data)

##What form do values in various columns take?
head(annot.data)

##Looking at all the options in columns with categorical information
unique(annot.data$ExonicFunc.refGeneWithVer)

If you are using RStudio, you might also find View(annot.data) quite useful.

The challenge: Variant interpretation

Recall that Mother and Daughter are affected by a rare autosomal dominant congenital heart defect. Your goal is to identify the disease-causing variant — and possibly, the associated gene and condition.

You should already be familiar with how to filter data in R, but here is a reminder what that can look like:

f.data <- annot.data %>%
filter(
ExonicFunc.refGeneWithVer != "synonymous_SNV",
grepl("^0/0", Otherinfo13),
gnomad211_exome_AF < 0.001 | gnomad211_exome_AF == ".",
SIFT_pred == "D"
)

In this example, we've eliminated any synonymous SNVs, looked at variants absent in the father and only variants that are very rare (MAF<0.001). Note the "." signals the absence of a variant from a database. Possibly an by-product of how rare a variant is. We also only include variants which SIFT deems likely deleterious.

This is just an example. More filters will have to be applied in order to get to the variant(s) of interest.

Some hints
  • The variant should be present in Mother and Daughter, but absent in Father. The columns present in the VCF are still present in your csv file as the very last columns (note: the first genotype column corresponds to father, followed by mother and daughter)
  • For a variant to be disease-causative, it will likely need to change the structure of the corresponding protein, i.e. a new amino acid is produced as a result of the variant.
  • The disease affecting the family is expected to be very rare indeed. Apply filters so that only variants with a frequency <0.1% or otherwise not present in population databases (this is an important distinction in this context, how is an absence recorded in our annotated data?).

Did you find a good candidate? Remember we're looking for a congenital heart defect associated gene. Perhaps the variant is in a known gene? Is there clinical data in our annotation as well?

need a little more help?

Once you have a shortlist, look up associated genes using resources like: