Skip to main content

Linking exons to variants

Finally, let's take a closer look at the variant annotation file. This file summarises genetic variation, and is often produced as a result of GWAS studies or population analysis - see 1000 Genomes Project.

Getting the data

Let's start by downloading the data...

curl -O  http://ftp.ensembl.org/pub/release-107/variation/vcf/homo_sapiens/homo_sapiens-chr19.vcf.gz

...and decompress it:

gunzip -k homo_sapiens-chr19.vcf.gz

(In case of problems the data can also be found in this folder.)

The data is stored as a text file 'homo_sapiens-chr19.vcf' downloaded from Ensembl includes variants from multiple source databases. Let's first inspect the file using head.

head -n 30 'homo_sapiens-chr19.vcf'
##fileformat=VCFv4.1
##fileDate=20220607
##source=ensembl;version=107;url=https://e107.ensembl.org/homo_sapiens
##reference=ftp://ftp.ensembl.org/pub/release-107/fasta/homo_sapiens/dna/
##INFO=<ID=ClinVar_202201,Number=0,Type=Flag,Description="Variants of clinical significance imported from ClinVar">
##INFO=<ID=dbSNP_154,Number=0,Type=Flag,Description="Variants (including SNPs and indels) imported from dbSNP">
##INFO=<ID=HGMD-PUBLIC_20204,Number=0,Type=Flag,Description="Variants from HGMD-PUBLIC dataset December 2020">
##INFO=<ID=COSMIC_95,Number=0,Type=Flag,Description="Somatic mutations found in human cancers from the COSMIC catalogue">
(etc.)

A bit like the GFF files, the file contains lots of 'metadata' at the top before the real data starts. It includes decriptions of some variables that appear later in the file - such as the dbSNP_154 variable, for example, which indicates variants sourced from the dbSNP database of genetic variants.

Let's use grep to temporarily remove the lengthy header to see the remaining content of the file:

grep -v '##' 'homo_sapiens-chr19.vcf' | head -n 20

This should print something like:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
19 60062 rs1555674440 G C . . dbSNP_154;TSA=SNV
19 60165 rs1415141782 G A . . dbSNP_154;TSA=SNV;E_Freq;E_gnomAD
19 60173 rs1371922052 G A . . dbSNP_154;TSA=SNV;E_Freq;E_gnomAD
19 60184 rs1391618909 G A . . dbSNP_154;TSA=SNV;E_Freq;E_gnomAD
19 60223 rs1187548881 A G . . dbSNP_154;TSA=SNV;E_Freq;E_TOPMed
19 60251 rs1310995734 G A . . dbSNP_154;TSA=SNV;E_Freq;E_gnomAD
19 60281 rs1380383890 A G . . dbSNP_154;TSA=SNV;E_Freq;E_gnomAD
19 60319 rs1244952011 C T . . dbSNP_154;TSA=SNV;E_Freq;E_gnomAD
19 60326 rs1292465550 A G . . dbSNP_154;TSA=SNV;E_Freq;E_TOPMed;E_gnomAD
(etc.)

The VCF file, much like GFF file, includes some obligatory columns, as well as a list of attributes in the last column. The required columns are:

ColumnDescription
CHROMName of the chromosome or sequence
POSPosition of the described variant
IDName of the variant from a database of known variants. dbSNP entries all start with 'rs'
REFReference genome base(s)
ALTActual base(s) detected
QUALQuality score associated with the variant
FILTEREither 'PASS' or 'FAIL' value - if the variant passed quality filtering
INFOA list of key=value pairs similar to that in GFF file

A dot in any column '.' indicates a missing value.

Loading the VCF file in R

We will first use base R function to read the VCF file excluding the header, and we'll give the columns meaningful names according to the table above

df <- read.delim(
'homo_sapiens-chr19.vcf',
comment.char = '#',
header = FALSE,
col.names = c( 'CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO' )
)

head(df)

You should see:

  CHROM   POS           ID REF ALT QUAL FILTER
1 19 60062 rs1555674440 G C . .
2 19 60165 rs1415141782 G A . .
3 19 60173 rs1371922052 G A . .
4 19 60184 rs1391618909 G A . .
5 19 60223 rs1187548881 A G . .
6 19 60251 rs1310995734 G A . .
INFO
1 dbSNP_154;TSA=SNV
2 dbSNP_154;TSA=SNV;E_Freq;E_gnomAD
3 dbSNP_154;TSA=SNV;E_Freq;E_gnomAD
4 dbSNP_154;TSA=SNV;E_Freq;E_gnomAD
5 dbSNP_154;TSA=SNV;E_Freq;E_TOPMed
6 dbSNP_154;TSA=SNV;E_Freq;E_gnomAD
(etc.)

As we stumble upon a similar problem to that we had with GFF file - non-parsed data in INFO column - we will use the package VariationAnnotation to read in the VCF again and extract the data in INFO in a usable way. First load the library:

library( VariantAnnotation )

and now use it to re-load the data:


annot <- readVcf( 'homo_sapiens-chr19.vcf' )

head(annot)

Helpfully this has parsed the info for us:

class: CollapsedVCF 
dim: 6 0
rowRanges(vcf):
GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
info(vcf):
DataFrame with 32 columns: ClinVar_202201, dbSNP_154, HGMD-PUBLIC_20204, C...
info(header(vcf)):
Number Type Description
ClinVar_202201 0 Flag Variants of clinical significa...
dbSNP_154 0 Flag Variants (including SNPs and i...
HGMD-PUBLIC_20204 0 Flag Variants from HGMD-PUBLIC data...
COSMIC_95 0 Flag Somatic mutations found in hum...

(...)

MA 1 String Minor Allele
MAF 1 Float Minor Allele Frequency
MAC 1 Integer Minor Alelele Count
AA 1 String Ancestral Allele
geno(vcf):
List of length 0:

To get ready for analysis, let's combine the two portions of the VCF file (the data in df and the info columns parsed above) and delete the original INFO column now:

vcf <- cbind( df, annot@info )
vcf$INFO <- NULL
rm(df, annot)

FUT2 revisited

Let's examine the variants present in the gene FUT2. To do that, we save the start and end location of the gene we extracted earlier from the GFF file, then use that data to subset our large VCF file. Next, let's tabulate the data to see how many variants in FUT2 have a known clinical association, based on data from ClinVar database.

FUT2_start <- FUTs_df[FUTs_df$Name == 'FUT2', 'start']
FUT2_end <- FUTs_df[FUTs_df$Name == 'FUT2', 'end']

FUTs_vcf <- subset(vcf, POS >= FUT2_start & POS <= FUT2_end)

FUTs_vcf <- subset(FUTs_vcf, dbSNP_154 == TRUE & E_Phenotype_or_Disease == TRUE)

table(FUTs_vcf$CLIN_association)
FALSE  TRUE 
30 1

Of the selected variants, only one has phenotype association. Let's check which one that is:

FUTs_vcf[ FUTs_vcf$CLIN_association == TRUE, 'ID' ]
[1] "rs601338"

Read about the variant and what it causes and prepare a short summary of your findings. The methods section of this paper is a good starting place.

Advanced example: linking variants to exons

Indexing the data

Next, we will use more advanced tools to combine FASTA, GFF, and VCF information and predict the effect of the SNP rs601338 on translation. In this practical, we are using a small fragment of a single human genome, but in common bioinformatic applications, we would use 10s or 100s of sequences, with files getting really big and more difficult to manipulate. One solution to that problem is to use a more powerful machine, but another is to use file formats that allow efficient computation. Tabix is a type of index file, which creates fast, easy-to-access coordinates system of large files. We will index our VCF file now using Rsamtools:

bgzip( 'homo_sapiens-chr19.vcf', 'homo_sapiens-chr19.vcf.bgz', overwrite = TRUE )
[1] "homo_sapiens-chr19.vcf.bgz"
tabix <- indexTabix( 'homo_sapiens-chr19.vcf.bgz', format = 'vcf' )

This produces a new file, 'homo_sapiens-chr19.vcf.bgz.tbi' that allows programs to easily access individual rows.

We will also index the genome file.

indexFa( 'Homo_sapiens.GRCh38.dna.chromosome.19.fa' )

This produces a new file, 'Homo_sapiens.GRCh38.dna.chromosome.19.fa.fai' that helps programs access the sequences. Let's load it again:

genome <- FaFile(file = 'Homo_sapiens.GRCh38.dna.chromosome.19.fa' )

Finally, we will prepare our GFF file for use by creating a 'TxDb' file - a transcript database - which is a data structure that makes it easy to search the gene transcripts:

txdb <- makeTxDbFromGFF( 'Homo_sapiens.GRCh38.107.chromosome.19.gff3' )

This will print out some information such as:

Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK

Defining regions of interest

Let's define some interesting ranges of the genome which correspond to the FUT genes on chromosome 19. IRanges and Granges objects are efficient means of acessing specified regions from indexed genomic files.

regions <- GRanges(
seqnames="19",
ranges = IRanges(
start = FUTs_df$start,
end = FUTs_df$end,
names = FUTs_df$Name
)
)

print(regions)

You should see something like:

GRanges object with 5 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
FUT6 19 5830408-5839722 *
FUT3 19 5842888-5851471 *
FUT5 19 5865826-5870540 *
FUT2 19 48695971-48705951 *
FUT1 19 48748011-48755390 *
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths

We will now use the VCF tabix map and the FUT ranges defined above to load a section of VCF data corresponding to these genes only.

vcf_short <- readVcf(tabix, genome = "GRCh38.107", param = regions )

Bringing it together

Finally, let's bring all the files we saw today together to predict the effect of variants we saw on the protein sequences which are produced from the DNA via mRNA. We will use the package VariationAnnotation, which will source the portion of FASTA file of interest, find the reading frames and annotations of FUT genes using the GFF3 file, and performe nucleotide changes specified in the VCF file. It will then translate the DNA sequences and specify the effect to mRNA: wheather it's a STOP codon (a nonsense mutation), a substitution (a missense mutation), or no change at all (a silent mutation).

coding <- predictCoding(vcf_short, subject = txdb, seqSource = genome)
coding

You should see something like this:

GRanges object with 9603 ranges and 17 metadata columns:
seqnames ranges strand | paramRangeID REF ALT QUAL FILTER
<Rle> <IRanges> <Rle> | <factor> <DNAStringSet> <CharacterList> <numeric> <character>
rs1182309127 19 5831378-5831394 - | FUT6 TCAGGCAGGTGAAGCTT rs1182309127 T,TCAGGCAGGTGAAGCTTCAG.. NA .
rs1182309127 19 5831378-5831394 - | FUT6 TCAGGCAGGTGAAGCTT rs1182309127 T,TCAGGCAGGTGAAGCTTCAG.. NA .
rs1304301383 19 5831378 - | FUT6 T rs1304301383 C NA .
rs1158033586 19 5831382 - | FUT6 G rs1158033586 A NA .
rs779049413 19 5831383 - | FUT6 C rs779049413 T NA .
... ... ... ... . ... ... ... ... ... ...
rs1269695902 19 48751277 - | FUT1 C rs1269695902 T NA .
rs1380079562 19 48751279 - | FUT1 C rs1380079562 A NA .
CM1617022 19 48751281 - | FUT1 T CM1617022 NA .
rs200607617 19 48751281 - | FUT1 T rs200607617 C,G NA .
rs200607617 19 48751281 - | FUT1 T rs200607617 C,G NA .
(etc.)
-------
seqinfo: 1 sequence from GRCh38.107 genome; no seqlengths

Let's subset the results object to see what the consequence is of mutations in SNP rs601338.

coding["rs601338", ]
GRanges object with 1 range and 17 metadata columns:
seqnames ranges strand | paramRangeID REF
<Rle> <IRanges> <Rle> | <factor> <DNAStringSet>
rs601338 19 48703417 + | FUT2 G
ALT QUAL FILTER varAllele CDSLOC
<CharacterList> <numeric> <character> <DNAStringSet> <IRanges>
rs601338 A NA . A 461
PROTEINLOC QUERYID TXID CDSID
<IntegerList> <integer> <character> <IntegerList>
rs601338 154 8508 5594 20226,20225,20224
GENEID CONSEQUENCE REFCODON VARCODON
<character> <factor> <DNAStringSet> <DNAStringSet>
rs601338 ENSG00000176920 nonsense TGG TAG
REFAA VARAA
<AAStringSet> <AAStringSet>
rs601338 W *
-------
seqinfo: 1 sequence from GRCh38.107 genome; no seqlengths

Aha! This tells us that:

  • the mutation is in amino acid number 154 (PROTEINLOC)

  • the mutation is a nonsense mutation (CONSEQUENCE) (i.e. it gives rise to a stop codon). The non-mutated amino acid is tryptophan (symbol W, encoded by TGG) and the one resultiong from the mutation is a stop codong (*, encoded by TAG).

So it looks like secretor status is determined by the ability or inability to produce full-length FUT2 transcript. Without FUT2, it is impossible to fucosylate the required carbohydrates, and thus blood groups are not present. This makes sense in the context of it being a recessive trait, as a single copy of FUT2 must be sufficient to fucosylate all the required surface glycans.