Required packages

Use the following code to install the required packages, if they’re not present on your system. If you are using one of the computers in the teaching room, the packages have already been installed for you.

install.packages("rtracklayer")

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("GenomicRanges")
BiocManager::install("Gviz")
BiocManager::install("VariantAnnotation")
BiocManager::install("GenomicFeatures")

With every new session, you need to tell R which packages to load from the storage. This is similar to Python’s import function. Use the code below to import the required packages rtracklayer, Gviz, GenomicRanges, VariantAnnotation, and GenomicFeatures.

library(rtracklayer)
## Warning: package 'BiocGenerics' was built under R version 4.0.5
## Warning: package 'GenomeInfoDb' was built under R version 4.0.5
library(Gviz)
## Warning: package 'Gviz' was built under R version 4.0.4
options(ucscChromosomeNames=TRUE)

library(GenomicRanges)

library(VariantAnnotation)
library(GenomicFeatures)
## Warning: package 'GenomicFeatures' was built under R version 4.0.4

Introduction

This tutorial will help you take a closer look at three common file formats used in bioinformatics research: FASTA read lists, GFF3 genome annotation files, and VCF variant annotation files. You will explore command line tools that take you from FASTA files to VCF files in later sessions, but for now, we will inspect pre-computed files using bash commands and R functions you are already familiar with, gradually introducing more advanced functionalities.

Two large consortia exist maintaining genomic sequences: Ensembl, maintained by the European bioinformatics institute, and UCSC, maintained by University of California Santa Cruz and funded by NIH, along a few other initiatives, such as ENCODE and RefSeq. These are largely interchangeable but use a differnt indexing strategy and own annotation pipelines. We will explore files relating to chromosome 19 sequence and annotation using the files provided by Ensembl consortium.

You can download the files by acessing Ensembl website (above), or using the code below which will download them for you.

download.file(url = 'http://ftp.ensembl.org/pub/release-107/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.19.fa.gz', destfile = 'Homo_sapiens.GRCh38.dna.chromosome.19.fa.gz')

download.file(url = 'http://ftp.ensembl.org/pub/release-107/gff3/homo_sapiens/Homo_sapiens.GRCh38.107.chromosome.19.gff3.gz', destfile = 'Homo_sapiens.GRCh38.107.chromosome.19.gff3.gz')

download.file(url = 'http://ftp.ensembl.org/pub/release-107/variation/gvf/homo_sapiens/homo_sapiens-chr9.gvf.gz', destfile = 'homo_sapiens-chr9.gvf.gz')
gunzip -k 'Homo_sapiens.GRCh38.dna.chromosome.19.fa.gz'
gunzip -k 'Homo_sapiens.GRCh38.107.chromosome.19.gff3.gz'
gunzip -k 'homo_sapiens-chr9.gvf.gz'

Genome sequence files

The data is stored as a text file ‘Homo_sapiens.GRCh38.dna.chromosome.19.fa’. GRCh38 refers to the name of the assembly - this is the latest published sequence. R Studio is a multilingual environment. Each code chunk you execute can be written in a different language. To tell R Studio which language you want to use, swap the default ‘{r}’ at the beginning of the code chunk for ‘{sh}’ to run the commands as if you were running them from the command prompt/terminal, or ‘{python}’ to run them in Python. Refer to R Studio documentation to find out more. Let’s use the bash commands head and tail to inspect the first and last 10 lines of the FASTA file.

echo 'head'
head -n 10 'Homo_sapiens.GRCh38.dna.chromosome.19.fa'
echo 'tail'
tail -n 10 'Homo_sapiens.GRCh38.dna.chromosome.19.fa'
## head
## >19 dna:chromosome chromosome:GRCh38:19:1:58617616:1 REF
## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
## tail
## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
## NNNNNNNNNNNNNNNN

We have the header line, starting with the ‘>’ character, indicating the sequence name, and the sequence itself should be below. But do you notice something strange? All the bases in the sequence are Ns rather than the expected A, C, G, and T. Why? N indicates an ambiguous base, genetic information is protected by the means of long telomeres at either end of the chromosome. Let’s see how many lines our file contains using the command wc.

wc -l 'Homo_sapiens.GRCh38.dna.chromosome.19.fa'
##   976962 Homo_sapiens.GRCh38.dna.chromosome.19.fa

OK, we have about 980,000 lines. Let’s select some lines at random and use awk to check whether we can see the expected sequences consisting of A, G, C, and T.

awk 'FNR>=1000 && FNR<=1020' 'Homo_sapiens.GRCh38.dna.chromosome.19.fa'
## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
## GATCACAGAGGCTGGGCTGCTCCCCACCCTCTGCACACCTCCTGCTTCTAACAGCAGAGC
## TGCCAGGCCAGGCCCTCAGGCAAGGGCTCTGAAGTCAGGGTCACCTACTTGCCAGGGCCG
## ATCTTGGTGCCATCCAGGGGGCCTCTACAAGGATAATCTGACCTGCAGGGTCGAGGAGTT
## GACGGTGCTGAGTTCCCTGCACTCTCAGTAGGGACAGGCCCTATGCTGCCACCTGTACAT
## GCTATCTGAAGGACAGCCTCCAGGGCACACAGAGGATGGTATTTACACATGCACACATGG
## CTACTGATGGGGCAAGCACTTCACAACCCCTCATGATCACGTGCAGCAGACAATGTGGCC
## TCTGCAGAGGGGGAACGGAGACCGGAGGCTGAGACTGGCAAGGCTGGACCTGAGTGTCGT
## CACCTAAATTCAGACGGGGAACTGCCCCTGCACATAGTGAACGGCTCACTGAGCAAACCC
## CGAGTCCCGACCACCGCCTCAGTGTGGTCTAGCTCCTCACCTGCTTCCATCCTCCCTGGT
## GCGGGGTGGGCCCAGTGATATCAGCTGCCTGCTGTTCCCCAGATGTGCCAAGTGCATTCT
## TGTGTGCTTGCATCTCATGGAACGCCATTTCCCCAGACATCCCTGTGGCTGGCTCCTGAT
## GCCCGAGGCCCAAGTGTCTGATGCTTTAAGGCACATCACCCCACTCATGCTTTTCCATGT
## TCTTTGGCCGCAGCAAGGCCGCTCTCACTGCAAAGTTAACTCTGATGCGTGTGTAACACG
## ACATCCTCCTCCCAGTCGCCCCTGTAGCTCCCCTACCTCCAAGAGCCCAGCCCTTGCCCA
## CAGGGCCATACTCCACGTGCAGAGCAGCCTCAGCACTCACCGGGCACGAGCGAGCCTGTG
## TGGTGCGCAGGGATGAGAAGGCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGA
## GGGTGTGGGATGGTGGAGGGGTTTGAGAAGGCAGAGGCGCGACTGGGGTTCATGAGGAAA
## GGGAGGGGGAGGATGTGGGATGGTGGAGGGGCTGCAGACTCTGGGCTAGGGAAAGCTGGG
## ATGTCTCTAAAGGTTGGAATGAATGGCCTAGAATCCGACCCAATAAGCCAAAGCCACTTC
awk 'FNR>=970000 && FNR<=970020' 'Homo_sapiens.GRCh38.dna.chromosome.19.fa'
## TAGAAGTATTAACTTATTTTGAGGGCTTAAAAAGGCTAGAAGTACTGATGTCCTTTTCCT
## GAGTCCTGAAGTCATTCTAGCCATCAACCTCTGGAGAAATGCTGCTGGGGCCATTTTACC
## ATGGGACCAGAAATACAAGTCCCTGACATGGGCTTGGCTGAGAAGAAGCAAGTGGGGTGC
## AAACTATGTGTGCTTTCATGTTGCAAAGAAGCTGTGTTGAATCAACAAATATTACTTGAG
## CACTTGCCAGGATTCCAGGTACTGTTCCAGGGCTGGATCACAGTGATGAGTGGGGCAGGT
## CTTACACCTGCTTTCATTGGGCCTACATCCATTCCATGGCTGATTGGATCAACAAAAATG
## AAGTTGCCCATTCAGTACCTCAATTGCACTGGGCACCTTTCAACTGCTCATTGGCCCTGC
## AGCTAATGCTACTATGTTGGACAGTACAGATCTAGAACATCTCATTTATAGAAAGTTCCA
## TTGGACAGCACTGCTCTAGATAGGGTGATAGACATGATAAATTAATGAATCAGATATTTG
## GGATGGTAATGAGTGTTGGAAAGAGAAGGAAACAATGGGATACATTGGTGGGGGACAAGC
## GAGGTACGGGTGGGCAGCAGTAGTTCAGAACTAAGTAGGAAGATGGTTCTGAGGAGGTGA
## CGTTTGAGCTCAGACCTGAATGGGAAAGGCGAACCAGGTGAAGATGTTGGGGAAGGTGAA
## CCAGGCGAAGATGTTGGGGAAGACTATACCAGGCAGGAGGGTGGTGCACATGTGGATCTC
## TGGATGAAAGGCCATATTCTAGGAAAACTAAGATGGTGGCTGTGGGAGACCCCTTTAGCG
## TCAAGGACATGAAATAACAGGGTTTGGAACCCGGCTCCTTTGTTGCCCACTATGTTAGGC
## CGTTCTTAAATTGCTGTAAATACCTGAGACTGGGTTATTTATGAAGATAAGAGATTTAAT
## TGGCTCACAGTTCTGCAGGCTGTACAAGGGTGGCACTGGCATCTGGTTGGCTTCTGGGGA
## GGCCTCAGGGAGCTTTTTACTCATGGTGGAAGGCAGGGAGCAGGCACATCATGGGAAAAG
## CAGCAGCCAGAGAGAGGAAGAAGGTGCCGCATACTGTTGTTTGTTTGTTTTGTTTTTTTT
## TTTGGAGACACGGTCTCACTCTCACCTAGGCGGTAGTACAGTGGTGCGATCATGACCCAC
## TGCAGCCTCGACTTCCTGAGTTCAAGCGATTCTCCCACCTCAGCTTCCTGAGTAGCCAGG

Aha! All looks good. Let’s now load the sequence into R’s memory and take a closer look at it. We will skip the header and manipulate the original object so that individual lines are joined together and then split by character.

fasta <- scan('Homo_sapiens.GRCh38.dna.chromosome.19.fa',
              what = 'character', #The type of file to be read
              skip = 1) #Skip the header

fasta <- paste0(fasta) #Add the individual lines together

fasta <- strsplit(fasta, split = '') #Split the long string into characters
fasta <- unlist(fasta) #Convert the object from a list back to a string

fasta[1:10]
##  [1] "N" "N" "N" "N" "N" "N" "N" "N" "N" "N"

Let’s see how many of each type of nucleotide is present in the sequence.

nucleotides <- table(fasta)

nucleotides
## fasta
##        A        C        G        N        T 
## 15142293 13954580 14061132   176858 15282753

Let’s create a barplot to visualise the data.

barplot(nucleotides)

OK, so there are about 200,000 N nucleotides, which form the minority of all the other bases. It would be interesting to know the telomere lengths. Let’s try to write a while-loop to measure the length of the first telomere.

i <- 1 #Initialise the iterator value
n <- 0 #Initialise N counter

while (fasta[i] == 'N') {
  i <- i + 1 #Add 1 each time this condition is true
  n <- n + 1
}

n
## [1] 60000

It looks like the telomere is 60kb long. The actual telomeres are composed of TTAGGG repeats and the number of ambiguous ‘N’ bases at chromosome ends is somewhat artificial, but it helps with bioinformatics applications, such as annotation and alignment. Let’s print the fragment 2 bases upstream and donstream of the last N to make sure our calculation is correct.

fasta[ (n-2) : (n+2) ]
## [1] "N" "N" "N" "G" "A"

Bingo, the base 6000 is the last N character.

Let’s now take a look at the other, coding bases, and check if AT and GC are in roughtly 50:50 proportion. We will convert the table object to a data frame in order to manipulate it. We will also transpose the table using t() function, so that the rows become columns - this will help in analysis.

nucleotides <- data.frame(nucleotides, row.names = rownames(nucleotides)) #Convert to a data frame object
colnames(nucleotides) <- c('nucleotide', 'count') #Add meaningful column names
nucleotides$nucleotide <- NULL #Delete unused column
nucleotides <- as.data.frame(t(nucleotides))  #Transpose and retain data frame structure (which gets simplified by default if there is only one column)

GC <- (nucleotides$G + nucleotides$C) / (nucleotides$A + nucleotides$T + nucleotides$G + nucleotides$C)
AT <- 1 - GC

GC
## [1] 0.4793865
AT
## [1] 0.5206135

Indeed, the ratio is approximately 50:50. Let’s add some code to present our data in a nicer way and make a barlot to show the relative percentages.

print( paste0('GC %: ', round(GC, digits = 2)))
## [1] "GC %: 0.48"
print( paste0('AT %: ', round(AT, digits = 2)))
## [1] "AT %: 0.52"
barplot(c(GC, AT), names.arg = c('GC %', 'AT %'))

The percentage of GC pairs is referred to as GC content and is an important QC diagnostic of sequencing runs. GC content differs between parts of genome as well as between the organisms. Regions with epigenetic control exhibit presence of CpG islands, GC-rich regions with methylated cytosine bases.

Genome annotation files

Now that we had a look at the bases in chromosome 19, let’s try to get a deeper understanding of the genomic elements they encode. Genome annotation files are separate from sequence files and can be stored in GFF3 or GTF format. The data we will use is stored as a text file ‘Homo_sapiens.GRCh38.107.chromosome.19.gff3’. GRCh38 refers to the name of the assembly - this is the latest ‘major release’, with 107 referring to the version of minor updates ‘minor release’. Let’s again use the bash command head to inspect the first 20 lines of the GFF file.

head -n 20 'Homo_sapiens.GRCh38.107.chromosome.19.gff3'
## ##gff-version 3
## ##sequence-region   19 1 58617616
## #!genome-build Genome Reference Consortium GRCh38.p13
## #!genome-version GRCh38
## #!genome-date 2013-12
## #!genome-build-accession GCA_000001405.28
## #!genebuild-last-updated 2022-04
## 19   GRCh38  chromosome  1   58617616    .   .   .   ID=chromosome:19;Alias=CM000681.2,chr19,NC_000019.10
## ###
## 19   havana  pseudogene  60951   71626   .   -   .   ID=gene:ENSG00000282458;Name=WASH5P;biotype=transcribed_processed_pseudogene;description=WASP family homolog 5%2C pseudogene [Source:HGNC Symbol%3BAcc:HGNC:33884];gene_id=ENSG00000282458;logic_name=havana_homo_sapiens;version=1
## 19   havana  lnc_RNA 60951   70976   .   -   .   ID=transcript:ENST00000632506;Parent=gene:ENSG00000282458;Name=WASH5P-206;biotype=processed_transcript;tag=basic;transcript_id=ENST00000632506;transcript_support_level=1;version=1
## 19   havana  exon    60951   61894   .   -   .   Parent=transcript:ENST00000632506;Name=ENSE00003783010;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003783010;rank=3;version=1
## 19   havana  exon    66346   66499   .   -   .   Parent=transcript:ENST00000632506;Name=ENSE00003783498;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003783498;rank=2;version=1
## 19   havana  exon    70928   70976   .   -   .   Parent=transcript:ENST00000632506;Name=ENSE00003781173;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003781173;rank=1;version=1
## 19   havana  lnc_RNA 62113   66524   .   -   .   ID=transcript:ENST00000633719;Parent=gene:ENSG00000282458;Name=WASH5P-208;biotype=retained_intron;transcript_id=ENST00000633719;transcript_support_level=NA;version=1
## 19   havana  exon    62113   66524   .   -   .   Parent=transcript:ENST00000633719;Name=ENSE00003783013;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003783013;rank=1;version=1
## 19   havana  lnc_RNA 63821   70951   .   -   .   ID=transcript:ENST00000633703;Parent=gene:ENSG00000282458;Name=WASH5P-207;biotype=processed_transcript;transcript_id=ENST00000633703;transcript_support_level=5;version=1
## 19   havana  exon    63821   64213   .   -   .   Parent=transcript:ENST00000633703;Name=ENSE00003781018;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003781018;rank=3;version=1
## 19   havana  exon    66346   66499   .   -   .   Parent=transcript:ENST00000633703;Name=ENSE00003783498;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003783498;rank=2;version=1
## 19   havana  exon    70928   70951   .   -   .   Parent=transcript:ENST00000633703;Name=ENSE00003782721;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003782721;rank=1;version=1

As you can see, there are a few header lines that start with a ‘#’ specifying some details of the file and the genome build. Next, there is some tab-delimited text split into 9 columns. These are (based on Ensembl website):

Coulmn Description
Sequence name (seqname) Name of the chromosome or scaffold. Watch out for differences between annotation consortia. Chromosome 1 could be denoted as ‘chr1’ or just ‘1’.
Source (source) Annotation source. Includes ‘ensembl’ - automatic annotation program and ‘havana’ - manual annotation by HAVANA group.
Feature type (feature or type) Region type. Can be coding sequence (CDS), exon (exon), 3’ UTR (three_prime_UTR), 5’ UTR (five_prime_UTR), and more, depending on the annotation source.
Start (start) Genomic coordinate where the annotated sequence starts.
End (end) Genomic coordinate where the annotated sequence ends.
Score (score) A numeric value normally referring to the confidence score of the annotated sequence.
Strand (strand) Direction of the sequence in double-stranded DNA. Can be either ‘+’ (forward/sense strand) or ‘-’ (reverse/antisense strand).
ORF (frame) Refers to the start of the open reading frame (ORF). Can be ‘0’, ‘1’, or ‘2’ - ‘0’ indicates that the first base of the feature is the first base of a codon.
Attributes (attribute) A list of name-value pairs separated by semicolons, in the format ‘name=value’. Attributes differ in type and number based on the feature type and the annotation program.

A dot in any column ‘.’ indicates no value.

Let’s try base R function read.delim(), which reads any delimited file (tab, comma, etc.), to read the GFF into memory and examine it. As the GFF does not include column names, we will specify header = F and tell R to ignore all the lines that start with ‘#’ using the argument comment.char - these happen to be default parameters of function read.delim(), but it is a good practice to be explicit about the function arguments in your code. We will give the columns meaningful names.

gff <- read.delim('Homo_sapiens.GRCh38.107.chromosome.19.gff3',
                  comment.char = '#',
                  header = FALSE,
                  col.names = c('seqname', 'source', 'type', 'start', 'end', 'score', 'strand', 'frame', 'attribute'))

head(gff)
##   seqname source       type start      end score strand frame
## 1      19 GRCh38 chromosome     1 58617616     .      .     .
## 2      19 havana pseudogene 60951    71626     .      -     .
## 3      19 havana    lnc_RNA 60951    70976     .      -     .
## 4      19 havana       exon 60951    61894     .      -     .
## 5      19 havana       exon 66346    66499     .      -     .
## 6      19 havana       exon 70928    70976     .      -     .
##                                                                                                                                                                                                                             attribute
## 1                                                                                                                                                                                ID=chromosome:19;Alias=CM000681.2,chr19,NC_000019.10
## 2 ID=gene:ENSG00000282458;Name=WASH5P;biotype=transcribed_processed_pseudogene;description=WASP family homolog 5%2C pseudogene [Source:HGNC Symbol%3BAcc:HGNC:33884];gene_id=ENSG00000282458;logic_name=havana_homo_sapiens;version=1
## 3                                                 ID=transcript:ENST00000632506;Parent=gene:ENSG00000282458;Name=WASH5P-206;biotype=processed_transcript;tag=basic;transcript_id=ENST00000632506;transcript_support_level=1;version=1
## 4                                                                                Parent=transcript:ENST00000632506;Name=ENSE00003783010;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003783010;rank=3;version=1
## 5                                                                                Parent=transcript:ENST00000632506;Name=ENSE00003783498;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003783498;rank=2;version=1
## 6                                                                                Parent=transcript:ENST00000632506;Name=ENSE00003781173;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003781173;rank=1;version=1

As you can see, the first 8 columns are readily readable, but the column ‘attribute’ poses a challenge. We need the information contained in it to understand which gene the sequences relate to (most genes will have multiple rows associated with them)! We could write a function that splits the ‘attribute’ sting based on the ‘;’ character, then split the resulting strings based on ‘=’ character and separate them into columns. If you are a more advanced R user (including regex) and feel like a challenge, you can try coding this up - Google is your friend! For now, we will use an existing package rtracklayer to read the GFF into R’s memory and convert it to a familiar data frame format.

gff <- readGFF('Homo_sapiens.GRCh38.107.chromosome.19.gff3')
gff <- as.data.frame(gff)
head(gff)
##   seqid source       type start      end score strand phase
## 1    19 GRCh38 chromosome     1 58617616    NA      *    NA
## 2    19 havana pseudogene 60951    71626    NA      -    NA
## 3    19 havana    lnc_RNA 60951    70976    NA      -    NA
## 4    19 havana       exon 60951    61894    NA      -    NA
## 5    19 havana       exon 66346    66499    NA      -    NA
## 6    19 havana       exon 70928    70976    NA      -    NA
##                           ID        Alias            Name
## 1              chromosome:19 CM000681....            <NA>
## 2       gene:ENSG00000282458                       WASH5P
## 3 transcript:ENST00000632506                   WASH5P-206
## 4                       <NA>              ENSE00003783010
## 5                       <NA>              ENSE00003783498
## 6                       <NA>              ENSE00003781173
##                            biotype
## 1                             <NA>
## 2 transcribed_processed_pseudogene
## 3             processed_transcript
## 4                             <NA>
## 5                             <NA>
## 6                             <NA>
##                                                             description
## 1                                                                  <NA>
## 2 WASP family homolog 5, pseudogene [Source:HGNC Symbol;Acc:HGNC:33884]
## 3                                                                  <NA>
## 4                                                                  <NA>
## 5                                                                  <NA>
## 6                                                                  <NA>
##           gene_id          logic_name version       Parent   tag
## 1            <NA>                <NA>    <NA>               <NA>
## 2 ENSG00000282458 havana_homo_sapiens       1               <NA>
## 3            <NA>                <NA>       1 gene:ENS.... basic
## 4            <NA>                <NA>       1 transcri....  <NA>
## 5            <NA>                <NA>       1 transcri....  <NA>
## 6            <NA>                <NA>       1 transcri....  <NA>
##     transcript_id transcript_support_level constitutive ensembl_end_phase
## 1            <NA>                     <NA>         <NA>              <NA>
## 2            <NA>                     <NA>         <NA>              <NA>
## 3 ENST00000632506                        1         <NA>              <NA>
## 4            <NA>                     <NA>            0                -1
## 5            <NA>                     <NA>            0                -1
## 6            <NA>                     <NA>            0                -1
##   ensembl_phase         exon_id rank external_name ccdsid protein_id
## 1          <NA>            <NA> <NA>          <NA>   <NA>       <NA>
## 2          <NA>            <NA> <NA>          <NA>   <NA>       <NA>
## 3          <NA>            <NA> <NA>          <NA>   <NA>       <NA>
## 4            -1 ENSE00003783010    3          <NA>   <NA>       <NA>
## 5            -1 ENSE00003783498    2          <NA>   <NA>       <NA>
## 6            -1 ENSE00003781173    1          <NA>   <NA>       <NA>

OK, this is much better! Every attribute now sits in its own column, and there is an ‘NA’ value if the information for that row is missing. Let’s save the length of chromosome 19 (first row of the data frame) which will become useful later and subset the GFF file so that it contains automatic annotations generated by Ensembl merged with manual annotations by HAVANA group. Read more about those on the Ensembl website.

Recall that subsetting in R is possible using square brackets [ ] or the function subset().

chr19 <- gff[1, ]
gff <- subset(gff, source == 'ensembl_havana')

To calculate the number of genes, we will use the column ‘type’ and select ‘gene’. The gff$type == ‘gene’ will return a boolean (TRUE/FALSE) value for every row in the dataframe. To count the number of ‘TRUE’ values, we will use the property that TRUE is equal to 1 and FALSE is equal to 0. Therefore, sum of TRUE occurences is the number of rows that correspond to gene annotation in our GFF file

head(gff$type == 'gene') #This returns a long boolean
## [1]  TRUE  TRUE FALSE FALSE FALSE FALSE
sum(gff$type == 'gene') #This counts the number of 'TRUE' occurences
## [1] 1343

Let/s take a closer look at the genes on chromosome 19. To do this, we will subset the original data frame to only include gene annotations. We can see that the number of rows in this new data frame is equal to the number we calculated before.

genes <- gff[gff$type == 'gene', ]  #Subset the data frame

nrow(genes) #Number of genes equal to what we calculated before
## [1] 1343

It would be interesting to see what proportion of genes lie on each strand of the DNA, denoted as + (sense) and - (antisense) in the gff file. We can produce count summaries using the fuction table().

table(genes$strand)
## 
##   -   + 
## 665 678

We can see that the proportion of genes on either strand is approximately equal.

What about genome coverage? How much of the genome is actually used to encode information about genes? Let’s calculate the proportion of DNA occupied by coding sequences in relation to the chromosome length.

cds <- gff[gff$type == 'CDS', ]
cds$length <- cds$end - cds$start

sum(cds$length) / (chr19$end - chr19$start)
## [1] 0.04460002

Can you carry out a similar calculation to see how much of the genome is covered by exons?

regions <- gff[gff$type%in% c('CDS', 'three_prime_UTR', 'five_prime_UTR'), ]
regions$length <- regions$end - regions$start

region_lengths <- data.frame()      #Create a data frame object to store the results
for (type in c('CDS', 'three_prime_UTR', 'five_prime_UTR')){
  region_lengths[type, 'length'] <- sum(regions[regions$type == type, 'length'])    #Calculate sum of lengths for each region
}

barplot(region_lengths$length, names.arg = rownames(region_lengths))

Can you hypothesise why the 3’ UTR sequences are, on average, longer than 5’?

We can see that, as suspected, coding sequences are the longest fragments of coding DNA. But we also saw that the coding DNA forms only a small fraction of the genome. To visualise this, let’s now add the length of the entire chromosome 19 for scale!

region_lengths['chromosome', 'length'] <- chr19$end - chr19$start

barplot(region_lengths$length, names.arg = rownames(region_lengths))

Let’s look at some interesting genes on chromosome 19. We will look at a class of enzymes called fucosyltarnsferases, which are involved in chemical modification of glycans displayed on the cell surface. There are 13 of them in the entire genome, let’s see how many on chromosome 13 - their symbols all start with ‘FUT’.

FUTs <- grep(pattern = 'FUT', gff$Name, value = TRUE) #Find all the gene names that contain 'FUT'
FUTs
##  [1] "FUT6"     "FUT6-202" "FUT6-201" "FUT3"     "FUT3-201" "FUT3-202"
##  [7] "FUT5"     "FUT2"     "FUT2-202" "FUT1"     "FUT1-206"

This is interewsting - there is some gene symbols as well as gene symbols with a dash and a number. Let’s take a closer look at their annotations.

subset(gff, Name %in% FUTs)
##        seqid         source type    start      end score strand phase
## 23031     19 ensembl_havana gene  5830408  5839722    NA      -    NA
## 23032     19 ensembl_havana mRNA  5830408  5839702    NA      -    NA
## 23061     19 ensembl_havana mRNA  5830774  5839717    NA      -    NA
## 23117     19 ensembl_havana gene  5842888  5851471    NA      -    NA
## 23118     19 ensembl_havana mRNA  5842888  5851449    NA      -    NA
## 23136     19 ensembl_havana mRNA  5842890  5851449    NA      -    NA
## 23198     19 ensembl_havana gene  5865826  5870540    NA      -    NA
## 147878    19 ensembl_havana gene 48695971 48705951    NA      +    NA
## 147886    19 ensembl_havana mRNA 48695971 48705951    NA      +    NA
## 148150    19 ensembl_havana gene 48748011 48755390    NA      -    NA
## 148151    19 ensembl_havana mRNA 48748011 48752641    NA      -    NA
##                                ID Alias     Name        biotype
## 23031        gene:ENSG00000156413           FUT6 protein_coding
## 23032  transcript:ENST00000318336       FUT6-202 protein_coding
## 23061  transcript:ENST00000286955       FUT6-201 protein_coding
## 23117        gene:ENSG00000171124           FUT3 protein_coding
## 23118  transcript:ENST00000303225       FUT3-201 protein_coding
## 23136  transcript:ENST00000458379       FUT3-202 protein_coding
## 23198        gene:ENSG00000130383           FUT5 protein_coding
## 147878       gene:ENSG00000176920           FUT2 protein_coding
## 147886 transcript:ENST00000425340       FUT2-202 protein_coding
## 148150       gene:ENSG00000174951           FUT1 protein_coding
## 148151 transcript:ENST00000645652       FUT1-206 protein_coding
##                                                                        description
## 23031                      fucosyltransferase 6 [Source:HGNC Symbol;Acc:HGNC:4017]
## 23032                                                                         <NA>
## 23061                                                                         <NA>
## 23117  fucosyltransferase 3 (Lewis blood group) [Source:HGNC Symbol;Acc:HGNC:4014]
## 23118                                                                         <NA>
## 23136                                                                         <NA>
## 23198                      fucosyltransferase 5 [Source:HGNC Symbol;Acc:HGNC:4016]
## 147878                     fucosyltransferase 2 [Source:HGNC Symbol;Acc:HGNC:4013]
## 147886                                                                        <NA>
## 148150     fucosyltransferase 1 (H blood group) [Source:HGNC Symbol;Acc:HGNC:4012]
## 148151                                                                        <NA>
##                gene_id                       logic_name version       Parent
## 23031  ENSG00000156413 ensembl_havana_gene_homo_sapiens      15             
## 23032             <NA>                             <NA>      10 gene:ENS....
## 23061             <NA>                             <NA>       5 gene:ENS....
## 23117  ENSG00000171124 ensembl_havana_gene_homo_sapiens      14             
## 23118             <NA>                             <NA>      12 gene:ENS....
## 23136             <NA>                             <NA>       7 gene:ENS....
## 23198  ENSG00000130383 ensembl_havana_gene_homo_sapiens       7             
## 147878 ENSG00000176920 ensembl_havana_gene_homo_sapiens      13             
## 147886            <NA>                             <NA>       3 gene:ENS....
## 148150 ENSG00000174951 ensembl_havana_gene_homo_sapiens      12             
## 148151            <NA>                             <NA>       2 gene:ENS....
##          tag   transcript_id            transcript_support_level constitutive
## 23031   <NA>            <NA>                                <NA>         <NA>
## 23032  basic ENST00000318336  2 (assigned to previous version 8)         <NA>
## 23061  basic ENST00000286955                                   1         <NA>
## 23117   <NA>            <NA>                                <NA>         <NA>
## 23118  basic ENST00000303225 1 (assigned to previous version 10)         <NA>
## 23136  basic ENST00000458379  1 (assigned to previous version 6)         <NA>
## 23198   <NA>            <NA>                                <NA>         <NA>
## 147878  <NA>            <NA>                                <NA>         <NA>
## 147886 basic ENST00000425340  1 (assigned to previous version 2)         <NA>
## 148150  <NA>            <NA>                                <NA>         <NA>
## 148151 basic ENST00000645652                                <NA>         <NA>
##        ensembl_end_phase ensembl_phase exon_id rank external_name      ccdsid
## 23031               <NA>          <NA>    <NA> <NA>          <NA>        <NA>
## 23032               <NA>          <NA>    <NA> <NA>          <NA> CCDS12152.1
## 23061               <NA>          <NA>    <NA> <NA>          <NA> CCDS12152.1
## 23117               <NA>          <NA>    <NA> <NA>          <NA>        <NA>
## 23118               <NA>          <NA>    <NA> <NA>          <NA> CCDS12153.1
## 23136               <NA>          <NA>    <NA> <NA>          <NA> CCDS12153.1
## 23198               <NA>          <NA>    <NA> <NA>          <NA>        <NA>
## 147878              <NA>          <NA>    <NA> <NA>          <NA>        <NA>
## 147886              <NA>          <NA>    <NA> <NA>          <NA> CCDS33069.1
## 148150              <NA>          <NA>    <NA> <NA>          <NA>        <NA>
## 148151              <NA>          <NA>    <NA> <NA>          <NA> CCDS12733.1
##        protein_id
## 23031        <NA>
## 23032        <NA>
## 23061        <NA>
## 23117        <NA>
## 23118        <NA>
## 23136        <NA>
## 23198        <NA>
## 147878       <NA>
## 147886       <NA>
## 148150       <NA>
## 148151       <NA>

Bingo! Some annotations are for the genes and the others correspond to mRNA transcripts (those with the dashes). We are only interested in the genes, so let’s modify our search to reflect that.

FUTs <- grep('\\-', FUTs, value = TRUE, invert = TRUE) #\\ is an escape character, which tells grep to look for an actual dash, not a dash symbol with a special meaning.
                                                    #Invert tells grep to exclude genes containing a dash.

FUTs_df <- subset(gff, Name %in% FUTs)
FUTs_df
##        seqid         source type    start      end score strand phase
## 23031     19 ensembl_havana gene  5830408  5839722    NA      -    NA
## 23117     19 ensembl_havana gene  5842888  5851471    NA      -    NA
## 23198     19 ensembl_havana gene  5865826  5870540    NA      -    NA
## 147878    19 ensembl_havana gene 48695971 48705951    NA      +    NA
## 148150    19 ensembl_havana gene 48748011 48755390    NA      -    NA
##                          ID Alias Name        biotype
## 23031  gene:ENSG00000156413       FUT6 protein_coding
## 23117  gene:ENSG00000171124       FUT3 protein_coding
## 23198  gene:ENSG00000130383       FUT5 protein_coding
## 147878 gene:ENSG00000176920       FUT2 protein_coding
## 148150 gene:ENSG00000174951       FUT1 protein_coding
##                                                                        description
## 23031                      fucosyltransferase 6 [Source:HGNC Symbol;Acc:HGNC:4017]
## 23117  fucosyltransferase 3 (Lewis blood group) [Source:HGNC Symbol;Acc:HGNC:4014]
## 23198                      fucosyltransferase 5 [Source:HGNC Symbol;Acc:HGNC:4016]
## 147878                     fucosyltransferase 2 [Source:HGNC Symbol;Acc:HGNC:4013]
## 148150     fucosyltransferase 1 (H blood group) [Source:HGNC Symbol;Acc:HGNC:4012]
##                gene_id                       logic_name version Parent  tag
## 23031  ENSG00000156413 ensembl_havana_gene_homo_sapiens      15        <NA>
## 23117  ENSG00000171124 ensembl_havana_gene_homo_sapiens      14        <NA>
## 23198  ENSG00000130383 ensembl_havana_gene_homo_sapiens       7        <NA>
## 147878 ENSG00000176920 ensembl_havana_gene_homo_sapiens      13        <NA>
## 148150 ENSG00000174951 ensembl_havana_gene_homo_sapiens      12        <NA>
##        transcript_id transcript_support_level constitutive ensembl_end_phase
## 23031           <NA>                     <NA>         <NA>              <NA>
## 23117           <NA>                     <NA>         <NA>              <NA>
## 23198           <NA>                     <NA>         <NA>              <NA>
## 147878          <NA>                     <NA>         <NA>              <NA>
## 148150          <NA>                     <NA>         <NA>              <NA>
##        ensembl_phase exon_id rank external_name ccdsid protein_id
## 23031           <NA>    <NA> <NA>          <NA>   <NA>       <NA>
## 23117           <NA>    <NA> <NA>          <NA>   <NA>       <NA>
## 23198           <NA>    <NA> <NA>          <NA>   <NA>       <NA>
## 147878          <NA>    <NA> <NA>          <NA>   <NA>       <NA>
## 148150          <NA>    <NA> <NA>          <NA>   <NA>       <NA>

Let’s visualise the relative placement of all the FUT genes using a genome browser. We will use GenomicRanges object, which is a compact way of storing sequence information. We will then use Gviz package to create ‘tracks’ (in our case, the reference track will be empty, but it can contain information about the genome).

FUTs_gr <- with(FUTs_df, GRanges(seqid, IRanges(start, end), strand, id = Name))
 
chr_track <- IdeogramTrack(genome = 'hg38', chromosome = 'chr19')   #This creates a chromosome 'ideogram', or a graphic, with cytological bands, based on corresponding data from UCSC (a repository like Ensembl)
ref_track <- GenomeAxisTrack()    #This creates an empty track where our genes will be displayed
data_track <- AnnotationTrack(FUTs_gr, name = "Genes", showFeatureId = TRUE)    #This creates the annotations to display

When inspecting the data frame FUTs_df, above, you can see that there are two gene clusters of fucosyltransferase genes, one in the 5.8Mb (megabase, 1e6 bases) and one in the 48.7Mb region. We will plot these two regions separately as they are far apart.

#Cluster 1
plotTracks(c(chr_track, ref_track, data_track),
           from = 5.82e6, to = 5.88e6, sizes = c(1,1,2))

#Cluster 2
plotTracks(c(chr_track, ref_track, data_track),
           from = 48.68e6, to = 48.76e6, sizes = c(1,1,2))

Feel free to change the from and to values to zoom in and out of the loci. You can also produce some code that will automatically establish the optimal region to show using functions min(), max(), and the arguments extend.right/extend.left of the function plotTracks. The arrows indicate transcript directionality (remember ‘+’ and ‘-’ strands from our GFF? they correspond to right- and left-pointing arrow, respectively).

Variannt annotation file

Finally, let’s take a closer look at the variant annotation file. This file summarises genetic variation, and is often produced as a result of GWAS studies or population analysis - see 1000 Genomes Project. The data is stored as a text file ‘homo_sapiens-chr19.vcf’ downloaded from Ensembl includes variants from multiple source databases. Let’s inspect the file using head.

head -n 30 'homo_sapiens-chr19.vcf'
## ##fileformat=VCFv4.1
## ##fileDate=20220607
## ##source=ensembl;version=107;url=https://e107.ensembl.org/homo_sapiens
## ##reference=ftp://ftp.ensembl.org/pub/release-107/fasta/homo_sapiens/dna/
## ##INFO=<ID=ClinVar_202201,Number=0,Type=Flag,Description="Variants of clinical significance imported from ClinVar">
## ##INFO=<ID=dbSNP_154,Number=0,Type=Flag,Description="Variants (including SNPs and indels) imported from dbSNP">
## ##INFO=<ID=HGMD-PUBLIC_20204,Number=0,Type=Flag,Description="Variants from HGMD-PUBLIC dataset December 2020">
## ##INFO=<ID=COSMIC_95,Number=0,Type=Flag,Description="Somatic mutations found in human cancers from the COSMIC catalogue">
## ##INFO=<ID=TSA,Number=1,Type=String,Description="Type of sequence alteration. Child of term sequence_alteration as defined by the sequence ontology project.">
## ##INFO=<ID=E_Cited,Number=0,Type=Flag,Description="Cited.https://www.ensembl.org/info/genome/variation/prediction/variant_quality.html#evidence_status">
## ##INFO=<ID=E_Multiple_observations,Number=0,Type=Flag,Description="Multiple_observations.https://www.ensembl.org/info/genome/variation/prediction/variant_quality.html#evidence_status">
## ##INFO=<ID=E_Freq,Number=0,Type=Flag,Description="Frequency.https://www.ensembl.org/info/genome/variation/prediction/variant_quality.html#evidence_status">
## ##INFO=<ID=E_TOPMed,Number=0,Type=Flag,Description="TOPMed.https://www.ensembl.org/info/genome/variation/prediction/variant_quality.html#evidence_status">
## ##INFO=<ID=E_Hapmap,Number=0,Type=Flag,Description="HapMap.https://www.ensembl.org/info/genome/variation/prediction/variant_quality.html#evidence_status">
## ##INFO=<ID=E_Phenotype_or_Disease,Number=0,Type=Flag,Description="Phenotype_or_Disease.https://www.ensembl.org/info/genome/variation/prediction/variant_quality.html#evidence_status">
## ##INFO=<ID=E_ESP,Number=0,Type=Flag,Description="ESP.https://www.ensembl.org/info/genome/variation/prediction/variant_quality.html#evidence_status">
## ##INFO=<ID=E_gnomAD,Number=0,Type=Flag,Description="gnomAD.https://www.ensembl.org/info/genome/variation/prediction/variant_quality.html#evidence_status">
## ##INFO=<ID=E_1000G,Number=0,Type=Flag,Description="1000Genomes.https://www.ensembl.org/info/genome/variation/prediction/variant_quality.html#evidence_status">
## ##INFO=<ID=E_ExAC,Number=0,Type=Flag,Description="ExAC.https://www.ensembl.org/info/genome/variation/prediction/variant_quality.html#evidence_status">
## ##INFO=<ID=CLIN_risk_factor,Number=0,Type=Flag,Description="risk factor.https://www.ensembl.org/info/genome/variation/phenotype/phenotype_annotation.html#clin_significance">
## ##INFO=<ID=CLIN_protective,Number=0,Type=Flag,Description="protective.https://www.ensembl.org/info/genome/variation/phenotype/phenotype_annotation.html#clin_significance">
## ##INFO=<ID=CLIN_confers_sensitivity,Number=0,Type=Flag,Description="confers sensitivity.https://www.ensembl.org/info/genome/variation/phenotype/phenotype_annotation.html#clin_significance">
## ##INFO=<ID=CLIN_other,Number=0,Type=Flag,Description="other.https://www.ensembl.org/info/genome/variation/phenotype/phenotype_annotation.html#clin_significance">
## ##INFO=<ID=CLIN_drug_response,Number=0,Type=Flag,Description="drug response.https://www.ensembl.org/info/genome/variation/phenotype/phenotype_annotation.html#clin_significance">
## ##INFO=<ID=CLIN_uncertain_significance,Number=0,Type=Flag,Description="uncertain significance.https://www.ensembl.org/info/genome/variation/phenotype/phenotype_annotation.html#clin_significance">
## ##INFO=<ID=CLIN_benign,Number=0,Type=Flag,Description="benign.https://www.ensembl.org/info/genome/variation/phenotype/phenotype_annotation.html#clin_significance">
## ##INFO=<ID=CLIN_likely_pathogenic,Number=0,Type=Flag,Description="likely pathogenic.https://www.ensembl.org/info/genome/variation/phenotype/phenotype_annotation.html#clin_significance">
## ##INFO=<ID=CLIN_pathogenic,Number=0,Type=Flag,Description="pathogenic.https://www.ensembl.org/info/genome/variation/phenotype/phenotype_annotation.html#clin_significance">
## ##INFO=<ID=CLIN_likely_benign,Number=0,Type=Flag,Description="likely benign.https://www.ensembl.org/info/genome/variation/phenotype/phenotype_annotation.html#clin_significance">
## ##INFO=<ID=CLIN_histocompatibility,Number=0,Type=Flag,Description="histocompatibility.https://www.ensembl.org/info/genome/variation/phenotype/phenotype_annotation.html#clin_significance">

So far we can only see the header file, which includes decriptions of some variables, such as ID=dbSNP_154 and that it corresponds to variants sourced from dbSNP. Let’s temporarily remove the lengthy header to see the remaining content of the file.

grep -v '##' 'homo_sapiens-chr19.vcf' | head -n 20
## #CHROM   POS ID  REF ALT QUAL    FILTER  INFO
## 19   60062   rs1555674440    G   C   .   .   dbSNP_154;TSA=SNV
## 19   60165   rs1415141782    G   A   .   .   dbSNP_154;TSA=SNV;E_Freq;E_gnomAD
## 19   60173   rs1371922052    G   A   .   .   dbSNP_154;TSA=SNV;E_Freq;E_gnomAD
## 19   60184   rs1391618909    G   A   .   .   dbSNP_154;TSA=SNV;E_Freq;E_gnomAD
## 19   60223   rs1187548881    A   G   .   .   dbSNP_154;TSA=SNV;E_Freq;E_TOPMed
## 19   60251   rs1310995734    G   A   .   .   dbSNP_154;TSA=SNV;E_Freq;E_gnomAD
## 19   60281   rs1380383890    A   G   .   .   dbSNP_154;TSA=SNV;E_Freq;E_gnomAD
## 19   60319   rs1244952011    C   T   .   .   dbSNP_154;TSA=SNV;E_Freq;E_gnomAD
## 19   60326   rs1292465550    A   G   .   .   dbSNP_154;TSA=SNV;E_Freq;E_TOPMed;E_gnomAD
## 19   60351   rs1323466133    C   T   .   .   dbSNP_154;TSA=SNV;E_Freq;E_TOPMed;E_gnomAD
## 19   60354   rs1202298947    T   A   .   .   dbSNP_154;TSA=SNV;E_Freq;E_TOPMed;E_gnomAD
## 19   60360   rs111660247 C   G   .   .   dbSNP_154;TSA=SNV
## 19   60376   rs1263131587    C   T   .   .   dbSNP_154;TSA=SNV;E_Freq;E_TOPMed
## 19   60377   rs1249701311    G   A   .   .   dbSNP_154;TSA=SNV;E_Freq;E_TOPMed;E_gnomAD
## 19   60384   rs1480309998    G   A   .   .   dbSNP_154;TSA=SNV;E_Freq;E_TOPMed;E_gnomAD
## 19   60398   rs1599941348    G   A   .   .   dbSNP_154;TSA=SNV;E_Freq
## 19   60410   rs1178059410    C   T   .   .   dbSNP_154;TSA=SNV;E_Freq;E_gnomAD
## 19   60414   rs1293885345    G   C   .   .   dbSNP_154;TSA=SNV;E_Freq;E_TOPMed
## 19   60419   rs1214163233    G   A   .   .   dbSNP_154;TSA=SNV;E_Freq;E_TOPMed

The VCF file, much like GFF file, includes some obligatory columns, as well as a list of attributes in the last column. The required columns are:

Column Description
CHROM Name of the chromosome or sequence
POS Position of the described variant
ID Name of the variant from a database of known variants. dbSNP entries all start with ‘rs’
REF Reference genome base(s)
ALT Actual base(s) detected
QUAL Quality score associated with the variant
FILTER Either ‘PASS’ or ‘FAIL’ value - if the variant passed quality filtering
INFO A list of key=value pairs similar to that in GFF file

A dot in any column ‘.’ indicates no value.

We will first use base R function to read the VCF file excluding the header, and we’ll give the columns meaningful names according to the table above

df <- read.delim('homo_sapiens-chr19.vcf',
                  comment.char = '#',
                  header = FALSE,
                  col.names = c('CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO'))

head(df)
##   CHROM   POS           ID REF ALT QUAL FILTER
## 1    19 60062 rs1555674440   G   C    .      .
## 2    19 60165 rs1415141782   G   A    .      .
## 3    19 60173 rs1371922052   G   A    .      .
## 4    19 60184 rs1391618909   G   A    .      .
## 5    19 60223 rs1187548881   A   G    .      .
## 6    19 60251 rs1310995734   G   A    .      .
##                                INFO
## 1                 dbSNP_154;TSA=SNV
## 2 dbSNP_154;TSA=SNV;E_Freq;E_gnomAD
## 3 dbSNP_154;TSA=SNV;E_Freq;E_gnomAD
## 4 dbSNP_154;TSA=SNV;E_Freq;E_gnomAD
## 5 dbSNP_154;TSA=SNV;E_Freq;E_TOPMed
## 6 dbSNP_154;TSA=SNV;E_Freq;E_gnomAD

As we stumble upon a similar problem to that we had with GFF file - non-parsed data in INFO column - we will use the package VariationAnnotation to read in the VCF again and extract the data in INFO in a usable way.

annot <- readVcf('homo_sapiens-chr19.vcf')

head(annot)
## class: CollapsedVCF 
## dim: 6 0 
## rowRanges(vcf):
##   GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
## info(vcf):
##   DataFrame with 32 columns: ClinVar_202201, dbSNP_154, HGMD-PUBLIC_20204, C...
## info(header(vcf)):
##                                Number Type    Description                      
##    ClinVar_202201              0      Flag    Variants of clinical significa...
##    dbSNP_154                   0      Flag    Variants (including SNPs and i...
##    HGMD-PUBLIC_20204           0      Flag    Variants from HGMD-PUBLIC data...
##    COSMIC_95                   0      Flag    Somatic mutations found in hum...
##    TSA                         1      String  Type of sequence alteration. C...
##    E_Cited                     0      Flag    Cited.https://www.ensembl.org/...
##    E_Multiple_observations     0      Flag    Multiple_observations.https://...
##    E_Freq                      0      Flag    Frequency.https://www.ensembl....
##    E_TOPMed                    0      Flag    TOPMed.https://www.ensembl.org...
##    E_Hapmap                    0      Flag    HapMap.https://www.ensembl.org...
##    E_Phenotype_or_Disease      0      Flag    Phenotype_or_Disease.https://w...
##    E_ESP                       0      Flag    ESP.https://www.ensembl.org/in...
##    E_gnomAD                    0      Flag    gnomAD.https://www.ensembl.org...
##    E_1000G                     0      Flag    1000Genomes.https://www.ensemb...
##    E_ExAC                      0      Flag    ExAC.https://www.ensembl.org/i...
##    CLIN_risk_factor            0      Flag    risk factor.https://www.ensemb...
##    CLIN_protective             0      Flag    protective.https://www.ensembl...
##    CLIN_confers_sensitivity    0      Flag    confers sensitivity.https://ww...
##    CLIN_other                  0      Flag    other.https://www.ensembl.org/...
##    CLIN_drug_response          0      Flag    drug response.https://www.ense...
##    CLIN_uncertain_significance 0      Flag    uncertain significance.https:/...
##    CLIN_benign                 0      Flag    benign.https://www.ensembl.org...
##    CLIN_likely_pathogenic      0      Flag    likely pathogenic.https://www....
##    CLIN_pathogenic             0      Flag    pathogenic.https://www.ensembl...
##    CLIN_likely_benign          0      Flag    likely benign.https://www.ense...
##    CLIN_histocompatibility     0      Flag    histocompatibility.https://www...
##    CLIN_not_provided           0      Flag    not provided.https://www.ensem...
##    CLIN_association            0      Flag    association.https://www.ensemb...
##    MA                          1      String  Minor Allele                     
##    MAF                         1      Float   Minor Allele Frequency           
##    MAC                         1      Integer Minor Alelele Count              
##    AA                          1      String  Ancestral Allele                 
## geno(vcf):
##   List of length 0:

Let’s combine the two portions of the VCF file and delete the original INFO column.

vcf <- cbind(df, annot@info)
vcf$INFO <- NULL
rm(df, annot)

Let’s examine the variants present in the gene FUT2. To do that, we save the start and end location of the gene we extracted earlier from the GFF file, then use that data to subset our large VCF file. Next, let’s tabulate the data to see how many variants in FUT2 have a known clinical association, based on data from ClinVar database.

FUT2_start <- FUTs_df[FUTs_df$Name == 'FUT2', 'start']
FUT2_end <- FUTs_df[FUTs_df$Name == 'FUT2', 'end']

FUTs_vcf <- subset(vcf, POS >= FUT2_start & POS <= FUT2_end)

FUTs_vcf <- subset(FUTs_vcf, dbSNP_154 == TRUE & E_Phenotype_or_Disease == TRUE)

table(FUTs_vcf$CLIN_association)
## 
## FALSE  TRUE 
##    30     1

Of the selected variants, only one has phenotype association. Let’s check which one that is:

FUTs_vcf[FUTs_vcf$CLIN_association == TRUE, 'ID']
## [1] "rs601338"

Read about the variant and what it causes and prepare a short summary of your findings. Methods section of this paper is a good starting place.

Next, we will use more advanced tools to combine FASTA, GFF, and VCF information and predict the effect of the SNP rs601338 on translation. In this practical, we are using a small fragment of a single human genome, but in common bioinformatic applications, we would use 10s or 100s of sequences, with files getting really big and more difficult to manipulate. One solution to that problem is to use a more powerful machine, but another is to use file formats that allow efficient computation. Tabix is a type of index file, which creates fast, easy-to-access coordinates system of large files. We will index our VCF file now using Rsamtools.

bgzip('homo_sapiens-chr19.vcf', 'homo_sapiens-chr19.vcf.bgz', overwrite = TRUE)
## [1] "homo_sapiens-chr19.vcf.bgz"
tabix <- indexTabix('homo_sapiens-chr19.vcf.bgz', format = 'vcf')

We will also index the genome file.

indexFa('Homo_sapiens.GRCh38.dna.chromosome.19.fa')
## [1] "Homo_sapiens.GRCh38.dna.chromosome.19.fa.fai"
genome <- FaFile(file = 'Homo_sapiens.GRCh38.dna.chromosome.19.fa')

Finally, we will prepare our GFF file for use by creating a TxDb file, which is an interface object to a database of features stored on your computer.

txdb <- makeTxDbFromGFF('Homo_sapiens.GRCh38.107.chromosome.19.gff3')
## Import genomic features from the file as a GRanges object ... OK
## Prepare the 'metadata' data frame ... OK
## Make the TxDb object ... OK

We also need an efficient way to work with regions of the genome. As with the genome browser visualisation above, we will define some interesting ranges of the genome which correspond to the FUT genes on chromosome 19. IRanges and Granges objects are efficient means of acessing specified regions from indexed genomic files.

regions <- GRanges(seqnames="19", ranges = IRanges(
                                  start = FUTs_df$start,
                                  end = FUTs_df$end,
                                  names = FUTs_df$Name))
regions
## GRanges object with 5 ranges and 0 metadata columns:
##        seqnames            ranges strand
##           <Rle>         <IRanges>  <Rle>
##   FUT6       19   5830408-5839722      *
##   FUT3       19   5842888-5851471      *
##   FUT5       19   5865826-5870540      *
##   FUT2       19 48695971-48705951      *
##   FUT1       19 48748011-48755390      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

We will now use the VCF tabix map and the FUT ranges defined above to load a section of VCF data corresponding to these genes only.

vcf_short <- readVcf(tabix, genome = "GRCh38.107", param = regions)

Finally, let’s bring all the files we saw today together to predict the effect of variants we saw on the protein sequences which are produced from the DNA via mRNA. We will use the package VariationAnnotation, which will source the portion of FASTA file of interest, find the reading frames and annotations of FUT genes using the GFF3 file, and performe nucleotide changes specified in the VCF file. It will then translate the DNA sequences and specify the effect to mRNA: wheather it’s a STOP codon (a nonsense mutation), a substitution (a missense mutation), or no change at all (a silent mutation).

coding <- predictCoding(vcf_short, subject = txdb, seqSource = genome)
coding
## GRanges object with 9603 ranges and 17 metadata columns:
##                seqnames          ranges strand | paramRangeID               REF
##                   <Rle>       <IRanges>  <Rle> |     <factor>    <DNAStringSet>
##   rs1182309127       19 5831378-5831394      - |         FUT6 TCAGGCAGGTGAAGCTT
##   rs1182309127       19 5831378-5831394      - |         FUT6 TCAGGCAGGTGAAGCTT
##   rs1304301383       19         5831378      - |         FUT6                 T
##   rs1158033586       19         5831382      - |         FUT6                 G
##    rs779049413       19         5831383      - |         FUT6                 C
##            ...      ...             ...    ... .          ...               ...
##   rs1269695902       19        48751277      - |         FUT1                 C
##   rs1380079562       19        48751279      - |         FUT1                 C
##      CM1617022       19        48751281      - |         FUT1                 T
##    rs200607617       19        48751281      - |         FUT1                 T
##    rs200607617       19        48751281      - |         FUT1                 T
##                                     ALT      QUAL      FILTER
##                         <CharacterList> <numeric> <character>
##   rs1182309127 T,TCAGGCAGGTGAAGCTTCAG..        NA           .
##   rs1182309127 T,TCAGGCAGGTGAAGCTTCAG..        NA           .
##   rs1304301383                        C        NA           .
##   rs1158033586                        A        NA           .
##    rs779049413                        T        NA           .
##            ...                      ...       ...         ...
##   rs1269695902                        T        NA           .
##   rs1380079562                        A        NA           .
##      CM1617022                                 NA           .
##    rs200607617                      C,G        NA           .
##    rs200607617                      C,G        NA           .
##                              varAllele    CDSLOC    PROTEINLOC   QUERYID
##                         <DNAStringSet> <IRanges> <IntegerList> <integer>
##   rs1182309127                       A 1079-1095       360,365       288
##   rs1182309127 AAGCTTCACC...ACCTGCCTGA 1079-1095       360,365       288
##   rs1304301383                       G      1095           365       289
##   rs1158033586                       T      1091           364       290
##    rs779049413                       A      1090           364       291
##            ...                     ...       ...           ...       ...
##   rs1269695902                       A         5             2     10291
##   rs1380079562                       T         3             1     10292
##      CM1617022                                 1             1     10293
##    rs200607617                       G         1             1     10294
##    rs200607617                       C         1             1     10294
##                       TXID         CDSID          GENEID    CONSEQUENCE
##                <character> <IntegerList>     <character>       <factor>
##   rs1182309127        8057         27954 ENSG00000156413  frameshift   
##   rs1182309127        8057         27954 ENSG00000156413  frameshift   
##   rs1304301383        8057         27954 ENSG00000156413  nonsynonymous
##   rs1158033586        8057         27954 ENSG00000156413  nonsynonymous
##    rs779049413        8057         27954 ENSG00000156413  nonsynonymous
##            ...         ...           ...             ...            ...
##   rs1269695902       12777         44420 ENSG00000174951 nonsense      
##   rs1380079562       12777         44420 ENSG00000174951 nonsynonymous 
##      CM1617022       12777         44420 ENSG00000174951 not translated
##    rs200607617       12777         44420 ENSG00000174951 nonsynonymous 
##    rs200607617       12777         44420 ENSG00000174951 synonymous    
##                          REFCODON                VARCODON         REFAA
##                    <DNAStringSet>          <DNAStringSet> <AAStringSet>
##   rs1182309127 GAAGCTTCACCTGCCTGA                      GA              
##   rs1182309127 GAAGCTTCACCTGCCTGA GAAGCTTCAC...ACCTGCCTGA              
##   rs1304301383                TGA                     TGG             *
##   rs1158033586                GCC                     GTC             A
##    rs779049413                GCC                     ACC             A
##            ...                ...                     ...           ...
##   rs1269695902                TGG                     TAG             W
##   rs1380079562                ATG                     ATT             M
##      CM1617022                ATG                      TG              
##    rs200607617                ATG                     GTG             M
##    rs200607617                ATG                     CTG             M
##                        VARAA
##                <AAStringSet>
##   rs1182309127              
##   rs1182309127              
##   rs1304301383             W
##   rs1158033586             V
##    rs779049413             T
##            ...           ...
##   rs1269695902             *
##   rs1380079562             I
##      CM1617022              
##    rs200607617             V
##    rs200607617             M
##   -------
##   seqinfo: 1 sequence from GRCh38.107 genome; no seqlengths

Let’s subset the results object to see what the consequence is of mutations in SNP rs601338.

coding["rs601338", ]
## GRanges object with 1 range and 17 metadata columns:
##            seqnames    ranges strand | paramRangeID            REF
##               <Rle> <IRanges>  <Rle> |     <factor> <DNAStringSet>
##   rs601338       19  48703417      + |         FUT2              G
##                        ALT      QUAL      FILTER      varAllele    CDSLOC
##            <CharacterList> <numeric> <character> <DNAStringSet> <IRanges>
##   rs601338               A        NA           .              A       461
##               PROTEINLOC   QUERYID        TXID             CDSID
##            <IntegerList> <integer> <character>     <IntegerList>
##   rs601338           154      8508        5594 20226,20225,20224
##                     GENEID CONSEQUENCE       REFCODON       VARCODON
##                <character>    <factor> <DNAStringSet> <DNAStringSet>
##   rs601338 ENSG00000176920    nonsense            TGG            TAG
##                    REFAA         VARAA
##            <AAStringSet> <AAStringSet>
##   rs601338             W             *
##   -------
##   seqinfo: 1 sequence from GRCh38.107 genome; no seqlengths

Aha! The mutation is in amino acid number 154 (PROTEINLOC), and the field COSEQUENCE tells us this is a nonsense mutation, with the usual amino acid REFAA being tryptophan (W) and the one resultiong from a mutation being * (STOP codon). So, it looks like secretor status is determined by the ability or inability to produce full-length FUT2 transcript. Without FUT2, it is impossible to fucosylate the required carbohydrates, and thus blood groups are not present. This makes sense in the context of it being a recessive trait, as a single copy of FUT2 must be sufficient to fucosylate all the required surface glycans.