Use the following code to install the required packages, if they’re not present on your system. If you are using one of the computers in the teaching room, the packages have already been installed for you.
install.packages("rtracklayer")
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("GenomicRanges")
BiocManager::install("Gviz")
BiocManager::install("VariantAnnotation")
BiocManager::install("GenomicFeatures")
With every new session, you need to tell R which packages to load
from the storage. This is similar to Python’s import
function. Use the code below to import the required packages
rtracklayer
, Gviz
, GenomicRanges
,
VariantAnnotation
, and GenomicFeatures
.
library(rtracklayer)
## Warning: package 'BiocGenerics' was built under R version 4.0.5
## Warning: package 'GenomeInfoDb' was built under R version 4.0.5
library(Gviz)
## Warning: package 'Gviz' was built under R version 4.0.4
options(ucscChromosomeNames=TRUE)
library(GenomicRanges)
library(VariantAnnotation)
library(GenomicFeatures)
## Warning: package 'GenomicFeatures' was built under R version 4.0.4
This tutorial will help you take a closer look at three common file formats used in bioinformatics research: FASTA read lists, GFF3 genome annotation files, and VCF variant annotation files. You will explore command line tools that take you from FASTA files to VCF files in later sessions, but for now, we will inspect pre-computed files using bash commands and R functions you are already familiar with, gradually introducing more advanced functionalities.
Two large consortia exist maintaining genomic sequences: Ensembl, maintained by the European bioinformatics institute, and UCSC, maintained by University of California Santa Cruz and funded by NIH, along a few other initiatives, such as ENCODE and RefSeq. These are largely interchangeable but use a differnt indexing strategy and own annotation pipelines. We will explore files relating to chromosome 19 sequence and annotation using the files provided by Ensembl consortium.
You can download the files by acessing Ensembl website (above), or using the code below which will download them for you.
download.file(url = 'http://ftp.ensembl.org/pub/release-107/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.19.fa.gz', destfile = 'Homo_sapiens.GRCh38.dna.chromosome.19.fa.gz')
download.file(url = 'http://ftp.ensembl.org/pub/release-107/gff3/homo_sapiens/Homo_sapiens.GRCh38.107.chromosome.19.gff3.gz', destfile = 'Homo_sapiens.GRCh38.107.chromosome.19.gff3.gz')
download.file(url = 'http://ftp.ensembl.org/pub/release-107/variation/gvf/homo_sapiens/homo_sapiens-chr9.gvf.gz', destfile = 'homo_sapiens-chr9.gvf.gz')
gunzip -k 'Homo_sapiens.GRCh38.dna.chromosome.19.fa.gz'
gunzip -k 'Homo_sapiens.GRCh38.107.chromosome.19.gff3.gz'
gunzip -k 'homo_sapiens-chr9.gvf.gz'
The data is stored as a text file
‘Homo_sapiens.GRCh38.dna.chromosome.19.fa’. GRCh38 refers to the name of
the assembly - this is the latest published sequence. R Studio is a
multilingual environment. Each code chunk you execute can be written in
a different language. To tell R Studio which language you want to use,
swap the default ‘{r}’ at the beginning of the code chunk for ‘{sh}’ to
run the commands as if you were running them from the command
prompt/terminal, or ‘{python}’ to run them in Python. Refer to R Studio
documentation to find out more. Let’s use the bash commands
head
and tail
to inspect the first and last 10
lines of the FASTA file.
echo 'head'
head -n 10 'Homo_sapiens.GRCh38.dna.chromosome.19.fa'
echo 'tail'
tail -n 10 'Homo_sapiens.GRCh38.dna.chromosome.19.fa'
## head
## >19 dna:chromosome chromosome:GRCh38:19:1:58617616:1 REF
## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
## tail
## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
## NNNNNNNNNNNNNNNN
We have the header line, starting with the ‘>’ character,
indicating the sequence name, and the sequence itself should be below.
But do you notice something strange? All the bases in the sequence are
Ns rather than the expected A, C, G, and T. Why? N indicates an
ambiguous base, genetic information is protected by the means of long
telomeres at either end of the chromosome. Let’s see how many lines our
file contains using the command wc
.
wc -l 'Homo_sapiens.GRCh38.dna.chromosome.19.fa'
## 976962 Homo_sapiens.GRCh38.dna.chromosome.19.fa
OK, we have about 980,000 lines. Let’s select some lines at random
and use awk
to check whether we can see the expected
sequences consisting of A, G, C, and T.
awk 'FNR>=1000 && FNR<=1020' 'Homo_sapiens.GRCh38.dna.chromosome.19.fa'
## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
## GATCACAGAGGCTGGGCTGCTCCCCACCCTCTGCACACCTCCTGCTTCTAACAGCAGAGC
## TGCCAGGCCAGGCCCTCAGGCAAGGGCTCTGAAGTCAGGGTCACCTACTTGCCAGGGCCG
## ATCTTGGTGCCATCCAGGGGGCCTCTACAAGGATAATCTGACCTGCAGGGTCGAGGAGTT
## GACGGTGCTGAGTTCCCTGCACTCTCAGTAGGGACAGGCCCTATGCTGCCACCTGTACAT
## GCTATCTGAAGGACAGCCTCCAGGGCACACAGAGGATGGTATTTACACATGCACACATGG
## CTACTGATGGGGCAAGCACTTCACAACCCCTCATGATCACGTGCAGCAGACAATGTGGCC
## TCTGCAGAGGGGGAACGGAGACCGGAGGCTGAGACTGGCAAGGCTGGACCTGAGTGTCGT
## CACCTAAATTCAGACGGGGAACTGCCCCTGCACATAGTGAACGGCTCACTGAGCAAACCC
## CGAGTCCCGACCACCGCCTCAGTGTGGTCTAGCTCCTCACCTGCTTCCATCCTCCCTGGT
## GCGGGGTGGGCCCAGTGATATCAGCTGCCTGCTGTTCCCCAGATGTGCCAAGTGCATTCT
## TGTGTGCTTGCATCTCATGGAACGCCATTTCCCCAGACATCCCTGTGGCTGGCTCCTGAT
## GCCCGAGGCCCAAGTGTCTGATGCTTTAAGGCACATCACCCCACTCATGCTTTTCCATGT
## TCTTTGGCCGCAGCAAGGCCGCTCTCACTGCAAAGTTAACTCTGATGCGTGTGTAACACG
## ACATCCTCCTCCCAGTCGCCCCTGTAGCTCCCCTACCTCCAAGAGCCCAGCCCTTGCCCA
## CAGGGCCATACTCCACGTGCAGAGCAGCCTCAGCACTCACCGGGCACGAGCGAGCCTGTG
## TGGTGCGCAGGGATGAGAAGGCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGA
## GGGTGTGGGATGGTGGAGGGGTTTGAGAAGGCAGAGGCGCGACTGGGGTTCATGAGGAAA
## GGGAGGGGGAGGATGTGGGATGGTGGAGGGGCTGCAGACTCTGGGCTAGGGAAAGCTGGG
## ATGTCTCTAAAGGTTGGAATGAATGGCCTAGAATCCGACCCAATAAGCCAAAGCCACTTC
awk 'FNR>=970000 && FNR<=970020' 'Homo_sapiens.GRCh38.dna.chromosome.19.fa'
## TAGAAGTATTAACTTATTTTGAGGGCTTAAAAAGGCTAGAAGTACTGATGTCCTTTTCCT
## GAGTCCTGAAGTCATTCTAGCCATCAACCTCTGGAGAAATGCTGCTGGGGCCATTTTACC
## ATGGGACCAGAAATACAAGTCCCTGACATGGGCTTGGCTGAGAAGAAGCAAGTGGGGTGC
## AAACTATGTGTGCTTTCATGTTGCAAAGAAGCTGTGTTGAATCAACAAATATTACTTGAG
## CACTTGCCAGGATTCCAGGTACTGTTCCAGGGCTGGATCACAGTGATGAGTGGGGCAGGT
## CTTACACCTGCTTTCATTGGGCCTACATCCATTCCATGGCTGATTGGATCAACAAAAATG
## AAGTTGCCCATTCAGTACCTCAATTGCACTGGGCACCTTTCAACTGCTCATTGGCCCTGC
## AGCTAATGCTACTATGTTGGACAGTACAGATCTAGAACATCTCATTTATAGAAAGTTCCA
## TTGGACAGCACTGCTCTAGATAGGGTGATAGACATGATAAATTAATGAATCAGATATTTG
## GGATGGTAATGAGTGTTGGAAAGAGAAGGAAACAATGGGATACATTGGTGGGGGACAAGC
## GAGGTACGGGTGGGCAGCAGTAGTTCAGAACTAAGTAGGAAGATGGTTCTGAGGAGGTGA
## CGTTTGAGCTCAGACCTGAATGGGAAAGGCGAACCAGGTGAAGATGTTGGGGAAGGTGAA
## CCAGGCGAAGATGTTGGGGAAGACTATACCAGGCAGGAGGGTGGTGCACATGTGGATCTC
## TGGATGAAAGGCCATATTCTAGGAAAACTAAGATGGTGGCTGTGGGAGACCCCTTTAGCG
## TCAAGGACATGAAATAACAGGGTTTGGAACCCGGCTCCTTTGTTGCCCACTATGTTAGGC
## CGTTCTTAAATTGCTGTAAATACCTGAGACTGGGTTATTTATGAAGATAAGAGATTTAAT
## TGGCTCACAGTTCTGCAGGCTGTACAAGGGTGGCACTGGCATCTGGTTGGCTTCTGGGGA
## GGCCTCAGGGAGCTTTTTACTCATGGTGGAAGGCAGGGAGCAGGCACATCATGGGAAAAG
## CAGCAGCCAGAGAGAGGAAGAAGGTGCCGCATACTGTTGTTTGTTTGTTTTGTTTTTTTT
## TTTGGAGACACGGTCTCACTCTCACCTAGGCGGTAGTACAGTGGTGCGATCATGACCCAC
## TGCAGCCTCGACTTCCTGAGTTCAAGCGATTCTCCCACCTCAGCTTCCTGAGTAGCCAGG
Aha! All looks good. Let’s now load the sequence into R’s memory and take a closer look at it. We will skip the header and manipulate the original object so that individual lines are joined together and then split by character.
fasta <- scan('Homo_sapiens.GRCh38.dna.chromosome.19.fa',
what = 'character', #The type of file to be read
skip = 1) #Skip the header
fasta <- paste0(fasta) #Add the individual lines together
fasta <- strsplit(fasta, split = '') #Split the long string into characters
fasta <- unlist(fasta) #Convert the object from a list back to a string
fasta[1:10]
## [1] "N" "N" "N" "N" "N" "N" "N" "N" "N" "N"
Let’s see how many of each type of nucleotide is present in the sequence.
nucleotides <- table(fasta)
nucleotides
## fasta
## A C G N T
## 15142293 13954580 14061132 176858 15282753
Let’s create a barplot to visualise the data.
barplot(nucleotides)
OK, so there are about 200,000 N nucleotides, which form the minority of all the other bases. It would be interesting to know the telomere lengths. Let’s try to write a while-loop to measure the length of the first telomere.
i <- 1 #Initialise the iterator value
n <- 0 #Initialise N counter
while (fasta[i] == 'N') {
i <- i + 1 #Add 1 each time this condition is true
n <- n + 1
}
n
## [1] 60000
It looks like the telomere is 60kb long. The actual telomeres are composed of TTAGGG repeats and the number of ambiguous ‘N’ bases at chromosome ends is somewhat artificial, but it helps with bioinformatics applications, such as annotation and alignment. Let’s print the fragment 2 bases upstream and donstream of the last N to make sure our calculation is correct.
fasta[ (n-2) : (n+2) ]
## [1] "N" "N" "N" "G" "A"
Bingo, the base 6000 is the last N character.
Let’s now take a look at the other, coding bases, and check if AT and
GC are in roughtly 50:50 proportion. We will convert the table object to
a data frame in order to manipulate it. We will also transpose the table
using t()
function, so that the rows become columns - this
will help in analysis.
nucleotides <- data.frame(nucleotides, row.names = rownames(nucleotides)) #Convert to a data frame object
colnames(nucleotides) <- c('nucleotide', 'count') #Add meaningful column names
nucleotides$nucleotide <- NULL #Delete unused column
nucleotides <- as.data.frame(t(nucleotides)) #Transpose and retain data frame structure (which gets simplified by default if there is only one column)
GC <- (nucleotides$G + nucleotides$C) / (nucleotides$A + nucleotides$T + nucleotides$G + nucleotides$C)
AT <- 1 - GC
GC
## [1] 0.4793865
AT
## [1] 0.5206135
Indeed, the ratio is approximately 50:50. Let’s add some code to present our data in a nicer way and make a barlot to show the relative percentages.
print( paste0('GC %: ', round(GC, digits = 2)))
## [1] "GC %: 0.48"
print( paste0('AT %: ', round(AT, digits = 2)))
## [1] "AT %: 0.52"
barplot(c(GC, AT), names.arg = c('GC %', 'AT %'))
The percentage of GC pairs is referred to as GC content and is an important QC diagnostic of sequencing runs. GC content differs between parts of genome as well as between the organisms. Regions with epigenetic control exhibit presence of CpG islands, GC-rich regions with methylated cytosine bases.
Now that we had a look at the bases in chromosome 19, let’s try to
get a deeper understanding of the genomic elements they encode. Genome
annotation files are separate from sequence files and can be stored in
GFF3 or GTF format. The data we will use is stored as a text file
‘Homo_sapiens.GRCh38.107.chromosome.19.gff3’. GRCh38 refers to the name
of the assembly - this is the latest ‘major release’, with 107 referring
to the version of minor updates ‘minor release’. Let’s again use the
bash command head
to inspect the first 20 lines of the GFF
file.
head -n 20 'Homo_sapiens.GRCh38.107.chromosome.19.gff3'
## ##gff-version 3
## ##sequence-region 19 1 58617616
## #!genome-build Genome Reference Consortium GRCh38.p13
## #!genome-version GRCh38
## #!genome-date 2013-12
## #!genome-build-accession GCA_000001405.28
## #!genebuild-last-updated 2022-04
## 19 GRCh38 chromosome 1 58617616 . . . ID=chromosome:19;Alias=CM000681.2,chr19,NC_000019.10
## ###
## 19 havana pseudogene 60951 71626 . - . ID=gene:ENSG00000282458;Name=WASH5P;biotype=transcribed_processed_pseudogene;description=WASP family homolog 5%2C pseudogene [Source:HGNC Symbol%3BAcc:HGNC:33884];gene_id=ENSG00000282458;logic_name=havana_homo_sapiens;version=1
## 19 havana lnc_RNA 60951 70976 . - . ID=transcript:ENST00000632506;Parent=gene:ENSG00000282458;Name=WASH5P-206;biotype=processed_transcript;tag=basic;transcript_id=ENST00000632506;transcript_support_level=1;version=1
## 19 havana exon 60951 61894 . - . Parent=transcript:ENST00000632506;Name=ENSE00003783010;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003783010;rank=3;version=1
## 19 havana exon 66346 66499 . - . Parent=transcript:ENST00000632506;Name=ENSE00003783498;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003783498;rank=2;version=1
## 19 havana exon 70928 70976 . - . Parent=transcript:ENST00000632506;Name=ENSE00003781173;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003781173;rank=1;version=1
## 19 havana lnc_RNA 62113 66524 . - . ID=transcript:ENST00000633719;Parent=gene:ENSG00000282458;Name=WASH5P-208;biotype=retained_intron;transcript_id=ENST00000633719;transcript_support_level=NA;version=1
## 19 havana exon 62113 66524 . - . Parent=transcript:ENST00000633719;Name=ENSE00003783013;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003783013;rank=1;version=1
## 19 havana lnc_RNA 63821 70951 . - . ID=transcript:ENST00000633703;Parent=gene:ENSG00000282458;Name=WASH5P-207;biotype=processed_transcript;transcript_id=ENST00000633703;transcript_support_level=5;version=1
## 19 havana exon 63821 64213 . - . Parent=transcript:ENST00000633703;Name=ENSE00003781018;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003781018;rank=3;version=1
## 19 havana exon 66346 66499 . - . Parent=transcript:ENST00000633703;Name=ENSE00003783498;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003783498;rank=2;version=1
## 19 havana exon 70928 70951 . - . Parent=transcript:ENST00000633703;Name=ENSE00003782721;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003782721;rank=1;version=1
As you can see, there are a few header lines that start with a ‘#’ specifying some details of the file and the genome build. Next, there is some tab-delimited text split into 9 columns. These are (based on Ensembl website):
Coulmn | Description |
---|---|
Sequence name (seqname ) |
Name of the chromosome or scaffold. Watch out for differences between annotation consortia. Chromosome 1 could be denoted as ‘chr1’ or just ‘1’. |
Source (source ) |
Annotation source. Includes ‘ensembl’ - automatic annotation program and ‘havana’ - manual annotation by HAVANA group. |
Feature type (feature or type ) |
Region type. Can be coding sequence (CDS), exon (exon), 3’ UTR (three_prime_UTR), 5’ UTR (five_prime_UTR), and more, depending on the annotation source. |
Start (start ) |
Genomic coordinate where the annotated sequence starts. |
End (end ) |
Genomic coordinate where the annotated sequence ends. |
Score (score ) |
A numeric value normally referring to the confidence score of the annotated sequence. |
Strand (strand ) |
Direction of the sequence in double-stranded DNA. Can be either ‘+’ (forward/sense strand) or ‘-’ (reverse/antisense strand). |
ORF (frame ) |
Refers to the start of the open reading frame (ORF). Can be ‘0’, ‘1’, or ‘2’ - ‘0’ indicates that the first base of the feature is the first base of a codon. |
Attributes (attribute ) |
A list of name-value pairs separated by semicolons, in the format ‘name=value’. Attributes differ in type and number based on the feature type and the annotation program. |
A dot in any column ‘.’ indicates no value.
Let’s try base R function read.delim()
, which reads any
delimited file (tab, comma, etc.), to read the GFF into memory and
examine it. As the GFF does not include column names, we will specify
header = F
and tell R to ignore all the lines that start
with ‘#’ using the argument comment.char
- these happen to
be default parameters of function read.delim()
, but it is a
good practice to be explicit about the function arguments in your code.
We will give the columns meaningful names.
gff <- read.delim('Homo_sapiens.GRCh38.107.chromosome.19.gff3',
comment.char = '#',
header = FALSE,
col.names = c('seqname', 'source', 'type', 'start', 'end', 'score', 'strand', 'frame', 'attribute'))
head(gff)
## seqname source type start end score strand frame
## 1 19 GRCh38 chromosome 1 58617616 . . .
## 2 19 havana pseudogene 60951 71626 . - .
## 3 19 havana lnc_RNA 60951 70976 . - .
## 4 19 havana exon 60951 61894 . - .
## 5 19 havana exon 66346 66499 . - .
## 6 19 havana exon 70928 70976 . - .
## attribute
## 1 ID=chromosome:19;Alias=CM000681.2,chr19,NC_000019.10
## 2 ID=gene:ENSG00000282458;Name=WASH5P;biotype=transcribed_processed_pseudogene;description=WASP family homolog 5%2C pseudogene [Source:HGNC Symbol%3BAcc:HGNC:33884];gene_id=ENSG00000282458;logic_name=havana_homo_sapiens;version=1
## 3 ID=transcript:ENST00000632506;Parent=gene:ENSG00000282458;Name=WASH5P-206;biotype=processed_transcript;tag=basic;transcript_id=ENST00000632506;transcript_support_level=1;version=1
## 4 Parent=transcript:ENST00000632506;Name=ENSE00003783010;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003783010;rank=3;version=1
## 5 Parent=transcript:ENST00000632506;Name=ENSE00003783498;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003783498;rank=2;version=1
## 6 Parent=transcript:ENST00000632506;Name=ENSE00003781173;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003781173;rank=1;version=1
As you can see, the first 8 columns are readily readable, but the
column ‘attribute’ poses a challenge. We need the information contained
in it to understand which gene the sequences relate to (most genes will
have multiple rows associated with them)! We could write a function that
splits the ‘attribute’ sting based on the ‘;’ character, then split the
resulting strings based on ‘=’ character and separate them into columns.
If you are a more advanced R user (including regex) and feel like a
challenge, you can try coding this up - Google is your friend! For now,
we will use an existing package rtracklayer
to read the GFF
into R’s memory and convert it to a familiar data frame format.
gff <- readGFF('Homo_sapiens.GRCh38.107.chromosome.19.gff3')
gff <- as.data.frame(gff)
head(gff)
## seqid source type start end score strand phase
## 1 19 GRCh38 chromosome 1 58617616 NA * NA
## 2 19 havana pseudogene 60951 71626 NA - NA
## 3 19 havana lnc_RNA 60951 70976 NA - NA
## 4 19 havana exon 60951 61894 NA - NA
## 5 19 havana exon 66346 66499 NA - NA
## 6 19 havana exon 70928 70976 NA - NA
## ID Alias Name
## 1 chromosome:19 CM000681.... <NA>
## 2 gene:ENSG00000282458 WASH5P
## 3 transcript:ENST00000632506 WASH5P-206
## 4 <NA> ENSE00003783010
## 5 <NA> ENSE00003783498
## 6 <NA> ENSE00003781173
## biotype
## 1 <NA>
## 2 transcribed_processed_pseudogene
## 3 processed_transcript
## 4 <NA>
## 5 <NA>
## 6 <NA>
## description
## 1 <NA>
## 2 WASP family homolog 5, pseudogene [Source:HGNC Symbol;Acc:HGNC:33884]
## 3 <NA>
## 4 <NA>
## 5 <NA>
## 6 <NA>
## gene_id logic_name version Parent tag
## 1 <NA> <NA> <NA> <NA>
## 2 ENSG00000282458 havana_homo_sapiens 1 <NA>
## 3 <NA> <NA> 1 gene:ENS.... basic
## 4 <NA> <NA> 1 transcri.... <NA>
## 5 <NA> <NA> 1 transcri.... <NA>
## 6 <NA> <NA> 1 transcri.... <NA>
## transcript_id transcript_support_level constitutive ensembl_end_phase
## 1 <NA> <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA>
## 3 ENST00000632506 1 <NA> <NA>
## 4 <NA> <NA> 0 -1
## 5 <NA> <NA> 0 -1
## 6 <NA> <NA> 0 -1
## ensembl_phase exon_id rank external_name ccdsid protein_id
## 1 <NA> <NA> <NA> <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA> <NA> <NA>
## 3 <NA> <NA> <NA> <NA> <NA> <NA>
## 4 -1 ENSE00003783010 3 <NA> <NA> <NA>
## 5 -1 ENSE00003783498 2 <NA> <NA> <NA>
## 6 -1 ENSE00003781173 1 <NA> <NA> <NA>
OK, this is much better! Every attribute now sits in its own column, and there is an ‘NA’ value if the information for that row is missing. Let’s save the length of chromosome 19 (first row of the data frame) which will become useful later and subset the GFF file so that it contains automatic annotations generated by Ensembl merged with manual annotations by HAVANA group. Read more about those on the Ensembl website.
Recall that subsetting in R is possible using square brackets
[ ]
or the function subset()
.
chr19 <- gff[1, ]
gff <- subset(gff, source == 'ensembl_havana')
To calculate the number of genes, we will use the column ‘type’ and select ‘gene’. The gff$type == ‘gene’ will return a boolean (TRUE/FALSE) value for every row in the dataframe. To count the number of ‘TRUE’ values, we will use the property that TRUE is equal to 1 and FALSE is equal to 0. Therefore, sum of TRUE occurences is the number of rows that correspond to gene annotation in our GFF file
head(gff$type == 'gene') #This returns a long boolean
## [1] TRUE TRUE FALSE FALSE FALSE FALSE
sum(gff$type == 'gene') #This counts the number of 'TRUE' occurences
## [1] 1343
Let/s take a closer look at the genes on chromosome 19. To do this, we will subset the original data frame to only include gene annotations. We can see that the number of rows in this new data frame is equal to the number we calculated before.
genes <- gff[gff$type == 'gene', ] #Subset the data frame
nrow(genes) #Number of genes equal to what we calculated before
## [1] 1343
It would be interesting to see what proportion of genes lie on each
strand of the DNA, denoted as + (sense) and - (antisense) in the gff
file. We can produce count summaries using the fuction
table()
.
table(genes$strand)
##
## - +
## 665 678
We can see that the proportion of genes on either strand is approximately equal.
What about genome coverage? How much of the genome is actually used to encode information about genes? Let’s calculate the proportion of DNA occupied by coding sequences in relation to the chromosome length.
cds <- gff[gff$type == 'CDS', ]
cds$length <- cds$end - cds$start
sum(cds$length) / (chr19$end - chr19$start)
## [1] 0.04460002
Can you carry out a similar calculation to see how much of the genome is covered by exons?
regions <- gff[gff$type%in% c('CDS', 'three_prime_UTR', 'five_prime_UTR'), ]
regions$length <- regions$end - regions$start
region_lengths <- data.frame() #Create a data frame object to store the results
for (type in c('CDS', 'three_prime_UTR', 'five_prime_UTR')){
region_lengths[type, 'length'] <- sum(regions[regions$type == type, 'length']) #Calculate sum of lengths for each region
}
barplot(region_lengths$length, names.arg = rownames(region_lengths))
Can you hypothesise why the 3’ UTR sequences are, on average, longer than 5’?
We can see that, as suspected, coding sequences are the longest fragments of coding DNA. But we also saw that the coding DNA forms only a small fraction of the genome. To visualise this, let’s now add the length of the entire chromosome 19 for scale!
region_lengths['chromosome', 'length'] <- chr19$end - chr19$start
barplot(region_lengths$length, names.arg = rownames(region_lengths))
Let’s look at some interesting genes on chromosome 19. We will look at a class of enzymes called fucosyltarnsferases, which are involved in chemical modification of glycans displayed on the cell surface. There are 13 of them in the entire genome, let’s see how many on chromosome 13 - their symbols all start with ‘FUT’.
FUTs <- grep(pattern = 'FUT', gff$Name, value = TRUE) #Find all the gene names that contain 'FUT'
FUTs
## [1] "FUT6" "FUT6-202" "FUT6-201" "FUT3" "FUT3-201" "FUT3-202"
## [7] "FUT5" "FUT2" "FUT2-202" "FUT1" "FUT1-206"
This is interewsting - there is some gene symbols as well as gene symbols with a dash and a number. Let’s take a closer look at their annotations.
subset(gff, Name %in% FUTs)
## seqid source type start end score strand phase
## 23031 19 ensembl_havana gene 5830408 5839722 NA - NA
## 23032 19 ensembl_havana mRNA 5830408 5839702 NA - NA
## 23061 19 ensembl_havana mRNA 5830774 5839717 NA - NA
## 23117 19 ensembl_havana gene 5842888 5851471 NA - NA
## 23118 19 ensembl_havana mRNA 5842888 5851449 NA - NA
## 23136 19 ensembl_havana mRNA 5842890 5851449 NA - NA
## 23198 19 ensembl_havana gene 5865826 5870540 NA - NA
## 147878 19 ensembl_havana gene 48695971 48705951 NA + NA
## 147886 19 ensembl_havana mRNA 48695971 48705951 NA + NA
## 148150 19 ensembl_havana gene 48748011 48755390 NA - NA
## 148151 19 ensembl_havana mRNA 48748011 48752641 NA - NA
## ID Alias Name biotype
## 23031 gene:ENSG00000156413 FUT6 protein_coding
## 23032 transcript:ENST00000318336 FUT6-202 protein_coding
## 23061 transcript:ENST00000286955 FUT6-201 protein_coding
## 23117 gene:ENSG00000171124 FUT3 protein_coding
## 23118 transcript:ENST00000303225 FUT3-201 protein_coding
## 23136 transcript:ENST00000458379 FUT3-202 protein_coding
## 23198 gene:ENSG00000130383 FUT5 protein_coding
## 147878 gene:ENSG00000176920 FUT2 protein_coding
## 147886 transcript:ENST00000425340 FUT2-202 protein_coding
## 148150 gene:ENSG00000174951 FUT1 protein_coding
## 148151 transcript:ENST00000645652 FUT1-206 protein_coding
## description
## 23031 fucosyltransferase 6 [Source:HGNC Symbol;Acc:HGNC:4017]
## 23032 <NA>
## 23061 <NA>
## 23117 fucosyltransferase 3 (Lewis blood group) [Source:HGNC Symbol;Acc:HGNC:4014]
## 23118 <NA>
## 23136 <NA>
## 23198 fucosyltransferase 5 [Source:HGNC Symbol;Acc:HGNC:4016]
## 147878 fucosyltransferase 2 [Source:HGNC Symbol;Acc:HGNC:4013]
## 147886 <NA>
## 148150 fucosyltransferase 1 (H blood group) [Source:HGNC Symbol;Acc:HGNC:4012]
## 148151 <NA>
## gene_id logic_name version Parent
## 23031 ENSG00000156413 ensembl_havana_gene_homo_sapiens 15
## 23032 <NA> <NA> 10 gene:ENS....
## 23061 <NA> <NA> 5 gene:ENS....
## 23117 ENSG00000171124 ensembl_havana_gene_homo_sapiens 14
## 23118 <NA> <NA> 12 gene:ENS....
## 23136 <NA> <NA> 7 gene:ENS....
## 23198 ENSG00000130383 ensembl_havana_gene_homo_sapiens 7
## 147878 ENSG00000176920 ensembl_havana_gene_homo_sapiens 13
## 147886 <NA> <NA> 3 gene:ENS....
## 148150 ENSG00000174951 ensembl_havana_gene_homo_sapiens 12
## 148151 <NA> <NA> 2 gene:ENS....
## tag transcript_id transcript_support_level constitutive
## 23031 <NA> <NA> <NA> <NA>
## 23032 basic ENST00000318336 2 (assigned to previous version 8) <NA>
## 23061 basic ENST00000286955 1 <NA>
## 23117 <NA> <NA> <NA> <NA>
## 23118 basic ENST00000303225 1 (assigned to previous version 10) <NA>
## 23136 basic ENST00000458379 1 (assigned to previous version 6) <NA>
## 23198 <NA> <NA> <NA> <NA>
## 147878 <NA> <NA> <NA> <NA>
## 147886 basic ENST00000425340 1 (assigned to previous version 2) <NA>
## 148150 <NA> <NA> <NA> <NA>
## 148151 basic ENST00000645652 <NA> <NA>
## ensembl_end_phase ensembl_phase exon_id rank external_name ccdsid
## 23031 <NA> <NA> <NA> <NA> <NA> <NA>
## 23032 <NA> <NA> <NA> <NA> <NA> CCDS12152.1
## 23061 <NA> <NA> <NA> <NA> <NA> CCDS12152.1
## 23117 <NA> <NA> <NA> <NA> <NA> <NA>
## 23118 <NA> <NA> <NA> <NA> <NA> CCDS12153.1
## 23136 <NA> <NA> <NA> <NA> <NA> CCDS12153.1
## 23198 <NA> <NA> <NA> <NA> <NA> <NA>
## 147878 <NA> <NA> <NA> <NA> <NA> <NA>
## 147886 <NA> <NA> <NA> <NA> <NA> CCDS33069.1
## 148150 <NA> <NA> <NA> <NA> <NA> <NA>
## 148151 <NA> <NA> <NA> <NA> <NA> CCDS12733.1
## protein_id
## 23031 <NA>
## 23032 <NA>
## 23061 <NA>
## 23117 <NA>
## 23118 <NA>
## 23136 <NA>
## 23198 <NA>
## 147878 <NA>
## 147886 <NA>
## 148150 <NA>
## 148151 <NA>
Bingo! Some annotations are for the genes and the others correspond to mRNA transcripts (those with the dashes). We are only interested in the genes, so let’s modify our search to reflect that.
FUTs <- grep('\\-', FUTs, value = TRUE, invert = TRUE) #\\ is an escape character, which tells grep to look for an actual dash, not a dash symbol with a special meaning.
#Invert tells grep to exclude genes containing a dash.
FUTs_df <- subset(gff, Name %in% FUTs)
FUTs_df
## seqid source type start end score strand phase
## 23031 19 ensembl_havana gene 5830408 5839722 NA - NA
## 23117 19 ensembl_havana gene 5842888 5851471 NA - NA
## 23198 19 ensembl_havana gene 5865826 5870540 NA - NA
## 147878 19 ensembl_havana gene 48695971 48705951 NA + NA
## 148150 19 ensembl_havana gene 48748011 48755390 NA - NA
## ID Alias Name biotype
## 23031 gene:ENSG00000156413 FUT6 protein_coding
## 23117 gene:ENSG00000171124 FUT3 protein_coding
## 23198 gene:ENSG00000130383 FUT5 protein_coding
## 147878 gene:ENSG00000176920 FUT2 protein_coding
## 148150 gene:ENSG00000174951 FUT1 protein_coding
## description
## 23031 fucosyltransferase 6 [Source:HGNC Symbol;Acc:HGNC:4017]
## 23117 fucosyltransferase 3 (Lewis blood group) [Source:HGNC Symbol;Acc:HGNC:4014]
## 23198 fucosyltransferase 5 [Source:HGNC Symbol;Acc:HGNC:4016]
## 147878 fucosyltransferase 2 [Source:HGNC Symbol;Acc:HGNC:4013]
## 148150 fucosyltransferase 1 (H blood group) [Source:HGNC Symbol;Acc:HGNC:4012]
## gene_id logic_name version Parent tag
## 23031 ENSG00000156413 ensembl_havana_gene_homo_sapiens 15 <NA>
## 23117 ENSG00000171124 ensembl_havana_gene_homo_sapiens 14 <NA>
## 23198 ENSG00000130383 ensembl_havana_gene_homo_sapiens 7 <NA>
## 147878 ENSG00000176920 ensembl_havana_gene_homo_sapiens 13 <NA>
## 148150 ENSG00000174951 ensembl_havana_gene_homo_sapiens 12 <NA>
## transcript_id transcript_support_level constitutive ensembl_end_phase
## 23031 <NA> <NA> <NA> <NA>
## 23117 <NA> <NA> <NA> <NA>
## 23198 <NA> <NA> <NA> <NA>
## 147878 <NA> <NA> <NA> <NA>
## 148150 <NA> <NA> <NA> <NA>
## ensembl_phase exon_id rank external_name ccdsid protein_id
## 23031 <NA> <NA> <NA> <NA> <NA> <NA>
## 23117 <NA> <NA> <NA> <NA> <NA> <NA>
## 23198 <NA> <NA> <NA> <NA> <NA> <NA>
## 147878 <NA> <NA> <NA> <NA> <NA> <NA>
## 148150 <NA> <NA> <NA> <NA> <NA> <NA>
Let’s visualise the relative placement of all the FUT genes using a genome browser. We will use GenomicRanges object, which is a compact way of storing sequence information. We will then use Gviz package to create ‘tracks’ (in our case, the reference track will be empty, but it can contain information about the genome).
FUTs_gr <- with(FUTs_df, GRanges(seqid, IRanges(start, end), strand, id = Name))
chr_track <- IdeogramTrack(genome = 'hg38', chromosome = 'chr19') #This creates a chromosome 'ideogram', or a graphic, with cytological bands, based on corresponding data from UCSC (a repository like Ensembl)
ref_track <- GenomeAxisTrack() #This creates an empty track where our genes will be displayed
data_track <- AnnotationTrack(FUTs_gr, name = "Genes", showFeatureId = TRUE) #This creates the annotations to display
When inspecting the data frame FUTs_df
, above, you can
see that there are two gene clusters of fucosyltransferase genes, one in
the 5.8Mb (megabase, 1e6 bases) and one in the 48.7Mb region. We will
plot these two regions separately as they are far apart.
#Cluster 1
plotTracks(c(chr_track, ref_track, data_track),
from = 5.82e6, to = 5.88e6, sizes = c(1,1,2))
#Cluster 2
plotTracks(c(chr_track, ref_track, data_track),
from = 48.68e6, to = 48.76e6, sizes = c(1,1,2))
Feel free to change the from
and to
values
to zoom in and out of the loci. You can also produce some code that will
automatically establish the optimal region to show using functions
min()
, max()
, and the arguments
extend.right
/extend.left
of the function
plotTracks
. The arrows indicate transcript directionality
(remember ‘+’ and ‘-’ strands from our GFF? they correspond to right-
and left-pointing arrow, respectively).
Finally, let’s take a closer look at the variant annotation file.
This file summarises genetic variation, and is often produced as a
result of GWAS studies or population analysis - see 1000 Genomes
Project. The data is stored as a text file ‘homo_sapiens-chr19.vcf’
downloaded from Ensembl includes variants from multiple source
databases. Let’s inspect the file using head
.
head -n 30 'homo_sapiens-chr19.vcf'
## ##fileformat=VCFv4.1
## ##fileDate=20220607
## ##source=ensembl;version=107;url=https://e107.ensembl.org/homo_sapiens
## ##reference=ftp://ftp.ensembl.org/pub/release-107/fasta/homo_sapiens/dna/
## ##INFO=<ID=ClinVar_202201,Number=0,Type=Flag,Description="Variants of clinical significance imported from ClinVar">
## ##INFO=<ID=dbSNP_154,Number=0,Type=Flag,Description="Variants (including SNPs and indels) imported from dbSNP">
## ##INFO=<ID=HGMD-PUBLIC_20204,Number=0,Type=Flag,Description="Variants from HGMD-PUBLIC dataset December 2020">
## ##INFO=<ID=COSMIC_95,Number=0,Type=Flag,Description="Somatic mutations found in human cancers from the COSMIC catalogue">
## ##INFO=<ID=TSA,Number=1,Type=String,Description="Type of sequence alteration. Child of term sequence_alteration as defined by the sequence ontology project.">
## ##INFO=<ID=E_Cited,Number=0,Type=Flag,Description="Cited.https://www.ensembl.org/info/genome/variation/prediction/variant_quality.html#evidence_status">
## ##INFO=<ID=E_Multiple_observations,Number=0,Type=Flag,Description="Multiple_observations.https://www.ensembl.org/info/genome/variation/prediction/variant_quality.html#evidence_status">
## ##INFO=<ID=E_Freq,Number=0,Type=Flag,Description="Frequency.https://www.ensembl.org/info/genome/variation/prediction/variant_quality.html#evidence_status">
## ##INFO=<ID=E_TOPMed,Number=0,Type=Flag,Description="TOPMed.https://www.ensembl.org/info/genome/variation/prediction/variant_quality.html#evidence_status">
## ##INFO=<ID=E_Hapmap,Number=0,Type=Flag,Description="HapMap.https://www.ensembl.org/info/genome/variation/prediction/variant_quality.html#evidence_status">
## ##INFO=<ID=E_Phenotype_or_Disease,Number=0,Type=Flag,Description="Phenotype_or_Disease.https://www.ensembl.org/info/genome/variation/prediction/variant_quality.html#evidence_status">
## ##INFO=<ID=E_ESP,Number=0,Type=Flag,Description="ESP.https://www.ensembl.org/info/genome/variation/prediction/variant_quality.html#evidence_status">
## ##INFO=<ID=E_gnomAD,Number=0,Type=Flag,Description="gnomAD.https://www.ensembl.org/info/genome/variation/prediction/variant_quality.html#evidence_status">
## ##INFO=<ID=E_1000G,Number=0,Type=Flag,Description="1000Genomes.https://www.ensembl.org/info/genome/variation/prediction/variant_quality.html#evidence_status">
## ##INFO=<ID=E_ExAC,Number=0,Type=Flag,Description="ExAC.https://www.ensembl.org/info/genome/variation/prediction/variant_quality.html#evidence_status">
## ##INFO=<ID=CLIN_risk_factor,Number=0,Type=Flag,Description="risk factor.https://www.ensembl.org/info/genome/variation/phenotype/phenotype_annotation.html#clin_significance">
## ##INFO=<ID=CLIN_protective,Number=0,Type=Flag,Description="protective.https://www.ensembl.org/info/genome/variation/phenotype/phenotype_annotation.html#clin_significance">
## ##INFO=<ID=CLIN_confers_sensitivity,Number=0,Type=Flag,Description="confers sensitivity.https://www.ensembl.org/info/genome/variation/phenotype/phenotype_annotation.html#clin_significance">
## ##INFO=<ID=CLIN_other,Number=0,Type=Flag,Description="other.https://www.ensembl.org/info/genome/variation/phenotype/phenotype_annotation.html#clin_significance">
## ##INFO=<ID=CLIN_drug_response,Number=0,Type=Flag,Description="drug response.https://www.ensembl.org/info/genome/variation/phenotype/phenotype_annotation.html#clin_significance">
## ##INFO=<ID=CLIN_uncertain_significance,Number=0,Type=Flag,Description="uncertain significance.https://www.ensembl.org/info/genome/variation/phenotype/phenotype_annotation.html#clin_significance">
## ##INFO=<ID=CLIN_benign,Number=0,Type=Flag,Description="benign.https://www.ensembl.org/info/genome/variation/phenotype/phenotype_annotation.html#clin_significance">
## ##INFO=<ID=CLIN_likely_pathogenic,Number=0,Type=Flag,Description="likely pathogenic.https://www.ensembl.org/info/genome/variation/phenotype/phenotype_annotation.html#clin_significance">
## ##INFO=<ID=CLIN_pathogenic,Number=0,Type=Flag,Description="pathogenic.https://www.ensembl.org/info/genome/variation/phenotype/phenotype_annotation.html#clin_significance">
## ##INFO=<ID=CLIN_likely_benign,Number=0,Type=Flag,Description="likely benign.https://www.ensembl.org/info/genome/variation/phenotype/phenotype_annotation.html#clin_significance">
## ##INFO=<ID=CLIN_histocompatibility,Number=0,Type=Flag,Description="histocompatibility.https://www.ensembl.org/info/genome/variation/phenotype/phenotype_annotation.html#clin_significance">
So far we can only see the header file, which includes decriptions of some variables, such as ID=dbSNP_154 and that it corresponds to variants sourced from dbSNP. Let’s temporarily remove the lengthy header to see the remaining content of the file.
grep -v '##' 'homo_sapiens-chr19.vcf' | head -n 20
## #CHROM POS ID REF ALT QUAL FILTER INFO
## 19 60062 rs1555674440 G C . . dbSNP_154;TSA=SNV
## 19 60165 rs1415141782 G A . . dbSNP_154;TSA=SNV;E_Freq;E_gnomAD
## 19 60173 rs1371922052 G A . . dbSNP_154;TSA=SNV;E_Freq;E_gnomAD
## 19 60184 rs1391618909 G A . . dbSNP_154;TSA=SNV;E_Freq;E_gnomAD
## 19 60223 rs1187548881 A G . . dbSNP_154;TSA=SNV;E_Freq;E_TOPMed
## 19 60251 rs1310995734 G A . . dbSNP_154;TSA=SNV;E_Freq;E_gnomAD
## 19 60281 rs1380383890 A G . . dbSNP_154;TSA=SNV;E_Freq;E_gnomAD
## 19 60319 rs1244952011 C T . . dbSNP_154;TSA=SNV;E_Freq;E_gnomAD
## 19 60326 rs1292465550 A G . . dbSNP_154;TSA=SNV;E_Freq;E_TOPMed;E_gnomAD
## 19 60351 rs1323466133 C T . . dbSNP_154;TSA=SNV;E_Freq;E_TOPMed;E_gnomAD
## 19 60354 rs1202298947 T A . . dbSNP_154;TSA=SNV;E_Freq;E_TOPMed;E_gnomAD
## 19 60360 rs111660247 C G . . dbSNP_154;TSA=SNV
## 19 60376 rs1263131587 C T . . dbSNP_154;TSA=SNV;E_Freq;E_TOPMed
## 19 60377 rs1249701311 G A . . dbSNP_154;TSA=SNV;E_Freq;E_TOPMed;E_gnomAD
## 19 60384 rs1480309998 G A . . dbSNP_154;TSA=SNV;E_Freq;E_TOPMed;E_gnomAD
## 19 60398 rs1599941348 G A . . dbSNP_154;TSA=SNV;E_Freq
## 19 60410 rs1178059410 C T . . dbSNP_154;TSA=SNV;E_Freq;E_gnomAD
## 19 60414 rs1293885345 G C . . dbSNP_154;TSA=SNV;E_Freq;E_TOPMed
## 19 60419 rs1214163233 G A . . dbSNP_154;TSA=SNV;E_Freq;E_TOPMed
The VCF file, much like GFF file, includes some obligatory columns, as well as a list of attributes in the last column. The required columns are:
Column | Description |
---|---|
CHROM |
Name of the chromosome or sequence |
POS |
Position of the described variant |
ID |
Name of the variant from a database of known variants. dbSNP entries all start with ‘rs’ |
REF |
Reference genome base(s) |
ALT |
Actual base(s) detected |
QUAL |
Quality score associated with the variant |
FILTER |
Either ‘PASS’ or ‘FAIL’ value - if the variant passed quality filtering |
INFO |
A list of key=value pairs similar to that in GFF file |
A dot in any column ‘.’ indicates no value.
We will first use base R function to read the VCF file excluding the header, and we’ll give the columns meaningful names according to the table above
df <- read.delim('homo_sapiens-chr19.vcf',
comment.char = '#',
header = FALSE,
col.names = c('CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO'))
head(df)
## CHROM POS ID REF ALT QUAL FILTER
## 1 19 60062 rs1555674440 G C . .
## 2 19 60165 rs1415141782 G A . .
## 3 19 60173 rs1371922052 G A . .
## 4 19 60184 rs1391618909 G A . .
## 5 19 60223 rs1187548881 A G . .
## 6 19 60251 rs1310995734 G A . .
## INFO
## 1 dbSNP_154;TSA=SNV
## 2 dbSNP_154;TSA=SNV;E_Freq;E_gnomAD
## 3 dbSNP_154;TSA=SNV;E_Freq;E_gnomAD
## 4 dbSNP_154;TSA=SNV;E_Freq;E_gnomAD
## 5 dbSNP_154;TSA=SNV;E_Freq;E_TOPMed
## 6 dbSNP_154;TSA=SNV;E_Freq;E_gnomAD
As we stumble upon a similar problem to that we had with GFF file -
non-parsed data in INFO column - we will use the package
VariationAnnotation
to read in the VCF again and extract
the data in INFO in a usable way.
annot <- readVcf('homo_sapiens-chr19.vcf')
head(annot)
## class: CollapsedVCF
## dim: 6 0
## rowRanges(vcf):
## GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
## info(vcf):
## DataFrame with 32 columns: ClinVar_202201, dbSNP_154, HGMD-PUBLIC_20204, C...
## info(header(vcf)):
## Number Type Description
## ClinVar_202201 0 Flag Variants of clinical significa...
## dbSNP_154 0 Flag Variants (including SNPs and i...
## HGMD-PUBLIC_20204 0 Flag Variants from HGMD-PUBLIC data...
## COSMIC_95 0 Flag Somatic mutations found in hum...
## TSA 1 String Type of sequence alteration. C...
## E_Cited 0 Flag Cited.https://www.ensembl.org/...
## E_Multiple_observations 0 Flag Multiple_observations.https://...
## E_Freq 0 Flag Frequency.https://www.ensembl....
## E_TOPMed 0 Flag TOPMed.https://www.ensembl.org...
## E_Hapmap 0 Flag HapMap.https://www.ensembl.org...
## E_Phenotype_or_Disease 0 Flag Phenotype_or_Disease.https://w...
## E_ESP 0 Flag ESP.https://www.ensembl.org/in...
## E_gnomAD 0 Flag gnomAD.https://www.ensembl.org...
## E_1000G 0 Flag 1000Genomes.https://www.ensemb...
## E_ExAC 0 Flag ExAC.https://www.ensembl.org/i...
## CLIN_risk_factor 0 Flag risk factor.https://www.ensemb...
## CLIN_protective 0 Flag protective.https://www.ensembl...
## CLIN_confers_sensitivity 0 Flag confers sensitivity.https://ww...
## CLIN_other 0 Flag other.https://www.ensembl.org/...
## CLIN_drug_response 0 Flag drug response.https://www.ense...
## CLIN_uncertain_significance 0 Flag uncertain significance.https:/...
## CLIN_benign 0 Flag benign.https://www.ensembl.org...
## CLIN_likely_pathogenic 0 Flag likely pathogenic.https://www....
## CLIN_pathogenic 0 Flag pathogenic.https://www.ensembl...
## CLIN_likely_benign 0 Flag likely benign.https://www.ense...
## CLIN_histocompatibility 0 Flag histocompatibility.https://www...
## CLIN_not_provided 0 Flag not provided.https://www.ensem...
## CLIN_association 0 Flag association.https://www.ensemb...
## MA 1 String Minor Allele
## MAF 1 Float Minor Allele Frequency
## MAC 1 Integer Minor Alelele Count
## AA 1 String Ancestral Allele
## geno(vcf):
## List of length 0:
Let’s combine the two portions of the VCF file and delete the original INFO column.
vcf <- cbind(df, annot@info)
vcf$INFO <- NULL
rm(df, annot)
Let’s examine the variants present in the gene FUT2. To do that, we save the start and end location of the gene we extracted earlier from the GFF file, then use that data to subset our large VCF file. Next, let’s tabulate the data to see how many variants in FUT2 have a known clinical association, based on data from ClinVar database.
FUT2_start <- FUTs_df[FUTs_df$Name == 'FUT2', 'start']
FUT2_end <- FUTs_df[FUTs_df$Name == 'FUT2', 'end']
FUTs_vcf <- subset(vcf, POS >= FUT2_start & POS <= FUT2_end)
FUTs_vcf <- subset(FUTs_vcf, dbSNP_154 == TRUE & E_Phenotype_or_Disease == TRUE)
table(FUTs_vcf$CLIN_association)
##
## FALSE TRUE
## 30 1
Of the selected variants, only one has phenotype association. Let’s check which one that is:
FUTs_vcf[FUTs_vcf$CLIN_association == TRUE, 'ID']
## [1] "rs601338"
Read about the variant and what it causes and prepare a short summary of your findings. Methods section of this paper is a good starting place.
Next, we will use more advanced tools to combine FASTA, GFF, and VCF information and predict the effect of the SNP rs601338 on translation. In this practical, we are using a small fragment of a single human genome, but in common bioinformatic applications, we would use 10s or 100s of sequences, with files getting really big and more difficult to manipulate. One solution to that problem is to use a more powerful machine, but another is to use file formats that allow efficient computation. Tabix is a type of index file, which creates fast, easy-to-access coordinates system of large files. We will index our VCF file now using Rsamtools.
bgzip('homo_sapiens-chr19.vcf', 'homo_sapiens-chr19.vcf.bgz', overwrite = TRUE)
## [1] "homo_sapiens-chr19.vcf.bgz"
tabix <- indexTabix('homo_sapiens-chr19.vcf.bgz', format = 'vcf')
We will also index the genome file.
indexFa('Homo_sapiens.GRCh38.dna.chromosome.19.fa')
## [1] "Homo_sapiens.GRCh38.dna.chromosome.19.fa.fai"
genome <- FaFile(file = 'Homo_sapiens.GRCh38.dna.chromosome.19.fa')
Finally, we will prepare our GFF file for use by creating a TxDb file, which is an interface object to a database of features stored on your computer.
txdb <- makeTxDbFromGFF('Homo_sapiens.GRCh38.107.chromosome.19.gff3')
## Import genomic features from the file as a GRanges object ... OK
## Prepare the 'metadata' data frame ... OK
## Make the TxDb object ... OK
We also need an efficient way to work with regions of the genome. As with the genome browser visualisation above, we will define some interesting ranges of the genome which correspond to the FUT genes on chromosome 19. IRanges and Granges objects are efficient means of acessing specified regions from indexed genomic files.
regions <- GRanges(seqnames="19", ranges = IRanges(
start = FUTs_df$start,
end = FUTs_df$end,
names = FUTs_df$Name))
regions
## GRanges object with 5 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## FUT6 19 5830408-5839722 *
## FUT3 19 5842888-5851471 *
## FUT5 19 5865826-5870540 *
## FUT2 19 48695971-48705951 *
## FUT1 19 48748011-48755390 *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
We will now use the VCF tabix map and the FUT ranges defined above to load a section of VCF data corresponding to these genes only.
vcf_short <- readVcf(tabix, genome = "GRCh38.107", param = regions)
Finally, let’s bring all the files we saw today together to predict the effect of variants we saw on the protein sequences which are produced from the DNA via mRNA. We will use the package VariationAnnotation, which will source the portion of FASTA file of interest, find the reading frames and annotations of FUT genes using the GFF3 file, and performe nucleotide changes specified in the VCF file. It will then translate the DNA sequences and specify the effect to mRNA: wheather it’s a STOP codon (a nonsense mutation), a substitution (a missense mutation), or no change at all (a silent mutation).
coding <- predictCoding(vcf_short, subject = txdb, seqSource = genome)
coding
## GRanges object with 9603 ranges and 17 metadata columns:
## seqnames ranges strand | paramRangeID REF
## <Rle> <IRanges> <Rle> | <factor> <DNAStringSet>
## rs1182309127 19 5831378-5831394 - | FUT6 TCAGGCAGGTGAAGCTT
## rs1182309127 19 5831378-5831394 - | FUT6 TCAGGCAGGTGAAGCTT
## rs1304301383 19 5831378 - | FUT6 T
## rs1158033586 19 5831382 - | FUT6 G
## rs779049413 19 5831383 - | FUT6 C
## ... ... ... ... . ... ...
## rs1269695902 19 48751277 - | FUT1 C
## rs1380079562 19 48751279 - | FUT1 C
## CM1617022 19 48751281 - | FUT1 T
## rs200607617 19 48751281 - | FUT1 T
## rs200607617 19 48751281 - | FUT1 T
## ALT QUAL FILTER
## <CharacterList> <numeric> <character>
## rs1182309127 T,TCAGGCAGGTGAAGCTTCAG.. NA .
## rs1182309127 T,TCAGGCAGGTGAAGCTTCAG.. NA .
## rs1304301383 C NA .
## rs1158033586 A NA .
## rs779049413 T NA .
## ... ... ... ...
## rs1269695902 T NA .
## rs1380079562 A NA .
## CM1617022 NA .
## rs200607617 C,G NA .
## rs200607617 C,G NA .
## varAllele CDSLOC PROTEINLOC QUERYID
## <DNAStringSet> <IRanges> <IntegerList> <integer>
## rs1182309127 A 1079-1095 360,365 288
## rs1182309127 AAGCTTCACC...ACCTGCCTGA 1079-1095 360,365 288
## rs1304301383 G 1095 365 289
## rs1158033586 T 1091 364 290
## rs779049413 A 1090 364 291
## ... ... ... ... ...
## rs1269695902 A 5 2 10291
## rs1380079562 T 3 1 10292
## CM1617022 1 1 10293
## rs200607617 G 1 1 10294
## rs200607617 C 1 1 10294
## TXID CDSID GENEID CONSEQUENCE
## <character> <IntegerList> <character> <factor>
## rs1182309127 8057 27954 ENSG00000156413 frameshift
## rs1182309127 8057 27954 ENSG00000156413 frameshift
## rs1304301383 8057 27954 ENSG00000156413 nonsynonymous
## rs1158033586 8057 27954 ENSG00000156413 nonsynonymous
## rs779049413 8057 27954 ENSG00000156413 nonsynonymous
## ... ... ... ... ...
## rs1269695902 12777 44420 ENSG00000174951 nonsense
## rs1380079562 12777 44420 ENSG00000174951 nonsynonymous
## CM1617022 12777 44420 ENSG00000174951 not translated
## rs200607617 12777 44420 ENSG00000174951 nonsynonymous
## rs200607617 12777 44420 ENSG00000174951 synonymous
## REFCODON VARCODON REFAA
## <DNAStringSet> <DNAStringSet> <AAStringSet>
## rs1182309127 GAAGCTTCACCTGCCTGA GA
## rs1182309127 GAAGCTTCACCTGCCTGA GAAGCTTCAC...ACCTGCCTGA
## rs1304301383 TGA TGG *
## rs1158033586 GCC GTC A
## rs779049413 GCC ACC A
## ... ... ... ...
## rs1269695902 TGG TAG W
## rs1380079562 ATG ATT M
## CM1617022 ATG TG
## rs200607617 ATG GTG M
## rs200607617 ATG CTG M
## VARAA
## <AAStringSet>
## rs1182309127
## rs1182309127
## rs1304301383 W
## rs1158033586 V
## rs779049413 T
## ... ...
## rs1269695902 *
## rs1380079562 I
## CM1617022
## rs200607617 V
## rs200607617 M
## -------
## seqinfo: 1 sequence from GRCh38.107 genome; no seqlengths
Let’s subset the results object to see what the consequence is of mutations in SNP rs601338.
coding["rs601338", ]
## GRanges object with 1 range and 17 metadata columns:
## seqnames ranges strand | paramRangeID REF
## <Rle> <IRanges> <Rle> | <factor> <DNAStringSet>
## rs601338 19 48703417 + | FUT2 G
## ALT QUAL FILTER varAllele CDSLOC
## <CharacterList> <numeric> <character> <DNAStringSet> <IRanges>
## rs601338 A NA . A 461
## PROTEINLOC QUERYID TXID CDSID
## <IntegerList> <integer> <character> <IntegerList>
## rs601338 154 8508 5594 20226,20225,20224
## GENEID CONSEQUENCE REFCODON VARCODON
## <character> <factor> <DNAStringSet> <DNAStringSet>
## rs601338 ENSG00000176920 nonsense TGG TAG
## REFAA VARAA
## <AAStringSet> <AAStringSet>
## rs601338 W *
## -------
## seqinfo: 1 sequence from GRCh38.107 genome; no seqlengths
Aha! The mutation is in amino acid number 154 (PROTEINLOC), and the field COSEQUENCE tells us this is a nonsense mutation, with the usual amino acid REFAA being tryptophan (W) and the one resultiong from a mutation being * (STOP codon). So, it looks like secretor status is determined by the ability or inability to produce full-length FUT2 transcript. Without FUT2, it is impossible to fucosylate the required carbohydrates, and thus blood groups are not present. This makes sense in the context of it being a recessive trait, as a single copy of FUT2 must be sufficient to fucosylate all the required surface glycans.