Working with gene annotations in R
To go further, let's also load the gene annotation file we used earlier. For the purposes of this tutorial, we'll use a version of this file for genes on chromosome 19, that can be downloaed from Ensembl. Start by downloading this now using the command
curl -O https://www.well.ox.ac.uk/bioinformatics/training/msc_gm/2023/data/gencode.v44.annotation.chr19.gff3.gz
Use less
(or rather zless
, since it's still compressed) to take a quick look at the file structure.
Quick recap
We looked at gene annotations files already, but here is a quick recap. Gene annotation files contain information about the genes in a genome and their structure (i.e. their exons, transcripts, coding sequences etc.) They come in a somewhat standard file format known as GFF3 (General Feature Format v3). A similar but older format known as GTF (Gene Transfer Format) is also found sometimes, but we'll use the gff version. See these pages for more information about gff files from Ensembl and NCBI
The main content of the file is some tab-delimited text split into 9 columns, each row corresponding to an annotation entry. These rows start after several 'header' rows with general information about the file; header rows start with a # symbol and are just one column.
Column | Description |
---|---|
Sequence name (seqid ) | Name of the chromosome or scaffold. Watch out for differences between annotation consortia - e.g. chromosome 1 could be denoted as 'chr1' or just '1' |
Source (source ) | Annotation source. Includes 'ensembl' - automatic annotation program and 'havana' - manual annotation by HAVANA group |
Feature type (type ) | Region type. Examples include coding sequence (CDS), exon (exon), 3' UTR (three_prime_UTR), 5' UTR (five_prime_UTR) |
Start (start ) | Genomic coordinate where the annotated sequence starts |
End (end ) | Genomic coordinate where the annotated sequence ends |
Score (score ) | A numeric value normally referring to the confidence score of the annotated sequence |
Strand (strand ) | Direction of the sequence in double-stranded DNA. Can be either '+' (forward/sense strand) or '-' (reverse/antisense strand) |
ORF (phase ) | Refers to the start of the open reading frame (ORF). Can be 0, 1, or 2 - 0 indicates that the first base of the feature is the first base of a codon |
Attributes (attributes ) | A list of name-value pairs separated by semicolons, in the format 'name=value'. Attributes differ in type and number based on the feature type and the annotation program |
Reading the gene annotation file into R
To make loading a gff file easy, we've included a function called parse_gff3_to_dataframe()
in the mscgm
package.
Let's use it to load the data now:
gff = mscgm::parse_gff3_to_dataframe( "gencode.v44.annotation.chr19.gff3.gz" )
Let's take a look at it:
print(gff)
# or
View(gff)
As the above function names suggests, the result is a 'data frame' - a two-dimensional structure where the rows are records and each column is a different variable. More specifically, we are using the tidyverse flavour of a data frame, known as a tibble. We recommend using tibbles where possible as they are easiest to work with.
If you look at the above output you'll notice something interesting. The function has extracted the ID
and Parent
attributes, and placed them into the data fram for us!
This is important because, as you'll recall from the earlier session, the ID
and Parent
fields form the main way to
link between records (such as between genes and their transcripts).
Dataframe basics
Let's do some basic manipulation of that data now, similar to what we did in the command-line.
For example we can look at the first few rows:
head(gff)
or hte last few rows:
tail(gff)
We can pull out specific rows or ranges of rows - for example, here is row 10:
gff[10,]
while here are rows 100 to 120:
gff[100:120,]
Similarly we can pull out specific columns. Suppose for example we decide we're only interested in the type and
position of records - not the score
, strand
, source
or the attributes:
gff[100:120, c( 1:3, 5:7 )]
Although this style of subsetting columns works, it also has the downside of not being very easy to understand later. A better option would be to specify the columns by name:
gff[100:120, c( "ID", "Parent", "seqid", "type", "start", "end" )]
Now when you read the code back later, you can see exactly what you're getting.
The above returns a new dataframe with just the chosen columns. What if you just want the values in a given column? There are in fact two ways. First, if you know the column name, you can use the dollar notation, as in :
IDs = gff$ID
print( IDs )
Alternatively you can use the double-square-bracket notation:
IDs = gff[['ID']]
print( IDs )
As you can see, R has a few different ways to subset things using brackets and column names - you just have to get used to them. A brief recap is:
- to get a subset of values of a vector of values, use single square brackets:
IDs[5]
IDs[105:115]
- To get rectangular subsets of 2-dimensional objects, like data frames and matrices, use single square brackets with a comma in:
gff[100:110, 1:5]
- However, to get individual columns from a dataframe (or values from an R 'list') use double square brackets:
gff[['ID']]
- Finally you can also use dollar notation to get columns from a dataframe (or entries in a list) that have names:
gff$ID
This complexity is a bit annoying, but you get used to it after a while.
You can also get R to tell you basic information about the dataframe. For example, how many rows and columns are there? And what are the column names?
dim(gff)
colnames(gff)
Dataframe data types
In a dataframe, each column can be a variable with a different data type. For example, in our gff
dataframe, some of
the values like the ID
are strings (or 'characters' in R parlance), while some are numbers. These types are more or
less specified by the GFF3 specification.
Has R got these types right? You can see them in the above output, or you can get their types directly, for example:
typeof( gff[['ID']] )
typeof( gff[['start']] )
The types will be one of:
chr
orcharacter
(a string value)int
orinteger
(a whole number)dbl
ordouble
(a floating-point number)
Compare the types printed out above with the specification - do they have the right data types?