Working with gene annotations in R

To go further, let's also load the gene annotation file we used earlier. For the purposes of this tutorial, we'll use a version of this file for genes on chromosome 19, that can be downloaed from Ensembl. Start by downloading this now using the command

curl -O https://www.well.ox.ac.uk/bioinformatics/training/msc_gm/2023/data/gencode.v44.annotation.chr19.gff3.gz

Use less (or rather zless, since it's still compressed) to take a quick look at the file structure.

Quick recap

We looked at gene annotations files already, but here is a quick recap. Gene annotation files contain information about the genes in a genome and their structure (i.e. their exons, transcripts, coding sequences etc.) They come in a somewhat standard file format known as GFF3 (General Feature Format v3). A similar but older format known as GTF (Gene Transfer Format) is also found sometimes, but we'll use the gff version. See these pages for more information about gff files from Ensembl and NCBI

The main content of the file is some tab-delimited text split into 9 columns, each row corresponding to an annotation entry. These rows start after several 'header' rows with general information about the file; header rows start with a # symbol and are just one column.

Column	Description
Sequence name (`seqid`)	Name of the chromosome or scaffold. Watch out for differences between annotation consortia - e.g. chromosome 1 could be denoted as 'chr1' or just '1'
Source (`source`)	Annotation source. Includes 'ensembl' - automatic annotation program and 'havana' - manual annotation by HAVANA group
Feature type (`type`)	Region type. Examples include coding sequence (CDS), exon (exon), 3' UTR (three_prime_UTR), 5' UTR (five_prime_UTR)
Start (`start`)	Genomic coordinate where the annotated sequence starts
End (`end`)	Genomic coordinate where the annotated sequence ends
Score (`score`)	A numeric value normally referring to the confidence score of the annotated sequence
Strand (`strand`)	Direction of the sequence in double-stranded DNA. Can be either '+' (forward/sense strand) or '-' (reverse/antisense strand)
ORF (`phase`)	Refers to the start of the open reading frame (ORF). Can be 0, 1, or 2 - 0 indicates that the first base of the feature is the first base of a codon
Attributes (`attributes`)	A list of name-value pairs separated by semicolons, in the format 'name=value'. Attributes differ in type and number based on the feature type and the annotation program

Reading the gene annotation file into R

To make loading a gff file easy, we've included a function called parse_gff3_to_dataframe() in the mscgm package. Let's use it to load the data now:

gff = mscgm::parse_gff3_to_dataframe( "gencode.v44.annotation.chr19.gff3.gz" )

Let's take a look at it:

print(gff)
# or
View(gff)

As the above function names suggests, the result is a 'data frame' - a two-dimensional structure where the rows are records and each column is a different variable. More specifically, we are using the tidyverse flavour of a data frame, known as a tibble. We recommend using tibbles where possible as they are easiest to work with.

Note

If you look at the above output you'll notice something interesting. The function has extracted the ID and Parent attributes, and placed them into the data fram for us!

This is important because, as you'll recall from the earlier session, the ID and Parent fields form the main way to link between records (such as between genes and their transcripts).

Dataframe basics

Let's do some basic manipulation of that data now, similar to what we did in the command-line.

For example we can look at the first few rows:

head(gff)

or hte last few rows:

tail(gff)

We can pull out specific rows or ranges of rows - for example, here is row 10:

gff[10,]

while here are rows 100 to 120:

gff[100:120,]

Similarly we can pull out specific columns. Suppose for example we decide we're only interested in the type and position of records - not the score, strand, source or the attributes:

gff[100:120, c( 1:3, 5:7 )]

Note

Although this style of subsetting columns works, it also has the downside of not being very easy to understand later. A better option would be to specify the columns by name:

gff[100:120, c( "ID", "Parent", "seqid", "type", "start", "end" )]

Now when you read the code back later, you can see exactly what you're getting.

The above returns a new dataframe with just the chosen columns. What if you just want the values in a given column? There are in fact two ways. First, if you know the column name, you can use the dollar notation, as in :

IDs = gff$ID
print( IDs )

Alternatively you can use the double-square-bracket notation:

IDs = gff[['ID']]
print( IDs )

Warning

As you can see, R has a few different ways to subset things using brackets and column names - you just have to get used to them. A brief recap is:

to get a subset of values of a vector of values, use single square brackets:

IDs[5]
IDs[105:115]

To get rectangular subsets of 2-dimensional objects, like data frames and matrices, use single square brackets with a comma in:

gff[100:110, 1:5]

However, to get individual columns from a dataframe (or values from an R 'list') use double square brackets:

gff[['ID']]

Finally you can also use dollar notation to get columns from a dataframe (or entries in a list) that have names:

gff$ID

This complexity is a bit annoying, but you get used to it after a while.

You can also get R to tell you basic information about the dataframe. For example, how many rows and columns are there? And what are the column names?

dim(gff)
colnames(gff)

Dataframe data types

In a dataframe, each column can be a variable with a different data type. For example, in our gff dataframe, some of the values like the ID are strings (or 'characters' in R parlance), while some are numbers. These types are more or less specified by the GFF3 specification.

Has R got these types right? You can see them in the above output, or you can get their types directly, for example:

typeof( gff[['ID']] )
typeof( gff[['start']] )

The types will be one of:

chr or character (a string value)
int or integer (a whole number)
dbl or double (a floating-point number)

Question

Compare the types printed out above with the specification - do they have the right data types?

Working with gene annotations in R

Quick recap​

Reading the gene annotation file into R​

Dataframe basics​

Dataframe data types​

Quick recap

Reading the gene annotation file into R

Dataframe basics

Dataframe data types