Analysing gene annotations using the data verbs

Now that you know how to use data verbs, let's use them to do something useful. Here are a set of challenges that re-analyse the gene annotation data using data verbs.

Note

Before starting, make sure you have the mscgm R package installed - library( mscgm ) should work without errors. If not you should be able to install it like this:

install.packages(
    "https://www.chg.ox.ac.uk/bioinformatics/training/msc_gm/2024/code/mscgm.tgz",
    type = "source",
    repos = NULL
)
library( mscgm )

Note: if that doesn't work for any reason, you can also just run the relevant script in this folder. For this tutorial we need the parse_gff3_to_dataframe() function.

Now start by loading up the GENCODE GFF gene annotation file:

gff = mscgm::parse_gff3_to_dataframe(
    "https://www.chg.ox.ac.uk/bioinformatics/training/msc_gm/2025/data/gencode.v49.basic.annotation.gff3.gz",
    extra_attributes = c( "gene_name", "gene_type" )
)

Warning

This is quite a big file! It might take a minute or two to load.

Note

If the above URL isn't working for some reason, you can also get the version from the gencode site. Make sure and choose the 'basic annotation' in GFF3 version. Just choose 'copy link' and use that URL instead.

At the moment the url is:

https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_49/gencode.v49.basic.annotation.gff3.gz

What's in the data?

We studied this GFF file in the command-line gene annotation tutorial and also the R tutorial. If you're unsure what you're looking at, go back and look at those now.

To make this easier to use, split up the data before we use it.

Analysing gene annotations using the data verbs

What's in the data?​

What's in the data?