Skip to main content

Advanced data frame manipulation

Subsetting data by row indexes is ok, but for real work we'll want to pull out rows matching particular conditions - for example we might want to find gene records. And we will want to be able to manipulate the data.

Filtering using base R

Here are two ways to do this. First, make sure you have tidyverse loaded:

library( tidyverse )

Note. This might print a few messages out including some 'conflicts' - that's ok, you can safely ignore them.

A basic way in R to subset things with conditions is to use bracket notation with a condition in, like this:

gff[ gff$type == 'gene', ]

Conditions can also be combined together using & ('and') and | ('or'). For example, let's find genes that are at least 10kb long:

gff[ gff$type == 'gene' & (gff$end - gff$start) >= 10000, ]
Note

The above contains a subtle bug - do you know what it is?

This type of subsetting works fine, but really it is annoying. Why? Well, because you have to keep typing gff.
In the above example this wasn't too bad, but what if your variable had been called 'human_genes'?

human_genes[ human_genes$type == 'gene' & (human_genes$end - human_genes$start) >= 10000, ]

It's easy to get more 'noise' than content in your code this way.

Luckily there is a solution to this - use the subset function (from base R) or filter (from dplyr, which is part of tidyverse.) For our tutorial we will use the latter but they work pretty much the same way:

filter( gff, type == 'gene' & (end-start+1) >= 10000 )

Adding columns

Wouldn't it be useful to have the record length as a new column? That's easy. Either do:

gff$length = gff$end - gff$start + 1

or

gff[['length']] = gff$end - gff$start + 1

If length is not already a column, this will create a new column and assign those values to it.

(Warning: if `length`` is already the name of a column this will overwrite its values. You should make sure first!)

Sorting

Another useful feature of dplyr is a function to sort (or 'arrange') datasets - it's called arrange. For example, let's arrange the dataframe in order of gene size:

arrange( gff, length )

You should see there are a bunch of short 'biological regions' in there, as well as some CDS records.

Note

This is also acheivable in base R, but it's a bit complicated. First, create a variable representing the correct order using order:

my_desired_order = order( gff$length )

Now use row indexing to order the data frame:

gff[ my_desired_order, ]

All this a bit clunky and we recommend using the arrange() function from dplyr, which makes life easy.

Pipelines

Even more fun is that (with the dplyr library loaded) we can use pipelines - much like we did in bash. For example, the above can be rewritten by 'piping' the data frame into the filter operation:

gff %>% filter( type == 'gene' & length >= 10000 )

What good is this? Well, just like in the command-line, this allows you to start building pipelines of commands that do useful operations. For example, let's build a pipeline that finds gene records, then sorts them by length. To format this pipeline nicely we will put it in round brackets:

(
gff
%>% filter( type == 'gene' )
%>% arrange( length )
)

The shortest genes are just a few hundreds of bases long!

What about the longest? This can be done using the desc() function in arrange():

(
gff
%>% filter( type == 'gene' )
%>% arrange( desc(length) )
)
Note

Remember we are measuring the extent of the genes on the genome here. This might be what you want, but note that it is not the same as length of the mRNA (using only exons) or the length of the protein (coding sequence).