Going further: plotting and joining

To unleash the full power of R, let's make some plots.

First let's grab just those protein-coding genes:

protein_coding_genes = (
    gff
    %>% filter( type == 'gene' & gene_type == 'protein_coding' )
)
print(protein_coding_genes)

Note

How many protein-coding genes are there on chromosome 19?

## Plotting gene length

A simple plot we could make is a histogram of gene length. We could do this in base R:

hist( protein_coding_genes$length )

But a nicer way is to use ggplot:

p = (
    ggplot( data = protein_coding_genes )
    + geom_histogram( aes( x = length ))
    + theme_minimal()
)
print(p)

Some genes are short and others are long. Maybe we need a log scale?

print( p + scale_x_log10() )

Most genes are around 10kb in length, but there is considerable spread.

How many transcripts?

A more advanced question is: how many transcripts does each gene have?

To answer this let's first get hold of the transcript records.

transcripts = gff %>% filter( type == 'transcript' )
print( transcripts )

We're just going to count transcripts, so let's simplify this a bit now by just pulling out the ID columns in both the genes and transcripts lists:

protein_coding_genes = protein_coding_genes[,c("ID", "start", "end", "length" )]
transcripts = transcripts[,c("ID", "Parent")]

Finally to link things up we will rename the columns of both genes and transcripts tables so they match:

colnames( transcripts) = c( "transcript_id", "gene_id" )
colnames( protein_coding_genes ) = c( "gene_id", "start", "end", "length" )

Note

If you're unsure what we've done - print the dataframes to see.

Finally let's join the genes to the transcripts to count them:

joined = inner_join( protein_coding_genes, transcripts, by = "gene_id" )
print(joined)

You should now have a data frame that lists each protein=coding gene multiple times, once for each transcript.

That's it! We're ready to count. To count the number of transcripts per gene we will group by the gene ID, and then summarise over the transcript count. (We'll keep the length in for now so we can plot it later):

summarised = (
    joined
    %>% group_by( gene_id, length )
    %>% summarise( number_of_transcripts = n() )
)
print( summarised )

Note

Use your filtering skills to check a few genes - has this got the right count?

Now let's plot:

p = (
    ggplot( data = summarised )
    + geom_point( aes( x = length, y = number_of_transcripts ))
    + theme_minimal(16)
)
print(p)

Challenge

Identify the longest genes, or the ones with the most transcripts. Look them up in the genome browser.

Warning they can take a while to load - there really are a lot of transcripts to look at!

Going further: plotting and joining

How many transcripts?​

How many transcripts?