The gene type and name

Look back to that list of shortest genes above. If you look at the attributes you'll see that in fact not many of them are protein coding genes. That information is encoded in the gene_type attribute. Wouldn't it be better if we had this as an extra column?

Luckily our mscgm R package has a way to do this - the extra_attributes argument. Let's re-load our data frame again and try to find the shortest and longest protein-coding genes. While we're at it, let's get the gene name too:

gff = mscgm::parse_gff3_to_dataframe(
    "gencode.v44.annotation.chr19.gff3.gz",
    extra_attributes = c( "gene_type", "gene_name" )
)
View(gff)

The gene_type and gene_name attributes should now be seperate columns, over on the right side of the data frame.

We'll have to add back in our length column:

gff$length = gff$end - gff$start + 1

Now we can run our query:

(
    gff
    %>% filter( type == 'gene' & gene_type == 'protein_coding' )
    %>% arrange( length )
)

Voila! The shortest protein-coding gene appears to be "SMIM46", which has the distinction of only being 153 bases long!

Challenge

Now use your subsetting / filtering skills to investigate this gene. How many transcripts does it have? How many exons? How long is the coding sequence? (Is it a multiple of three?) Where is it in the genome?

Also go and search for this gene in the UCSC Genome Browser. What do you make of it?

Note. to make sure the coordinates are compatible, make sure you have this set to the 'Dec. 2013 (GRCh38/hg38)` genome build.

Challenge

At this point you should have loaded both the genome sequence for chromosome 19 (in a variable called sequence) and the gene annotations. Use this to extract the genome sequence corresponding to the coding sequence of gene SMIM46.

Note. The result should look something like this:

  [1] "T" "C" "A" "G" "G" "G" "T" "G" "C" "T" "G" "G" "C" "T" "G" "T" "G" "G"
 [19] "C" "T" "T" "A" "G" "G" "C" "C" "C" "C" "C" "G" "G" "A" "A" "G" "G" "T"
 [37] "G" "C" "G" "G" "T" "G" "C" "A" "G" "G" "T" "A" "G" "C" "C" "C" "A" "G"
 [55] "G" "A" "A" "A" "C" "G" "T" "A" "C" "A" "G" "C" "C" "A" "G" "G" "T" "G"
 [73] "G" "G" "C" "C" "C" "A" "G" "A" "G" "G" "A" "G" "C" "A" "G" "C" "T" "G"
 [91] "C" "A" "G" "C" "C" "A" "C" "A" "G" "C" "T" "G" "G" "A" "A" "G" "G" "T"
[109] "G" "G" "T" "C" "T" "C" "C" "A" "A" "G" "T" "C" "C" "C" "C" "A" "C" "C"
[127] "C" "T" "G" "G" "C" "T" "G" "C" "T" "G" "C" "C" "T" "G" "A" "T" "C" "C"
[145] "C" "A" "G" "A" "T" "C" "C" "A" "T"

Note

What about the amino acid sequence? Here is one way.

First, we group the sequence into three base pair segments using a matrix:

bases = sequence[ 14112324:14112476 ]
in_threes = matrix(
    bases,
    byrow = TRUE,
    ncol = 3
)
print( in_threes )

Now we require a bit of magic code to stick those together into strings:

codons = sapply(
    1:nrow( in_threes ),
    function(i) {
        paste( in_threes[i,], collapse = '' )
    }
)
print(codons)

And let's look them up in the codon translation table:

codon_table <- c(
    TTT = "F", TTC = "F", TTA = "L", TTG = "L",
    TCT = "S", TCC = "S", TCA = "S", TCG = "S",
    TAT = "Y", TAC = "Y", TAA = "*", TAG = "*",
    TGT = "C", TGC = "C", TGA = "*", TGG = "W",
    CTT = "L", CTC = "L", CTA = "L", CTG = "L",
    CCT = "P", CCC = "P", CCA = "P", CCG = "P",
    CAT = "H", CAC = "H", CAA = "Q", CAG = "Q",
    CGT = "R", CGC = "R", CGA = "R", CGG = "R",
    ATT = "I", ATC = "I", ATA = "I", ATG = "M",
    ACT = "T", ACC = "T", ACA = "T", ACG = "T",
    AAT = "N", AAC = "N", AAA = "K", AAG = "K",
    AGT = "S", AGC = "S", AGA = "R", AGG = "R",
    GTT = "V", GTC = "V", GTA = "V", GTG = "V",
    GCT = "A", GCC = "A", GCA = "A", GCG = "A",
    GAT = "D", GAC = "D", GAA = "E", GAG = "E",
    GGT = "G", GGC = "G", GGA = "G", GGG = "G"
)


amino_acid_sequence = codon_table[ codons ]
print( amino_acid_sequence)

Note. the names of the amino acids can be looked up here.

Examine this exon in the genome browser. Does this amino acid sequence agree with what you see there?

:::