The gene type and name
Look back to that list of shortest genes above. If you look at the attributes you'll see that in fact not many of them
are protein coding genes. That information is encoded in the gene_type
attribute. Wouldn't it be better if we had
this as an extra column?
Luckily our mscgm
R package has a way to do this - the extra_attributes
argument. Let's re-load our data frame
again and try to find the shortest and longest protein-coding genes. While we're at it, let's get the gene name too:
gff = mscgm::parse_gff3_to_dataframe(
"gencode.v44.annotation.chr19.gff3.gz",
extra_attributes = c( "gene_type", "gene_name" )
)
View(gff)
The gene_type
and gene_name
attributes should now be seperate columns, over on the right side of the data frame.
We'll have to add back in our length
column:
gff$length = gff$end - gff$start + 1
Now we can run our query:
(
gff
%>% filter( type == 'gene' & gene_type == 'protein_coding' )
%>% arrange( length )
)
Voila! The shortest protein-coding gene appears to be "SMIM46", which has the distinction of only being 153 bases long!
Now use your subsetting / filtering skills to investigate this gene. How many transcripts does it have? How many exons? How long is the coding sequence? (Is it a multiple of three?) Where is it in the genome?
Also go and search for this gene in the UCSC Genome Browser. What do you make of it?
Note. to make sure the coordinates are compatible, make sure you have this set to the 'Dec. 2013 (GRCh38/hg38)` genome build.
At this point you should have loaded both the genome sequence for chromosome 19 (in a variable
called sequence
) and the gene annotations. Use this to extract the genome sequence corresponding to the coding sequence
of gene SMIM46.
Note. The result should look something like this:
[1] "T" "C" "A" "G" "G" "G" "T" "G" "C" "T" "G" "G" "C" "T" "G" "T" "G" "G"
[19] "C" "T" "T" "A" "G" "G" "C" "C" "C" "C" "C" "G" "G" "A" "A" "G" "G" "T"
[37] "G" "C" "G" "G" "T" "G" "C" "A" "G" "G" "T" "A" "G" "C" "C" "C" "A" "G"
[55] "G" "A" "A" "A" "C" "G" "T" "A" "C" "A" "G" "C" "C" "A" "G" "G" "T" "G"
[73] "G" "G" "C" "C" "C" "A" "G" "A" "G" "G" "A" "G" "C" "A" "G" "C" "T" "G"
[91] "C" "A" "G" "C" "C" "A" "C" "A" "G" "C" "T" "G" "G" "A" "A" "G" "G" "T"
[109] "G" "G" "T" "C" "T" "C" "C" "A" "A" "G" "T" "C" "C" "C" "C" "A" "C" "C"
[127] "C" "T" "G" "G" "C" "T" "G" "C" "T" "G" "C" "C" "T" "G" "A" "T" "C" "C"
[145] "C" "A" "G" "A" "T" "C" "C" "A" "T"
What about the amino acid sequence? Here is one way.
First, we group the sequence into three base pair segments using a matrix:
bases = sequence[ 14112324:14112476 ]
in_threes = matrix(
bases,
byrow = TRUE,
ncol = 3
)
print( in_threes )
Now we require a bit of magic code to stick those together into strings:
codons = sapply(
1:nrow( in_threes ),
function(i) {
paste( in_threes[i,], collapse = '' )
}
)
print(codons)
And let's look them up in the codon translation table:
codon_table <- c(
TTT = "F", TTC = "F", TTA = "L", TTG = "L",
TCT = "S", TCC = "S", TCA = "S", TCG = "S",
TAT = "Y", TAC = "Y", TAA = "*", TAG = "*",
TGT = "C", TGC = "C", TGA = "*", TGG = "W",
CTT = "L", CTC = "L", CTA = "L", CTG = "L",
CCT = "P", CCC = "P", CCA = "P", CCG = "P",
CAT = "H", CAC = "H", CAA = "Q", CAG = "Q",
CGT = "R", CGC = "R", CGA = "R", CGG = "R",
ATT = "I", ATC = "I", ATA = "I", ATG = "M",
ACT = "T", ACC = "T", ACA = "T", ACG = "T",
AAT = "N", AAC = "N", AAA = "K", AAG = "K",
AGT = "S", AGC = "S", AGA = "R", AGG = "R",
GTT = "V", GTC = "V", GTA = "V", GTG = "V",
GCT = "A", GCC = "A", GCA = "A", GCG = "A",
GAT = "D", GAC = "D", GAA = "E", GAG = "E",
GGT = "G", GGC = "G", GGA = "G", GGG = "G"
)
amino_acid_sequence = codon_table[ codons ]
print( amino_acid_sequence)
Note. the names of the amino acids can be looked up here.
Examine this exon in the genome browser. Does this amino acid sequence agree with what you see there?
:::