Authors: Helen Lockstone and Gavin Band
To round off the Bioinformatics, Statistics, and Data Interpretation in Genomic Analysis module, you can put your new skills to the test - in this challenge quiz, you will take a tour of genomic data resources and extract or compute information to answer the questions. Fill in your answers in the grid provided to solve the mystery message.
We'll organise small groups to work together and you can also use google where needed to look up information.
First, we'll delve into the alpha-globin locus on chromosome 16:
Navigate to the UCSC Genome Browser homepage. Click on 'Genome Browser' from the menu bar at the top of the page
Clue 1
In this summary, find out what the abbreviation HbF stands for and enter into the answer grid (NB use the American spelling in your answer)
Look at the RefSeq entry for HBA1 by following the NM_000558.5 link at the top of the current page.
:::note Note The NCBI Handbook has detailed information about RefSeq annotations in Chapter 18: The Reference Sequence (RefSeq) Database :::
:::Hint (hidden hint) Look back at CM4.3 tutorial 'Exploring FASTA Files' for an example of how to do this.
fasta <- scan('filename.fa', what = 'character', skip = 1)
## the sequence in the file occurs on multiple lines - let's put those back together:
fasta <- strsplit(fasta, split = '') #Split the long string into characters
fasta <- unlist(fasta) #Convert the resulting list object list back to a string
## then use the nchar function to determine the number of characters in the string i.e. the length of the HBA1 gene sequence
nchar(fasta)
:::
Clue 2
Enter the length of HBA1 mRNA sequence (in words) in the grid.
Clue 3
Next, determine the identity of the base at position 280 in the HBA1 gene (mRNA) sequence. Hint: you can use 'substr' function in R to do this. Write the name of the nucleotide base you identify in the grid.
Look at the table at the link below, which describes various RefSeq identifiers: https://www.ncbi.nlm.nih.gov/books/NBK21091/table/ch18.T.refseq_accession_numbers_and_mole/?report=objectonly
Clue 4
What does the NM_ prefix indicate about the nature of the HBA1 gene?
Return to the main summary page for HBA1 (by navigating back or starting from the main browser page again).
You should see this page: !(/images/image1.png)
The second table on the page provides links to more than a dozen external tools and databases that provide a rich and comprehensive source of information about the gene, protein, biological pathways and role in disease.
Up pops a very cool graphical representation of HBA1, related genes and their links to disease; the corresponding entry numbers in the OMIM database are shown and directly link to detailed pages by clicking on the gene symbol or disease name. It is also interactive as you can explore related conditions by clicking the OMIM reference number part of the annotations and this will re-draw the map centred on that gene/disease.
Clue 5
Once you've explored this a bit, find out what the acronym OMIM stands for - enter the second and third words into your grid.
AmiGO 2 is a project to create the next official web-based set of tools for searching and browsing the Gene Ontology database, which consists of a controlled vocabulary of terms covering biological concepts, and a large number of genes or gene products whose attributes have been annotated using GO terms.
All GO annotations across different species are listed (132 in total) so let's filter them down a bit:
Clue 6
There will be 38 annotations in the filtered list - write down the 3-word GO term for the 3rd entry in the results list.
:::note Note
The GO identifier for this term is GO:0005506 (this will appear when the mouse hovers over the GO term). The search can also be done in reverse by entering this ID in the AmiGO homepage to get to the answer another way - most of these databases are updated daily and likely to change over time.
Back in the UCSC genome browser page for HBA1, if you scroll a long way down you will find GO annotation terms for the gene and other things like protein structure information. :::
Recall that HBA1 sits within the human alpha globin gene cluster, which comprises 7 loci.
Clue 7
This gene has elevated expression in a particular brain region. Which region is it?
Now let's switch to the Ensembl genome browser and tools to look at a different gene, CFTR. Mutations in CFTR cause cystic fibrosis.
Clue 8
Use the Ensembl genome browser to search for CFTR in human genome GRCh38.p13. Note down, in words, the chromosome on which CFTR is located.
Search online to find out some details about the mutations that cause cystic fibrosis. The most common mutation is a deletion of the 508th amino acid, about halfway along the gene.
Clue 9
Write down the chemical name of the amino acid that is deleted.
(You can solve this however you choose - programming or google - but bonus points/kudos if you do it bioinformatically!)
Clue 10
In DNA sequencing, the identification of individual nucleotide bases is assigned a quality score. What is the name given to these quality scores?
:::hint Hint
Look back at the lecture from CM4.5 if you are not sure.
It is also a homophone for the first name of the scientist who invented the earliest DNA sequencing method...though that is coincidental I think??
:::
Clue 11
Returning to HBA1, use an R function to determine which nucleotide appears most frequently in the HBA1 gene sequence. Write down the name of the resulting nucleotide base in your answer grid.
Clue 12
Clue 13
Clue 14
The Ensembl gene annotation curation team that shares its name with the capital of Cuba will give you the answer to clue 14.
C9orf72 clue
Clue 15
Clue 16
Searching for the CFTR gene in the Ensembl database/browser will reveal its Ensembl gene level identifier, which has the prefix ENSG. Write down the full Ensembl gene ID for CFTR.
Clue 17
Use the gene annotation file gencode.v41.annotation.gff3.gz and your programming skills to work out what proportion of protein-coding genes lie on the X chromosome.
Clue 18
Enter the result as a numeric value with 4 decimal places in the grid.
Clue 19
Explore the file 'season.txt' on the command line or using R. How many lines does season.txt contain? Write your answer in the grid.
Clue 20
Enter the last 7 words of the message in your grid.
Congratulations - you have completed the quiz!
Submit your group's answers for the vertical and horizontal messages that now appear in your grid.