# The World of Genomic Data
## Bioinformatics & Genomic Data Quiz Challenge!
**Authors**: Helen Lockstone and Gavin Band

To round off the Bioinformatics, Statistics, and Data Interpretation in Genomic Analysis module, you can put your new skills to the test - in this challenge quiz, you will take a tour of genomic data resources and extract or compute information to answer the questions. Fill in your answers in the grid provided to solve the mystery message. 

We'll organise small groups to work together and you can also use google where needed to look up information.

### Let's get started...

First, we'll delve into the alpha-globin locus on chromosome 16:

Navigate to the [UCSC Genome Browser](https://genome.ucsc.edu/) homepage. Click on 'Genome Browser' from the menu bar at the top of the page

* Type 'HBA1' - or 'hba1', it's not case sensitive - into the search box and click the 'go' button to the right
* In the list of results, click on the NCBI RefSeq gene identifier for HBA1 - this will take you to the genome browser view focused on the HBA1 gene
* Click on the RefSeq gene track (either the identifier HBA1/NM_000558.5 or anywhere along the gene) and read the summary at the top of the page it directs to.

**Clue 1**

In this summary, find out what the abbreviation HbF stands for and enter into the answer grid *(NB use the American spelling in your answer)*
___


Look at the RefSeq entry for HBA1 by following the NM_000558.5 link at the top of the current page.

:::note Note
The NCBI Handbook has detailed information about RefSeq annotations in [Chapter 18: The Reference Sequence (RefSeq) Database](https://www.ncbi.nlm.nih.gov/books/NBK21091/)
:::

* Click the FASTA link
* Copy and paste the header row and gene sequence and save in a text file (extension can be any of .txt, .fa, .fasta)
* Read the content into R and determine the length of the mRNA sequence for HBA1.

:::Hint (hidden hint)
Look back at CM4.3 tutorial 'Exploring FASTA Files' for an example of how to do this.

```
fasta <- scan('filename.fa', what = 'character', skip = 1)

## the sequence in the file occurs on multiple lines - let's put those back together:
fasta <- strsplit(fasta, split = '') #Split the long string into characters
fasta <- unlist(fasta) #Convert the resulting list object list back to a string

## then use the nchar function to determine the number of characters in the string i.e. the length of the HBA1 gene sequence
nchar(fasta)
```
:::
 

**Clue 2**

Enter the length of HBA1 mRNA sequence (in words) in the grid.
___

**Clue 3**

Next, determine the identity of the base at position 280 in the HBA1 gene (mRNA) sequence. Hint: you can use 'substr' function in R to do this.
Write the name of the nucleotide base you identify in the grid.
___


Look at the table at the link below, which describes various RefSeq identifiers:
https://www.ncbi.nlm.nih.gov/books/NBK21091/table/ch18.T.refseq_accession_numbers_and_mole/?report=objectonly

**Clue 4**

What does the NM_ prefix indicate about the nature of the HBA1 gene?
___


Return to the main summary page for HBA1 (by navigating back or starting from the main browser page again). 

You should see this page:
!(/images/image1.png) 

The second table on the page provides links to more than a dozen external tools and databases that provide a rich and comprehensive source of information about the gene, protein, biological pathways and role in disease. 

* Click on the OMIM link in the table
* Briefly look at the information available for HBA1 to get a sense of the scope and depth of information available
* Return to the top of the page and click the green box 'PheneGene Graphics' and select 'radial'

Up pops a very cool graphical representation of HBA1, related genes and their links to disease; the corresponding entry numbers in the OMIM database are shown and directly link to detailed pages by clicking on the gene symbol or disease name. It is also interactive as you can explore related conditions by clicking the OMIM reference number part of the annotations and this will re-draw the map centred on that gene/disease. 

**Clue 5**

Once you've explored this a bit, find out what the acronym OMIM stands for - enter the second and third words into your grid.

___


[AmiGO 2](http://amigo.geneontology.org/amigo/landing) is a project to create the next official web-based set of tools for searching and browsing the Gene Ontology database, which consists of a controlled vocabulary of terms covering biological concepts, and a large number of genes or gene products whose attributes have been annotated using GO terms.

* From the External Links menu on the right-hand side of the OMIM page for HBA1, click 'Gene Info' and follow the 'Gene Ontology' link in the drop-down list.

All GO annotations across different species are listed (132 in total) so let's filter them down a bit:
* Scroll down and select 'Organism' on the left; click the green box with a '+' symbol next to Homo sapiens
* Further down, select 'GO class (excluding "regulates")' and 'molecular_function' from the list of filters to add. 


**Clue 6 **

There will be 38 annotations in the filtered list - write down the 3-word GO term for the 3rd entry in the results list.


:::note Note

The GO identifier for this term is GO:0005506 (this will appear when the mouse hovers over the GO term). The search can also be done in reverse by entering this ID in the AmiGO homepage to get to the answer another way - most of these databases are updated daily and likely to change over time. 

Back in the UCSC genome browser page for HBA1, if you scroll a long way down you will find GO annotation terms for the gene and other things like protein structure information. 
:::

___

Recall that HBA1 sits within the human alpha globin gene cluster, which comprises 7 loci. 

* Click back to return to the browser view and zoom out until the whole cluster is visible. 
* Look downstream for the gene immediately adjacent to the globin cluster (i.e. to the right of HBQ1). To check you have the right gene, its coordinates are chr16:188,969-229,463
* Inspect the gene expression track just below the transcript depictions for this gene
* Click on the bargraph to see more details of how this gene is expressed in different tissues (GTEx track)

**Clue 7**

This gene has elevated expression in a particular brain region. Which region is it?

___


Now let's switch to the Ensembl genome browser and tools to look at a different gene, CFTR. Mutations in CFTR cause cystic fibrosis.


**Clue 8**

Use the Ensembl genome browser to search for CFTR in human genome GRCh38.p13. Note down, in words, the chromosome on which CFTR is located. 

___


Search online to find out some details about the mutations that cause cystic fibrosis. The most common mutation is a deletion of the 508th amino acid, about halfway along the gene. 

**Clue 9**

Write down the chemical name of the amino acid that is deleted.

*(You can solve this however you choose - programming or google - but bonus points/kudos if you do it bioinformatically!)*

___

**Clue 10**

In DNA sequencing, the identification of individual nucleotide bases is assigned a quality score. What is the name given to these quality scores?

:::hint Hint
Look back at the lecture from CM4.5 if you are not sure. 
It is also a homophone for the first name of the scientist who invented the earliest DNA sequencing method...though that is coincidental I think??  
:::
___


**Clue 11**

Returning to HBA1, use an R function to determine which nucleotide appears most frequently in the HBA1 gene sequence.
Write down the name of the resulting nucleotide base in your answer grid.

___


**Clue 12**

___

 
**Clue 13**


___


**Clue 14**

The Ensembl gene annotation curation team that shares its name with the capital of Cuba will give you the answer to clue 14.
___


C9orf72 clue

**Clue 15**

___


**Clue 16**

Searching for the CFTR gene in the Ensembl database/browser will reveal its Ensembl gene level identifier, which has the prefix ENSG. 
Write down the full Ensembl gene ID for CFTR.

___


**Clue 17**

___


Use the gene annotation file gencode.v41.annotation.gff3.gz and your programming skills to work out what proportion of protein-coding genes lie on the X chromosome. 

**Clue 18**

Enter the result as a numeric value with 4 decimal places in the grid.

___



**Clue 19**

Explore the file 'season.txt' on the command line or using R. 
How many lines does season.txt contain? Write your answer in the grid. 

___


* Extract lines 6 and 7 of season.txt - store them in a new object or file for convenience
* Replace the word "Christmas" with "Xmas"
* Replace "And" with "and"
* concantenate the two lines and print to screen

**Clue 20**

Enter the last 7 words of the message in your grid.

___

**Congratulations - you have completed the quiz!**

Submit your group's answers for the vertical and horizontal messages that now appear in your grid. 

___