The Bioinformatics & Genomic Data Quiz!

To round off the Bioinformatics, Statistics, and Data Interpretation in Genomic Analysis module, you can put your new skills to the test - in this challenge quiz, you will take a tour of genomic data resources and extract or compute information to answer the questions.

First - you need to be in a team! Pick your team now, or we can self-organise.

Then to complete the quiz you need this grid:

which we've printed for you. Fill in your answers in the grid provided to solve the mystery message.

Rules:

you can use any tools you like to solve these puzzles.
first one to complete all the answers and discover the message is the winner!

Good luck!

Clue 1: HBA1

First, we'll delve into the alpha-globin locus on chromosome 16:

Navigate to the UCSC Genome Browser homepage. Let's start by resetting all settings - in the 'Genome Browser' menu at the top, select 'Reset All User Settings'. Now you are ready to start!

Search for 'HBA1' (or 'hba1', it's not case sensitive) in the search box and click the 'go' button to the right
In the list of results, click on the NCBI RefSeq gene identifier for HBA1 - this will take you to the genome browser view focused on the HBA1 gene
Click on the gene in the 'NCBI RefSeq genes' track (either the identifier HBA1/NM_000558.5 or anywhere along the gene) and read the summary information about the gene. (For this question, don't use the GENCODE track which gives slightly different results.)

Question 1

What does the abbreviation HbF stand for? Enter the answer grid (NB use the American spelling in your answer)

Clue 2: sequence lengths

At the top of the summary page you'll see a link to the RefSeq entry for HBA1 - called NM_000558.5. Click on that now to visit NCBI RefSeq.

Note

The NCBI Handbook has detailed information about RefSeq annotations in Chapter 18: The Reference Sequence (RefSeq) Database

Let's use RefSeq to get the gene sequence:

Click the FASTA link
Copy and paste the header row and gene sequence and save in a text file (extension can be any of .txt, .fa, .fasta)

Question 2

Read the content into R.

What is the length of the mRNA sequence for HBA1? Enter this value (in words) into the grid.

Hints

Question
Hint

Click the tabs above for some hints!

Clue 3: getting a specific base

Question

What is the identity of the base at position 280 in the HBA1 mRNA sequence you just loaded? Write the name of the nucleotide base you identify in the grid.

Clue 4: gene identifiers

Look at the table at the link below, which describes various RefSeq identifiers: https://www.ncbi.nlm.nih.gov/books/NBK21091/table/ch18.T.refseq_accession_numbers_and_mole/?report=objectonly

Question

What does the NM_ prefix indicate about the nature of the HBA1 gene? Write the answer in the grid.

Clue 5: OMIM in the gloamin'

Go back to the genome browser and this time click on the HBA1 gene in the 'GENCODE V44' track rather than the 'RefSeq' track. (Somewhat confusingly, clicking on the gene in different tracks like this gives you slightly different views of the data.)

(If you don't see the 'GENCODE' track - make sure you have clicked 'Reset All User Settings' in the 'Genome Browser' menu at the top of the screen first.)

You should see a page like this:

The second table on the page provides links to more than a dozen external tools and databases that provide a rich and comprehensive source of information about the gene, protein, biological pathways and role in disease. Let's look at OMIM:

Click on the OMIM link in the table
Briefly look at the information available for HBA1 to get a sense of the scope and depth of information available
Return to the top of the page and click the green box 'PheneGene Graphics' and select 'radial'

Up pops a very cool graphical representation of HBA1, related genes and their links to disease; the corresponding entry numbers in the OMIM database are shown and directly link to detailed pages by clicking on the gene symbol or disease name. It is also interactive as you can explore related conditions by clicking the OMIM reference number part of the annotations and this will re-draw the map centred on that gene/disease.

Question

What does the acronym OMIM stand for? (It's on this page somewhere!)
Enter the second and third words into your grid.

Clue 6: GOing places

AmiGO 2 is a project to create the next official web-based set of tools for searching and browsing the Gene Ontology database, which is a 'controlled vocabulary' of terms covering biological concepts. A large number of genes or gene products have been annotated using GO terms.

From the External Links menu on the right-hand side of the OMIM page for HBA1, click 'Gene Info' and follow the 'Gene Ontology' link in the drop-down list.

All GO annotations across different species are listed (132 in total) so let's filter them down a bit:

Scroll down and select 'Organism' on the left; select humans by clicking the green box with a '+' symbol next to 'Homo sapiens'.
Further down, select 'GO class (excluding "regulates")' and 'molecular_function' from the list of filters to add.

Question

There should be 38 annotations in the filtered list. Write down the three-word GO term for the 3rd entry in the results list.

Note

The GO identifier for this term is GO:0005506 (this will appear when the mouse hovers over the GO term). The search can also be done in reverse by entering this ID in the AmiGO homepage to get to the answer another way - most of these databases are updated daily and likely to change over time.

Back in the UCSC genome browser page for HBA1, if you scroll a long way down you will find GO annotation terms for the gene and other things like protein structure information.

Clue 7: on the brain

Recall that HBA1 sits within the human alpha globin gene cluster, which comprises 7 loci.

Go back to your UCSC genome browser tab and click 'back' to return to the browser view (or search again for HBA1) and zoom out until the whole cluster is visible.
Look downstream for the first gene with more than one exon, that is immediately adjacent to the globin cluster (i.e. to the right of HBQ1). To check you have the right gene, its coordinates are chr16:188,969-229,463. (Make sure you are using the GRCh38/hg38 assembly for this quiz.)
Inspect the gene expression track just below the transcript depictions for this gene
Click on the bargraph to see more details of how this gene is expressed in different tissues (GTEx track)

Question

This gene has elevated expression in a particular brain region. Which region is it? Write the name in the grid.

Clue 8: where's the gene?

Now let's switch to the Ensembl genome browser and tools to look at a different gene, CFTR. Mutations in CFTR cause cystic fibrosis. Use the Ensembl genome browser to search for CFTR in human genome GRCh38.p13.

Question

What chromosome is CFTR located on? Write down its name in words.

Clue 9: Amino acids

Search online to find out some details about the mutations that cause cystic fibrosis. The most common mutation is a deletion of the 508th amino acid, about halfway along the gene.

Question

Write down the chemical name of the amino acid that is deleted.

Note

You can solve this however you choose - programming or google - but bonus points/kudos if you do it bioinformatically!

Clue 10: The quality of bases

In DNA sequencing, the identification of individual nucleotide bases is assigned a quality score. They are usually represented on a particular scale.

Question

What is the name given to the scale these quality scores are presented on? Write it down in the grid.

Hint

Look back at the sequence data analysis tutorial if you are not sure.

The answer is also a homophone for the first name of the Nobel prize-winning scientist who invented the earliest DNA sequencing method...

Clue 11: Covering all the bases

Return to your R session with the HBA1 sequence loaded now.

Question

Use an R function to determine which nucleotide appears most frequently in the HBA1 gene sequence. Write down the name of the resulting nucleotide base in your answer grid.

Clue 12: Disease association

Suppose you conduct a test to see whether a particular genetic variant is causing a disease. You collect a number of cases and controls from hospitals and form a 2x2 table to conduct an association test.

You see a strong signal - an estimated relative risk of four! It looks like a major risk factor.

But now you want to include some covariates to check the result is robust.

Question

What kind of statistical test could you use to do this?

Question
Hint 1

Click the tabs above for some hints!

Clue 13: From association to causality

Question

Thinking about causality - why might you want to include covariates anyway? To deal with possible something factors?

Question
Hint 1

Click the tabs above for some hints!

Clue 14: Team names

The Ensembl gene annotation curation team that shares its name with a city in Cuba will give you the answer to clue 14.

Clue 15: a repeat expansion

Amyotrophic Lateral Sclerosis, also known as Motor Neurone Disease, is a progressive and lethal disease of the nervous system. It has recently been in the news due to the death of Doddie Weir, as well as the death in 2018 of Stephen Hawking, both after a long struggle with the disease.

THe most frequency known cause of ALS is a repeat expansion in the gene C9orf72 as shown in the following diagram from that paper:

The repeat occurs between the first two exons of (the longest transcript of) C9orf72 - yet despite being noncoding, it is a major risk factor for the disease - as

"...The vast majority (>95%) of neurologically healthy individuals have ≤11 hexanucleotide repeats in the C9orf72 gene [...] an arbitrary cut-off of 30 repeats is used [to determine pathogenicity] in most studies, but larger expansions ranging from hundreds to thousands of repeats are most commonly observed in patients with frontotemporal dementia and ALS."

Use the UCSC genome browser to answer:

which strand is C9orf72 transcribed on?
how many copies of the GGGGCC hexanucleotide repeat does the reference sequence carry?

Hint

Turning on the 'ClinVar' track under 'Phenotype and literature' may help you find it.

Question

Compute the number of nucleotide bases contained in these repeats. Write that number in words in the grid!

Question
Hint 1
Hint 2
Hint 3

Click the tabs above for some hints!

The repeat is close to the right (upstream) side of the exon at chr9:27,573,426 - 27,573,491.

Clue 16 and 17: Counting transcripts

Now let's get back to some command-line programming with gene annotations.

In your (Ubuntu or Mac OS X ) terminal, download or find the file gencode.v41.annotation.gff3.gz from this folder.

For simplicity gunzip the file first.

Now use the file and your command-line skills (or, if you prefer, your R skills) to answer:

Question

How many transcripts are listed for the gene GYPA ?

Write down the digits in the boxes for clues 16 and 17.

Question
Hint 1
Hint 2

Click the tabs above for some hints!

A command like

cat gencode.v41.annotation.gff3 | awk '$3=="transcript"'

will pull out all rows that are transcripts... but how to find those for GYPA?

Note

GYPA encodes glycophorin A, one of the most abundant molecules that gets expressed on the red cell surface. (There are about a million copies on each red cell, and about 25 trillion red cells in the body... that's a lot of glycophorin A!

Clue 18: Counting genes

Using the same file, can you answer:

Question

What proportion of protein-coding genes lie on the X chromosome?

Write your answer (as numbers) to 4 decimal places in the the grid.

Hints

Question
Hint 1
Hint 2
Hint 3 - spoilers!

Click the tabs above for some hints!

A command like:

cat gencode.v41.annotation.gff3 | awk '$3=="gene"' | grep 'gene_type=protein_coding'

will find the rows you want. But now you need to count the number of them on each chromosome - how?

Clue 19: Some seasoning

Nearly there! Now download the file 'season.txt' from this folder and explore it using the command line or using R.

Question

How many lines does season.txt contain? Write your answer in the grid.

Clue 20: making edits

You can run this clue in the command-line, in R, or in a text editor - up to you. Working with season.txt, make the following transformations:

Extract lines 6 and 7 of season.txt - store them in a new object or file for convenience.
Replace the word "Christmas" with "Xmas"
Replace "And" with "and"
concantenate the two lines and print to screen

Question

Enter the last 7 words of the message in your grid.

Congratulations!

You have completed the quiz!

A message should appear in the shaded boxes!

The Bioinformatics & Genomic Data Quiz!

Clue 1: HBA1​

Clue 2: sequence lengths​

Clue 3: getting a specific base​

Clue 4: gene identifiers​

Clue 5: OMIM in the gloamin'​

Clue 6: GOing places​

Clue 7: on the brain​

Clue 8: where's the gene?​

Clue 9: Amino acids​

Clue 10: The quality of bases​

Clue 11: Covering all the bases​

Clue 12: Disease association​

Clue 13: From association to causality​

Clue 14: Team names​

Clue 15: a repeat expansion​

Clue 16 and 17: Counting transcripts​

Clue 18: Counting genes​

Clue 19: Some seasoning​

Clue 20: making edits​

Congratulations!​

Clue 1: HBA1

Clue 2: sequence lengths

Clue 3: getting a specific base

Clue 4: gene identifiers

Clue 5: OMIM in the gloamin'

Clue 6: GOing places

Clue 7: on the brain

Clue 8: where's the gene?

Clue 9: Amino acids

Clue 10: The quality of bases

Clue 11: Covering all the bases

Clue 12: Disease association

Clue 13: From association to causality

Clue 14: Team names

Clue 15: a repeat expansion

Clue 16 and 17: Counting transcripts

Clue 18: Counting genes

Clue 19: Some seasoning

Clue 20: making edits

Congratulations!