Skip to main content

Warm up

Introduction

This part of the tutorial will look at files containing sequence data: FASTA files. These files are the main format used to hold genome assemblies for major organisms.

Two large consortia exist maintaining genomic sequences: Ensembl, maintained by the European bioinformatics institute, and UCSC, maintained by University of California Santa Cruz and funded by NIH, along a few other initiatives, such as ENCODE and RefSeq. These are largely interchangeable but use a different indexing strategy and own annotation pipelines. Here we will explore files relating to chromosome 19 sequence and annotation using the files provided by th Ensembl consortium.

For this tutorial we'll focus on chromosome 19. Let's start by downloading a fasta file containing the genomic sequence for this chromosome. To get this, run the following in the command-line:

curl -O http://ftp.ensembl.org/pub/release-107/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.19.fa.gz

Note. In case of issues a backup of this file is available in this folder.

As before let's decompressit and use ls to check the file size:

% gunzip Homo_sapiens.GRCh38.dna.chromosome.19.fa.gz
% ls -lh

Our file is reasonably big (57Mb) but, because we've only downloaded chromosome 19, it's not too big to work with.

Genome sequence files

The data is stored as a text file 'Homo_sapiens.GRCh38.dna.chromosome.19.fa'. Here, 'GRCh38' refers to the name of the assembly - also known as 'build 38'. This is a recent version of the human genome assembly and is the same one that we'll use for our gene annotation data below.

Note

The coordinates of bases in the genome differ between different genome assembly builds!

Therefore, you must always know which genome assembly version you are working with. For human work, GRCh38 is a good choice. However, some datasets use the earlier GRCh37 build, whilst some now use something called the 'telomere-to-telomere' (T2T) assembly.

So what does a FASTA file look like? Let's look at the top and bottom of the file now:

head -n 10 'Homo_sapiens.GRCh38.dna.chromosome.19.fa'
tail -n 10 'Homo_sapiens.GRCh38.dna.chromosome.19.fa'

This should print something like:

>19 dna:chromosome chromosome:GRCh38:19:1:58617616:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
tail
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNN

First, it shows the header line, which starts with the > character followed by the sequence name (in this case '19', because we are just looking at chromosome 19.). Following the sequence name is some additional information about the sequence, essentially saying that it covers bases 1-58617616 of GRCh38 chromosome 19.

The following lines show the sequence data.

But do you notice something strange about the sequence of bases? They all seem to be N's. Why?

In fact N in a FASTA file indicates an ambiguous base. Even though we are using the main human genome assembly here, the assemblers are not sure what sequence of bases should go at the ends of the chromosome in the telomeres.

Note

The main reason these are not given, is that they are highly repetitive and therefore hard to assemble. Telomeres are typically made up of thousands of repeats of a `TTAGGG`` motif, with minor variations.

If you want to see what's in those telomeres, try looking at the T2T assembly instead. (But not right now!)

If we wanted to see some actual useful bases, let's look further within our file. For example, look at lines 2,002-2,023:

head -n 2023 Homo_sapiens.GRCh38.dna.chromosome.19.fa | tail -n 22

That's better! What about further down the file?

head -n 970000 Homo_sapiens.GRCh38.dna.chromosome.19.fa | tail -n 22

Ok, lots of As, Cs, Ts and Gs here.

Question

How many lines are there in the file in total? (Hint: use wc -l.)

So all is looking good except those chromosome ends: most of the file seems to be full of genuine sequence.