Exploring FASTA files
Introduction
This part of the tutorial will look at files containing sequence data: FASTA files. These files are the main format used to hold genome assemblies for major organisms.
Two large consortia exist maintaining genomic sequences: Ensembl, maintained by the European bioinformatics institute, and UCSC, maintained by University of California Santa Cruz and funded by NIH, along a few other initiatives, such as ENCODE and RefSeq. These are largely interchangeable but use a differnt indexing strategy and own annotation pipelines. Here we will explore files relating to chromosome 19 sequence and annotation using the files provided by th Ensembl consortium.
You can download the files by acessing Ensembl website (above), or using the code below which will download them for you.
curl -O http://ftp.ensembl.org/pub/release-107/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.19.fa.gz
Note. In case of issues a backup of this file is available in this folder.
You will have to decompress the files before we start - run this in a terminal window now:
gunzip -k 'Homo_sapiens.GRCh38.dna.chromosome.19.fa.gz'
Genome sequence files
The data is stored as a text file 'Homo_sapiens.GRCh38.dna.chromosome.19.fa'. GRCh38 refers to the name of the assembly - this is
the latest published sequence. R Studio is a multilingual environment. Each code chunk you execute can be written in a different
language. To tell R Studio which language you want to use, swap the default '{r}' at the beginning of the code chunk for '{sh}'
to run the commands as if you were running them from the command prompt/terminal, or '{python}' to run them in Python. Refer to R
Studio documentation to find out more. Let's use the bash commands head and tail to inspect the first and last 10 lines of
the FASTA file.
Let's look at the top and bottom of the file now:
head -n 10 'Homo_sapiens.GRCh38.dna.chromosome.19.fa'
tail -n 10 'Homo_sapiens.GRCh38.dna.chromosome.19.fa'
This should print something like:
>19 dna:chromosome chromosome:GRCh38:19:1:58617616:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
tail
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNN
We have the header line, starting with the '>' character, indicating the sequence name, and the sequence itself should be below.
But do you notice something strange? All the bases in the sequence are Ns rather than the expected A, C, G, and T. Why? N
indicates an ambiguous base, genetic information is protected by the means of long telomeres at either end of the chromosome.
Let's see how many lines our file contains using the command wc.
wc -l 'Homo_sapiens.GRCh38.dna.chromosome.19.fa'
This prints:
976962 Homo_sapiens.GRCh38.dna.chromosome.19.fa
OK, we have about 980,000 lines. Let's select some lines at random and use awk to check whether we can see the expected
sequences consisting of A, G, C, and T.
awk 'FNR>=1000 && FNR<=1020' 'Homo_sapiens.GRCh38.dna.chromosome.19.fa'
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
GATCACAGAGGCTGGGCTGCTCCCCACCCTCTGCACACCTCCTGCTTCTAACAGCAGAGC
TGCCAGGCCAGGCCCTCAGGCAAGGGCTCTGAAGTCAGGGTCACCTACTTGCCAGGGCCG
ATCTTGGTGCCATCCAGGGGGCCTCTACAAGGATAATCTGACCTGCAGGGTCGAGGAGTT
(etc).
Note
This is an 'advanced use' of the awk command. We are using it to find lines between 1000 and 1020. Don't worry if you are not an awk expert - it's often better to work in R or another programming language for these kinds of tasks.
Aha - we can see some non-missing bases now. What about further down the file?
awk 'FNR>=970000 && FNR<=970020' 'Homo_sapiens.GRCh38.dna.chromosome.19.fa'
TAGAAGTATTAACTTATTTTGAGGGCTTAAAAAGGCTAGAAGTACTGATGTCCTTTTCCT
GAGTCCTGAAGTCATTCTAGCCATCAACCTCTGGAGAAATGCTGCTGGGGCCATTTTACC
ATGGGACCAGAAATACAAGTCCCTGACATGGGCTTGGCTGAGAAGAAGCAAGTGGGGTGC
AAACTATGTGTGCTTTCATGTTGCAAAGAAGCTGTGTTGAATCAACAAATATTACTTGAG
CACTTGCCAGGATTCCAGGTACTGTTCCAGGGCTGGATCACAGTGATGAGTGGGGCAGGT
(etc.)
Aha! All looks good - most of the file seems to be full of genuine sequence.
Let's now load the sequence into R's memory and take a closer look at it. We will skip the header and manipulate the original
object so that individual lines are joined together and then split by character. We'll use the scan command which just reads
data from a file:
fasta <- scan(
'Homo_sapiens.GRCh38.dna.chromosome.19.fa',
what = 'character', # The type of file to be read
skip = 1 #Skip the header
)
As shown above the sequence in the file occurs on multiple lines - before starting let's put those back together:
fasta <- strsplit(fasta, split = '') #Split the long string into characters
fasta <- unlist(fasta) #Convert the object from a list back to a string
fasta[1:10]
[1] "N" "N" "N" "N" "N" "N" "N" "N" "N" "N"
Let's see how many of each type of nucleotide is present in the sequence.
nucleotides <- table(fasta)
nucleotides
You should get something like this:
A C G N T
15142293 13954580 14061132 176858 15282753
Finally let's create a barplot to visualise this:
barplot(nucleotides)

OK cool!
So, there are about 200,000 N nucleotides, which form the minority of all the other bases.
It would be interesting to know the telomere lengths of course. To find out, let's try to write a while-loop to measure the length of the first telomere:
n <- 0 # Initialise N counter
while(
fasta[n+1] == 'N' # Keep going (n+1 will take on values 1, 2, 3, …) until we see a non-N base
) {
n <- n + 1 # Add 1 to the conut (initially 0)
}
n
[1] 60000
It looks like the telomere is 60kb long.
::tip Note
The actual telomeres are composed of TTAGGG repeats and the number of ambiguous 'N' bases at chromosome ends is somewhat artificial, but it helps with bioinformatics applications, such as annotation and alignment.
:::
To check we got this right, let's print sequence fragment 2 bases upstream and downstream of the last N to make sure our calculation is correct:
fasta[ (n-2) : (n+2) ]
[1] "N" "N" "N" "G" "A"
Bingo! The base 6000 is the last N character.
Let's now take a look at the other, coding bases, and check if AT and GC are in roughtly 50:50 proportion.
To do this, let's take our table object and first convert it to a data frame for easier manipulation. We'll do this now and reformat in a useful way:
nucleotides <- data.frame(nucleotides, row.names = rownames(nucleotides))
colnames(nucleotides) <- c('nucleotide', 'count') # Add meaningful column names
nucleotides$nucleotide <- NULL # Delete unused column
#Transpose and retain data frame structure:
nucleotides <- as.data.frame(t(nucleotides))
Now let's compute GC content:
GC <- (nucleotides$G + nucleotides$C) / (nucleotides$A + nucleotides$T + nucleotides$G + nucleotides$C)
AT <- 1 - GC
GC
You should get something like this:
[1] 0.4793865
AT
[1] 0.5206135
Indeed, the ratio is approximately 50:50. Let's add some code to present our data in a nicer way and make a barlot to show the relative percentages.
print( paste0('GC %: ', round(GC, digits = 2)))
[1] "GC %: 0.48"
print( paste0('AT %: ', round(AT, digits = 2)))
[1] "AT %: 0.52"
That's much better! And let's plot it:
barplot(c(GC, AT), names.arg = c('GC %', 'AT %'))

The percentage of GC pairs is referred to as GC content and is an important QC diagnostic of sequencing runs. GC content differs between parts of genome as well as between the organisms. Regions with epigenetic control exhibit presence of CpG islands, GC-rich regions with methylated cytosine bases.