Loading a FASTA file
Introduction
The first part of the tutorial will look at files containing sequence data: FASTA files. These files are the main format used to hold genome assemblies for major organisms.
Two large consortia exist maintaining genomic sequences: Ensembl, maintained by the European bioinformatics institute, and UCSC, maintained by University of California Santa Cruz and funded by NIH, along with a few other initiatives, such as ENCODE and RefSeq. These are similar resources but use a different indexing strategy and their own annotation pipelines for example, so you need to know the details for the source you are using and be aware that it may differ for files obtained elsewhere.
Here we will explore human chromosome 19 sequence data and gene annotations using the files provided by the Ensembl consortium.
Let's start by downloading a fasta file containing the genomic sequence for chromosome 19:
download.file("https://www.chg.ox.ac.uk/bioinformatics/training/msc_gm/2024/data/Homo_sapiens.GRCh38.dna.chromosome.19.fa", "Homo_sapiens.GRCh38.dna.chromosome.19.fa")
Genome sequence files
The data is stored as a text file 'Homo_sapiens.GRCh38.dna.chromosome.19.fa'. Here, 'GRCh38' refers to the name of the assembly - also known as 'build 38'. This is a recent version of the human genome assembly and is the same one that we'll use for our gene annotation data below.
The coordinates of bases in the genome differ between different genome assembly builds!
Therefore, you must always know which genome assembly version you are working with. For human work, GRCh38 is a good choice. However, some datasets use the earlier GRCh37 build, whilst some now use something called the 'telomere-to-telomere' (T2T) assembly.
Let's now load the reference sequence into our R session and take a closer look at it. To help with this, we have created a small R
package called mscgm that can be used to load the file into R. To install this package, run the following in your R session:
install.packages(
"https://www.chg.ox.ac.uk/bioinformatics/training/msc_gm/2024/code/mscgm.tgz",
type = "source",
repos = NULL
)
You should see some messages about downloading and installing the package. At the end it should say something like:
* DONE (mscgm)
Congratulations! You have installed the mscgm package. Let's use it to load the FASTA data now:
fasta = mscgm::load_fasta_to_list( "Homo_sapiens.GRCh38.dna.chromosome.19.fa" )
It will bring a bunch of messages, and hopefully say 'success!'.
The function has returned an R list object. Let's just get the sequence for chromosome 19 out:
sequence = fasta[['19']]
So what does a FASTA file look like? Let's find out by moving onto the next section 'Analysing sequence data in R'.