Step 1: What's in the fastq files?
Using UNIX to inspect the fastqs
The simplest thing to do is look at the data in the terminal. The command less
will do this for
us (actually we'll use zless
because the data is compressed). Try it now:
zless -S human/NA12878-read1.fastq.gz
You ought to be able to copy/paste the above into your terminal, or else type it directly. (Remember to use the <tab>
key to auto-complete filenames, which is much quicker than typing it all out!)
We covered use of less
in the Getting started with the command line and Analysing gene annotations on the command-line tutorials.
The -S
option tells zless
to stretch long lines off to the right of the screen rather than wrapping them. You can navigate using the arrow keys. You can also quit any time by pressing q
.
Now look at the structure of the fastq file. Each read occupies four lines in the file. For example the first read in the above file looks like:
@A00217:77:HFJWFDSXX:3:2477:19967:10739/1
GAGACCATCCTGGCTAACACGGTGAAATCTCATCTCTACTAAAAATACAAAAATTAGCCGGACGTGGTGGCAGGTGCCTGTAGTCCCAGCTACTCTGGAGGCTGAGGCAGGAGAATGGCGTGAACCCAGGAGGTGGAGCTTGCAGTGAGC
+
????????????????????????????????????????????????????????????????????????????????????????????????????????????'??????????????????+??????????????????????
...
In the above:
- The first line is a header line that identifies the read.
- The second line is the read itself
- the third line is just a
+
- the fourth line contains base qualities in "PHRED" encoding.
(More details on what this all means are given below, but this is the basic idea.)
To exit the file press 'q' to return to the command line prompt.
Warmup questions
Here are some initial questions to ask about the FASTQ files. Can you answer them? (Fear not - hints are below):
- Question 1 How does the header differ between the first read in the read1 file and the first read in the read2 file?
- Question 2: how many reads are in the file?
- Question 3: how long are the reads?
The answers to Questions 2-3 should let you calculate an important quantity - sequencing depth:
- Question 4: The data here all come from human chromosome 19. Given that chromosome 19 is about 58.6Mb long, can you calculate the average sequence "depth" or "coverage" (i.e. the number of reads that cover each position, on average across the chromosome?)
There are several ways to answer these questions, but the quickest and easiest involve using basic
UNIX command line tools - notably zcat
(which decompressed the file; use gzcat
on Mac OS), wc
, which counts lines or
characters in its input, and head
and tail
which isolate the top or bottom lines. Here we go:
- To look at the header of the first read you can use
head
(we also need to decompress the file usingzcat
). So:
zcat human/NA12878-read1.fastq.gz | head -n 1
zcat human/NA12878-read2.fastq.gz | head -n 1
Spot the difference?
- To count the number of reads you could do:
zcat human/NA12878-read1.fastq.gz | wc -l
This might take a minute or so to run as it is decompressing the whole file. The number output is the number of lines in the file - so what is the number of reads? Can you check it's the same number in both read1 and read2 files?
- To count the read length you could inspect the first read - which is on the 2nd line of the file:
zcat human/NA12878-read1.fastq.gz | head -n 2 | tail -n 1 | wc -c
The number output is the almost the number of characters in the second line of the file, i.e. the read length.
Warning wc -c
counts all characters in its input - which in this case will include the ending newline character!
(You can't see the newline character but it's how the computer knows it's the end of the line.)
Because of this to calculate the read length you need to subtract one. (Alternatively, you could of course add a tr
step to your pipeline, to delete the new line \n
character.)
So what's in the FASTQ file?
As described above, each FASTQ record spans four lines: the header, the read sequence, a separator, and the encoded base qualities. Here is some information on each of these:
The read header
The read header rows typically contain a whole bunch of information. For example:
@A00217:77:HFJWFDSXX:3:2477:19967:10739/1
This tells us the ID of the sequencer this was run on (A00217
), the run number (77
), the flowcell ID (HFJWFDSXX
), the lane number (3
), the 'tile' number on the lane (2477
), and the location of the cluster within the tile (19967:10739
). And finally it identifies whether it's read 1 or 2 (/1
or /2
).
(If you look in the second file in the pair, you should see exactly the same IDs in the same order,
except that the end /2
will be present indicating the second read.)
Note. The format of this information is not standard across platforms, and it changes depending on your data provider. Some other examples can be found on wikipedia or on the GATK read groups page.
The read sequence
The second row of each record contains the read bases themselves:
ATTCATGTACAATTCATTATATGGACATGTGCTTTCTTTCCTC...
These are the DNA bases as called by the sequencer, in the order they were sequenced. Simple!
Sounds obvious but the length of this line shows you the read length. So you can compute the read length by:
zcat file.fastq.gz | head -n 2 | tail -n 1 | tr -d '\n' | wc -c
The number output is the number of characters in the line (minus the ending \n
newline character) - that is, the read length.
The base qualities
For each of the above bases, the sequencer also emits an estimate of the base quality. These look a bit confusing at first:
B@DECEFEEGFEEHFGGEFFFFFEFFEEFEGHHGGGFFFAEFFFGGEFFH...
The base qualities represent the sequencer's estimate of how accurate its base call is for each base in the sequence. The scaling goes like this:
- First, the sequencer uses the underlying signal to estimate the probability the base call is wrong:
Obviously low probabilities are better than high ones!
- Next, the value is transformed to get the base quality:
Note. This transformation is known as 'PHRED scale' and you will see in lots of places in sequencing - including base qualities and alignment mapping qualities. It is named after Nobel Prize laureate Frederick Sanger, the inventor of Sanger sequencing.
- Finally, the base quality is represented as a single printable character in the file. To do that it is mapped onto the ASCII character range. (Visible characters start at 33, so this involves adding 33 to the actual base quality.)
It's this transformation that leads to this sequence of funny-looking characters in the 4th line of the fastq file.
If this maths looks confusing, here's a handy table to explain:
P(basecall is an error) | ...on PHRED scale | ... + 33 | ...as ASCII character |
---|---|---|---|
1 | 0 | 33 | ! |
0.1 | 10 | 43 | + |
0.01 | 20 | 53 | 5 |
0.001 | 30 | 63 | ? |
0.0001 | 40 | 73 | I |
In the data here, the base qualities for the first read are either ?
(for most bases), '
, or +
. What do these corresond to?
Well, the ASCII table gives these decimal values: ?
= 63, '
= 39, and +
= 43. Subtracting 33 gives base qualities of 30, 6, and 10 respectively. Thus, the sequencer is very confident (1 in 1,000 chance of error) about the base calls for most bases, but a few have much higher chance of error.
It is important to realise that these qualities are only the values estimated by the sequencer. They are generally pretty good but can be improved by re-estimation (but that's beyond the scope of this tutorial.)
Next step
That's all there is to FASTQ format! When you're ready to move on, continue the practical (go on to step 2).