Skip to main content

Step 4: Inspecting the alignments

If you followed the alignment steps we generated various files:

  • a file tmp/NA12878-aligned.sam which contains the alignments in the SAM text file format.

  • a binary version tmp/NA12878-aligned.bam of the same thing, in BAM format.

  • a coordinate-sorted version of that, in tmp/NA12878-sorted.bam

  • a final version of the alignments with duplicate reads 'marked' in NA12878.bam

  • and a corresponding index file NA12878.bam.bai.

We put the key output files in our main directory. Now let's take a look at them.

The contents of a SAM file

If you look at the unsorted SAM file (e.g. by typing less -S tmp/NA12878-aligned.sam - press q when you want to quit) you'll see something like this:

@SQ     SN:chr19        LN:58617616
@HD VN:1.5 SO:unsorted GO:query
@PG ID:bwa PN:bwa VN:0.7.18-r1243-dirty CL:bwa mem -t 2 -o tmp/NA12878-aligned.sam GRCh38_chr19.fa NA12878-read1.fastq.gz NA12878-read2.fastq.gz
A00217:77:HFJWFDSXX:3:2477:19967:10739 83 chr19 49899627 60 150M = 49899379 -398 GCTCACTGCAAGCTCCACCTCCTGGGTTCACGCCATTCTCCTGCCTCAGCCTCCAGAGTAGCTGGGACTACAGGCACCTGCCACCACGTCCGGCTAATTTTTGTATTTTTAGTAGAGATGAGATTTCACCGTGTTAGCCAGGATGGTCTC ??????????????????????+??????????????????'???????????????????????????????????????????????????????????????????????????????????????????????????????????? NM:i:0 MD:Z:150 MC:Z:150M MQ:i:60 AS:i:150 XS:i:90
A00217:77:HFJWFDSXX:3:2477:19967:10739 163 chr19 49899379 60 150M = 49899627 398 GGGTGGTTGTTGTTGTTGTTTTGAGAGATGGGGTCTTGCTGTGTCACTCAGGCTAGAGCACAGTGGCACAATCTCGGCTCACTGCAGCCTCAACCTCCCGGGCTCAAGCAATCCTCCCACCTCAGCCTCCCTAGTAGCTGGGACTACAGG ???????????????????????????????????????????????5?????????????????????????????????????????????????????????????????????????????????????????????????+???? NM:i:0 MD:Z:150 MC:Z:150M MQ:i:60 AS:i:150 XS:i:38
A00296:43:HCLHLDSXX:3:2269:2076:2143 99 chr19 24377613 60 150M = 24378035 572 ACACTGTTTTTGTAGAATCTGGGAGGGGATACTTTGAGCACAGGGATAAATTTGAGTGCTTTGTGGCCTATGTTGAAAAAGAAATATCTTCAGATAAAAACTAGAGAGAAGATTTCTGAGAAACTGCTTTGCGATGTGTGCATTCTTCTC ?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? NM:i:1 MD:Z:131T18 MC:Z:150M MQ:i:60 AS:i:145 XS:i:32
A00296:43:HCLHLDSXX:3:2269:2076:2143 147 chr19 24378035 60 150M = 24377613 -572 AATATCTACAGATAAAAAGTAGAAAGAAGTTTTCTGAGAAACTACTTTGTCATGTGTGCATTCAACCCACAGAATTAAACTTGTTTTTCAATTAAGCAGTTTGCAAACACTGTTTATATAGAAGATGCGATAGGATATTTGGGAATGCTT ???+???+??????????????????+??+?????????????????????+?????????????????????????????????????????????????????????????????????????????????????????????????? NM:i:3 MD:Z:16T8T105G18 MC:Z:150M MQ:i:60 AS:i:135 XS:i:49
A00217:77:HFJWFDSXX:4:1418:15718:28150 83 chr19 49358091 60 150M = 49357824 -417 AGGTGTGGTGGTGCGCGCCTGTAGTCCCAGCTACTCAGGAGGCTGAAGCAAGAGAATCGCTTGAACCCGGGAGGCGGGGCTTCCAGTGAGCCAAGATCACGCCACTGCACTCCAGCCTGGGCAACAGAGCGAGACTCCGTCTCAAAAAAA 5????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? NM:i:0 MD:Z:150 MC:Z:150M MQ:i:60 AS:i:150 XS:i:66
A00217:77:HFJWFDSXX:4:1418:15718:28150 163 chr19 49357824 60 150M = 49358091 417 GTGGTGGCTCACGCCTGTACTCCCAACACTTTGGGAGGCCTAGATGGGCAGATCACTGGAGGTCAGGAGTTCAAGACCAGCCTGGCCAACATGGTGAAACCCTGACTCTCCTAAAAACACAAAAATTAGGGCTGGGCTCAGTGGCTCAGA ????????5???'????+?+?????????????+??++????+???????+??????????++???????+??????+??+?????55??'??5+???????5++????+55'?+??????+?+?+???++????+????+??+??+5?+ NM:i:4 MD:Z:19A84T4A39C0 MC:Z:150M MQ:i:60 AS:i:134 XS:i:76
A00217:77:HFJWFDSXX:1:2129:3730:28401 83 chr19 41555959 60 150M = 41555709 -400 AATGCCTGGTACTCACAAAACCAAAGCAATCAGCCAAGATGAAATAAAAGAGAGCAGAGCTGCAGACCTGGAGGAGCCTGTCCCTGACTCTGGACTCCACAAGGAGAACAGGAGACCCCCAAAAGGGGTGAGTGGGATCTTTTTGTGTAT ??????????????????+?????5????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? NM:i:0 MD:Z:150 MC:Z:150M MQ:i:60 AS:i:150 XS:i:0
...

The file consists of some metadata (lines starting with @, which record information about the sequences in the reference assembly and the program that was run to generate the alignments) followed by the alignments themselves.

The alignment columns rows consist of

  • The read ID (straight from the fastq file), like A00217:77:HFJWFDSXX:3:2477:19967:10739
  • Some flags encoded as an integer. The best way to understand these is through the "Explain SAM flags" webpage. We'll go through some examples below.
  • The chromosome and position at which the read aligns, e.g. for the first read this is chr19 and 49899627.
  • The mapping quality, here given as 60 - more details on this below.
  • The CIGAR string which gives some detail about how the read aligns - 150M for all these alignments.
  • The chromosome and position of the other read in the pair if it's aligned. (Note. The chromosome is often =, which means the same chromosome as this read, which you'd expect since they both came from the same fragment. A * means the other read was not aligned).
  • The 'template length' (i.e. how long was the span of the read pair on the reference contig?
  • The read bases and base qualities. These are straight from the fastq, but might be reverse-complemented if necessary so they are in the same order as the reference bases.
  • Finally there are some optional fields. Among these are NM (number of mismatches between the read and the reference), and MD (a kind of counterpoint to the CIGAR string in column 6).

How many alignments are there?

Our data had 25 million read pairs in the fastq files, i.e. 50 million reads in total. Are there the same number of alignments?

You can use the UNIX wc command to count them:

samtools view tmp/NA12878-aligned.sam | wc -l

Note. The BAM file is exactly the same, it's just encoded in a binary format. So this also works and is slightly quicker:

samtools view tmp/NA12878-aligned.bam | wc -l

You should see that there are slightly more alignments than reads.

The reason for this is that one read (in the fastq file) might correspond to more than one alignment in the SAM/BAM file. In technical terms,

  • sometimes a read alignment is split into two records, because one part of the read aligns in one place and the rest somewhere else. In this case one alignment is called primary and the other one supplementary

  • It's also possble that bwa spits out alternate alignment locations for some reads. These are known as secondary alignments.

What types of alignment are there?

To get a better sense of the types of alignment let's use samtools flagstat. This counts reads according to the flags column. Run it like this:

samtools flagstat tmp/NA12878-aligned.sam 

From the output you should see:

  • Of 2.5 million reads ("5000000 + 0 primary"), only 4,999,987 were actually aligned by bwa ("primary mapped"). So there were 13 reads that weren't aligned at all.
  • Most of these (4,999,974) were were aligned in pairs.
  • Most of these (4,999,230) were 'properly paired' - that is, both reads mapped to the same chromosome, in the right orientation (i.e. facing each other) and about the right distance apart.
  • But a small subset (744 in total) had unmapped mate, mate on a different chromosome, or not close or in the wrong orientation to the original read.

Note. A much more detailed view of reads can be generated by running samtools stats. See more on the samtools stats documentation page.

You can move straight on to Step 5 now or if interested in more details about the alignment output, read on!

More on the alignment flags

The flagstat command is looking at the alignment flags - which is the second column in the SAM file.

The easiest way to understand the flags is to look them up on the Explain SAM Flags page.

For example, for the first alignment above the flags value is 83, which corresponds to Read paired, Read mapped in proper pair, Read reverse strand, and First in pair. The second row in the above corresponds to the alignment of the other read in the pair (how do I know this?) which has flags 163 (you can look this up as well.)

If you have time, try looking up the flags for a few other reads now.

Important terminology

A primary alignment is bwa's 'best guess' as to where a read aligns. These are usually the most important alignments to look at.

Non-primary alignments are identified by either the Supplementary alignment or the Not primary alignment flags.

Interpreting the mapping qualities

Each alignment in the file comes with a mapping quality (5th column in the SAM/BAM file) which reflects how confident bwa is about its alignment location. We are most interested in alignments with high mapping qualities (like 60, which is a very confident alignment.)

Like base qualities, mapping qualities are expressed on the PHRED scale. They are defined as the aligner's estimate of the probability

P(this read alignment location is wrong)P(\text{this read alignment location is wrong})

Just like for base qualities, this is expressed on the PHRED scale so the actual numbers are:

mapping quality=10log10(P(this read alignment location is wrong))\text{mapping quality} = -10 \log_{10} \left( P( \text{this read alignment location is wrong}) \right)

That gives this relationship:

mapping qualityEstimated probability alignment is wrong
0100%
1010%
201%
300.1%
400.001%
500.0001%
600.00001%

A mapping quality of zero translates to an (approximate) 100% probability that the read aligns elsewhere, that is, that bwa is not confident at all in this alignment. On the other hand, a mapping quality of 60 is very confident alignment.

Question

As you can see, bwa is very confident about the alignments in the excerpt above - they all have mapping quality 60.

If you scroll through the file, can you see reads with lower mapping qualities? What is their misalignment probabilty from the above table?

Note. Like base qualities, the mapping quality is only an estimate of the probability of misalignment. It might not be well calibrated.

So what did markdup do?

Question what did the samtools markdup step do?

Hint 1. Try viewing the read A00296:43:HCLHLDSXX:1:1569:22191:30154 in the coordinate-sorted and final (post-markdup) BAM. E.g.:

samtools view tmp/NA12878-sorted.bam  | grep 'A00296:43:HCLHLDSXX:1:1569:22191:30154'

and

samtools view NA12878.bam | grep 'A00296:43:HCLHLDSXX:1:1569:22191:30154'

Can you spot the difference? (The Explain SAMTOOLS flags page might help.)

Hint 2. Try running samtools flagstat on these two files. What's different?

Important! Cleaning up

Important! Cleaning up

Our pipeline has left behind several copies of the original data that (now that we've had a look at them) we won't need again. They are wasting space so we should delete them.

You can see them by looking in the temporary directory we created before:

ls -lh tmp
du -ch tmp

You will see there's something like 3.4Gb of space essentially wasted. Now is a good time to get rid of these!

Let's use the rm and rmdir commands to do this. Before using rm you should remember that is is a blunt tool: it deletes files immediately, and permanently, often without giving you a chance to change your mind! A good way to make sure what you will delete is to use ls first:

ls tmp/*

This looks ok, so let's use rm to delete those same exact files:

rm tmp/*

Finally let's delete the tmp folder with rmdir:

rmdir tmp

Next steps

Now that we have an aligned set of reads, we can move on to viewing the read pileups