Step 3: Aligning reads

Most analyses of short-read sequence data are based on a basic paradigm made up of two steps:

first, align reads to an established reference genome assembly.
then use those alignments to identify and call genetic variation for analysis.

In this tutorial we will focus on the first step - getting reads aligned - and visualise these to get a sense of the output. (Further work with variant calling will be covered in CM4.8).

In this section, we'll implement the basic read alignment pipeline that starts with the fastq files and ends with alignments for downstream analysis.

The basic pipeline

The basic pipeline is:

align reads to the reference genome assembly
sort them so they are in genome position order
'mark' reads that look like duplicates (since these are likely to be artifacts)
Finally we'll gather some statistics about the alignments.

In theory that's it - in practice it gets a bit more fiddly because of the need for various format conversion and indexing steps. (Some data might also need a read trimming step but we're skipping that.)

Without further ado let's implement these steps using a popular aligner bwa mem and the swiss army knife-like samtools to do the work. We are using data for human chromosome 19 and aligning to the human reference sequence GRCh38_chr19.fa.

To get started, move to a terminal window and make sure you are in the sequence_data_analysis/human folder:

cd ~/sequence_data_analysis/human

The pipeline can be run like this:

1. Make a temporary directory

# Make a temp dir to hold intermediate files
mkdir -p tmp

2. Align reads

# Create a BWA index for the reference
bwa index GRCh38_chr19.fa

# align the reads (using 2 threads - don't use more please!)
bwa mem -t 2 -o tmp/NA12878-aligned.sam GRCh38_chr19.fa NA12878-read1.fastq.gz NA12878-read2.fastq.gz

Warning

This bwa alignment step is aligning millions of read pairs to the reference genome - it might take quite some time up run. (Maybe 10 minutes or so - we are, after all, analysing real high-throughput sequence data.)

Before doing anything else, watch the output for a bit to make sure it is running. You should see some lines of output like this:

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 133334 sequences (20000100 bp)...
[M::process] read 133334 sequences (20000100 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 60998, 0, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (372, 434, 506)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (104, 774)
[M::mem_pestat] mean and std.dev: (442.02, 99.09)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 908)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 133334 reads in 23.983 CPU sec, 12.191 real sec
[M::process] read 133334 sequences (20000100 bp)...

Once you are happy things are running well, now is a good time to go and read more about the theory and practice of paired-end sequencing. (Or feel free to take a break for a few minutes.)

Unfortunately that's not the end of the pipeline! The results aren't yet in the best file format, are still in read order (instead of chromosome/position order) and we haven't done the necessary work to identify possible read duplicates.

2. Post-processing of reads

Let's do the final steps now - please note each of these steps can take a minute or so to complete.

First, samtools view can be used to convert data to the more compressed BAM binary format.

samtools view -b -o tmp/NA12878-aligned.bam tmp/NA12878-aligned.sam

We'll also 'fix matepair information' which adds a bit of information to each alignment which is used further down the pipeline:

samtools fixmate -m tmp/NA12878-aligned.bam tmp/NA12878-fixmate.bam

3. Sort reads by genomic position

samtools sort will sort reads by their genomic location:

samtools sort -T tmp -o tmp/NA12878-sorted.bam tmp/NA12878-fixmate.bam

4. Identify duplicate reads

samtools markdup can be used to identify probably fragments that were sequenced more than once:

samtools markdup -s tmp/NA12878-sorted.bam tmp/NA12878-markdup.bam

Question

Look at the output from this step. How many read pairs did samtools think were likely duplicates? (DUPLICATE PAIR is the thing to look at). Does this look compatible with your fastqc results?

5. Index the reads

Finally we will rename the final output file to 'NA12878.bam' and index the reads (so they can be quickly looked up by position):

mv tmp/NA12878-markdup.bam NA12878.bam
samtools index NA12878.bam

Done!

Phew! If you run all that it might take around 10-15 minutes to complete (the longest step being the bwa mem indexing and alignment steps.) It's worth running the steps one by one to make sure they complete without errors, and watching the output to see what is being done.

Tip

You can use ls to see what was created in the tmp/ folder. We'll get into what's in those files below, but for the moment note that you can only view the .sam file with less.

(And they are big files, so don't try to open in nano or another text editor!)

The other files are in a binary format which will look like nonsense if you open them in less... we'll see how to view these in a moment.

Next steps

Congratulations! You now have an aligned set of reads ready for analysis.

Now go and insepct the alignment output.

Step 3: Aligning reads

The basic pipeline​

1. Make a temporary directory​

2. Align reads​

2. Post-processing of reads​

3. Sort reads by genomic position​

4. Identify duplicate reads​

5. Index the reads​

Done!​

Next steps​