Step 3: Aligning reads
Most analyses of short-read sequence data are based on a basic paradigm made up of two steps:
- first, align reads to an established reference genome assembly.
- then use those alignments to identify and call genetic variation for analysis.
In this tutorial we will focus on the first step - getting reads aligned - and visualise these to get a sense of the output. (Further work with variant calling will be covered in CM4.8).
In this section, we'll implement the basic read alignment pipeline that starts with the fastq files and ends with alignments for downstream analysis.
The basic pipeline
The basic pipeline is:
- align reads to the reference genome assembly
- sort them so they are in genome position order
- 'mark' reads that look like duplicates (since these are likely to be artifacts)
- Finally we'll gather some statistics about the alignments.
In theory that's it - in practice it gets a bit more fiddly because of the need for various format conversion and indexing steps. (Some data might also need a read trimming step but we're skipping that.)
Without further ado let's implement these steps using a popular aligner bwa mem
and the swiss
army knife-like samtools
to do the work. We are using data for human chromosome 19 and aligning to the human reference sequence GRCh38_chr19.fa
.
To get started, move to a terminal window and make sure you are in the
sequence_data_analysis/human
folder:
cd ~/sequence_data_analysis/human
The pipeline can be run like this:
1. Make a temporary directory
# Make a temp dir to hold intermediate files
mkdir -p tmp
2. Align reads
# Create a BWA index for the reference
bwa index GRCh38_chr19.fa
# align the reads (using 2 threads - don't use more please!)
bwa mem -t 2 -o tmp/NA12878-aligned.sam GRCh38_chr19.fa NA12878-read1.fastq.gz NA12878-read2.fastq.gz
This bwa
alignment step is aligning millions of read pairs to the reference genome - it might take quite some time up run.
(Maybe 10 minutes or so - we are, after all, analysing real high-throughput sequence data.)
Before doing anything else, watch the output for a bit to make sure it is running. You should see some lines of output like this:
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 133334 sequences (20000100 bp)...
[M::process] read 133334 sequences (20000100 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 60998, 0, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (372, 434, 506)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (104, 774)
[M::mem_pestat] mean and std.dev: (442.02, 99.09)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 908)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 133334 reads in 23.983 CPU sec, 12.191 real sec
[M::process] read 133334 sequences (20000100 bp)...
Once you are happy things are running well, now is a good time to go and read more about the theory and practice of paired-end sequencing. (Or feel free to take a break for a few minutes.)
Unfortunately that's not the end of the pipeline! The results aren't yet in the best file format, are still in read order (instead of chromosome/position order) and we haven't done the necessary work to identify possible read duplicates.
2. Post-processing of reads
Let's do the final steps now - please note each of these steps can take a minute or so to complete.
First, samtools view
can be used to convert data to the more compressed BAM binary format.
samtools view -b -o tmp/NA12878-aligned.bam tmp/NA12878-aligned.sam
We'll also 'fix matepair information' which adds a bit of information to each alignment which is used further down the pipeline:
samtools fixmate -m tmp/NA12878-aligned.bam tmp/NA12878-fixmate.bam
3. Sort reads by genomic position
samtools sort
will sort reads by their genomic location:
samtools sort -T tmp -o tmp/NA12878-sorted.bam tmp/NA12878-fixmate.bam
4. Identify duplicate reads
samtools markdup
can be used to identify probably fragments that were sequenced more than once:
samtools markdup -s tmp/NA12878-sorted.bam tmp/NA12878-markdup.bam
Look at the output from this step. How many read pairs did samtools
think were likely duplicates? (DUPLICATE PAIR
is the thing to look at).
Does this look compatible with your fastqc results?
5. Index the reads
Finally we will rename the final output file to 'NA12878.bam' and index the reads (so they can be quickly looked up by position):
mv tmp/NA12878-markdup.bam NA12878.bam
samtools index NA12878.bam
Done!
Phew! If you run all that it might take around 10-15 minutes to complete (the longest step being
the bwa mem
indexing and alignment steps.) It's worth running the steps one by one to make sure
they complete without errors, and watching the output to see what is being done.
You can use ls
to see what was created in the tmp/
folder.
We'll get into what's in those files below, but for the moment note that you can only view the .sam
file with less
.
(And they are big files, so don't try to open in nano
or another text editor!)
The other files are in a binary format which will look like nonsense if you open them in less
... we'll see how to view these in a moment.
Next steps
Congratulations! You now have an aligned set of reads ready for analysis.
Now go and insepct the alignment output.