Dataset A: Illumina paired-end
File URL:
https://www.chg.ox.ac.uk/bioinformatics/training/gms/data/sequence_data_sightseeing_tour/illumina.bam
(For index URL, add .bai).
You should be familiar with this type of data by now, as it's what the earlier practical was all about. The data has 150bp paired-end reads and was generated on the Illumina Novaseq 6000 platform.
This is a good point to try a few options to get used to IGV.
For paired-end data, try the 'view as pairs' option. It can be found in the context menu, obtained by right-clicking anywhere on the track.
Try clicking on a read. What is all that info? Where does it come from?
Try moving around and zooming in/out.
If you don't like how reads are displayed, try 'collapsed', 'expanded', or 'squished' from the context menu. Which one do you like best?
A reminder that all data in this sightseeing tour is included strictly for training purposes - it is not publicly-available data. Please do not share outside this course. Contact me (Gavin Band) if you have any queries about this.
Questions
Can you find:
- a heterozygous SNP? A homozygous SNP?
- an INDEL (short insertion or deletion variant)?
- a sequencing error?
- How does the secretor status SNP look in IGV? (It's at
chr19:48703417).
Poorly alignable regions
Short reads don't always align very well to the genome.
One reason for poor alignment is that the underlying DNA is repetitive - that is, the same read might have come from multiple source locations. This is characterised by low mapping qualities of the reads.
Another thing that might cause poor read alignment is genome structural variation - e.g. if the sample genome has a different structure to the reference genome.
Try pointing IGV at this locus - can you make out what is going on?
chr19:48,669,128-48,669,589
Hint try zooming in until you can see the base-level information in the genome sequence. Do you notice anything?
Repetitive regions are hard for the aligner, but they are also hard for the DNA replication machinery, which just like the aligner, can 'misalign' two copies of DNA during replication, leading to genomic structural variation.
Next steps
When you're ready, move on to Dataset B.