Skip to main content

Variant calling and genotyping with bcftools

In the earlier session we focussed on a single sample and investigate typical steps used to get from the data that comes from a sequencer (the fastq files) to a good set of read alignments. You should have viewed some alignments and been able to see evidence of genetic variation.

In this section we will look at analysing multiple samples using a statistical model that does the work of finding genetic variants and genotyping them for us.

Calling variants and genotypes

Before starting let's have a look at the data. First make sure you are in the right folder:

cd ~/variant_calling

Use the command-line or the panel on the left to look at what is in the data folder.

  • In the reads/ folder you will see BAM files from 164 human samples. The reads are all from one part of the genome, around the gene FUT2, which lies on chromosome 19. These samples all come from the 1000 Genomes Project, which is a publicly accessible resource of genomic data from worldwide populations.
  • In the top-level folder you will also see a FASTA file GRCh38_chr19.fa.gz containing the human reference genome assembly of chromosome 19.

Let's use this data now to generate some variant and genotype calls inside FUT2.

Variant calling

Conceptually genotype calling with bcftools has two steps:

  1. The first step looks at each sample seperately and compute genotype likelihoods for each possible genotype at each site. (A heuristic is used to decide what the 'possible' genotypes are, by looking across samples.)

  2. The in the second step we use this information to call genotypes.

First make sure you are in the right folder:

cd ~/variant_calling

Now compute genotype likelihoods using the bcftools mpileup command:

bcftools mpileup -Oz -f GRCh38_chr19.fa.gz -o GWD_FUT2_pileup.vcf.gz reads/*.bam

This will take a minute or so to run. Watch the screen output to see what bcftools is doing. The mpileup command computes the read pileup for each sample at each position in the region, and outputs a 'VCF' file with this information in.

Note

The -Oz argument tells bcftools we want to output gzipped VCF format (i.e. compressed variant call format.)

Now let's use this with the bcftools call command to call variants and genotypes:

bcftools call -m -v -Oz -o GWD_FUT2_calls.vcf.gz GWD_FUT2_pileup.vcf.gz

This command reads in the genotype likelihoods computed in the earlier step, and produces genotype calls across samples.

Note

If you want to see what the command-line options do, run bcftools call without any options:

bcftools call

Here we have used -m - "use the multiallelic caller", and -v "Output only variant sites."

Inspecting the pileup output

Both the pileup and call steps output a Variant Call Format (VCF) file. What's the difference? Look first at the pileup file using zless:

zless -S GWD_FUT2_pileup.vcf.gz

The first lines in the file (the ones starting with a # character) are all metadata. They're useful but let's skip them for now. The easiest way to do this is search for 'CHROM' which you can do by pressing / (forward slash) followed by CHROM, and then press enter. You should see something like this:

img

Feel free to scroll around a bit. A few things to note:

  • bcftools mpileup has looked at every site in the region (from 48693822 to 48708100, as it happens) and assessed the evidence for variation at each one.

  • The 4th and 5th columns tell you what alleles mpileup thinks are present in the reads. The 4th column is the reference sequence base, and the 5th is are the other alleles seen (with <*> standing for 'anything else').

  • The 8th column is called INFO and contains a bunch of information about the variants across all samples. Exactly what is in there is described in the metadata, but for example you should be able to see values called DP. This is the total depth of reads observed across all samples at the site. Other metrics similarly capture properties of the read pileup.

  • The per-sample data itself starts way over in the 10th column. The values are PHRED-scaled genotype likelihoods for each possible genotype for the given alleles.

Note. If there are kk alleles listed, then there are (k2)k \choose 2 possible genotypes (for a diploid human individual). This is why some rows have many more values than others - more alleles were seen in the reads.

Note. The PHRED-scaled likelihood (PL) for a particular genotype gg is computed as

PL(g)=10log10(P(read datagenotype=g))\text{PL}(g) = -10 \log_{10}\left( P\left( \text{read data} | \text{genotype} = g \right) \right)

The simplest reasonable likelihood function in the bracket here would be something proportional to a binomial likelihood, but in practice bcftools uses a more complex likelihood that allows for error dependency between reads. (One motivation is that an error might occur as an alignment error or be sequence context-dependent, so it makes sense to allow for this.)

Example. For the first site and first sample, the PLs are 0, 3, 26. That translates to likelihoods of about 1, 0.5, and 0.0025. pileup is not very certain about what this genotype is!

If you look at the sample using tview you'll see why:

samtools tview -p chr19:48693822 reads/HG02461.final_FUT2.bam

Only two reads cover this position in this sample (the position is right at the end of the region that has been extracted). Both reads have A bases, but more reads would be needed to be really confident this is a homozygous A.

Inspecting the calls output

Now look at the calls file:

zless -S GWD_FUT2_calls.vcf.gz

Again skip past the metadata by typing /CHROM and pressing ENTER. You should see something like this:

img

(NB. I've shrunk the INFO data in the above image to make it easier to see - your file will no doubt have a bunch of stuff in there).

A few points to note:

  • The file only has a subset of sites in it - those where bcftools thinks there is genuine variation.
  • The <*> alleles are gone. They were just there to enable the calling step to sum evidence across samples.
  • The PLs are still there - but now they are joined by genotype calls in the GT field.
Qustion

Scroll down the file a bit. Can you find a multi-allelic variant? (i.e. one that has several alleles listed in column 5?) What about a multi-allelic SNP, where all the alleles are single bases (as opposed to an indel?)

Interpreting the GT field

It's pretty simple. A 0 corresponds to the reference allele, a 1 corresponds to the first alternative allele, and so on. For a diploid sample the genotypes are of the form 0/0 (homozygous reference), 0/1 (heterozygous for the first alternative allele) 1/1 homozygous for the first alternative allele, and so on.

If the genotype is ./. that indicates the bcftools was uncertain about the genotype call. This might happen if there were too few reads in that sample covering the site.

By and large these genotypes correspond to the maximum likelihood genotype (as you can tell by comparing to the PL fields... remember we're looking for the smallest PL because of the PHRED scaling.) However, if there aren't many reads the genotype may also be influenced by the prior which is learnt across samples.

An aside on genotype posteriors.

In the lecture you will have seen that the calling works in a Bayesian way, by computing a posterior probability of each genotype. But bcftools doesn't output that posterior by default.

To get it, you can change the calling command to add the -a GP option:

bcftools call -m -v -Oz -a GP -o GWD_FUT2_calls_with_GP.vcf.gz GWD_FUT2_pileup.vcf.gz

If you look at this new output file it has, in addition to genotype (GT) and PHRED-scaled genotype likelihood (PL), a new field called GP which are the genotype posteriors. They look like this:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  FORMAT  HG02461 HG02462 HG02463 HG02464 HG02465 HG02466 HG02561 HG02562 HG02563 HG02568 HG02569 HG02570 HG02571 HG02572 HG02573 HG02574 HG02575 HG02582 HG02583 HG02584 HG02585 HG02586 HG02588 HG02589 HG02590 HG02594 HG02595...
chr19 48693829 . C T 3357.45 . GT:PL:GP 0/0:0,9,75:0.810234,0.189766,2.21701e-08 1/1:85,9,0:2.87613e-09,0.213018,0.786982 0/1:21,0,51:0.00425148,0.995745,3.67872e-06 ./.:0,0,0:0,0,0 0/1:24,0,24:0.00213138,0.996024,0.00184...
chr19 48693956 . A G 6806.98 . GT:PL:GP 0/0:0,99,255:1,3.1458e-11,4.93632e-28 0/0:0,105,255:1,7.9019e-12,4.93632e-28 0/0:0,111,255:1,1.98487e-12,4.93632e-28 0/0:0,78,255:1,3.96033e-09,4.93632e-28 0/0:0,69,255:1,3.1458e-08,4.93632e-28...
chr19 48693992 . G A 1802.53 . GT:PL:GP 0/0:0,96,255:1,1.43086e-11,2.56527e-29 0/0:0,105,255:1,1.80134e-12,2.56527e-29 0/0:0,90,255:1,5.69634e-11,2.56527e-29 0/0:0,105,255:1,1.80134e-12,2.56527e-29 0/0:0,81,255:1,4.52477e-10,2.56527e-29...
chr19 48694122 . C G 125.986 . GT:PL:GP 0/0:0,144,255:1,2.25005e-17,2.52536e-31 0/0:0,114,255:1,2.25005e-14,2.52536e-31 0/0:0,102,255:1,3.56609e-13,2.52536e-31 0/0:0,111,255:1,4.48944e-14,2.52536e-31 0/0:0,93,255:1,2.83264e-12,2.52536e-31...
chr19 48694244 . GTTTTTTTTTTTTTTTTTTTTTT GTTTTTTTTTTTTTTTTTTTTT,GTTTTTTTTTTTTTTTTTTTTTTT 3261.26 . GT:PL:GP 0/1:31,0,37,52,46,77:0.00179946,0.998173,2.19386e-05,3.14376e-06,2.75729e-06,5.46634e-10 0/2:54,84,91,0,42,27:1.8098e-05,7.97434e-09,1.7...
chr19 48694267 . G T 186.473 . GT:PL:GP 0/0:0,42,255:0.999989,1.09366e-05,2.37526e-28 0/0:0,28,255:0.999725,0.000274643,2.37463e-28 0/0:0,30,248:0.999827,0.000173306,1.19026e-27 0/0:0,38,217:0.999973,2.74711e-05,1.49866e-24 0/0:0,8...
chr19 48694335 . G A 1408.47 . GT:PL:GP 0/0:0,81,255:1,4.55947e-10,2.60477e-29 0/0:0,84,255:1,2.28515e-10,2.60477e-29 0/0:0,58,255:1,9.09735e-08,2.60477e-29 0/0:0,75,255:1,1.81516e-09,2.60477e-29 0/0:0,93,255:1,2.87683e-11,2.60477e-29...
chr19 48694424 . T A 11282.2 . GT:PL:GP 0/0:0,102,255:1,3.49087e-11,2.41996e-27 1/1:255,90,0:4.13231e-25,7.2298e-09,1 0/1:202,0,151:1.14042e-20,1,1.09869e-16 0/0:0,108,255:1,8.76867e-12,2.41996e-27 0/1:200,0,202:1.80745e-20,1,8.72717e-22...
chr19 48694447 . G A 3214.56 . GT:PL:GP 0/1:180,0,241:8.19006e-18,1,2.42467e-26 0/0:0,105,255:1,3.86112e-12,1.1786e-28 0/1:161,0,240:6.50559e-16,1,3.05248e-26 0/0:0,108,255:1,1.93514e-12,1.1786e-28 0/0:0,81,255:1,9.69869e-10,1.1786e-28...
chr19 48694448 . G A 349.569 . GT:PL:GP 0/0:0,114,255:1,4.83668e-14,1.1669e-30 0/0:0,105,255:1,3.84191e-13,1.1669e-30 0/0:0,105,255:1,3.84191e-13,1.1669e-30 0/0:0,108,255:1,1.92552e-13,1.1669e-30 0/0:0,78,255:1,1.92552e-10,1.1669e-30...
chr19 48694648 . G A 360.549 . GT:PL:GP 0/0:0,105,255:1,4.222e-13,1.40921e-30 0/0:0,87,255:1,2.6639e-11,1.40921e-30 0/0:0,87,255:1,2.6639e-11,1.40921e-30 0/1:196,0,199:1.8814e-18,1,4.20202e-23 0/0:0,78,255:1,2.11601e-10,1.40921e-30...
chr19 48694728 . C A 3031.84 . GT:PL:GP 0/0:0,99,255:1,1.31458e-11,8.62013e-29 0/0:0,81,255:1,8.29443e-10,8.62013e-29 0/0:0,78,255:1,1.65496e-09,8.62013e-29 0/1:178,0,193:1.5178e-17,1,1.30836e-21 0/0:0,75,255:1,3.30207e-09,8.62013e-29...

For example, for the first sample at the first SNP, the posterior probabilities are 0.810.81, 0.190.19, and 2×1082\times 10^-8. So bcftools is reasonably confident that the genotype is homozygous reference, but there's about a 20% chance it is heterozygous.

Note. The VCF spec lies about these fields. It says they are PHRED scaled, but they aren't - they are unscaled probabilities.

Some questions

Questions

Look at the calls VCF and see if you can answer these questions:

  1. How many of your variants are SNPs?
  2. How many are INDELs?
  3. How many are multi-allelic and how many bi-allelic?

Is the secretor status SNP (rs601338, at chr19:48703417) in these calls? Can you find a sample that has non-secretor status (i.e. has A/A genotype at rs601338.)