Variant calling and genotyping with bcftools
In the earlier session we focussed on a single sample and investigate typical steps used to get from the data that comes from a sequencer (the fastq files) to a good set of read alignments. You should have viewed some alignments and been able to see evidence of genetic variation.
In this section we will look at analysing multiple samples using a statistical model that does the work of finding genetic variants and genotyping them for us.
Calling variants and genotypes
Before starting let's have a look at the data. First make sure you are in the right folder:
cd ~/variant_calling
Use the command-line or the panel on the left to look at what is in the data folder.
- In the
reads/folder you will see BAM files from 164 human samples. The reads are all from one part of the genome, around the gene FUT2, which lies on chromosome 19. These samples all come from the 1000 Genomes Project, which is a publicly accessible resource of genomic data from worldwide populations. - In the top-level folder you will also see a FASTA file
GRCh38_chr19.fa.gzcontaining the human reference genome assembly of chromosome 19.
Let's use this data now to generate some variant and genotype calls inside FUT2.
Variant calling
Conceptually genotype calling with bcftools has two steps:
The first step looks at each sample seperately and compute genotype likelihoods for each possible genotype at each site. (A heuristic is used to decide what the 'possible' genotypes are, by looking across samples.)
The in the second step we use this information to call genotypes.
First make sure you are in the right folder:
cd ~/variant_calling
Now compute genotype likelihoods using the bcftools mpileup command:
bcftools mpileup -Oz -f GRCh38_chr19.fa.gz -o GWD_FUT2_pileup.vcf.gz reads/*.bam
This will take a minute or so to run. Watch the screen output to see what bcftools is doing. The mpileup command computes the read pileup for each
sample at each position in the region, and outputs a 'VCF' file with this information in.
The -Oz argument tells bcftools we want to output gzipped VCF format (i.e. compressed variant call format.)
Now let's use this with the bcftools call command to call variants and genotypes:
bcftools call -m -v -Oz -o GWD_FUT2_calls.vcf.gz GWD_FUT2_pileup.vcf.gz
This command reads in the genotype likelihoods computed in the earlier step, and produces genotype calls across samples.
If you want to see what the command-line options do, run bcftools call without any options:
bcftools call
Here we have used -m - "use the multiallelic caller", and -v "Output only variant sites."
Inspecting the pileup output
Both the pileup and call steps output a Variant Call Format
(VCF) file. What's the difference? Look first
at the pileup file using zless:
zless -S GWD_FUT2_pileup.vcf.gz
The first lines in the file (the ones starting with a # character) are all metadata. They're
useful but let's skip them for now. The easiest way to do this is search for 'CHROM' which you can
do by pressing / (forward slash) followed by CHROM, and then press enter. You should see
something like this:

Feel free to scroll around a bit. A few things to note:
bcftools mpileuphas looked at every site in the region (from 48693822 to 48708100, as it happens) and assessed the evidence for variation at each one.The 4th and 5th columns tell you what alleles
mpileupthinks are present in the reads. The 4th column is the reference sequence base, and the 5th is are the other alleles seen (with<*>standing for 'anything else').The 8th column is called
INFOand contains a bunch of information about the variants across all samples. Exactly what is in there is described in the metadata, but for example you should be able to see values calledDP. This is the total depth of reads observed across all samples at the site. Other metrics similarly capture properties of the read pileup.The per-sample data itself starts way over in the 10th column. The values are PHRED-scaled genotype likelihoods for each possible genotype for the given alleles.
Note. If there are alleles listed, then there are possible genotypes (for a diploid human individual). This is why some rows have many more values than others - more alleles were seen in the reads.
Note. The PHRED-scaled likelihood (PL) for a particular genotype is computed as
The simplest reasonable likelihood function in the bracket here would be something proportional to a binomial likelihood, but in practice bcftools uses a more complex likelihood that allows for error dependency between reads. (One motivation is that an error might occur as an alignment error or be sequence context-dependent, so it makes sense to allow for this.)
Example. For the first site and first sample, the PLs are 0, 3, 26. That translates to
likelihoods of about 1, 0.5, and 0.0025. pileup is not very certain about what this genotype is!
If you look at the sample using tview you'll see why:
samtools tview -p chr19:48693822 reads/HG02461.final_FUT2.bam
Only two reads cover this position in this sample (the position is right at the end of the region that has been
extracted). Both reads have A bases, but more reads would be needed to be really confident this is a homozygous
A.
Inspecting the calls output
Now look at the calls file:
zless -S GWD_FUT2_calls.vcf.gz
Again skip past the metadata by typing /CHROM and pressing ENTER. You should see something like this:

(NB. I've shrunk the INFO data in the above image to make it easier to see - your file will no doubt have a bunch of stuff in there).
A few points to note:
- The file only has a subset of sites in it - those where
bcftoolsthinks there is genuine variation. - The
<*>alleles are gone. They were just there to enable the calling step to sum evidence across samples. - The PLs are still there - but now they are joined by genotype calls in the
GTfield.
Scroll down the file a bit. Can you find a multi-allelic variant? (i.e. one that has several alleles listed in column 5?) What about a multi-allelic SNP, where all the alleles are single bases (as opposed to an indel?)
Interpreting the GT field
It's pretty simple. A 0 corresponds to the reference allele, a 1 corresponds to the first
alternative allele, and so on. For a diploid sample the genotypes are of the form 0/0 (homozygous
reference), 0/1 (heterozygous for the first alternative allele) 1/1 homozygous for the first
alternative allele, and so on.
If the genotype is ./. that indicates the bcftools was uncertain about the genotype call. This might happen if
there were too few reads in that sample covering the site.
By and large these genotypes correspond to the maximum likelihood genotype (as you can tell by comparing to the PL fields... remember we're looking for the smallest PL because of the PHRED scaling.) However, if there aren't many reads the genotype may also be influenced by the prior which is learnt across samples.
An aside on genotype posteriors.
In the lecture you will have seen that the calling works in a Bayesian way, by computing a
posterior probability of each genotype. But bcftools doesn't output that posterior by default.
To get it, you can change the calling command to add the -a GP option:
bcftools call -m -v -Oz -a GP -o GWD_FUT2_calls_with_GP.vcf.gz GWD_FUT2_pileup.vcf.gz
If you look at this new output file it has, in addition to genotype (GT) and PHRED-scaled
genotype likelihood (PL), a new field called GP which are the genotype posteriors. They look
like this:
#CHROM POS ID REF ALT QUAL FILTER FORMAT HG02461 HG02462 HG02463 HG02464 HG02465 HG02466 HG02561 HG02562 HG02563 HG02568 HG02569 HG02570 HG02571 HG02572 HG02573 HG02574 HG02575 HG02582 HG02583 HG02584 HG02585 HG02586 HG02588 HG02589 HG02590 HG02594 HG02595...
chr19 48693829 . C T 3357.45 . GT:PL:GP 0/0:0,9,75:0.810234,0.189766,2.21701e-08 1/1:85,9,0:2.87613e-09,0.213018,0.786982 0/1:21,0,51:0.00425148,0.995745,3.67872e-06 ./.:0,0,0:0,0,0 0/1:24,0,24:0.00213138,0.996024,0.00184...
chr19 48693956 . A G 6806.98 . GT:PL:GP 0/0:0,99,255:1,3.1458e-11,4.93632e-28 0/0:0,105,255:1,7.9019e-12,4.93632e-28 0/0:0,111,255:1,1.98487e-12,4.93632e-28 0/0:0,78,255:1,3.96033e-09,4.93632e-28 0/0:0,69,255:1,3.1458e-08,4.93632e-28...
chr19 48693992 . G A 1802.53 . GT:PL:GP 0/0:0,96,255:1,1.43086e-11,2.56527e-29 0/0:0,105,255:1,1.80134e-12,2.56527e-29 0/0:0,90,255:1,5.69634e-11,2.56527e-29 0/0:0,105,255:1,1.80134e-12,2.56527e-29 0/0:0,81,255:1,4.52477e-10,2.56527e-29...
chr19 48694122 . C G 125.986 . GT:PL:GP 0/0:0,144,255:1,2.25005e-17,2.52536e-31 0/0:0,114,255:1,2.25005e-14,2.52536e-31 0/0:0,102,255:1,3.56609e-13,2.52536e-31 0/0:0,111,255:1,4.48944e-14,2.52536e-31 0/0:0,93,255:1,2.83264e-12,2.52536e-31...
chr19 48694244 . GTTTTTTTTTTTTTTTTTTTTTT GTTTTTTTTTTTTTTTTTTTTT,GTTTTTTTTTTTTTTTTTTTTTTT 3261.26 . GT:PL:GP 0/1:31,0,37,52,46,77:0.00179946,0.998173,2.19386e-05,3.14376e-06,2.75729e-06,5.46634e-10 0/2:54,84,91,0,42,27:1.8098e-05,7.97434e-09,1.7...
chr19 48694267 . G T 186.473 . GT:PL:GP 0/0:0,42,255:0.999989,1.09366e-05,2.37526e-28 0/0:0,28,255:0.999725,0.000274643,2.37463e-28 0/0:0,30,248:0.999827,0.000173306,1.19026e-27 0/0:0,38,217:0.999973,2.74711e-05,1.49866e-24 0/0:0,8...
chr19 48694335 . G A 1408.47 . GT:PL:GP 0/0:0,81,255:1,4.55947e-10,2.60477e-29 0/0:0,84,255:1,2.28515e-10,2.60477e-29 0/0:0,58,255:1,9.09735e-08,2.60477e-29 0/0:0,75,255:1,1.81516e-09,2.60477e-29 0/0:0,93,255:1,2.87683e-11,2.60477e-29...
chr19 48694424 . T A 11282.2 . GT:PL:GP 0/0:0,102,255:1,3.49087e-11,2.41996e-27 1/1:255,90,0:4.13231e-25,7.2298e-09,1 0/1:202,0,151:1.14042e-20,1,1.09869e-16 0/0:0,108,255:1,8.76867e-12,2.41996e-27 0/1:200,0,202:1.80745e-20,1,8.72717e-22...
chr19 48694447 . G A 3214.56 . GT:PL:GP 0/1:180,0,241:8.19006e-18,1,2.42467e-26 0/0:0,105,255:1,3.86112e-12,1.1786e-28 0/1:161,0,240:6.50559e-16,1,3.05248e-26 0/0:0,108,255:1,1.93514e-12,1.1786e-28 0/0:0,81,255:1,9.69869e-10,1.1786e-28...
chr19 48694448 . G A 349.569 . GT:PL:GP 0/0:0,114,255:1,4.83668e-14,1.1669e-30 0/0:0,105,255:1,3.84191e-13,1.1669e-30 0/0:0,105,255:1,3.84191e-13,1.1669e-30 0/0:0,108,255:1,1.92552e-13,1.1669e-30 0/0:0,78,255:1,1.92552e-10,1.1669e-30...
chr19 48694648 . G A 360.549 . GT:PL:GP 0/0:0,105,255:1,4.222e-13,1.40921e-30 0/0:0,87,255:1,2.6639e-11,1.40921e-30 0/0:0,87,255:1,2.6639e-11,1.40921e-30 0/1:196,0,199:1.8814e-18,1,4.20202e-23 0/0:0,78,255:1,2.11601e-10,1.40921e-30...
chr19 48694728 . C A 3031.84 . GT:PL:GP 0/0:0,99,255:1,1.31458e-11,8.62013e-29 0/0:0,81,255:1,8.29443e-10,8.62013e-29 0/0:0,78,255:1,1.65496e-09,8.62013e-29 0/1:178,0,193:1.5178e-17,1,1.30836e-21 0/0:0,75,255:1,3.30207e-09,8.62013e-29...
For example, for the first sample at the first SNP, the posterior probabilities are , ,
and . So bcftools is reasonably confident that the genotype is homozygous
reference, but there's about a 20% chance it is heterozygous.
Note. The VCF spec lies about these fields. It says they are PHRED scaled, but they aren't - they are unscaled probabilities.
Some questions
Look at the calls VCF and see if you can answer these questions:
- How many of your variants are SNPs?
- How many are INDELs?
- How many are multi-allelic and how many bi-allelic?
Is the secretor status SNP (rs601338, at chr19:48703417) in these calls? Can you find a sample that has non-secretor status (i.e. has A/A genotype at rs601338.)