Step 1: A Simple Example of Variant Calling
Calling variants for a single sample
To call variants from a single alignment file using BCFtools, run:
bcftools mpileup -f GRCh38_chr19.fa NA12878.bam | bcftools call -mv -Ov -o NA12878.vcf
where -f or --fasta refers to the reference genome we are using.
The first command, mpileup, generates a summary of base calls at each position in the genome where data is available. If we saved the output of mpileup, the first few lines (excluding all but the last line of the header) would look like this:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878.bam
chr19 60031 . C <*> 0 . DP=1;I16=1,0,0,0,30,900,0,0,60,3600,0,0,0,0,0,0;QS=1,0;FS=0;MQ0F=0 PL 0,3,30
chr19 60032 . T <*> 0 . DP=1;I16=1,0,0,0,30,900,0,0,60,3600,0,0,1,1,0,0;QS=1,0;FS=0;MQ0F=0 PL 0,3,30
chr19 60033 . G <*> 0 . DP=1;I16=1,0,0,0,30,900,0,0,60,3600,0,0,2,4,0,0;QS=1,0;FS=0;MQ0F=0 PL 0,3,30
chr19 60034 . C <*> 0 . DP=1;I16=1,0,0,0,30,900,0,0,60,3600,0,0,3,9,0,0;QS=1,0;FS=0;MQ0F=0 PL 0,3,30
chr19 60035 . A <*> 0 . DP=1;I16=1,0,0,0,30,900,0,0,60,3600,0,0,4,16,0,0;QS=1,0;FS=0;MQ0F=0 PL 0,3,30
chr19 60036 . C <*> 0 . DP=1;I16=1,0,0,0,30,900,0,0,60,3600,0,0,5,25,0,0;QS=1,0;FS=0;MQ0F=0 PL 0,3,30
chr19 60037 . A <*> 0 . DP=1;I16=1,0,0,0,30,900,0,0,60,3600,0,0,6,36,0,0;QS=1,0;FS=0;MQ0F=0 PL 0,3,30
chr19 60038 . C <*> 0 . DP=1;I16=1,0,0,0,30,900,0,0,60,3600,0,0,7,49,0,0;QS=1,0;FS=0;MQ0F=0 PL 0,3,30
chr19 60039 . C <*> 0 . DP=1;I16=1,0,0,0,30,900,0,0,60,3600,0,0,8,64,0,0;QS=1,0;FS=0;MQ0F=0 PL 0,3,30
chr19 60040 . T <*> 0 . DP=1;I16=1,0,0,0,30,900,0,0,60,3600,0,0,9,81,0,0;QS=1,0;FS=0;MQ0F=0 PL 0,3,30
We’ll come back to the meaning of each column shortly with our final output.
The most important thing to note here is that the ALT column contains <*>. This signifies no variation has been observed at that locus. These lines represent reference-matching positions. There are many more loci containing no variation on the reference. The first locus with a putative variant in our example file is chr19 60173
chr19 60173 . G A,<*> 0 . DP=12;I16=11,0,1,0,320,9400,30,900,660,39600,60,3600,208,4704,7,49;QS=0.914286,0.0857143,0;SGB=-0.379885;RPBZ=-1.30357;BQBZ=0.301511;FS=0;MQ0F=0 PL 0,6,118,33,121,137
Since we’re only interested in variant sites, we pipe (|) the mpileup output directly into bcftools call, which identifies and outputs variant positions if you specify -v (note -m ensures that multiple alternate alleles can be detected at any given site). This should not be confused with -Ov which tells BCFtools to output an uncompressed VCF file...and what is a VCF file?
VCF: The Variant Calling Format
Most popular variant callers produce their output in a VCF (Variant Calling Format) file.
To view the contents of a VCF file:
less NA12878.vcf
Each entry corresponds to a position where a variant has been called.
- The
REFcolumn gives the reference base at that position. - The
ALTcolumn shows the alternative base(s) found in the sample. - The
QUALcolumn provides a Phred-scaled quality score expressing the confidence in the variant call. Like base and mapping quality scores, it is defined as:(call in ALT is wrong) - The
INFOcolumn contains variant-specific metrics such as read depth, allele frequency, or strand bias. The specific fields included inINFOcan vary depending on the variant caller and the options used. The header of the VCF file defines each INFO field and its format.
To view only the header of NA12878.vcf:
bcftools view -h NA12878.vcf
VCF is in human-redeable format so we could also simply use head -30 NA12878.vcf which shows us the header and the first variant site. Notice it's not what we saw in the mpileup file. Why not?
Following these fields is genotype data, which is specific to each sample in the VCF. In our current file, we only have one sample, but we’ll soon look at a multi-sample example. The FORMAT column gives the identifying abbreviation and order of each genotype-level field present. the field definitions can be found in the VCF header.
Let's generate some multi-sample VCFs now!