QCTOOL v2

Per-sample summary statistics

The -sample-stats option can be used to compute per-sample summary statistics. The output goes to a file specified by the -osample option. E.g:

$ qctool -g example.bgen -s example.sample -sample-stats -osample sample-stats.txt

This computes per-sample summary statistics (average missingness and heterozygosity) and places them in the file sample-stats.txt. Additionally, if array intensity data is available (see processing intensity data), average X channel, Y channel, total (X+Y) and difference (X-Y) of intensities will be computed. These can be used useful for QC purposes - for example, average intensity on the X and Y chromosomes can be used to directly determine sample gender.

Note: the output file can be formatted in various of ways, controlled by the file extension. See the page on summary statistic file formats for information on output file formatting.

Per-variant summary statistics

The basic option to compute per-variant summary statistics is -snp-stats. E.g.:

$ qctool -g example.bgen -snp-stats -osnp snp-stats.txt

This will compute genotype counts, allele counts and frequencies, missing data rates, info metrics, and a P-value against the null that genotypes are in Hardy-Weinberg proportions in diploid samples. Output is sent to the file specified in the -osnp option. See the page on summary statistic file formats for information on output file formatting.

Analysis on the sex chromosomes is complicated by the fact that males and females have differing ploidy. To process sex chromosomes correctly, QCTOOL relies on the ploidy being correct in the input genotype files. However, some data sets (and some file formats) instead encode males as diploid homozygotes. The -infer-ploidy-from can be used to deal with such data - see the page on inferring ploidy .

For sex chromosomes, QCTOOL outputs both diploid and haploid genotype counts, as well an appropriate allele frequency, a sex-chromosome specific info metric, and a test for difference in frequency between males and females.

Differential missingness

It's often useful to compare levels of missingness between different samples. The -differential option can be used to compare levels of missingness between samples having different levels of a covariate in the sample file. E.g.:

$ qctool -g example.bgen -s example.sample -osnp snp-stats.txt -differential <column>

This computes: the count of missing and non-missing genotypes in each level of the covariate specified in the specified column, and a likelihood ratio test P-value comparing missing data rates in the different levels. Additionally, if the covariate has exactly two levels, a Fishers exact test P-value is also computed.

Combining options

Summary statistic options can be combined; e.g:

$ qctool -g example.bgen -s example.sample -osnp snp-stats.txt -differential <column>

This computes both basic summary statistics and values for differential missingness, and places it in the same output file.

Stratifying summary statistics

The -stratify option can be used to compute summary statistics stratified over subsets of the data. E.g.:

$ qctool -g example.bgen -s example.sample -snp-stats -osnp snp-stats.txt -stratify <column>

The argument must be the name of a column in the sample file containing discrete values (i.e. it must be of type B or D). Summary statistic calculations will be computed for each subset of samples having the same value in that column. The output will contain the same fields as for -snp-stats, but each column will appear multiple times with a suffix of the form [<column>=<value>] to denote which strata the values are computed for.

This feature has several possible use cases - for example, it can be used to compute allele counts across ethnic groups in a sample of mixed ancestry, or to inspect deviation from Hardy-Weinberg equilibrium seperately in disease cases and controls.

Computing summary statistics