Working with sex chromosomes
Handling haploid and diploid samples
Several operations in QCTOOL, including
handle haploid and diploid genotypes directly. For example, consider the command:
$ qctool -g <input file> -snp-stats -osnp <output file>
When the input file contains X chromosome variants, the output file will contain relevant information including haploid and diploid genotype counts, tests for equal frequency between haploid and diploid samples, and appropriately estimated frequency and imputation certainty metrics.
Inferring ploidy from sex information
In order for this to work, it is important that haploid genotypes are encoded appropriately in input files - e.g. by using haploid calls in vcf GT field, or haploid data in BGEN-format files. However, some file formats (e.g. GEN format) do not support haploid calls, and some datasets (E.g. the 1000 Genomes Project genotype files) choose to encode all calls as diploid even when the underlying genotype is haploid (i.e. for males on the X chromosome). To work around this, qctool can infer the ploidy from sex information supplied in a sample file. (This functionality currently has hard-coded ampping of chromosome identifiers to ploidy appropriate for humans).
This works as follows. Suppose
sex is the name of a column in the sample file containing sex information
- this must be of type 'D', and valid values are
1 for male samples,
2 for female. Then, the
can be used to tell QCTOOL to interpret diploid calls as haploid (or zero-ploid) as appropriate:
$ qctool -g <input file> -s <sample file> -infer-ploidy-from sex -snp-stats -osnp <output file>
For each genotype QCTOOL applies the following rules:
- If input genotype is not diploid, raise an error and terminate program.
- If the sex of the sample is missing, set the genotype to missing.
- Otherwise, compute the ploidy of the sample based on the chromosome identifier and the sample sex.
- If the inferred ploidy is diploid, output the input genotype unchanged.
- If the inferred ploidy is haploid, check whether the input genotype is homozygous (or, for genotype probabilities, has 100% of the probability mass on homozygous calls). If so, set the genotype to the corresponding haploid call. Otherwise, set the genotype to missing.
These rules apply to both hard-coded genotype calls (e.g. vcf GT field) and to genotype probabilities (e.g. vcf GP field / GEN format / BGEN format data).