Genotype file formats
QCTOOL supports the following file formats for genotype data:
Format (recognised extensions)
|
Filetype
(For -[o]filetype |
Notes |
---|---|---|
(.gen, .gen.gz)
(Input / output) |
gen |
Optionally, an extra initial column containing chromosomes can be included in the input. QCTOOL auto-detects this in input files by counting the columns in the file. To suppress this column in output files, use the -omit-chromosome option. The GEN format is further described here. |
(.bgen)
(Input / output) |
bgen |
Output files are in BGEN v1.2 with 16 bits per probability and compressed
using zlib by default.
The
Use |
(.vcf, .vcf.gz)
(Input / output) |
vcf |
QCTOOL is strict about VCF metadata in input files for the fields it reads. Since metadata is not always correct a -metadata option is provided to override the input file metadata. Currently, only genotypes are output when outputting VCF files. Note that QCTOOL does not apply PHRED scaling to probabilities in the GP field. |
(.bed, .bim, .fam)
(Input / output) |
binary_ped |
Note that QCTOOL currently does only the most basic processing of FAM files: when reading, it uses them to count the number of samples in the BED file, when writing it writes a FAM file with missing data in all fields except the ID field. You will therefore need to create fuller FAM files seperately for use with other tools. |
SHAPEIT haplotype format
(Input / output) |
shapeit_haplotypes |
This format has five initial columns (`SNPID`, `rsid`, `position`, first and seconda alleles) followed by two columns for each sample representing the two haplotypes. These columns contain 0 (representing the first allele) or 1 (representing the second allele). The first column is sometimes used to record the chromosome instead of a seperate ID, but to my knowledge this is a convention. QCTOOL does not interpret the first column as chromosome information, but the `-assume-chromosome` option can be used to work around this. To convert haplotypes to genotypes, QCTOOL assumes that the two haplotypes for each individual are consecutive columns in the file. (This format is described here.) |
IMPUTE allele probabilities format
(Input / output) |
impute_allele_probs |
This file format is like the shapeit haplotype format but contains a probability for each haplotype (i.e. two probabilities per individual), specifying the probability that the haplotype carries the second allele. |
IMPUTE haplotype format
(Input only) |
impute_haplotypes |
It is assumed the legend file name is the same as the haplotypes file name, minus extension,
with .legend appended; QCTOOL will also remove/add the
.gz extension as appropriate.
For genotypic computations, genotypes are formed from pairs of haplotypes;
it is assumed that the two haplotypes for each individual are consecutive columns in the haplotypes file.
|
HLAIMP probability format
(Input only) |
hlaimp |
Currently, this input format implicitly splits each HLA locus as a series of bi-allelic variants. |
QCTOOL 'long' format
(Input only) |
long |
Input must be a file with columns SNPID, rsid, chromosome, position, number_of_alleles,
allele1, other_alleles, sample_id, ploidy, genotype .
Further columns may also be included (but QCTOOL ignores these).
Allelesin the other_alleles column must be comma-separated (as with VCF ALT alleles).
When outputting to vcf format, both genotype (GT) and a field 'typed' indicating whether a row for each
sample and variant was present will be output.
|
PennCNV
/ QuantiSNP format
(Output only) |
penncnv |
PennCNV uses a single sample per input file, this can be acheived using the sample filtering options,
e.g. -incl-samples-where ID_1=<identifier>
|
BIMBAM dosage format;
QCTOOL dosage format
(.dosage[.gz])
(Output only) |
bimbam_dosage or dosage |
This file outputs a single column per sample (named by the sample identifier) containing the expected second allele dosage for the sample at each variant. The formats differ in that BIMBAM format has no chromosome/position information. |
QCTOOL intensity text format
.intensity[.gz]
(Output only) |
intensity |
The output file has two columns per sample, representing X and Y channel intensities for the
sample at each variant. Currently data must be read from a VCF file; the field is specified
using the -vcf-intensity-field option.
|