Introduction. inthinnerator is a command-line utility to help with the common task of "thinning" genetic variants, most commonly SNPs, from a dense (usually genome-wide) set of variants such as you would use in a GWAS study. Inthinnerator might be useful if:
- You want an approximately independent set of variants for use in other analyses (like principal components analysis).
- Or you want to pick out 'top' loci (e.g. lead GWAS SNPs) as best representatives of variation in each region.
- Or you want to simulate 'null' sets of variants to use, for example, in empirical enrichment analysis.
Features. Inthinnerator can be used to:
- Pick SNPs by rank (e.g. association P-value), randomly among SNPs, or uniformly in the genome.
- Thin variants based on physical distance, recombination distance, or a combination of both.
- Annotate variants with physical and recombination distances, and physical or recombination regions.
- Annotate variants with nearby genes.
Note: Inthinnerator does not currently thin variants based on estimates of LD, though we may implement that in future. For LD-based thinning, try other programs such as plink.
Change history.
Important!
This page documents version 2 of inthinnerator which is currently experimental. This means we expect some features not to work, or
not to work well, or to work wrongly, or to destroy your computer or sanity.
References. In different incarnations inthinnerator has been used in the guts of several papers. Here are a few:
- Band et al, "Imputation-Based Meta-Analysis of Severe Malaria in Three African Populations", PLOS Genetics (2013)
- Su et al, "Common variants at the MHC locus and at chromosome 16q24.1 predispose to Barrett's esophagus", Nature Genetics (2012)
Acknowledgements. The following people contributed to the design and implementation of qctool:
Contact. For more information or questions, please contact the oxstatgen mailing list at
oxstatgen (at) jiscmail.ac.uk
The general process followed by inthinnerator is shown on the right. In brief, inthinnerator picks SNPs by alternating two processes:
- Pick a SNP from the set of SNPs remaining, using a picking strategy.
- Exclude SNPs from the remaining list using an thinning strategy (typically by excluding SNPs in a physical or recombination interval around the picked SNP.)
- Either go back to step 1 and repeat, or stop and write the results.
Picking. Inthinnerator implements the following strategies for picking variants in step 1:
- Pick a SNP at random from the SNPs remaining (-strategy random). This is the default behaviour.
- Pick a SNP at uniformly at random from regions of the genome covered by SNPs in the input file (-strategy random_by_position). The -bin-size option controls how genomic coverage is computed.
- Pick the first SNP remaining (-strategy first).
- Pick SNPs according to a specified rank (-rank).
Thinning. Inthinnerator implements the following strategies for thinning variants in step 2:
- Thin by physical distance, e.g. -min-distance 100kb, -min-distance 20bp, -min-distance 2.5Mb.
- Thin by recombination distance, e.g. -min-distance 0.125cM.
- Thin by recombination distance with a physical margin, e.g. -min-distance 0.125cM+25kb.
Stopping. Inthinnerator can choose to stop picking in a few ways:
- Continue until there are no SNPs left to pick (this is the default behaviour).
- Continue until a specified number of SNPs have been picked (-max-picks).
- Continue until one SNP has been picked for each tag specified to -match-tag.
Repeating. Specifying -N x, where x is a number > 1, will cause inthinnerator to repeat the whole thinning process x times. Output files will be numbered from 0 to x-1.
By default inthinnerator will write all variants included and excluded from the thinned list to the output file. The options -suppress-excluded and -suppress-included can be used to adjust this behaviour.
Example output. Basic tab-delimited inthinnerator output looks like this:
# Analysis: "inthinnerator analysis" # started: 2014-09-02 08:54:33 # # Analysis properties: # -g ../imputed_chr22.index (user-supplied) # -map ../genetic_map_chr#_combined_b37.txt (user-supplied) # -o example (user-supplied) # alternate_ids rsid chromosome position alleleA alleleB iteration pick_index result cM_from_start_of_chromosome region_lower_bp region_upper_bp region_lower_cM region_upper_cM ? kgp14987749 22 16152031 A C 0 2676 picked 0.462220808233324 16150629 16153432 0.452218357350001 0.472216124700895 --- 22-16156144 22 16156144 GC G 0 1855 picked 0.491564660218421 16154742 16157545 0.481562209335098 0.501559976685992 --- 22-16158548 22 16158548 ACT A 0 1325 picked 0.508715795684546 16157146 16159949 0.498713344801223 0.518711112152117 --- 22-16160493 22 16160493 A AC 0 4950 picked 0.522592234320824 16159091 16161894 0.512589783437502 0.532587550788396 --- 22-16163055 22 16163055 TTATC T 0 3397 picked 0.540870607475655 16161653 16164456 0.530868156592332 0.550865923943226 --- 22-16164909 22 16164909 CCT C 0 4524 picked 0.554097814278565 16163507 16166310 0.544095363395242 0.564093130746136
$ head snps.txt SNPID rsid chromosome position alleleA alleleB snp1 kgp14987749 22 16152031 A C snp2 rs150880246 22 16152288 A G snp3 rs144530981 22 16152415 G A snp4 rs142259989 22 16154873 T G snp5 rs186282246 22 16155262 C T snp6 22-16156144 22 16156144 GC G snp7 22-16156442 22 16156442 G GCGT snp8 22-16156727 22 16156727 T TG snp9 22-16158548 22 16158548 ACT A
Also, we assume the directory contains files genetic_map_chr1.txt, genetic_map_chr2.txt, ... containing a recombination map. Suitable recombination map files can be downloaded from the SHAPEIT website or the IMPUTE website.
See also the list of options and the file formats page.
format | recognised extension(s) | notes |
---|---|---|
GEN, BGEN, VCF formats | .gen, .gen.gz, .bgen, .vcf, .vcf.gz | These can be supplied to the -g option. See the qctool webpage for more detailed information on support for these formats. |
Inthinnerator flat-file format | .txt, .csv, .tsv | This is the most flexible format to supply to the -g. It consists of six columns, SNPID, rsid, chromosome, position, alleleA, alleleB. Optional additional columns include a rank column (specified by name using the -rank-column option) and a tag column (specified using the -tag-column option.) Other columns are ignored. Inthinnerator expects space-, comma- or tab-separated file according to the file extension. |
Inthinnerator is available either as binaries or as source code.
Binaries
Pre-compiled binaries are available for the following platforms.
Version | Platform | File |
---|---|---|
v2-dev† | Linux x86-64 static build | inthinnerator_v2.0-dev-linux-x86_64.tgz |
v2-dev† | Linux alternative build | inthinnerator_v2.0-dev-scientific-linux-x86_64.tgz |
v2-dev† | Mac OS X | inthinnerator_v2.0-dev-osx.tgz |
†This version of inthinnerator is considered experimental.
To run inthinnerator, download the relevant file and extract it as follows.
$ tar -xzf inthinnerator_v2.0-dev-[machine].tgz $ cd inthinnerator_v2.0-dev-[machine] $ ./inthinnerator_v2.0-dev -help
Source
The source code to inthinnerator is available as part of the qctool package on bitbucket. See this page for details.