SNPTEST

SNPTEST is a program for the analysis of single SNP association in genome-wide studies. The tests implemented include

Binary (case-control) phenotypes, single and multiple quantitative phenotypes
Bayesian and Frequentist tests
Ability to condition upon an arbitrary set of covariates and/or SNPs.
Various different methods for the dealing with imputed SNPs.

The program is designed to work seamlessly with the output of our genotype imputation software IMPUTE [1] and the programs QCTOOL and GTOOL. This program was used in the analysis of the 7 genome-wide association studies carried out by the Wellcome Trust Case-Control Consortium (WTCCC) [2]. Much of the theory behind the implemented tests is described in this paper [3].

SNPTEST has many different features which are illustrated below through a number of different examples that use the datasets provided with the software in the directory example/. These files contain data at 200 SNPs on 1000 individuals that are split into a control cohort and a case cohort. These datasets can be used to try out the tests using both binary (case-control) and quantitative phenotypes.

The latest stable version of SNPTEST is v2.5.6. Changes in this release include bug fixes and enhancements as documented here. To get started, download a pre-built binary for your platform from the download page and run an example command.

Contact

To contact us, please use the OXSTATGEN mailing list - see here for details.

Contributors

The following people contributed to the design and development of SNPTEST:

Changes in v2.5.6

Bug fixes and enhancements:

-method newml now uses a more robust algorithm to fit the association model, specifically a modified Newton-Raphson with line search method. This change makes model fitting more robust when there are parameters with little information (which can arise e.g. for rare variants or for categorical phenotypes with many levels).
The performance of sample filtering options (i.e. the -[in|ex]clude_samples or -[in|ex]clude_samples_where) has been improved. This can make a large difference to analysis startup times when using these options with large cohorts, such as the UK Biobank. A number of other performance improvements have also been implemented.
Log-likelihood calculations for -method newml have been altered to use a compensated summation algorithm (Neumaier summation) to avoid potential accumulation of numerical error when using very large datasets (such as the UK Biobank).
The default value for the -minimum_predictor_count option has been set to 5. This applies when using -method newml and means that by default SNPTEST will not try to test variants with fewer than 5 minor alleles present in the sample.
Documentation on input file formats has been updated to accurately reflect the sample file format used for SNPTEST.
v2.5.6 fixes a bug that prevented using -condition_on with a sample file that does not have ID_1 as the first column name.

Changes in v2.5.4-beta3

New functionality

-method newml now supports bayesian tests for association. (Note only the gaussian prior options are currently supported, not t distribution priors for main effects.) Two new options - -prior_mean and -prior_sd have been added to support this.
New options -[in|ex]clude_samples_where allow you to filter samples based on values of a column in the sample file. See the section on making exclusions for more information.
A new option -minimum_predictor_count which restricts testing to variants having a specified minimum count of alleles (or predictor for non-additive tests). See the section on making exclusions for more information. In some cases this can dramatically speed up scans by skipping the rarest (and hardest to fit) variants.
Experimental A new option -interaction has been added to support testing for interactions with sample file covariates. (Only supported by -method newml). See the section on testing for interactions for full details.
Full support for BGEN v1.2 format files - see the file formats page for more information.

Convenience features

SNPTEST now supports streaming input files (currently restricted to BGEN format only). This means that SNPTEST can now be used in a pipeline with other tools such as bgenix to efficiently operate on a subset of data. See the section on streaming input files for more information.
SNPTEST v2.5.4 relaxes the requirements on sample files. The only requirement now is that the first column have type '0' and must reflect the primary (unique) identifier for samples. This column can be named as desired (previously it was required to be named "ID_1"). Further, the ID_2 and missing columns are no longer required to be present.

Changes in v2.5.2

Bug fixes

v2.5.2 is a bug fix release with the following changes:

Fixed support for plink binary (bed) files. Important: support for BED files in SNPTEST v2.5.1 was substantially broken. Users who have used this feature should re-run analyses using SNPTEST v2.5.2.
Restore the behaviour where the default file type (for unrecognised filename extentions) is GEN. This is useful e.g. for processing impute output, which are often not named with the .gen.gz extension.

Changes in v2.5.1

New features in v2.5.1

-method newml now supports frequentist add, dom, rec, gen and het models.
-frequentist and -bayesian options now accept models by name (add, dom, rec, gen, het) as well as number (1-5).
Further performance improvements to -method newml.
New file format support. SNPTEST now has support for reading plink binary (bed) format files. (A sample file is still required - see the Input File Formats page for details). Support for VCF v4.2 has also been added.
Experimental Support for testing of categorical traits using multinomial logistic regression. See here for details.
Experimental Support for the new BGEN v1.2 format. See the BGEN v1.2 file format page for details.

Bug fixes and other enhancements

Only print out summaries for phenotypes and covariates actually used.
Fix crashing bug in Hardy-Weinberg computation for large samples sizes.
-mpheno: increase the limit on the number of phenotypes from 10 to 25.
Make -renorm work again.
Add a work-around to handle sample files that have Windows-style (CRLF) line-endings on non-Windows platforms.
-method newml: Fix bug that would include samples with missing gender when doing test on the X chromosome.
-method newml: Make -haploid_genotype_coding hom the default. (Previously SNPTEST would try to guess how haploid samples - e.g. males on the X chromosome - were encoded based on the data at the first unambiguous informative SNP.)
-method newml: Stop with an error message if a non-binary phenotype is specified.

New features in v2.5

Changes relating to model-fitting code

A new option -lower_sample_limit has been added - this can be used to allow tests to be performed when there are fewer than 100 samples.
Continuous covariates are now standardised to have mean 0 and variance 1 by default. This should work around an issue some users encountered when using several principal components with small variance. A new option -use_raw_covariates is provided to turn this feature off.
Output now includes a comment column, which is used to indicate any problems that occur with model fitting for each variant.
A bug relating to convergence criteria for case/control tests have been fixed.

New model-fitting functionality (-method newml)

A new set of model-fitting code, activated using -method newml, has been developed for case/control phenotypes in SNPTEST v2.5. This behaves broadly like -method ml, but supports new features:

X chromosome testing -The allele frequency, info, and association test computations under -method newml are now aware of ploidy, and can be used for testing on the X and Y chromosomes.
Stratified testing - A new option -stratify_on has been added to allow for separate effects in each level of a given discrete covariate. (A motivating use case would be testing association on the X chromosome allowing for heterogeneity of effect between males and females).
A new option -full_parameter_estimates has been added to output estimates and standard errors for all parameters, including those for baseline and covariate parameters in addition to genotype. In addition, the full variance-covariance matrix for parameter estimates is output.
Performance -method newml is significantly faster than -method ml.

Note: -method newml currently only supports frequentist additive model tests (-frequentist 1).

New output functionality

Output code has been rewritten and has some new features:

Output files now contain meta-information recording the command used. This helps to alleviate the common bioinformatic problem of keeping track of different versions of analyses.
The column naming scheme for columns representing the results of statistical tests has been simplified. Use the -use_long_column_naming_scheme option to use the old column naming scheme.
Output files can now be tab- or comma- delimited as well as the default space-delimited. The separator to use is detected based on the filename extension (.csv, .tsv).
SNPTEST can now write its output to a database table instead of a flat file, using the -odb flag. (Currently, the sqlite3 single-file database format is supported; in future we may add support for other database systems.)

Convenience features

Any command-line arguments which SNPTEST does not recognise are now considered an error. (In previous versions these would be silently ignored.)
Fields in the output files that could not be computed are now written as NA, rather than -1 as previously.
Previous versions of SNPTEST would scan data before processing, resulting in long start-up times for large datasets. This no longer occurs (unless -overlap is used).
The syntax for the -condition_on option has changed slightly to enable conditioning on a SNP by chromosome and position. To do this, use -condition_on position=chr:xxxx where chr:xxxx is the chromosome/position of the SNP to condition on. (The chromosome can be omitted if SNPTEST doesn't know the chromosome.) You can also specify a mode of inheritance, e.g. -condition_on position=chr:xxxx dom.

Program options

A full list of available options can be obtained by running with the -help option, e.g.

./snptest -help

Download

SNPTEST is available free to use for academic use only. Please see the LICENCE and also included with the package.

Pre-compiled versions of the program and example files can be downloaded from the links below. For linux, you should use the dynamically linked version unless you run into trouble. On some systems, library incompatibilities cause problems so we have provided two statically linked versions as well. If you have any problems getting the program to work on your machine please contact us.

Please fill out the registration form to receive emails about updates to this software.

To unpack the files use the command like

tar zxvf snptest_v2.5.1_linux_x86_64_dynamic.tgz

This will create a folder called snptest_v2.5.1_linux_x86_64_dynamic/ containing an executable snptest_v2.5.1 and an example/ directory containing the example files. To see a list of options available in SNPTEST, cd into the directory and type

./snptest_v2.5.1 -help

Current stable version

Version	File
v2.5.6 Mac OS X	snptest_v2.5.6_MacOSX_x86_64.tgz
v2.5.6 CentOS7.8 (x86-64)	snptest_v2.5.6_CentOS_Linux7.8-x86_64_dynamic.tgz

Older versions

Older versions are preserved here for download, but are unsupported.

Version	File
v2.5.4-beta3 Mac OS X	snptest_v2.5.4-beta3_MacOSX_x86_64.tgz
v2.5.4-beta3 Ubuntu 12.04 (x86-64)	snptest_v2.5.4-beta3_linux_x86_64_dynamic.tgz snptest_v2.5.4-beta3_linux_x86_64_static.tgz
v2.5.4-beta3 CentOS6.6 (x86-64)	snptest_v2.5.4-beta3_CentOS6.6_x86_64_dynamic.tgz snptest_v2.5.4-beta3_CentOS6.6_x86_64_static.tgz
v2.5.2 Mac OS X	snptest_v2.5.2_MacOSX_x86_64.tgz
v2.5.2 Ubuntu 12.04 (x86-64)	snptest_v2.5.2_linux_x86_64_dynamic.tgz snptest_v2.5.2_linux_x86_64_static.tgz
v2.5.2 CentOS6.5 (x86-64)	snptest_v2.5.2_CentOS6.5_x86_64_dynamic.tgz snptest_v2.5.2_CentOS6.5_x86_64_static.tgz
v2.5.2 CentOS5 (x86-64)	snptest_v2.5.2_CentOS5_x86_64_dynamic.tgz snptest_v2.5.2_CentOS5_x86_64_static.tgz
v2.5.1 Mac OS X	snptest_v2.5.1_MacOSX_x86_64.tgz
v2.5.1 Ubuntu 12.04 (x86-64)	snptest_v2.5.1_linux_x86_64_dynamic.tgz snptest_v2.5.1_linux_x86_64_static.tgz
v2.5.1 CentOS6.5 (x86-64)	snptest_v2.5.1_CentOS6.5_x86_64_dynamic.tgz snptest_v2.5.1_CentOS6.5_x86_64_static.tgz
v2.5.1 CentOS5 (x86-64)	snptest_v2.5.1_CentOS5_x86_64_dynamic.tgz snptest_v2.5.1_CentOS5_x86_64_static.tgz

Input File Formats

SNPTEST allows the analysis of multiple cohorts of individuals. The data for each cohort is stored in two files. The first file (the genotype file) stores the genotype data for the cohort. The second file (the sample file) stores the ID's and associated covariate and phenotype information of the individuals of each cohort. For the example datasets included with the software the sample and genotype files for each of these cohorts have the suffices .sample and .gen respectively.

When using multiple cohorts SNPTEST assumes that

EITHER each cohort has data at the same set of SNPs and in the same order OR each cohort can have a different sets of SNPs and the intersection can be tested using the -overlap option.
the sample files for each cohort have exactly the same set of covariates and phenotypes and these occur in the same order in the files

Several file formats are supported:

Sample file formats

SNPTEST sample file format.

Sample files must be in the SNPTEST sample file format. This is a space-separated text file format with a single header row and a second header row that indicates the data type in each column. Subsequent rows represent the data. (Thus if there are N samples in the dataset, there will be N+2 rows in the sample file).

The first column in a sample file is treated specially: it is assumed to contain a unique identifier for each sample. To denote this it must be given type '0'. Subsequent columns can have arbitrary names and data values, and use types from the following table:

Column type	Description
`0`	Mandatory for the first column. (See below for other columns that can have this type.)
`C`	Continuous (numerical) values. Interpreted as a covariate
`P`	Continuous (numerical) values. Interpreted as a phenotype
`B`	Binary values (either 0/1 or the string values case/control can be used). Used as a phenotype
`D`	Discrete or categorical values. These can contain arbitrary (not including whitespace) and are internally mapped to discrete values 1...N across all the cohorts included in the analysis.

Any column in a sample file, except the first column containing identifiers, can contain missing data values. The default missing value for samples is the two-character string "NA". This can be changed using the -missing_code option.

An example of this format is:

sample_id sex cov1 cov2 phenotype height
0 D C D B P
S1 M 0.52 high case 1.82
S2 F 0.89 low control 1.73
S3 F 0.77 high control 1.75
S4 F 0.01 NA case 1.64
⋮

Note: Versions of SNPTEST up to 2.5.2, and some other programs such as SHAPEIT2, have additional restrictions on the sample file format. In these programs the first, second and third columns are assumed to be called 'ID_1', 'ID_2', and 'missing', and all three columns had to have type '0'. These additional type '0' columns are still permitted in SNPTEST, but we recommend omitting them unless you need compatibility with other tools. (In particular, the 'missing' column was used to filter out samples based on the values in this column; if this column is included we recommend setting it to the missing value. Similar filtering can now be achieved using the -[in|ex]clude_samples_where options.) With these restrictions, the above sample file would be encoded as:

ID_1 ID_2 missing sex cov1 cov2 phenotype height
0 0 0 D C D B P
S1 S1 NA M 0.52 high case 1.82
S2 S2 NA F 0.89 low control 1.73
S3 S3 NA F 0.77 high control 1.75
S4 S4 NA F 0.01 NA case 1.64
⋮

Note that for encoding gender inforamtion, SHAPEIT2 expects there to be a column called sex or gender, which must be of type '0'. This is not currently permitted in SNPTEST; columns containing gender information should be encoded as type 'D'.

Note: All sample files understood by SNPTEST are also useable with QCTOOL.

Genotype file formats

GEN and gzipped GEN format.

These will be used if the filename extension is .gen or .gen.gz, or if the extension is otherwise unrecognised. GEN is a text file format, space separated, with no header row. The columns are: chromosome identifier (this column is optional), SNPID, rsid, chromosome, position, A allele, B allele, followed by the probabilities of the 'AA', 'AB' and 'BB' genotypes for samples 1, 2, ... in order. Here is an example:

1 SNP1 rs1 100 A G 0 0 1 1 0 0 0 0.88 0.12…
15 SNP2 rs2 200 A C 0 0.9 0.1 1 0 0 0.2 0.6 0.2…
X SNP3 rs3 1004 T G 0.94 0 0.06 1 0 0 0 0.92 0.08…
⋮

GEN can be used to encode genotype probabilties (e.g. from imputation) for biallelic variants. By convention, positions in GEN are 1-based (i.e. the first position on each chromosome is 1).

Note: The original GEN specification did not include the chromosome column. SNPTEST v.2.3.0 and above were updated to support this column which is useful for various workflows. The chromosome column is additional to the other columns must be the first column in the file, and its presence is autodetected. Files including this column can be created using QCTOOL. Support for this additional column has also been included in recent releases of SHAPEIT.

BGEN format.

BGEN (binary GEN) format will be used if the filename extension is .bgen. BGEN files are designed to have file size similar or better than gzipped GEN files, but to support faster loading and seeking of individual SNPs. More information on using BGEN files and on converting GEN files to BGEN files can be found on the BGEN file format website. BGEN files can be created using QCTOOL.

Streaming support: SNPTEST can now read BGEN-formatted data from stdin. To get this behaviour, specify the input data file as '-' and specify the filetype as bgen using the -filetype option. For example, SNPTEST can be used with bgenix to operate on a chunk of data, e.g.

bgenix myfile.bgen 1:1000000-2000000 |
./snptest -filetype bgen -data - cohort1.sample ...

which will produce the same output as

./snptest -range 1:1000000-2000000-data myfile.bgen cohort1.sample

Support for the BGEN v1.1 format was added in SNPTEST v2.2.0. Full support for BGEN v1.2 was added in SNPTEST v2.5.4.

Plink binary format (BED).

As of v2.5.1, SNPTEST has support for plink binary format (BED) files, described here and here. (SNPTEST only understands the SNP-major versions of these files, which begin with the thee bytes 0x6c, 0x1b, and 0x01, not sample-major version. Most BED files are in SNP-major format.) A few points to note are:

BED files are identified via the filename extension .bed, and SNPTEST expects to find corresponding .bim and .fam files in the same directory.
Although the .fam file must be present and represent the correct samples, it's still necessary to supply a sample file in the SNPTEST format to the -data option. SNPTEST uses the information in the .sample file for association testing.
Similarly, SNPTEST does not make use of any family structure specified in the .fam file. If family relationships are present, normally the -exclude_samples option should be used to make exclusions as appropriate for association testing.

For example, if the directory contains the files cohort1.bed, cohort1.bim, and cohort1.fam, then the command

./snptest \
-data cohort1.bed cohort1.sample \
-frequentist 1 \
-pheno bin1 \
-method newml \
-o snptest.out

conducts an association test on phenotype bin1 in cohort1.sample using genotypes read form cohort1.bed.

Variant Call Format (VCF).

VCF format (version 4.0, 4.1, or 4.2) will be assumed if the filename extension is .vcf or .vcf.gz. VCF is more complicated than GEN format and there are a few points to bear in mind.

A VCF file can contain several different types of data. The new option -genotype_field has been added to tell SNPTEST which field it should read genotypes from.
SNPTEST can currently use genotype call (GT)-style fields given by fields with two integer values equal to 0 or 1. It can also operate on genotype call probability (GP)-style fields having three or four floating-point values per individual. The fourth value, if present, is interpreted as a NULL call and is ignored.
SNPTEST currently assumes that all variants are biallelic loci and that samples are encoded as diploid. (Haploid samples, such as males on the X chromosome, must be encoded as if having homozygous calls.)
SNPTEST requires that correct metadata be present in the file. In particular, a correct FORMAT definition must be given for all fields in the file (even those such as GT which have standard meanings).

Support for VCF format was added in v2.3.0. An example of using VCF files can be found below.

Streaming input

A common feature request to assist large jobs run in parallel has been for SNPTEST to support random access into input files. SNPTEST v2.5.4 adds a new feature - streaming input - that assists this.

This feature is currently restricted to BGEN format files. To use this feature specify "-" as the first input genotype file and specify the file type as bgen using the -filetype bgen option. As an example, the following command uses bgenix with SNPTEST to efficiently perform association tests on variants in a 1Mb region from a BGEN file.

bgenix myfile.bgen 20:1000000-2000000 | ./snptest \
-data - cohort1.sample \
-filetype bgen \
-frequentist add \
-method newml \
-o snptest.out \
-pheno bin2 \

Output file formats

In SNPTEST v2.5 a few changes have been made to the output file format, described below.

Metadata

Metadata reflecting the options used is now written to the top of the file protected by a '#' comment character. For example, here is the metadata from the output for an example command:

# Analysis: "SNPTEST analysis, started 2013-05-21 15:38:16"
#  started: 2013-05-21 15:38:16
# 
# Analysis properties:
#   -data cohort1.gen cohort1.sample (user-supplied)
#   -frequentist 1 (user-supplied)
#   -log /tmp/log (user-supplied)
#   -method newml (user-supplied)
#   -o /tmp/snptest.out (user-supplied)
#   -pheno bin2 (user-supplied)

We have found this feature useful in keeping track of different analyses run using SNPTEST. (You can give the analysis a different name using the -analysis_name option.)

Comma- and tab-separated files, and compression

SNPTEST v2.5 and above support comma-separated and tab-separated files in addition to the default space-separated files. The desired output format is detected based on the filename extension (.csv for csv files, .tsv for tab-separated files, and anything else for space-separated files.)

It's also possible to write gzipped output files - add the .gz extension to the filename to get this behaviour.

Outputting to a database

SNPTEST v2.5 and above support output to a database instead of a flat file using the -odb option. Currently the sqlite embedded database is supported. (Sqlite databases are entirely contained in a single file, and don't require the use of special server software.) For example, the command

./snptest \
-data cohort1.gen cohort1.sample \
-frequentist 1 \
-method newml \
-odb snptest.sqlite \
-analysis_name my_snptest_analysis \
-table_name TestAnalysis

produces a sqlite3 database named snptest.sqlite. A command like the following could then be used to quickly view the output for a selection of SNPs:

sqlite3 -header -column snptest.sqlite "SELECT rsid, FROM TestAnalysisView WHERE rsid IN ( 'RSID_34', 'RSID_99' ) " | less -S

A major motivation for this feature is that large flat files like the ones SNPTEST outputs can be difficult to work with - in particular, rows are not indexed, and the large number of columns can make viewing particular fields awkward. The snptest.sqlite database above has indices which makes it easy to find data by position or rsid, and queries can be adjusted to select desired columns.

A rough guide to the database schema produced by the above example command is as follows.

Table or view	Description
Variant	Stores a list of variants (SNPs and indels) used by the analysis. Variants are considered the same if they have the same chromosome, position and alleles. (Where a variant has several identifiers, these are stored in the VariantIdentifier table.)
TestAnalysis	This table contains the main analysis results and has one column for each variable SNPTEST computes.
TestAnalysisView	This is a convenience view which links the Variant and TestAnalysis tables. This view closely resembles the results of a traditional flat file output.
AnalysisView	A view which shows analyses that have been stored in the database.
EntityDataView	A view of metadata about analyses, analogous to the metadata example above.

There are a few things to bear in mind when outputting to a database.

In principle, it's possible to have several SNPTEST jobs writing to the same table in the same database file in parallel. However, due to limitations with sqlite, this might not be appropriate when there are many jobs or when jobs are run across a compute cluster.
By default, SNPTEST gives output tables a name of the form Analysis<n>, where n is a uniquely chosen integer. The -table_name option can be used to rename this table. However, if two analyses write to the same table, they must match in the sense that their column names agree.
To make the most of the database format, you will need some knowledge of the SQL language. The wikipedia page is a good starting point. More information on the sqlite3 command-line client can be found here.

Data Summaries

The simplest use of SNPTEST is to calculate data summaries for each SNP i.e genotype counts, allele frequencies, SNP missing data proportions and odds ratios. This is specified using the -summary_stats_only option.

NOTE : within each command box below, most lines end with the '\' character. This is not actually part of the command -- it is just a shorthand notation that means "keep reading the next line as part of a single command." We use this notation to split each example command over multiple lines so it is easier to read. This is a valid way to enter commands in a Unix-style terminal window (so, for example, you should be able to directly paste these commands into the terminal and hit 'enter' to make them run), but it would be equivalent to put all of the arguments on a single line, separated by spaces.

For example, the command

./snptest \
-summary_stats_only \
-data ./example/cohort1.gen ./example/cohort1.sample ./example/cohort2.gen ./example/cohort2.sample \
-o ./example/ex.out

produces a file ./example/ex.out which contains the data summaries for all 200 SNPs across the two cohorts. Note how the cohorts are specified by placing the relevant genotype and sample files after the -data and option in the command. For each cohort the name of the genotype file should be followed by its associated sample file. There is a limit of 18 cohorts that can be specified.
The -o option specified the output file i.e. ./example/ex.out. This file contains a line for each SNP and there is a header line which specifies the contents of each column.

Basic output columns

The following table give a description of each of the entries in the output file.

id	SNP ID (taken from input files)
rsid	RS ID of the SNP (taken from input files)
chromosome	A 2-letter chromosome identifier (if SNPTEST can determine it) or the value NA. See the section on chromosomes.
pos	Base pair position of the SNP
allele_A allele_B	The two alleles at the SNP. allele_A is coded 0 and allele_B is coded 1.
average_maximum_posterior_call	The average maximum posterior probability across all individuals in the sample that are used for the test at each SNP.This is a measure of how much uncertainty there is at each SNP. Samples excluded will be (a) those excluded using the -exclude_samples option, (b) samples with a missing phenotype or covariate relevant to the test, (c) samples without genotypes if the -method threshold option is used, (d) samples where the sum of the genotype probabilities is less than 0.1.
info	A measure of the observed statistical information for the estimate of allele frequency of the SNP using all individuals in the sample that are used for the test at each SNP. This measure has a maximum value of 1 that indicates that perfect information. Samples excluded will be (a) those excluded using the -exclude_samples option, (b) samples with a missing phenotype or covariate relevant to the test, (c) samples without genotypes if the -method threshold option is used, (d) samples where the sum of the genotype probabilities is less than the value set by the option -total_prob_limit (default 0.1).
cohort_1_AA cohort_1_AB cohort_1_BB cohort_1_NULL	Counts of AA, AB, BB and NULL genotypes in the 1st cohort. See Note below which details exactly how genotype counts are calculated in SNPTEST v2.
cohort_2_AA cohort_2_AB cohort_2_BB cohort_2_NULL	Counts of AA, AB, BB and NULL genotypes for the 2nd cohort (see details above). Subsequent cohorts will be included in a similar way. See Note below which details exactly how genotype counts are calculated in SNPTEST v2.
all_AA all_AB all_BB all_NULL all_total	Counts of AA, AB, BB and NULL thresholded genotypes, as well as the total number of samples considered, across all cohorts. See Note below which details exactly how genotype counts are calculated in SNPTEST v2.
all_maf	Minor allele frequencies (MAF) in the combined controls, combined cases and combined across all cohorts.
missing_data_proportion	The proportion of missing data across all cohorts.

If a test for a binary phenotype is being carried out then the following additional fields are included:

controls_AA controls_AB controls_BB controls_NULL	Counts of AA, AB, BB and NULL genotypes across all case cohorts. See Note above which details exactly how genotype counts are calculated in SNPTEST v2.
cases_AA cases_AB cases_BB cases_NULL	Counts of AA, AB, BB and NULL genotypes across all case cohorts. See Note above which details exactly how genotype counts are calculated in SNPTEST v2.
cases_maf controls_maf	Minor allele frequencies (MAF) in the controls and cases across all cohorts.
het_OR het_OR_lower het_OR_upper	Estimated odds ratios and lower and upper 95% confidence limits for the heterozygote genotype AB versus the (baseline) AA genotype.
hom_OR hom_OR_lower hom_OR_upper	Estimated odds ratios and lower and upper 95% confidence limits for the homozygote genotype BB versus the (baseline) AA genotype.
all_OR, all_OR_lower all_OR_upper	Estimated allelic odds ratios and lower and upper 95% confidence limits for the B allele versus the (baseline) A allele.

NOTE : Odds ratios and their confidence limits are set to NA if they cannot be calculated.

See the section on frequentist tests for association for further columns that are output when performing association tests.

How SNPTEST computes counts, frequencies, info measures and missing data proportions

SNPTEST tries to include the 'right' set of samples in computation of genotype counts, NULL call counts, allele frequencies and info measures. To avoid confusion the rules SNPTEST uses to determine samples to include are as follows:

SNPTEST ignores any sample that is present in an exclusion list - these samples are excluded before analysis and are never represented in any of the output fields.
If -method threshold is specified then threshholded genotype counts are used. In all other cases expected counts (the sum of the genotype probabilities for individuals in the sample) are given.
If -summary_stats_only option is given, the computation includes all non-excluded samples.
Otherwise, if an association test is carried out, the computation includes only those samples that a) have non-missing phenotype information and b) have non-missing covariate information (where covariates are specified).

NOTE (1): the behaviour of NULL call counts has changed in v2.5. In previous versions, NULL call counts would only reflect samples that had high enough genotype probability to be included in the association test (i.e. those passing the limit set by -total_prob_limit (default 0.1), but whose genotype call probabilities summed to less than one. In v2.5, NULL call counts include in addition all those samples that have non-missing phenotype (and, where relevant, non-missing covariates) but have missing genotypes or whose genotype probabilities are too low to be included in analysis.

NOTE (2): prior to v2.4, NULL count counts would in addition reflect samples whose phenotype and/or covariate information was missing.

Screen Output

You should notice that SNPTEST produces some screen output when run. Information about which data files were specified, the tests selected, the numbers of SNPs, the total number of cases and the total number of controls, information about the covariates and phenotypes in the sample files and information about individuals and SNPs selected for exclusion is all written to the screen. Also, information about the progress of the program is written to the screen. Warning and/or error messages may also be shown. Incorrect use of the options or input files with the wrong format may cause the program to terminate. The screen output can be used to identify any problems that lead to the termination. The flag -printids can be used to print the SNP IDs of each SNP as it is processed which can be useful to identify where problems occur.

For example, the command

snptest \
-data cohort1.gen cohort1.sample \
-pheno bin2 \
-frequentist 1 \
-method newml \
-o /tmp/snptest.out

produces this output:

Welcome to SNPTEST
© University of Oxford 2008-2013
https://mathgen.stats.ox.ac.uk/genetics_software/snptest/snptest.html
Read LICENCE file for conditions of use.

==============

Data Files : 
 -gen files : cohort1.gen 
 -sample files : cohort1.sample 

Tests : 
 -frequentist : 1
 -method newml

reading sample exclusion lists

Inspecting data (this may take some time)...
Sample and exclusions summary :
 - Number of individuals in : (cohort 1) 
                              500        


Reading sample files :
Summary of covariates and phenotypes
 # discrete variables : 3
  cov1 : type = D (Discrete covariate)
  cov2 : type = D (Discrete covariate)
  sex : type = D (Discrete covariate)
 # continuous variables : 2
  cov3 : type = C (Continuous covariate)
  cov4 : type = C (Continuous covariate)
 # phenotypes : 4
  pheno1 : type = P (Continuous phenotype)
  pheno2 : type = P (Continuous phenotype)
  bin1 : type = B (Binary phenotype)
  bin2 : type = B (Binary phenotype)
Covariate summary :
  cov1    : missing  levels
            1        0(244) 1(255)
  cov2    : missing  levels
            1        0(10) 1(76) 2(150) 3(164) 4(76) 5(23)
  cov3    :                 missing  min      max      mean     variance
            (unnormalised): 1        -3.2702  3.8310   0.0703   1.0131  
              (normalised): 1        -3.3189  3.7364   0.0000   1.0000  
               (histogram): 
                                50-|              *                
                                   |             **                
                                   |             *****             
                                   |           * *****             
                                26-|         ********* **          
                                   |         ************          
                                   |         ************          
                                   |      ***************          
                                 3-|      ****************** *     
                                   +-------------------------------
                                    -3.43                      3.85
  cov4    :                 missing  min      max      mean     variance
            (unnormalised): 1        -2.8552  3.1769   0.0324   0.8858  
              (normalised): 1        -3.0681  3.3411   -0.0000  1.0000  
               (histogram): 
                                45-|             *                 
                                   |             ***               
                                   |            **** **            
                                   |            *******            
                                24-|          **********           
                                   |        *************          
                                   |       ***************  *      
                                   |      ***************** *      
                                 3-|   * ********************      
                                   +-------------------------------
                                    -3.17                      3.45
  sex     : missing  levels
            2        female(237) male(261)
Phenotype summary :
  pheno1  :                 missing  min      max      mean     variance
            (unnormalised): 1        -1.0766  5.2884   2.1386   1.4532  
              (normalised): 1        -2.6672  2.6129   -0.0000  1.0000  
               (histogram): 
                                45-|             *                 
                                   |             *                 
                                   |             *   *             
                                   |            ** ***             
                                24-|          **********           
                                   |         ************          
                                   |         *************** *     
                                   |      ********************     
                                 3-| ** *********************** ** 
                                   +-------------------------------
                                    -2.75                      2.70
  pheno2  :                 missing  min      max      mean     variance
            (unnormalised): 1        -2.5428  3.7000   -0.0028  1.0025  
              (normalised): 1        -2.5369  3.6982   -0.0000  1.0000  
               (histogram): 
                                46-|             **                
                                   |            ***                
                                   |          * ***                
                                   |          *******              
                                24-|        *********              
                                   |       **********  *           
                                   |       *************           
                                   |    * ************** *         
                                 3-|* **********************       
                                   +-------------------------------
                                    -2.64                      3.80
  bin1    : missing  levels
            1        1(499)
  bin2    : missing  levels
            1        0(236) 1(263)

Phenotype being used : bin2

Data Summaries : 
 -number of SNPs = (unknown)

Data with missing genotype data threshold and exclusion list applied :
 cohort1.gen : 500


Analyzing Data :
PerVariantComputationManager: using the following computations:
 --> NewMLSinglePhenotypeTest with regression design:
  phenotype   baseline genotype 
       0.00       1.00        ?
       1.00       1.00        ?
       0.00       1.00        ?
       0.00       1.00        ?
       0.00       1.00        ?
       0.00 ~     1.00        ?
       1.00       1.00        ?
         NA       1.00        ?
       0.00       1.00        ?
       0.00       1.00        ?

 --> GenotypeCountComputation( all )
 --> InfoMeasureComputation( all )
 --> GenotypeCountComputation( cases )
 --> InfoMeasureComputation( cases )
 --> GenotypeCountComputation( cohort_1 )
 --> InfoMeasureComputation( cohort_1 )
 --> GenotypeCountComputation( controls )
 --> InfoMeasureComputation( controls )
 scanning... read chunk [1 of (unknown)]... done.
 scanning... read chunk [2 of (unknown)]... done.
 scanning... read chunk [3 of (unknown)]... done.
 scanning... no more data.

finito

Frequentist Association Tests

There are 3 options that control Frequentist testing for association (-pheno, -frequentist and -method),

-pheno <name>	This specifies which phenotype you wish to test. The <name> should match one of the phenotypes in the sample file. If the phenotype in the sample file is binary (B) then a case-control test is carried out. If the phenotypes in the sample file is continuous (P) then a quantitative trait test (i.e. F-test for a linear model) is carried out. See FILE FORMAT WEBPAGE for more details about how to specify a phenotype in the sample file. If no phenotype is specified then the first phenotype in the sample file is used.
-frequentist <t1>...<tn>	This option controls the model you wish to test at each SNP versus a model of no association. The five different models are coded as 1=Additive, 2=Dominant, 3=Recessive, 4=General and 5=Heterozygote. When using this option the output file will have a column for each test that contains the p-value for the test as well as estimates of the model parameters (beta's) and their standard errors. SNPTEST codes allele_A as 0 and allele_B as 1 and this defines the meaning of the beta's and there se's. For example, when using the additive model the beta estimates the increase in log-odds that can be attributed to each copy of allele_B. When a model cannot be fitted to the data the p-value is set to -1.
-quantile_normalise_phenotypes	(This option applies to continuous phenotypes only). Quantile normalize continuous phenotypes. This is done AFTER samples have been excluded.
-use_raw_phenotypes	(This option applies to continuous phenotypes only). By default continuous phenotypes are mean centered and scaled to have variance 1. This feature can be turned off with this option.

Dealing with genotype uncertainty (the -method option)

The -method option which controls the way genotype uncertainty is taken into account when carrying out association tests. The options are listed in the table below.

Method	Phenotypes	Description
-method newml	case-control or discrete	Use multiple Newton-Raphson (or since v2.5.5 modified Newton-Raphson with line search) iterations to estimate the parameters, summing over the uncertainty in imputed genotypes for each sample. The `-max_iterations` option controls the maximum number of iterations permitted.
-method ml	Case-control	Deprecated - use `-method newml` instead. Use multiple Newton-Raphson iterations to estimate the parameters in the missing data likelihood for the model.
-method threshold	Case-control or continuous	Use thresholded genotypes. The calling threshold is controlled by the flag -call_thresh. The default calling threshold is 0.9.
-method expected	Case-control or continuous	Use expected genotype counts (aka genotype dosages).
-method score	Case-control or continuous	Use a missing data likelihood score test. This is equivalent to the -proper option in previous versions, except that if the score test experiences problems at a SNP (usually due to a rare SNP and/or high uncertainty) then -method em is used for this SNP.
-method em	Continuous	Use an EM algorithm to estimate the parameters in the missing data likelihood for the model.

There are two other options that control how the imputed genotypes are treated.

-renorm

The methods described above to deal with genotype uncertainty were developed for the use with imputed SNPs. This implies that the genotype probabilities will sum to 1. If probabilistic genotype calls from an algorithm like CHIAMO are used then the probabilities might sum to less than one and any left over probability is the probability of a NULL call. The -renorm option renormalizes the genotype probabilities to sum to 1. The default is not to renormalize the probablities unless the -method expected option is chosen in which case it is automatically turned on.

-total_prob_limit <x>

There is an internal lower limit set on the sum of genotype probabilities. The default is 0.1. If this threshold is not met then that genotype is not included in the test. This protects against SNPs with a high proportion of NULL genotypes.

The statistical details of the Frequentist tests implemented are given in this pdf.

Information measure

If score, ml or em are chosen as the method when using a frequentist test then a relative information measure will be calculated at each SNP. This will be reported in a column ending in _info.The statistical details of these information measures are given in this pdf.

Output column naming convention

From SNPTEST v2.5 , the naming convention used for columns of the output file that contain results of statistical tests is

<test_type>_<genetic_model>_<summary_measure>

where the parts of the name are as in the table below. For example, the column containing p-values for a frequentist additive test would be named frequentist_add_pvalue.

Alternatively,the -use_long_column_naming_scheme option can be used to produce names similar to those output by SNPTEST v2.4 and below:

<phenotype_name(s)>_<test_type>_<genetic_model>_<covariate_name(s)>_<summary_measure>

<test_type>	frequentist or bayesian
<genetic_model>	add, dom, rec, gen or het
<summary_measure>	One of pvalue, info, beta_X, se_X or log10_bf depending on the column
<phenotype_name(s)>	The name (or names if -mpheno is used) of the phenotypes used in the test.
<covariate_name(s)>	The name (or names) of the covariates being conditioned upon in the test

Example 1 - Case-Control Test

The following example carries out a case-control test for the binary phenotype named bin1.

./snptest \
-data ./example/cohort1.gen ./example/cohort1.sample ./example/cohort2.gen ./example/cohort2.sample \
-o ./example/ex.out \
-frequentist 1 \
-method score \
-pheno bin1

The p-values for the test is given in the column bin1_frequentist_add_pvalue. Parameter estimates and their standard errors are given in the columns labeled bin1_frequentist_add_beta_1 and bin1_frequentist_add_se_1.

Example 2 Quantitative Trait Test

The following example carries out a case-control test for the quantitative phenotype named pheno1

./snptest \
-data ./example/cohort1.gen ./example/cohort1.sample ./example/cohort2.gen ./example/cohort2.sample \
-o ./example/ex.out \
-method score \
-frequentist 1 \
-pheno pheno1

The p-values for the test is given in the column pheno1_frequentist_add_pvalue. Parameter estimates and their standard errors are given in the columns labeled pheno1_frequentist_add_beta_1 and pheno1_frequentist_add_se_1.

Bayesian Tests (Bayes Factors)

The Bayesian tests are specified by the -bayesian option, in a similar way to the use of the -frequentist option. The statistical details of the Bayesian tests implemented are given in this pdf.

								-bayesian <t1>...<tn>           

							

This option controls the model you wish to test at each SNP versus a model of no association. The five different models are coded as 1=Additive, 2=Dominant, 3=Recessive, 4=General and 5=Heterozygote. When using this option the output file will have a column for each test that contains the log10 Bayes Factor for the test as well as posterior mean estimates of the model parameters (beta's) and their standard errors. SNPTEST codes allele_A as 0 and allele_B as 1 and this defines the meaning of the beta's and there se's. For example, when using the additive model the beta estimates the increase in log-odds that can be attributed to each copy of allele_B. A Bayes factor will always be calculated at a SNP.

The -method option is also used to control the way the Bayesian models are fit, but not all options are valid.

If the phenotype is binary then the only options that work are threshold, expected, score and ml. The score option uses a single newton-raphson iteration to estimate the mode of the posterior while the ml option uses multiple iterations.
If the phenotype is quantitative then the only options that work are threshold and expected.

Priors for Binary Trait models

The table below gives a description of the linear predictor of the logistic regression used, the form of the priors used on the model parameters, the default priors used in SNPTEST and the command line option that can be used to change the priors.

Model	Linear Predictor	Priors	Default	Coding	Command line option
Additive	log(p_i/(1-p_i)) = µ + ßG_i	µ~N(a₀, a₁²) ß~N(b₀, b₁²)	a₀=0, a₁=1 b₀=0, b₁=0.2	G_i is the additive coding of the SNP i.e. AA -> 0, AB ->1, BB -> 2.	-prior_add a₀ a₁ b₀ b₁
Dominant	log(p_i/(1-p_i)) = µ + ßD_i	µ~N(a₀, a₁²) ß~N(b₀, b₁²)	a₀=0, a₁=1 b₀=0, b₁=0.5	D_i is the dominant coding of the SNP i.e. AA -> 0, AB -> 1, BB -> 1.	-prior_dom a₀ a₁ b₀ b₁
Recessive	log(p_i/(1-p_i)) = µ + ßR_i	µ~N(a₀, a₁²) ß~N(b₀, b₁²)	a₀=0, a₁=1 b₀=0, b₁=0.5	R_i is the recessive coding of the SNP i.e. AA -> 0, AB -> 0, BB -> 1.	-prior_rec a₀ a₁ b₀ b₁
General	log(p_i/(1-p_i)) = µ + ßG_i + qH_i	µ~N(a₀, a₁²) ß~N(b₀, b₁²) q~N(c₀, c₁²)	a₀=0, a₁=1 b₀=0, b₁=0.2 c₀=0, c₁=0.5	G_i is the additive coding of the SNP i.e. AA -> 0, AB ->1, BB -> 2. H_i is the heterozygote coding of the SNP i.e. AA -> 0, AB ->1, BB -> 0.	-prior_gen a₀ a₁ b₀ b₁ c₀ c₁
Heterozygote	log(p_i/(1-p_i)) = µ + ßH_i	µ~N(a₀, a₁²) ß~N(b₀, b₁²)	a₀=0, a₁=1 b₀=0, b₁=0.5	H_i is the heterozygote coding of the SNP i.e. AA -> 0, AB ->1, BB -> 0.	-prior_het a₀ a₁ b₀ b₁

t-distribution priors

In SNPTEST v2 there is a new option to specify the use of t-distribution priors on the genetic effects. The fatter tails of the t-distribution allow more flexibility in specifying beliefs about the size of the genetic effects. This option is controlled by the following two options.

-t_prior

Specfies the use of t-distribution priors on the genetic effects. Effectively, this option modifies the priors described in the table above i.e. the mean and variance of the t-distributions are specified by the options given in the table above, but the normal distributon is replaced by the t-distribution. NOTE : a t-distribution is only used for the genetic effects i.e. the parameters ß and q in the models above. For example, -bayesian add -t_prior would specify the linear predictor log(p_i/(1-p_i)) = µ + ßG_i and the priors would be µ~N(a₀, a₁²) and ß~t(b₀, b₁², df = 3).

-t_df <x>

The degrees of freedom parameter of the t-distribution. The default value is 3. When this parameter is set very large the prior converges to the normal distribution prior.

Example - Bayesian Case-Control Test

The following example calculates a Bayesian additive model Bayes Factor for the binary phenotype bin1 named using the default priors.

./snptest \
-data ./example/cohort1.gen ./example/cohort1.sample ./example/cohort2.gen ./example/cohort2.sample \
-o ./example/ex.out \
-bayesian 1 \
-method score \
-pheno bin1

Bayesian Quantitative Trait models and priors

The Bayesian tests for quantitative traits are carried out using the conjugate prior formulation of the linear model using either thresholded genotypes (-method threshold) or the expected genotypes (-method expected). The model is most easily explained through an example. For an additive model the formulation is

y_i = ßG_i + e_i, e_i ~ N(0, σ²),

where
y_i = the residual phenotype for the ith individual. The residual phenotype is calculated by subtracting off a baseline term and estimates of any specified covariates.
G_i = an additive coding for the thresholded or expected genotype of the ith indvidual.
σ² = the error variance of the model.

This model is compared to the model y_i = e_i, e_i ~ N(0, σ²).

Prior Specification

We use a Normal Inverse Gamma (NIG) prior on the effects ß and σ². This prior has the form

σ² ~ IG(a,b) and ß ~ N(m_ß, V_ßσ²)

This makes it clear that the prior variance on ß is specified in terms of the fraction (V_ß) of the error variance.
It can be shown that the expected non-centrality parameter for the F-test when fitting the above linear model is approximately Np(1 − p)2ß²/σ²
where ß and σ² are the true values of the alternative model, p is the allele frequency of the SNP and 2N is the total sample size.
This can be usefully compared to the non-centrality parameter for the case-control test which is approximately Np(1 − p)ß²
assuming N cases and N controls, and here ß is the log-odds ratio parameter of a logistic regression model. So,
if we are happy to put a N(0, 0.2²) prior on ß for a binary trait we might reasonably put the same prior on √2ß/σ in the model above i.e ß ∼ N(0, 0.02σ²).

In the context of the NIG prior used in SNPTEST v2 this would mean setting m_ß=0 and V_ß = 0.02.

By default all quantitative phenotypes are centered and scaled to have zero mean and unit variance before analysis. This places all the quantitative phenotypes on a comparable scale. Since most genetic effects will be very small in GWAS it is reasonable to assume that the error variance σ² will be close to 1. Thus using a IG(3,2) prior for σ² which has mean 1 and variance 1 will produce reasonably robust results. The centering and scaling can be turned off with the -use_raw_phenotypes flag. In this case the prior on the error variance σ² should be specified to take this into account.

The following example uses this model to analyze the phenotype pheno1. This produces a log₁₀ Bayes Factor in the output file.

./snptest \
-data ./example/cohort1.gen ./example/cohort1.sample ./example/cohort2.gen ./example/cohort2.sample \
-o ./example/ex.out \
-bayesian 1 \
-method expected \
-pheno pheno1 \
-prior_qt_mean_b 0 \
-prior_qt_V_b 0.02 \
-prior_qt_a 3 \
-prior_qt_b 2

The 5 genetic models, their priors and how to specify them on the command line are set out in the following table.

NOTE : there are no default values for these parameters. You MUST specify them manually in order to use the Bayesian Quantitative Trait models.

Model name	Model	Priors	Command line options needed
Additive	y_i = ßG_i + e_i, e_i ~ N(0, σ²)	ß~N(b₀, V_ßσ²) σ² ~ IG(a,b)	-prior_qt_mean_b b₀ -prior_qt_V_b V_ß -prior_qt_a a -prior_qt_b b
Dominant	y_i = ßD_i + e_i, e_i ~ N(0, σ²)	ß~N(b₀, V_ßσ²) σ² ~ IG(a,b)	-prior_qt_mean_b b₀ -prior_qt_V_b V_ß -prior_qt_a a -prior_qt_b b
Recessive	y_i = ßR_i + e_i, e_i ~ N(0, σ²)	ß~N(b₀, V_ßσ²) σ² ~ IG(a,b)	-prior_qt_mean_b b₀ -prior_qt_V_b V_ß -prior_qt_a a -prior_qt_b b
General	y_i = ßG_i + qH_i + e_i, e_i ~ N(0, σ²)	ß~N(b₀, V_ßσ²) ß~N(b₁, V_qσ²) σ² ~ IG(a,b)	-prior_qt_mean_b b₀ -prior_qt_V_b V_ß -prior_qt_mean_q b₁ -prior_qt_V_q V_q -prior_qt_a a -prior_qt_b b
Heterozygote	y_i = ßH_i + e_i, e_i ~ N(0, σ²)	ß~N(b₀, V_ßσ²) σ² ~ IG(a,b)	-prior_qt_mean_b b₀ -prior_qt_V_b V_ß -prior_qt_a a -prior_qt_b b

Model averaging option

The option -mean_bf is used to average over a set of Bayesian models. This can be used for both binary and quantitative phenotype tests. This option does not currently work with the -mpheno option.

-mean_bf <w1>...<wn>

Specify that a log10 Bayes factor for a weighted average over the models specified by -bayesian with weights given by <w1>....<wn>. For example, -bayesian 1 4 -mean_bf 9 1 would calculate a Bayes factor for a weighted average of the additive and general models where the additive model is given weight 9 and the general model is given weight 1. The log10 Bayes factor will be written in a column with the label mean_bf.

Bayesian Multiple Phenotype Test

A Bayesian test for association of a SNP with multiple quantitative phenotypes can be carried out with the -mpheno option.

The model we use is the Bayesian Multivariate Linear model which is specified by

(y_i1,....,y_iq)^T= G_i (ß₁,...,ß_q)^T + (e_i1,...,e_iq)^T where (e_i1,...,e_iq)^T ~ N_q(0, Σ)

where the (y_i1,....,y_iq) is the vector of the q residual phenotypes measured on the ith individual. The residual phenotype is calculated by subtracting off an baseline term and estimates of any specified covariates. Further we assume that each of these phenotypes has been centered and scaled to have zero mean and unit variance. Also, G_i is the coded version of the SNP genotype for the ith individual.

We use the conjugate prior for this model. This is an inverse Wishart prior IW(c,Q) prior on the error covariance matrix Σ and a matrix normal (N) prior on the vector of parameters

(ß₁,...,ß_q) - M ~ N(V, Σ),

where M is a mean vector and V is a constant. For more details of the matrix normal distribution see

A. P. Dawid (1981) Some matrix-variate distribution theory : notational considerations and a bayesian application. Biometrika 68:265-274.

This distribution has the property that the covariance matrix of (ß₁,...,ß_q) - M is given by VΣ. By a similar argument to that used above when discussing how to set the priors for a single quantitative phenotype we recommend setting V=0.02 and M = (0,...,0). Since the phenotypes have been centered and scaled we also recommend placing a IW(6,4I_q) prior on Σ where I_q is the (qxq) Identity matrix. The centering and scaling can be turned off with the -use_raw_phenotypes flag.

The fit of the full model (M₁) in which (ß₁,...,ß_q) are unconstrained is compared to the fit of the null model (M₀) in which (ß₁,...,ß_q) = 0. The Bayes factor calculated then has the form

BF = P(Data | M₁) / P(Data | M₀).

The following example uses this model to analyze the phenotypes pheno1 and pheno2 jointly. This produces a log_¹⁰ Bayes Factor in the output file.

NOTE : the Inverse-Wishart prior is set with the options -prior_mqt_c <c> and -prior_mqt_Q <Q>. This specifies an IW(c,QI_q).

./snptest \
-data ./example/cohort1.gen ./example/cohort1.sample ./example/cohort2.gen ./example/cohort2.sample \
-o ./example/ex.out \
-bayesian 1 \
-method expected \
-mpheno pheno1 pheno2 \
-prior_qt_mean_b 0 \
-prior_qt_V_b 0.02 \
-prior_mqt_c 6 \
-prior_mqt_Q 4

Multinomial phenotype test

SNPTEST v2.5.1 includes support for testing categorical traits using a multinomial logistic regression likelihood. This extends the logistic regression implemented for binary traits to multiple categories. This feature is currently considered experimental and this page provides initial documentation on its use.

To specify a multinomial traits you must:

Use -method newml
Specify -pheno <x> where x is a discrete column (type 'D') in the sample file.

You can also optionally use -baseline_phenotype to adjust which level of the phenotype SNPTEST treats as the baseline.

Understanding the parameters

Parameters in the multinomial model can be thought of as forming a matrix (β_ij), where β_ij is the effect size for predictor j (i.e. the jth column of the design matrix) on non-baseline outcome level i. SNPTEST internally renumbers these parameters as β_k, k = 0, ..., K-1, where K = (number of non-baseline outcome levels) × (number of predictors) . To allow parameter identification, the output contains columns named in the following way:

frequentist_<model>_beta_<k>:<predictor>/<phenotype>=<level>

To avoid cluttering the output, corresponding standard errors and other columns are simply identified by number, e.g. the column containing standard errors for the ith parameter is named

frequentist_<model>_se_<k>

Example

For example, suppose the column 'bin3' contains a phenotype with levels control, case1 and case2. The command

./snptest \
-data ./example/cohort1.gen ./example/cohort1.sample ./example/cohort2.gen ./example/cohort2.sample \
-o ./example/ex.out \
-frequentist add \
-method newml \
-pheno bin3 \
-baseline_phenotype control

fits a multinomial logistic regression at each SNP with a single additive genetic effect parameter, using "control" as the baseline outcome. SNPTEST will output the following columns relevant to the parameters:

column	description
frequentist_add_beta_1:add/bin3=case1	Effect size parameter (β₁) for outcome case1 relative to control
frequentist_add_beta_3:add/bin3=case2	Effect size parameter (β₃) for outcome case2 relative to control.
frequentist_add_se_1	Standard error for β₁
frequentist_add_se_3	Standard error for β₃
frequentist_add_cov_1,3	Covariance between the two parameters.
frequentist_add_wald_pvalue_1	Wald test p-value for β₁ (based on the effect size and standard error).
frequentist_add_wald_pvalue_3	Wald test p-value for β₃

Important: the particular order or numbering of parameters may change in future.

Example (general model test)

Similarly, the command

./snptest \
-data ./example/cohort1.gen ./example/cohort1.sample ./example/cohort2.gen ./example/cohort2.sample \
-o ./example/ex.out \
-frequentist gen \
-method newml \
-pheno bin3 \
-baseline_phenotype control

will fit a model with both additive ('add') and heterozygote ('het') parameters, with effect sizes columns named as follows.

column	description
frequentist_add_beta_1:add/bin3=case1	Effect size parameter (β₁) for additive effect on outcome case1.
frequentist_add_beta_2:het/bin3=case1	Effect size parameter (β₂) for heterozygote effect on outcome case1.
frequentist_add_beta_4:add/bin3=case2	Effect size parameter (β₄) for additive effect on outcome case2.
frequentist_add_beta_5:het/bin3=case2	Effect size parameter (β₅) for heterozygote effect on outcome case2.

Similarly, number columns for the standard errors, covariances and Wald test p-values will be output.

Other options

By default, as above only the parameters corresponding to genetic effects are output. The -full_parameter_estimates option can be used to output parameter estimates for all columns of the design matrix.

Conditional Tests of Association

There are several options that control how covariates and/or SNPs can be conditioned upon in order to carryout a test of association. These options work with both the Frequentist and Bayesian association tests.

-cov_names <name_1> ... <name_n>	Condition upon the covariates in the sample files with names name_1,...., name_n.
-cov_all	Condition upon all the covariates in the sample files.
-cov_all_discrete	Condition upon all the discrete covariates (D) in the sample files.
-cov_all_continuous	Condition upon all the continuous covariates (C) in the sample files.
-condition_on <snp_1> <model_1> ... <snp_n> <model_n>	Condition upon a list of SNPs with IDs given by snp_1,...,snp_n. For each SNP a list of models can be supplied; the choices are add, dom, rec, het, or gen. Here "gen" is shorthand for "add het", i.e. condition on additive and heterozygote dosages. If no model is supplied, the default "add" is used. These covariates are internally added to the sample file as continuous (type C) covariates and appear in the covariate summary in the screen output.

Conditioning upon one (or more) covariate means that the test of association being carried out is testing for a genetic effect over and above that explained by the covariate(s). Discrete covariates are added into the model as factors i.e. a different baseline term for each level of the factor is fitted.

Example 1 - Mantel-Hantzel Test

If a single Discrete (D) covariate is conditioned upon then this is equivalent to a Mantel-Hantzel test. This is a test for a common genetic effect where each group is allowed to have it's own baseline effect. Here is an example of conditioning upon the binary covariate called cov1 in the sample files.

./snptest \
-data ./example/cohort1.gen ./example/cohort1.sample ./example/cohort2.gen ./example/cohort2.sample \
-o ./example/ex.out \
-frequentist 1 \
-method score \
-pheno bin2 \
-cov_names cov1

This produces an output file ./example/ex.out which contains a column with header bin2_frequentist_add_cov1_pvalue that contains the p-values for the test based on the covariate.

Example 2 - Conditioning on covariates that code for population structure

For association studies it has become popular to use eigenvectors from a PCA analysis to code for unobserved population structure. This is carried out in SNPTEST by setting the eigenvectors as Continunus (C) covariates in the sample file and then conditioning upon these covariates. Here is an example of conditioning upon the two continuous covariates called cov3 and cov4 in the sample files.

./snptest \
-data ./example/cohort1.gen ./example/cohort1.sample ./example/cohort2.gen ./example/cohort2.sample \
-o ./example/ex.out \
-frequentist 1 \
-method score \
-pheno bin1 \
-cov_names cov3 cov4

Example 3 - Conditioning on SNPs

In regions where an association has been found it is often desirable to carryout a test conditioning upon the most associated SNP to look for secondary signals of association which may highlight allelic heterogeneity or possible a haplotype effect in the region. This can be carried out in SNPTEST using the -condition_on option. A list of SNPs can be specified along with the coding to be applied to those SNPs. The following example carries out a conditional test of association conditional upon the SNPs with IDs RSID_10 and RSID_20. The SNP RSID_10 is coded as an additive effect while SNP RSID_20 is coded as a general effect.

./snptest \
-data ./example/cohort1.gen ./example/cohort1.sample ./example/cohort2.gen ./example/cohort2.sample \
-o ./example/ex.out \
-frequentist 1 \
-method score \
-pheno bin1 \
-condition_on RSID_10 add RSID_20 gen

The p-values from this command occurs in a column labelled bin1_frequentist_add_RSID_10:additive_dosage_RSID_20:additive_dosage_RSID_20:heterozygote_dosage_score_pvalue.

A summary of the conditioned-on dosages appears in the main covariate summary in the screen output.

In case of SNPs for which a useful ID is not present, the syntax -condition_on position=chr:xxxx (or -condition_on position=xxxx if chromosome information is missing) can be used, where chr:xxxx is the chromosome and position of the SNP to be conditioned on. ( position can be shortened to pos if desired.)

Testing for interactions

SNPTEST v2.5.4 and above support testing for interactions with covariates defined in the sample file. To test for interactions you must:

Use -method newml
Specify -interaction <x> <y>... where x, y, etc. are the names of columns in the sample file.
You must also include the same as covariates, e.g. using -cov_names <x> <y>....

Note: this feature is currently experimental. Please cross-check results with other methods.

Understanding the output columns

The -interaction x option tells SNPTEST to add a term corresponding to the interaction in the design matrix. This column is obtained by multiplying the genotype predictor by the value of the interaction column (or, for discrete covariates, by an indicator for each of the variable's possible values). Each such column will The corresponding output columns are labelled with the predictor name according to the following naming scheme:

column	description
frequentist_<model>_beta_<n>:<mode>/<phenotype>=<value>	Main effect parameter (usually β₁).
frequentist_<model>_beta_<m>:<mode>x<predictor>/<phenotype>=<value>	Effect size parameter (β_m) for cases relative to controls.
frequentist_add_se_<n>	Standard error for the corresponding effect parameter.
frequentist_add_wald_pvalue_<n>	Wald test P-value for the corresponding effect parameter.
frequentist_add_cov_<n>,<m>	Covariance in the loglikelihood between parameters n and m.
frequentist_<model>_lrt_pvalue	Likelihood ratio test P-value for the full model including main genetic and interaction effects, versus the null model (which includes the covariates)

As elsewhere, predictors corresponding to covariates are named as the name of the sample file column (for continuous variables) or as <name>=<value> for discrete covariates, where value takes on any of the possible values of the covariate.

Example

For example, the command

./snptest \
-data ./example/cohort1.gen ./example/cohort1.sample ./example/cohort2.gen ./example/cohort2.sample \
-o ./example/ex.out \
-frequentist 1 \
-method newml \
-pheno bin1 \
-cov_names cov1 cov2 \
-interaction cov2 \

Fits a logistic regression model for the phenotype bin1 including cov1 and cov2 as covariates. It includes a main effect for the additive genotype add and an interaction effect add × cov1. The output columns reflecting parameter estimates will be

frequentist_add_beta_1:add/bin1=1
frequentist_add_beta_2:addxcov2=1/bin1=1
frequentist_add_beta_3:addxcov2=2/bin1=1
frequentist_add_beta_4:addxcov2=3/bin1=1
frequentist_add_beta_5:addxcov2=4/bin1=1
frequentist_add_beta_6:addxcov2=5/bin1=1

Note: SNPTEST tries to give useful screen output to help diagnose issues. In this case it will print out the design matrices for null and alternate models, which look like this:

 -- Model #2 ("frequentist_add"): normal-weighted:LogisticRegressionLogLikelihood( 1000 samples ):
        phenotype   baseline   add addxcov1=1 cov1=1 cov2=1 cov2=2 cov2=3 cov2=4 cov2=5  cov3 
    0           1          1     ?          ?     NA      0      1      0      0      0 -1.5881
    1           1          1     ?          ?      0     NA     NA     NA     NA     NA -0.771098
    2           1          1     ?          ?      1      0      0      0      0      1    NA
    3           1          1     ?          ?      1      0      0      1      0      0 2.75584
    4           1          1     ?          ?      0      1      0      0      0      0 -1.60543
    5           1          1     ?          ?      1      0      0      0      1      0 -0.655319
    6          NA ~        1     ?          ?      1      0      0      1      0      0 -0.117888
                .          .     .          .      .      .      .      .      .      .     .
                .          .     .          .      .      .      .      .      .      .     .
                .          .     .          .      .      .      .      .      .      .     .
  996           0          1     ?          ?      0      1      0      0      0      0 -1.08686
  997           0          1     ?          ?      0      0      1      0      0      0 -0.298678
  998           0          1     ?          ?      1      0      1      0      0      0 -1.93056
  999           0          1     ?          ?      1      1      0      0      0      0 0.183302

This shows the design matrix for the first and last few samples. Columns are: a columns of 1s (corresponding to the baseline term), a column for the additive predictor (marked as ? because this varies between SNPs), a column for the interaction (also marked as ?), columns for each level of each of the discrete covariates, and a column for each continuous covariate.

Specifying which samples or SNPs to include

The following table lists options that can be used to adjust what data is included in an analysis. See below for a full description of each option.

option	description
-exclude_samples `<file1>` `<file2>`...	Exclude samples whose identifier is listed in the given file(s).
-include_samples `<file1>` `<file2>`...	Include only samples whose identifier is listed in the given file(s).
-exclude_samples_where `<condition>`	Exclude samples meeting the specified condition. If specified multiple times, the conditions are ANDed together, i.e. more samples will be excluded.
-include_samples_where `<condition>`	Include only samples meeting the specified condition. If specified multiple times, the conditions are ORed together, i.e. more samples will be included.
-miss_thresh `<x>`	Deprecated. Exclude samples based on a missingness threshhold. Use -[ex\|in]clude_samples_where instead.
-missing_code `<a>` `<b>`...	Specify that values `a`, `b` etc. should be treated as missing values in the sample file.
-range `<range>`	Include only SNPs or variants within the given genomic interval.
-snpid `<id1>` `<id2>`...	Include only SNPs with the given ID(s).
-minimum_predictor_count `<x>`	Include only SNPs where the count of the minor allele (or for nonadditive tests, the count of the minor predictor, described below) is at least `x`.
-overlap	Operate on an overlapping set of positions between cohorts. (This only applies if multiple cohorts are included in the analysis.)

Specifying lists of individuals (-[ex|in]clude_samples)

The -exclude_samples option can be used to specify a file containing a list of individuals that should be excluded from the analysis. The IDs in the file should be the ID that appears in the first column of the sample files. For example, the file ./example/samples.list contains a list of the IDs for the first 10 individuals in the example data files. To exclude these individuals from the analysis we can use

./snptest \
-data ./example/cohort1.gen ./example/cohort1.sample ./example/cohort2.gen ./example/cohort2.sample \
-o ./example/ex.out \
-frequentist 1 \
-method score \
-pheno bin1 \
-exclude_samples ./example/samples.list

You should notice that the screen output reports that it has read in 10 sample IDs and that these individuals were excluded.

If multiple lists are provided to this option, the lists are internally concatenated, i.e. samples in the union of the lists will be excluded.

Similarly, from SNPTEST v2.5.4 the option -include_samples can be used to include samples (i.e. to exclude samples not in the list.) If multiple lists are provided then the lists are internally concatenated, i.e. more samples will be included.

Excluding individuals based on values in the sample file
(-[in|ex]clude_samples_where)

In SNPTEST v2.5.4 new options were added to assist in specifying sets of samples to operate on. These options are:

The full syntax of these options is:

-[in|ex]clude_samples_where <name> [=|==|!=] <value>

where name is the name of a column in the sample file(s) and value is a value (which may be a numerical value or a string.) Comparisons '=' and '==' are synonymous and identify samples where the specified column equals the given value. The comparison '!=' stands for 'not equals' and identifies samples where the value differs from the given value.

Note: Most UNIX shells, such as bash, process arguments based on whitespace. To work around this you must generally write the condition inside single quote marks around the condition. (An alternative that sometimes works is to write it without any whitespace, e.g. <name>=<value>.)

For example, the command

./snptest \
-data ./example/cohort1.gen ./example/cohort1.sample ./example/cohort2.gen ./example/cohort2.sample \
-o ./example/ex.out \
-frequentist 1 \
-method newml \
-pheno bin2 \
-include_samples_where cov2=2 \
-include_samples_where cov2=3

tests for association with bin2 in individuals having cov2 equal to 2 or 3.

Excluding individuals based on levels of missingness (-miss_thresh)

The -miss_thresh option can be used to exclude individuals whose proportion of missing data (as reflected in the missing column of the sample file) exceeds some level. The missing data proportion of each individual is specified in the 3rd column of the sample file. For example, to specify a maximum missing data proportion of 1%,

./snptest \
-data ./example/cohort1.gen ./example/cohort1.sample ./example/cohort2.gen ./example/cohort2.sample \
-o ./example/ex.out \
-frequentist 1 \
-method score \
-pheno bin1 \
-miss_thresh 0.01

You should notice that the screen output reports that it has read in 10 SNPs IDs that the number of individuals included after the missing data threshold and exclusion list has been applied is less than the original number of individuals in the raw files.

Note: this feature is deprecated meaning that we will remove it in a future version, and we generally advise against using it. The -[ex|in]clude_samples_where options can be used instead.

Adjusting how values are treated as missing in the sample files (-missing_code)

When carrying out a statistical test that conditions on covariates or uses a quantitative phenotype any indvidual with at least one missing value of a covariate or phenotype will be excluded from the test. The default code for missing covariates or phenotypes in the sample files is NA (see Input File Formats). The option -missing_code can be used to specify a list of comma-separated alphanumeric codes that will be interpreted as missing values. For example, the syntax -missing_code NA,-999 will treat any value equal to -999 or NA in the sample files as missing.

Excluding or including SNPs

Specify a range of SNPs by base-pair position (-range)

The -range option can be used to analyze only those SNPs whose base-pair position lies within a given set of intervals. The following example only carries out tests on SNPs within the intervals [20000,30000] and [40000, 50000].

./snptest \
-data ./example/cohort1.gen ./example/cohort1.sample ./example/cohort2.gen ./example/cohort2.sample \
-o ./example/ex.out \
-frequentist 1 \
-method score \
-pheno bin1 \
-range 20000-30000 40000-50000

In a range specification the start or end of the range can be omitted. For example, the syntax -range 50000- will restrict to all SNPs with position 50000 or above.

Specify a list of SNPs (-snpid)

The -snpid option can be used to specify a list of specific SNPs to analyze. The following example only carries out tests at SNPs with IDs RSID_4 and SNPID_7.

./snptest \
-data ./example/cohort1.gen ./example/cohort1.sample ./example/cohort2.gen ./example/cohort2.sample \
-o ./example/ex.out \
-frequentist 1 \
-method score \
-pheno bin1 \
-snpid RSID_4 SNPID_7

Excluding specific SNPs (-exclude_snps)

The -exclude_snps option can be used to specify a file containing a list of SNPs that should be excluded from the analysis. The IDs in the file can be the SNP IDs (first column of the genotype file) or RS IDs (second column of the genotype file). For example, the file ./example/snps.list contains a list of the SNP IDs for the first 10 SNPs in the example data files. To exclude these SNPs from the analysis we can use

./snptest \
-data ./example/cohort1.gen ./example/cohort1.sample ./example/cohort2.gen ./example/cohort2.sample \
-o ./example/ex.out \
-frequentist 1 \
-method score \
-pheno bin1 \
-exclude_snps ./example/snps.list

You should notice that the screen output reports that it has read in 10 SNP IDs and that the output file does not contain output for these SNPs.

Excluding SNPs that have low minor allele or predictor counts (-minimum_predictor_count)

The option -minimum_predictor_count <x> instructs SNPTEST to ignore all SNPs for which the expected count of the genetic predictor is less than <x>. In some situations this can speed up scans considerably (as rarer SNPs are often harder to fit and may be uninformative.) The minor predictor count is computed as follows:

For additive tests, it is the expected minor allele count.
For dominance tests, it is the expected count of the AB and BB genotypes, or of the AA genotype, whichever is smaller
For recessive tests, it is the expected count of the BB genotype, or of the AA and AB genotypes, whichever is smaller.
For heterozygote tests, it is the expected count of the AB genotype, or of the AA and BB genotypes, whichever is smaller
For general tests, it is the expected minor allele count, or of the AB genotype, or of the AA and BB genotypes, whichever is smaller

A new column minor_predictor_count appears in the output reflecting the above count.

Variants not meeting this threshhold will still be output, but the association test will be skipped and all fields reflecting association test results will be NA.

Combining data files with differing sets of SNPs (-overlap)

The -overlap option can be used to when multiple .gen files with differing sets of SNPs are supplied with the -data option. This option will find the intersection of the SNPs in all the .gen file and test these SNPs. A restriction is that all .gen files must have SNPs ordered in position order. If this is not the case a warning will be given. In the following example the files cohort1.gen and cohort2_partial.gen, which have an overlap of 100 SNPs, are combined together.

./snptest \ -data ./example/cohort1.gen ./example/cohort1.sample ./example/cohort2_partial.gen ./example/cohort2.sample \
-overlap \
-o ./example/ex.out \
-frequentist 1 \
-method score \
-pheno bin1

Testing on the X chromosome

SNPTEST v2.5 and above includes support for testing for association on the sex chromosomes. Both X and Y chromosomes are supported but we focus the discussion on the X chromosome here. There are a few complexities to bear in mind when testing on the X chromosome:

There is less data. Males have only one copy of the X, and in females only one copy is active at most loci, so there is effectively half as much data on the X chromosome relative to an autosomal locus (and consequently less power to detect modest effects.)
X inactivation. X inactivation in females occurs at an early stage of development so that the activated copy varies throughout the body (and probably within each tissue.) At most loci, inactivation is complete, but some loci show no inactivation or reduced inactivation.

Testing with -method newml

When using -method newml for case/control traits, SNPTEST ignores samples with missing sex and assumes a model of full X inactivation by default. The command

./snptest \
-data ./example/cohort1_0X.gen ./example/cohort1.sample ./example/cohort2_0X.gen ./example/cohort2.sample \
-o ./example/ex.out \
-method newml \
-frequentist 1 \
-pheno bin1

fits a logistic regression model assuming complete inactivation of one allele in females and equal effect size between males and females. In this model, male genotypes are encoded as 0 / 1 and females as 0 / ½ / 1. The estimated effect should be interpreted as the log-odds of case status for individuals carrying the 'B' allele in males, or carrying two 'B' alleles in females. In addition to association test statistics, SNPTEST will output expected genotype and allele counts for diploid and haploid samples. Computation of allele frequencies and info statistics also take into account ploidy.

SNPTEST will ignore samples with unspecified sex as well as males that are coded wrongly. By default, sex information is taken from a column named "sex" in the sample file, and males are coded in the input file in the same way as homozygote females. The -sex_column and -haploid_genotype_coding options can be used to adjust this behaviour.

Testing with other methods

Testing with other methods is also possible, with the following caveats:

Males must be coded as homozygote females in the input file (for an X inactivation model).
Samples with unknown gender should be manually excluded using the -exclude_samples option.
Counts for haploid samples won't be provided
Info measure computation is the same as on the autosomes and does not take ploidy into account.

The rest of the information on this page is specific to -method newml.

Specifying chromosomes

SNPTEST reads chromosome information from the input files and understands "X" or "0X" in the input data to be the non-pseudo-autosomal part of the X chromosome, "Y" or "0Y" to be the Y chromosome, and "XY" to be the pseudo-autosomal loci on the X and Y chromosomes. (The pseudo-autosomal regions are treated like autosomes.)

If chromosome data is not present in the input files, use the -assume_chromosome option to specify the chromosome.

Specifying gender

By default, gender information must be supplied in a column called 'sex' in the sample file. This can be adjusted using the -sex_column option. Currently, SNPTEST understands M or MALE to indicate a male sample and F or FEMALE to indicate a female sample. For compatibility with IMPUTE, SNPTEST also permits encoding males as 1 and females as 2.

Allowing for heterogeneity

To allow for heterogeneity between males and females, or to allow for incomplete inactivation in females, the -stratify_on option can be used. For example, the command

./snptest \ -data ./example/cohort1_0X.gen ./example/cohort1.sample ./example/cohort2_0X.gen ./example/cohort2.sample \
-o ./example/ex.out \
-method newml \
-frequentist 1 \
-pheno bin1 \
-stratify_on sex \
-cov_names sex

fits a logistic regression model with seperate effects for males (coded as 0 / 1) and females (coded as 0 / ½ / 1) and separate baseline terms for males and females. (The same result can be achieved by running SNPTEST separately in males and females and meta-analysing the results.)

Note: when using -stratify_on, it is usually correct to specify the same variables to -cov_names to allow for a different baseline between strata.

Stratified testing

SNPTEST v2.5 includes a new option -stratify_on which performs an association test stratified over levels of a given discrete covariate - i.e. fitting a different effect parameter in each stratum. (Currently this option only applies when using -method newml.) Possible uses for this option might be

Allowing for heterogeneity between males and females when testing on the X chromosome.
Allowing for differences in effect between populations or ethnicities when testing in ethnically diverse sample sets.

For example, the command

./snptest \ -data ./example/cohort1.gen ./example/cohort1.sample ./example/cohort2.gen ./example/cohort2.sample \
-o ./example/ex.out \
-method newml \
-frequentist 1 \
-pheno bin1 \
-stratify_on cov1 \
-cov_names cov1

fits a logistic regression model with a seperate effect parameter in each level of the covariate cov1. This is essentially equivalent to running association tests in each strata separately and then meta-analysing results using an independent-effects meta-analysis. However, it may be quicker and more convenient to use -stratify_on.

Note: when using -stratify_on, you should (almost) always specify the same variables to -cov_names to allow for a different baseline between strata. In case-control settings it almost never makes sense to stratify effects but not baseline parameters.

When using -stratify_on, in addition to P-value and other columns, SNPTEST will output one effect size parameter and one standard error for each level of the covariate. For example, in the above command cov1 has two levels 0 and 1, and SNPTEST outputs variables with the following names:

Name	Value
bin2_cov1_frequentist_add_newml_beta_1:genotype/cov1=0	Effect size for strata with cov1 = 0
bin2_cov1_frequentist_add_newml_beta_1:genotype/cov1=1	Effect size for strata with cov1 = 1
bin2_cov1_frequentist_add_newml_se_1:genotype/cov1=1	Standard error of effect size for strata with cov1 = 0
bin2_cov1_frequentist_add_newml_se_1:genotype/cov1=1	Standard error of effect size for strata with cov1 = 1
bin2_cov1_frequentist_add_newml_degrees_of_freedom	Degrees of freedom in likelihood ratio test (here equal to 2)
bin2_cov1_frequentist_add_newml_pvalue	P-value from likelihood ratio test.

Sample size limits

By default, SNPTEST will refuse to test a variant if any stratum contains fewer than 100 individuals. This limit can be adjusted using the -lower_sample_limit option.

Info measures

SNPTEST v2.5 computes two types of info measure.

The IMPUTE info measure

The IMPUTE info measure, which reflects the information in imputed genotypes relative to the information if only the allele frequency were known. It can be written as

info = 1 - mean( variance in imputed genotype / variance if only allele frequency were known ).

The numerator of this expression is computed over the imputed genotype distribution for each sample. The denominator is computed using the estimated allele frequency

θ = ∑_i(P(g_i=1)+2P(g_i=2)) / 2∑_i,gP(g_i=g)

and the assumption of Hardy-Weinberg equilibrium.

The info measure takes the value 1 if all genotypes are completely certain, and the value 0 if the genotype probabilities for each sample are completely uncertain in Hardy-Weinberg proportions (i.e. they equal (1-θ)², 2θ(1-θ), θ²). It is also possible for info to drop below zero.

Info is usually computed as if assuming all samples are diploid and that the genotype probabilities for each sample sum to one. This is what IMPUTE computes, and also what SNPTEST computes when you use a method other than newml.

The IMPUTE info measure using -method newml

The assumptions of diploidy and that probabilities sum to one are generally applicable to imputed, autosomal SNPs. They may break down for typed SNPs (where missing probability data is possible) and for variants on the sex chromosomes. To deal with this, when using -method newml only, SNPTEST currently makes two modifications to the above. Firstly, missing probability data is filled in using the expected distribution given θ and the assumption of Hardy-Weinberg equilibrium. This modification implies that completely missing individuals contribute -1/n to the info measure, and in fact that

info ≤ 1 - (proportion of missing data)

affecting its interpretation at typed SNPs. Secondly, SNPTEST computes the denominator as 2 θ ( 1 - θ) for diploid samples and θ ( 1 - θ) for haploid samples, e.g. for males on the X chromosome.

When using -method newml, SNPTEST will also output columns named ..._impute_info which reflect the traditional computation outlined above.

The association test info measure

For some methods, SNPTEST also computes an association test info measure which reflects the relative information about the parameter of interest; see this pdf for details.

Other Options

Option and value(s)	Description
-hwe	This will produce an output file with columns that contain the p-values for an exact test of HWE in each cohort. If a test for a binary phenotype is carried out then HWE for all the case individuals and all the control individuals are also reported.
-chunk <x>	The program works by reading in, analyzing and writing output for chunks of the data at a time. This option is included to control the maximum amount of RAM used by the program at any one time. The default chunk size is 100 SNPs.
-log <filename>	Copy all screen output to the specified log file.
-printids	Print out each variant to the screen and/or log file before analysing it. (This is useful for debugging problems with data).
-lower_sample_limit <n>	By default, SNPTEST will refuse to run a regression if there are fewer than 100 samples in the design matrix (or, when using -stratify_on, if there are fewer than 100 samples in any strata). This option can be used to alter this limit.

FAQ

Q : My sample file looks fine but SNPTEST says it is malformed - why?

Up to v2.5, SNPTEST would fall over on files that have Windows-style line endings (CR LF) but used on platforms with UNIX line endings (LF), or vice versa. The solution is to convert the line endings to LFs using either the dos2unix command or a text editor. From v2.5.1, SNPTEST should cope with files with either line ending convention.

Q : SNPTEST does not produce a p-value at my SNP.

SNPTEST sometimes fails to fit the association model at a variant. In this case it tries to produce an indication of the reason for failure in the comment column. Possible reasons are:

model_not_fit:number_of_samples_below_limit: The number of informative samples at this variant is less than the internal limit (by default 100). This limit can be adjusted using the -lower_sample_limit option. However, be aware that SNPTEST relies on asymptotic approximations which are only valid with large sample sizes.
model_not_fit:design_matrix_singular_value_below_limit: For some model fitting methods, SNPTEST will not test at a variant if the design matrix is not informative about parameters. This usually occurs if the allele frequency is very low (when there is no power to detect association) but could also happen if the variant is very strongly correlated with a covariate, or two covariates are highly correlated.

Q : I get the error "igamc underflow error" printed to the screen. What does this mean?

This error occurs at SNPs where a very small p-value from a chi-squared test needs to be calculated. The CPROB library used by SNPTEST is used to carry this out and it reports an underflow error when this occurs. In this case it returns a p-value of 0. This usually occurs when the signal of association is very huge and can sometime indicate problems with the data. To identify which SNPs this occurs at you can use the -printids flag.

Contacting us

If you have a question about SNPTEST, please send a message to our mailing list:

http://www.jiscmail.ac.uk/OXSTATGEN

You will need to subscribe to the mailing list to post a question. The list has low but steady traffic, so you may want to redirect the messages to a dedicated e-mail folder if you don't want them all landing in your inbox.

What to include

If you are having a problem with the software, please try to include the following details in your e-mail (otherwise we may be unable to help):

The version of SNPTEST and the type of computer you are using to run it - e.g., "SNPTEST v2.5 on Mac OSX 10.6"
The full command-line used to run the program.
Any log files or screen output the program provides - in particular, please include the first few hundred lines of the log file which contain details about how SNPTEST interprets the data.

For difficult problems like memory access errors (e.g. "segmentation faults") we may further ask you to send data files that show the problem. These should generally be small and we can provide suggestions if you are not allowed to share your actual data.

Note: please do not send large files to the mailing list.

Version History

Version	Date	Details
2.4.1	03/07/2012	Bug fix release. Fix bug that meant that options specifying covariates (-cov_names, -cov_all, -cov_all_continuous, and -cov_all_discrete) were not respected if they appeared directly after the -condition_on option and its values.
2.4.0	13/04/2012	Minor release. There was a bug in -overlap option which is now fixed. The -condition_on pos:NNNNN option was not working properly and this is now fixed. SNPTEST v2.4.0 can now read bgen format files that contain biallelic indels i.e. alleles that are greater than 1 character long. These files QCTOOL has been updated so that gen files with indels can be converted to bgen files.
2.3.0	16/12/2011	This release can be found here. bi-allelic Indels and structural variants can now be handled. (Alleles at such loci can be more than one character long). quantile normalization of the phenotypes can now be carried out using the -quantile_normalise_phenotypes option. missing values are now allowed in the missing column (3rd column) of sample files. genotype counts, NULL call counts and missing data proportion columns in the output file, and all data summaries printed to screen now take into account samples excluded due to having missing phenotype or covariate values as well as other exclusion criteria. screen output has been improved to include text-based histograms of phenotypes and covariates. support for VCF files has been added (see below). This feature is under development, so user feedback would be most welcome. we have added the -overlap option which allows multiple .gen files to be specified with differing sets of SNPs. This option will find the intersection of the SNPs (based on chromosome and basepair position) in all the .gen files and test these SNPs. A restriction is that all .gen files must have SNPs in strictly increasing order of position (after SNP exclusions). If this is not the case a warning will be given.
2.2.0	07/12/2010	This release can be found here. This is a substantial update on the previous version that implements a number of new features A -condition_on option has been added to allow tests conditional upon other SNPs. This is useful when doing conditional analyses to look for secondary effects. A -range option that allows analysis of only those SNPs whose base-pair position lies within a given set of intervals. A -summary_stats_only option that produces just the summary statistics at each SNP. Continuous phenotypes are now mean-centred and scaled to have variance 1 by default. Use the -use_raw_phenotypes option to turn this off. A -mpheno option that implements a Bayesian multiple phenotype test. The -snpid option can now take a list of SNP or RS IDs. The -missing_code option now takes a comma-separated list of values, each of which is treated as missing when encountered in the sample file(s). The -log option can be used to copy all screen output to a log file. Columns of type "D" (discrete covariate) in the sample file can now accept any string value (previously positive integers were required). Phenotypes and covariates can now appear in any order in the sample files. To avoid issues with incorrect file formatting, more extensive checks are now performed on the sample and gen files. SNPTEST can now process binary gen (BGEN) files; these can be produced using the QCTOOL program as described here. Support for chromosome information has been added; see the section on chromosomes. More detailed data summaries are produced in the screen output. Performance improvements
2.1.1	01/04/2010	Minor update. This release can be found here.
2.1.0	19/03/2010	This is major change to SNPTEST from previous versions. Please read the following carefully The file format used by this version has been modified NEW FILE FORMAT. I have changed type 1,2,3 covariates to types D=discrete, C=continuous in the sample file. Binary phenotypes now need to be specified in the sample files by using a column of 1's and 0's (1=case and 0=control). The column should be labelled B. Quantitative phenotypes should be labelled P. Look at the sample files example/*.sample for examples. The -cases and -controls flags have been replaced by the -data option i.e. all cohorts should be specified by this option. You can specify multiple gen and sample files but you no longer divide them up into cases and controls. There is no longer a -qt flag. To specify the phenotype you use -pheno <name>. The name_of_phenotype should match the column you want to use from the sample file. It runs logistic regression or linear regression dependent on the type of phenotype you select. There are some changes to the output and the header line of the output file. Take a look. They are pretty straight forward. Basically some of the names of the columns have changed and you get a few extra columns of output if you use a binary phenotype. The -cov_names flag has been added so that you can specify covariates by their name i.e. -cov_names Gender will condition on the covariates named Gender . Multiple covariates can now be specified i.e -cov_names 1 3 will condition on covariates 1 and 3 and it does not matter if they are of different types There are now 3 flags that allow you to specify groups of covariates (i) -cov_all_continuous - condition on all continuous covariates, (ii) -cov_all_discrete - condition on all discrete covariates, (iii) -cov_all - condition on all covariates If no association tests are specified or -method threshold is specified then thresholded genotype counts are reported. Otherwise, expected conts are given. The expected count for a genotype is the sum of the probabilities across all individuals in the sample. If individuals are explicitely excluded then they will not be included in the genotype counts in any way. When testing for association, if an individual has at least one missing phenotype or missing covariate that is needed for the test then their genotype will be called as NULL in the genotype counts. Samples where the sum of the genotype probabilities is less than 0.1 will also be counted as NULL at each SNP. The -exp_counts flaghas been removed. There is a new option -method that is used to specify the method used to fit the chosen model. The new options give better results at SNPs that are rare and/or have high genotype uncertainty. The Bayesian tests now account for genotype uncertainy and can allow covariates in the tests. Bayesian Binary Trait tests now have an option to use a t-distribution prior on the genetic effect parameters. This allows more flexibility in specifying the prior beliefs about the genetic effect sizes. See option -t_prior and -t_df in the section on Bayesian Tests. There are now Bayesian tests for quantitative traits. There is now an option -mean_bf that calculate the weighted mean of the Bayes factors across the range of models specified. This 'model averaging' feature allows a range of models to be tested at the same time. See the section on Bayesian Tests. There is now a Bayesian test for multiple quantitative phenotypes.
1.1.5	28/05/2008	This release can be found here

References

[1] J. Marchini, B. Howie, S. Myers, G. McVean and P. Donnelly (2007) A new multipoint method for genome-wide association studies via imputation of genotypes. Nature Genetics 39 : 906-913 [Free Access PDF][Supplementary Material][News and Views Article]
[2] The Wellcome Trust Case Control Consortium (2007) Genomewide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447;661-78. PMID: 17554300 DOI: 10.1038/nature05911
[3] J. Marchini and B. Howie (2010) Genotype imputation for genome-wide association studies. Nature Reviews Genetics [Link]

SNPTEST

Contact

Contributors

Changes in v2.5.6

Bug fixes and enhancements:

Changes in v2.5.4-beta3

New functionality

Convenience features

Changes in v2.5.2

Bug fixes

Changes in v2.5.1

New features in v2.5.1

Bug fixes and other enhancements

New features in v2.5

Changes relating to model-fitting code

New model-fitting functionality (-method newml)

New output functionality

Convenience features

Program options

Download

Current stable version

Older versions

Input File Formats

Sample file formats

SNPTEST sample file format.

Genotype file formats

GEN and gzipped GEN format.

BGEN format.

Plink binary format (BED).

Variant Call Format (VCF).

Streaming input

Output file formats

Metadata

Comma- and tab-separated files, and compression

Outputting to a database

Data Summaries

Basic output columns

How SNPTEST computes counts, frequencies, info measures and missing data proportions

Screen Output

Frequentist Association Tests

Dealing with genotype uncertainty (the -method option)

Information measure

Output column naming convention

Example 1 - Case-Control Test

Example 2 Quantitative Trait Test

Bayesian Tests (Bayes Factors)

Priors for Binary Trait models

t-distribution priors

Example - Bayesian Case-Control Test

Bayesian Quantitative Trait models and priors

Prior Specification

Model averaging option

Bayesian Multiple Phenotype Test

Multinomial phenotype test

Understanding the parameters

Example

Example (general model test)

Other options

Conditional Tests of Association

Example 1 - Mantel-Hantzel Test

Example 2 - Conditioning on covariates that code for population structure

Example 3 - Conditioning on SNPs

Testing for interactions

Understanding the output columns

Example

Specifying which samples or SNPs to include

Specifying lists of individuals (-[ex|in]clude_samples)

Excluding individuals based on values in the sample file (-[in|ex]clude_samples_where)

Excluding individuals based on levels of missingness (-miss_thresh)

Adjusting how values are treated as missing in the sample files (-missing_code)

Excluding or including SNPs

Specify a range of SNPs by base-pair position (-range)

Specify a list of SNPs (-snpid)

Excluding specific SNPs (-exclude_snps)

Excluding SNPs that have low minor allele or predictor counts (-minimum_predictor_count)

Combining data files with differing sets of SNPs (-overlap)

Testing on the X chromosome

Testing with -method newml

Testing with other methods

Specifying chromosomes

Excluding individuals based on values in the sample file
(-[in|ex]clude_samples_where)