Getting started with HPTEST
HPTEST tests for association between two sets of genotypes - for example between genotypes of a host and of a pathogen infecting those hosts. To do this, it implements the binomial logistic regression model:
(1)
in which is a vector of parameters (the regression coefficients). For each predictor and outcome genetic variant, HPTEST finds the value of that maximises the corresponding likelihood, estimates its standard error (by inspecting the curvature of the likelihood near the maximum) and outputs these along with other summaries to the output file.
By default, HPTEST computes estimates under a weakly informative prior that helps to regularise the
estimates (this can be turned off with the -no-prior
option.)
For example, the following command:
$ hptest \
-outcome parasite.vcf \
-predictor host.vcf \
-s samples.sample \
-o output.csv
regresses each parasite variant (from parasite.vcf
) on each host variant (from host.vcf
).
The genotype counts, maximum posterior estimates and other quantities are output to the output file
(output.csv
). Additional options allow you to add covariates (-covariates
), to adjust the
prior being used (-prior
) or remove it entirely (-no-prior
), or to filter the list of
samples or variants included (e.g. -incl-samples
, -excl-samples-where
, or
-incl-outcome-range
) - see the list of
options for more.
Obtaining HPTEST
HPTEST is currently included as part of the QCTOOL package, which is currently hosted at code.enkre.net. To obtain it you will need to download and compile the source code (you need QCTOOL version >= 2.1.9 for this). See the download page for more details on obtaining and installing HPTEST.
Running HPTEST
In the most basic form HPTEST is run like this:
hptest -outcome outcome.vcf -predictor predictor.vcf -s samples.sample -o test.csv
where outcome.vcf
and predictor.vcf
contain genotype calls, and samples.sample
contains
information on the included samples. All three files must represent the same set of individuals and
they must be in the same order in each file. (The -reorder
option to QCTOOL can help to set this up).
For each predictor and outcome variant, HPTEST will do the following:
-
Both sets of genotypes are loaded. HPTEST can interpret both hard-called genotypes (e.g. GT field in a vcf file), and imputed genotype probabilities (e.g. GP field in a vcf file, or a BGEN format file). For outcome genotypes, imputed genotypes are thresholded to produce hard calls - by default any genotypes with probability < 0.9 are treated as missing. For the predictor genotype, HPTEST will sum over the imputed probability distribution as described in technical details.
-
Genotypes are internally tabulated. Since rare variants are likely not very useful, by default the rest of the analysis is skipped if the count of the minor allele is < 10 in either predictor or outcome genotypes.
-
Assuming the variants are common enough, HPTEST fits the above binomial logistic regression model to the two genotypes and any specified covariates. When the outcome is haploid, this is the same as a regular logistic regression. If the outcome is diploid or has higher ploidy, this is the same except that the outcome genotypes are treated as arising from multiple independent draws of a alleles with the modelled probability distribution (1).
-
HPTEST currently assumes predictor genotypes are diploid, and uses a 'general' model for the predictor by default, including both an additive and an overdominance term. Alternative models can be specified using the
-model
option. Multiple models can also be specified in the same run. -
Parameters are estimated under a weak regularising prior by default. Specifically, a prior is applied to the additive effect; this is a lot like a normal distribution with mean 0 and standard deviation around 1.87, but has somewhat fatter tails. The prior can be altered using the
-prior
option, or turned off entirely using the-no-prior
option. By default, both a Bayes factor and an approximate P-value (both computed under assumptions about the asymptotic normality of the likelihood) are output. -
If desired, covariates can be included in the analysis using the -covariates option. They are loaded from the sample file. They can be continuous (e.g. a principal component, with type
C
in the sample file) or discrete (e.g. a population label, with typeD
in the sample file.)
Interpreting HPTEST output
HPTEST prints a wealth of information to the output file (and/or to a log file specified with the -log option). A basic run looks like this:
$ hptest_v2.1.9 -predictor host.vcf -outcome parasite.vcf -s samples.sample -o test.tsv -covariates discrete1
Welcome to hptest
(version: 2.1.9, revision: 7b2ccee)
(C) 2009-2020 University of Oxford
Loaded data for 2000 samples.
Predictor data:
(not computed) "host.vcf"
(total 1 sources, number of snps not computed).
Number of samples: 2000
Outcome data:
(not computed) "parasite.vcf"
(total 1 sources, number of snps not computed).
Number of samples: 2000
Adding covariates...
++ Added covariate: "discrete1":
missing levels
0 0(995) 1(1005)
Models are:
- Model 1 ("null"): BinomialLogistic( 2000 of 2000 samples ): (outcome=1) ~ baseline/outcome=1 + discrete1=1/outcome=1
with priors:
discrete1=1/outcome=1 ~ logF( 0.0770407, 0.0770407 ).
- Model 2 ("gen"): BinomialLogistic( 2000 of 2000 samples ): (outcome=1) ~ baseline/outcome=1 + add/outcome=1 + overdominance/outcome=1 + discrete1=1/outcome=1
with priors:
add/outcome=1 ~ logF( 2, 2 ).
overdominance/outcome=1 ~ logF( 4, 4 ).
discrete1=1/outcome=1 ~ logF( 0.0770407, 0.0770407 ).
Model design for first test:
outcome baseline add overdominance discrete1=1
0 2 0 1 1 1 1
1 2 0 1 1 1 1
2 1 1 1 0 0 1
3 2 0 1 0 0 1
4 2 0 1 0 0 1
5 2 0 1 0 0 1
6 1 1 1 0 0 0
7 0 0 1 0 0 1
8 2 0 ~ 1 0 0 1
. . . . .
. . . . .
. . . . .
1994 2 0 1 0 0 1
1995 2 0 1 1 1 1
1996 2 0 1 1 1 1
1997 2 0 1 0 0 1
1998 0 0 1 0 0 0
1999 2 0 1 2 0 1
.
Testing : (10000/?,44.9s,222.7/s)
Thank you for using hptest.
In general it is always worth inspecting this output carefully to check that HPTEST has interpreted your data correctly. In particular, the above output shows that HPTEST has loaded data on 2,000 samples, has included a discrete covariate with no missing data, and is planning to fit a null model and a genotypic model alternative at each pair of variants. As a further diagnostic it also outputs the first and last few rows of the model design (the outcome and predictor levels) for the first test.
The output file looks like this:
$ cat test.tsv
# Analysis: "hptest analysis"
# started: 2021-03-28 12:48:00
#
# Analysis properties:
# $ hptest(version: 2.1.9, revision: 7b2ccee) (user-supplied)
# -covariates discrete1 (user-supplied)
# -o test.tsv (user-supplied)
# -outcome parasite.vcf (user-supplied)
# -predictor host.vcf (user-supplied)
# -s samples.sample (user-supplied)
#
predictor:alternate_ids predictor:rsid predictor:chromosome predictor:position predictor:alleleA predictor:alleleB outcome:alternate_ids outcome:rsid outcome:chromosome outcome:position outcome:alleleA outcome:alleleB N missing outcome=0 outcome=1 outcome=2 predictor=0 predictor=1 predictor=2 minimum_outcome_count minimum_predictor_count minimum_expected_predictor_allele_count minimum_expected_predictor_allele_count_genotype predictor_info null:converged null:iterations null:fit_time null:ll null:degrees_of_freedom gen:converged gen:iterations gen:fit_time gen:ll gen:degrees_of_freedom gen:beta_1:add/outcome=1 gen:beta_2:overdominance/outcome=1 gen:sd_1 gen:sd_2 gen:cov_1,2 gen:log10_bf gen:prior_mode_1 gen:se_1 gen:pvalue_1 gen:prior_mode_2 gen:se_2 gen:pvalue_2 comment
NA H1 H1 1 A G NA P1 P1 1 C T 1998 2 1805 188 5 1351 570 77 198 724 20 outcome=1/predictor=1 1 1 5 0.0000 -787.892 0 1 7 0.0000 -757.823 2 -1.60157 0.219057 0.581235 0.59473 -0.318002 12.9939 0 0.453919 0.00041819 0 0.449276 0.625848 NA
NA H1 H1 1 A G NA P2 P1 2 C T 2000 0 1501 464 35 1353 570 77 534 0 175 outcome=1/predictor=1 1 1 4 0.0100 -1571.85 0 1 4 0.0000 -1570.6 2 -0.163177 0.0487432 0.133177 0.159307 -0.0160542 -0.51645 0 0.131615 0.215046 0 0.156866 0.756005 NA
NA H1 H1 1 A G NA P3 P1 3 C T 1999 1 1347 595 57 1352 570 77 709 0 271 outcome=1/predictor=1 1 1 4 0.0000 -1868.36 0 1 4 0.0000 -1867.71 2 0.108302 -0.0544743 0.101935 0.127023 -0.00904929 -0.955588 0 0.101268 0.284863 0 0.125833 0.66508 NA
...
NA H100 H1 100 A G NA P99 P1 99 C T 2000 0 1959 40 1 1991 9 0 42 0 0 outcome=1/predictor=1 1 1 7 0.0000 -233.146 0 1 7 0.0000 -233.11 2 -0.258997 -0.128958 1.28596 0.954733 -0.187642 0.897732 0 0.523405 0.620719 0 0.258434 0.617781 NA
NA H100 H1 100 A G NA P100 P1 100 C T 2000 0 1523 451 26 1991 9 0 503 0 2 outcome=1/predictor=1 1 1 4 0.0000 -1512.89 0 1 4 0.0000 -1512.87 2 -0.0798289 -0.0398986 0.935746 0.848044 -0.56311 0.552519 0 0.419414 0.849047 0 0.209456 0.848928 NA
# Completed successfully at 2021-03-28 12:48:38
In the output file, HPTEST first outputs metadata sufficient to figure out what command generated this file, a (long!) row of column names, and then the results for each pair of predictor and outcome variants. The last line shows that HPTEST successfully completed the job.
See the page on interpreting HPTEST output for more information on interpreting the output.
Additional features
HPTEST includes a number of features to make it easy to run:
-
HPTEST has several filtering options that can be used to adjust the set of samples (e.g.
-incl-samples
or-excl-samples-where
) or variants (e.g.-incl-predictor-range
,-excl-outcome-rsids
) included in the analysis. See the page on filtering for details. -
Options to adjust or remove the priors (
-prior
,-no-prior
,-no-covariate-priors
). See the page on specifying priors for details. -
As illustrated above HPTEST can output to a variety of flat file output formats. It currently decides upon the format based on the file extension (
.csv
=> comma-separated;.txt
-> space separated;.tsv
-> tab-separated). Additionally if the output file ends with.gz
then it will be compressed with gzip. -
HPTEST can also output to the sqlite3 database format. This is the way I usually run HPTEST and is useful in several ways - see the page on sqlite output for details.