Getting started
LDBIRD computes LD metrics (i.e. the genotype correlation ) between all pairs of variants in two sets of genotypes specified on the command line, or between all pairs in one set of genotypes. It records a histogram of the LD values it sees and outputs specific pairs meeting specific thresholds. The two ways to run it are:
Run with one genotypes file:
ldbird -g1 <file.bgen> -s samples.sample -o ldbird.sqlite
This computes genotype counts and correlation () between every pair of variants in file.bgen. By default only
those with will be output, but if you specify -min-r2 0 then the number of output rows will be .
Run with two genotypes files:
ldbird -g1 <file1.bgen> -g2 <file2.bgen> -s samples.sample -o ldbird.sqlite
This computes genotype counts and correlation between every variant in file1.bgen and in file2.bgen. By default
only those with will be output, but if you specify -min-r2 0 then the number of output rows will be , where is the number of variants in file<i>.bgen.
Supported file formats
You don't necessarily need to use BGEN files - LDBIRD also supports VCF format and a subset of other formats supported by QCTOOL.
Samples must be supplied in a sample file of the type understood by QCTOOL - see the QCTOOL file formats page for details.
What you can put in
At the time of writing there are three types of genotype you can put in. First, the input files can contain haploid
genotypes. If so then the R table will contain a table of counts at both loci as well as the computed R value.
Second, diploid genotypes can be put in - in this case the table of counts is currently not output.
A third option is to input haploid genotypes that are encoded as if they were diploid. The -assume-haploid option
tells LDBIRD to convert these internally to haploid genotypes - it does this by treating heterozygous calls as missing
and converting each homozygous call to the corresponding haploid genotype.
Understanding the output
LDBIRD always outputs results to a sqlite database. The output file has the following features:
-
The analysis name, its start and stop times, and the command-line arguments are recorded in the
Analysis,AnalysisStatus, andAnalysisPropertytables. -
LDBIRD computes the frequency of each variant it sees. The variants are stored in the
Varianttable and the frequencies are stored in theFrequencytable. A convenient view,FrequencyViewis also created to link these tables together. -
The main results are stored in the
Rtable. A more convenient view,RViewis also created which links theR,Analysis, andVarianttables. -
A histogram of values is also stored in the
Histogramtable - again there is a more convenientHistogramViewview which joins this to theAnalysistable.
Using the R table
To save space in the output file, LDBIRD currently encodes correlations by the following formula:
This is stored in the encoded_r column of the R table. Because sqlite uses a variable-length integer encoding,
this value only takes up one or two bytes of space in the output file.
To convert encoded_r back to correlation, you can use the formula:
E.g. in sqlite:
$ sqlite3 ldbird.sqlite
> SELECT *, (encoded_r-1024.0)/1024.0 AS r FROM R
(NB. the form 1024.0 is needed here to force sqlite to treat the result as a floating-point number.)
The above means that LDBIRD stores a quantized version of the correlation - it is always encoded as an integer between 0 and 2048 (inclusive). Correlation is expressed with respect to the second allele of each variant - this values in the range 0-1023 represent -ve correlation between the second allele, and values in the range 1025-2048 represent positive correlation.
Controlling what comparisons are made
There are three additional options which control what variants LDBIRD computes LD for:
-
The
-min-distanceoption tells LDBIRD not to compute LD for any variants that are too close together on the same chromosome. -
The
-min-mafoption tells LDBIRD not to compute LD for any variants that have too low minor allele frequency -
The
-min-Noption tells LDBIRD to not compute LD for any pair of variants with fewer than this number of samples having non-missing genotypes (taken pairwise across the two variants). -
The
-min-N-propnoption tells LDBIRD to not compute LD for any pair of variants with fewer than this proportion of samples having of non-missing genotypes (taken pairwise across the two variants). -
Finally, the
-min-r2option tells LDBIRD not to output results in theRtable for any variants where the computed is less than the given threshold. (This is slightly different to the above options because LD is still computed and contributes to theHistogramtable).
Additionally a number of options are provided to filter the set of samples included in the analysis, or to filter the genomic ranges or IDs of variants that are included. See the page on filtering samples for more information.
Using the Histogram table
At the end of the run, LDBIRD stores a histogram of LD in the Histogram table. This is set up as follows: for each
value of encoded_r, the corresponding row of Histogram stores the number of variant pairs that were observed to
have that encoded_r value. The column encoded_r can be converted back to as shown above. The Histogram
table always has 2049 rows (or 2049 per analysis, if you run multiple analyses into the same output file) corresponding
to the 2049 possible values of encoded_r.
The Histogram table ignores any minimum value specified using -min-r2 - in cases of testing long-range LD
it will therefore likely contain a peak of LD values near (i.e. around ), while if you
test nearby variants you will get a spike near ($\text{encoded r}=2048). However, it respects the other filtering
options outlined above, i.e. only variants with sufficient frequency and levels of non-missingness will be represented.
Setting up convenience views
It is generally most useful to wrap all the above into a combined view of the frequencies and LD. We recommend setting this up by running the following SQL in the output file:
CREATE VIEW MyRView AS
SELECT A.name AS `analysis`,
V1.id AS g1_id, V1.rsid AS `g1_rsid`, V1.chromosome AS `g1_chromosome`, V1.position AS `g1_position`,
V2.id AS g2_id, V2.rsid AS `g2_rsid`, V2.chromosome AS `g2_chromosome`, V2.position AS `g2_position`,
R.*, (R.encoded_r-1024.0)/1024 AS r,
F1.frequency AS g1_frequency, F2.frequency AS g2_frequency
FROM R
INNER JOIN Variant V1 ON V1.id = g1_id
INNER JOIN Variant V2 ON V2.id = g2_id
CROSS JOIN Analysis A ON A.id = R.analysis_id
CROSS JOIN Frequency F1 ON F1.analysis_id == R.analysis_id AND F1.variant_id == R.g1_id
CROSS JOIN Frequency F2 ON F2.analysis_id == R.analysis_id AND F2.variant_id == R.g2_id;
Which can be used to produce output like this (here run on test data):
$ sqlite3 -column -header ldbird.sqlite "SELECT * FROM MyRView LIMIT 10"
analysis g1_id g1_rsid g1_chromosome g1_position g2_id g2_rsid g2_chromosome g2_position analysis_id g1_id:1 g2_id:1 N encoded_r r g1_frequency g2_frequency
--------------- ---------- ---------- ------------- ----------- ---------- ---------- ------------- ----------- ----------- ---------- ---------- ---------- ---------- ---------- ------------ ------------
ldbird analysis 1 H1 H1 1 1 H1 H1 1 1 1 1 4000 2048 1.0 0.181 0.181
ldbird analysis 1 H1 H1 1 2 H2 H1 2 1 1 2 4000 1041 0.01660156 0.181 0.14225
ldbird analysis 1 H1 H1 1 3 H3 H1 3 1 1 3 4000 1017 -0.0068359 0.181 0.19
ldbird analysis 1 H1 H1 1 4 H4 H1 4 1 1 4 4000 1051 0.02636718 0.181 0.01025
ldbird analysis 1 H1 H1 1 5 H5 H1 5 1 1 5 4000 1022 -0.0019531 0.181 0.31725
ldbird analysis 1 H1 H1 1 6 H6 H1 6 1 1 6 4000 982 -0.0410156 0.181 0.00525
ldbird analysis 1 H1 H1 1 7 H7 H1 7 1 1 7 4000 992 -0.03125 0.181 0.14725
ldbird analysis 1 H1 H1 1 8 H8 H1 8 1 1 8 4000 1022 -0.0019531 0.181 0.1465
ldbird analysis 1 H1 H1 1 9 H9 H1 9 1 1 9 4000 1026 0.00195312 0.181 0.039
ldbird analysis 1 H1 H1 1 10 H10 H1 10 1 1 10 4000 1002 -0.0214843 0.181 0.405
LDBIRD doesn't create this view by itself currently, but a future version might do so. A similar view of the Histogram table can also be made.