QCTOOL v2

Merging data

The -merge-in option can be used to merge variants in one dataset into another. For example:

$ qctool -g first.bgen -s first.sample -merge-in second.bgen second.sample -og merged.bgen

This command produces a dataset that contains a record for each variant from first.bgen and a record for each variant from second.bgen - i.e. it has L₁+L₂ variants, where L₁ and L₂ are the number of variants in the two datsets.

Data is output for the set of samples in the first dataset; any other samples in the merged-in dataset are ignored.

Controlling how samples are matched between datasets

By default, samples are matched by the first ID column in each dataset. The -match-sample-ids option can be used to change this. For example:

$ qctool -g first.bgen -s first.sample -merge-in second.bgen second.sample -og merged.bgen -match-sample-ids column1~column2

Where column1 and column2 are columns in first.sample and second.sample respectively, containing the fields to match on. We recommend that sample file columns used to match samples should contain unique sample identifiers.

Controlling what variants appear in the output

The -merge-strategy option controls what happens when the same variant appears in both datasets. Possible values are -keep-all (the default) or -drop-duplicates. For example:

$ qctool -g first.bgen -s first.sample -merge-in second.bgen second.sample -og merged.bgen -merge-strategy drop-duplicates

In this command, if the same variant appears in first.bgen and in second.bgen, only the first will be output. As when combining datasets, the -compare-variants-by option is used to control how variants are compared, and it is assumed that variants are sorted by these fields in each input dataset.

To further help disambiguate the source of data in the output file, the -merge-prefix option can also be used to add a prefix to the identifier of each merged-in -variant, e.g.:

$ qctool -g first.bgen -s first.sample -merge-in second.bgen -s second.sample -og merged.bgen -merge-prefix "merged:"

Currently this only affects the 'alternate' identifier fields (e.g. the SNPID field of GEN or BGEN files).

Merging variants from one dataset into another