STITCH

Last update: January 4, 2017

Overview of STITCH - more text

STITCH is an R program for reference panel free, read aware, low coverage sequencing genotype imputation. STITCH runs on a set of samples with sequencing reads in BAM format, as well as a list of positions to genotype, and outputs imputed genotypes in VCF format.

STITCH works by modelling each chromosome in the set of samples as a mosaic of K unknown founders or ancestral haplotypes. STITCH employs a hidden Markov model, whose parameters are sequentially updated using expectation maximization. Both steps are handled in a read aware fashion done without using external reference haplotype sets.

STITCH has been tested on low coverage mouse and human data. STTICH has been run on mouse genotype-by-sequencing (GBS) data, although it may encounter issues when the number of reads becomes close to 0. For guidance on parameter options (like K), please see the supplementary note in Davies et al. given below

License

STITCH is free for academic use. For commercial inquiries please contact Robert Davies (robertwilliamdavies at gmail dot com) and Simon Myers (myers at stats dot ox dot ac dot uk)

Changelog

v1.2.4

Enable C++11 compilation
Fix bug where sample name from bam header was being grabbed from any line with @RG in it and not specifically lines starting with @RG

v1.2.3

Fix bug where the central SNP in a read was random from SNPs in read and not the central SNP by position among SNPs in the read as it ought to have been
Fix bug where the final SNP from posfile wasn't being loaded from the sample BAMs and as a result not being imputed
Fix bug where reads split into 3+ pieces were not being properly handled (e.g. long reads where sections map to multiple locations)
Faster internal handling of cigar string

v1.2.2

Reduce RAM footprint when using reference panels
Crash early if rsync is not in PATH
Added option to override default VCF output name. See vcf_output_name
Added unit tests under testthat framework

v1.2.1

Change internal system calls to reduce RAM usage
Fix bug passing through variable to subfunction

v1.2.0

Can use reference panels in IMPUTE2 format. See example script and reference_* variables
Can bundle together inputs to facilitate imputation of very large N. See inputBundleBlockSize
Example human data provided to showcase STITCH functionality. See examples script

v1.1.4

Can work off CRAM files or BAM files. To use CRAM files, see cramlist and reference variables, or see examples script
Changed GL as genotype likelihood to GP as genotype posterior probability in output VCF

v1.1.3

Changed R example script to work on provided example data
Changed default downsampleToCov to 50 to reduce likelihood of overflow at high coverage SNPs
Miscellaneous small fixes to BAM conversion script to better handle samples with very few reads
Changed default region where reads from BAM are loaded (chrStart and chrEnd) to NA, to be inferred from posfile and the region to be imputed, rather than to grab reads from the whole chromosome
Fixed ability to use high coverage validation samples (genfile) when using generateInputOnly and regenerateInput

v1.1.2

Fixed typo in header of outputted VCF

v1.1.1

Fixed bug where samples with no reads on a chromosome gave an old input format

v1.1.0

Changed default output to VCF (added option outputBlockSize to control how it is written)
Removed two package dependencies
Remove ability to write output to .gen format (remove outputGenFormat)
Add change log to README and miscellaneous other changes

v1.0.1

Added ability to use soft clipped bases

v1.0.0

Version used for paper

Downloads

Directory contents Contains previous versions of STITCH
STITCH_1.2.4.tar.gz STITCH program. For installation see README or example below
STITCH_README_1.2.4.txt README file. Includes description, change log, installation information, included files, description of required or important program options and output
STITCH_1.2.4.pdf PDF with command options available in R to run STITCH
STITCH_examples_1.2.4.R Some examples of commands that can be used to run STITCH, illustrated on the example data
STITCH_example_2016_05_10.tgz Example data to test installation of STITCH and demonstrate functionality
mm10_2016_10_02.fa.gz Mouse reference fasta used for aligning example data. Useful to test CRAM functionality

Complete working example

Install R if not already installed. Install R dependencies parallel, Rcpp and RcppArmadillo from CRAN (using the "install.packages" option within R), and Rsamtools from BioConductor
Download STITCH from above. Install by opening R and using install.packages, giving install.packages the path to the downloaded STITCH tar.gz file above
Download example tar.gz from above and open using a command such as tar -xzvf
Run STITCH. Open R, change your working directory using setwd() to the directory where the example tar.gz was unzipped, and then run STITCH(tempdir = tempdir(), chr = "chr19", bamlist = "bamlist.txt", posfile = "pos.txt", genfile = "gen.txt", outputdir = paste0(getwd(), "/"), K = 4, nGen = 100, nCores = 1). Once complete, a VCF should appear in the current working directory named stitch.chr19.vcf.gz

Contact

If you have any problems running STITCH please contact Robert Davies (robertwilliamdavies at gmail dot com)

Citation

If you use STITCH, please cite Davies, R. W., Flint J, Myers S., Mott R. Rapid genotype imputation from sequence without reference panels. Nat. Genet. 48, 965-969 (2016)