DESCRIPTION
===========


STITCH v1.1.4
Robert William Davies
October 2, 2016


CHANGE LOG
==========
v1.1.4 - Can work off CRAM files or BAM files. To use CRAM files, see cramlist and reference variables, or see examples script
       - Changed GL as genotype likelihood to GP as genotype posterior probability in output VCF
v1.1.3 - Changed R example script to work on provided example data
       - Changed default downsampleToCov to 50 to reduce likelihood of overflow at high coverage SNPs
       - Miscellaneous small fixes to BAM conversion script to better handle samples with very few reads
       - Changed default region where reads from BAM are loaded (chrStart and chrEnd) to NA, to be inferred from posfile and the region to be imputed, rather than to grab reads from the whole chromosome
       - Fixed ability to use high coverage validation samples (genfile) when using generateInputOnly and regenerateInput
v1.1.2 - Fixed typo in header of outputted VCF
v1.1.1 - Fixed bug where samples with no reads on a chromosome gave an old input format
v1.1.0 - Changed default output to VCF (added option outputBlockSize to control how it is written)
       - Removed two package dependencies
       - Remove ability to write output to .gen format (remove outputGenFormat)
       - Add change log to README and miscellaneous other changes
v1.0.1 - Added ability to use soft clipped bases
v1.0.0 - Version used for paper


INSTALLATION
============


STITCH requires dependencies parallel, Rsamtools, Rcpp and RcppArmadillo, which can be installed in the usual manner from either CRAN or Bioconductor

To install STITCH once dependencies are installed, use the install.packages function in R as follows
install.packages("STITCH_1.1.4.tar.gz")
substituting in the appropriate path and version number as necessary


INCLUDED FILES
==============


STITCH_README_1.1.4.txt - This README
STITCH_1.1.4.tar.gz - Package to be installed
STITCH_1.1.4.pdf - PDF output of command line options
STITCH_1.1.4.R - Example commands for how to run STITCH in a few scenarios


PROGRAM OPTIONS
===============


Note: Only necessary and commonly used options are highlighted here. For a full list and description, please type ?STITCH in R once the library is loaded, or see the PDF

Required

chr -  What chromosome to run. Should match BAM header
posfile -  Where to find file with positions to run. File is tab separated with no header, one row per SNP, with col 1 = chromosome, col 2 = physical position (sorted from smallest to largest), col 3 = reference base, col 4 = alternate base. Bases are capitalized. STITCH only handles bi-allelic SNPs
K -  Integer, how many founder / mosaic haplotypes to use
outputdir -  What output directory to use / where output files go
tempdir - What directory to use as temporary directory. If possible, use ramdisk, like /dev/shm/
bamlist - Path to file with BAM file locations. File is one row per entry, path to BAM files. BAM index files should exist in same directory as for each BAM, suffixed either .bam.bai or .bai
cramlist - Same as bamlist, but path is to CRAM locations. If used, requires reference variable to be set with path to fasta file

Optional

method - How to run imputation - either diploid or pseudoHaploid, the former having quadratic time complexity in K, the later having linear time complexity in K
switchModelIteration - When selected, the iteration to switch from pseudoHaploid to diploid. Note that one EM iteration is defined as first using the parameters to estimate hidden phase, and secondly to use hidden phase to update parameters. So a choice of 39 with iterations = 40 would means 38 complete pseudo-haploid iterations, a 39th iteration of both estimating hidden phase and updating parameters, and a 40th iteration of updating hidden phase, and from this estimating dosages (parameter updates on the 40th iteration have no influence on dosages). Therefore, we say that a choice of 39 gives 38 pseudo-haploid iterations and 2 diploid iterations
genfile - Path to gen file with high coverage results. Empty for no genfile. File has a header row with a name for each sample, matching what is found in the bam file. Each subject is then a tab seperated column, with 0 = hom ref, 1 = het, 2 = hom alt and NA indicating missing genotype, with rows corresponding to rows of the posfile. Note therefore this file has one more row than posfile which has no header
regionStart - When running imputation, where to start from
regionEnd - When running imputation, where to stop
buffer - Buffer of region to perform imputation over. Imputation is run from bases including regionStart - buffer to regionEnd + buffer, including the bases, with 1-based positions. After imputation, the VCF is shrunk to only include positions from regionStart to regionEnd, inclusive
chrStart - When loading from BAM, some start position, before SNPs occur
chrEnd - When loading from Bam, some end position, far after last SNP


Output

VCF named <outputdir>stitch.<chr>.<regionStart>.<regionEnd>.vcf.gz
or if no regionStart and regionEnd is given
<outputdir>stitch.<chr>.vcf.gz