DESCRIPTION
===========


STITCH v1.2.4
Robert William Davies
January 4, 2017


CHANGE LOG
==========
v1.2.4 - Enable C++11 compilation
       - Fix bug where sample name from bam header was being grabbed from any line with @RG in it and not specifically lines starting with @RG
v1.2.3 - Fix bug where the central SNP in a read was random from SNPs in read and not the central SNP by position among SNPs in the read as it ought to have been
       - Fix bug where the final SNP from posfile wasn't being loaded from the sample BAMs and as a result not being imputed
       - Fix bug where reads split into 3+ pieces were not being properly handled (e.g. long reads where sections map to multiple locations)
       - Faster internal handling of cigar string
v1.2.2 - Reduce RAM footprint when loading reference haplotypes
       - Crash early if rsync is not in PATH
       - Added option to override default VCF output name. See vcf_output_name
       - Added unit tests under testthat framework
v1.2.1 - Change internal system calls to reduce RAM usage
       - Fix bug passing through variable to subfunction
v1.2.0 - Can use reference panels in IMPUTE2 format. See example script and reference_* variables
       - Can bundle together inputs to facilitate imputation of very large N. See inputBundleBlockSize
       - Example human data provided to showcase STITCH functionality. See examples script
v1.1.4 - Can work off CRAM files or BAM files. To use CRAM files, see cramlist and reference variables, or see examples script
       - Changed GL as genotype likelihood to GP as genotype posterior probability in output VCF
v1.1.3 - Changed R example script to work on provided example data
       - Changed default downsampleToCov to 50 to reduce likelihood of overflow at high coverage SNPs
       - Miscellaneous small fixes to BAM conversion script to better handle samples with very few reads
       - Changed default region where reads from BAM are loaded (chrStart and chrEnd) to NA, to be inferred from posfile and the region to be imputed, rather than to grab reads from the whole chromosome
       - Fixed ability to use high coverage validation samples (genfile) when using generateInputOnly and regenerateInput
v1.1.2 - Fixed typo in header of outputted VCF
v1.1.1 - Fixed bug where samples with no reads on a chromosome gave an old input format
v1.1.0 - Changed default output to VCF (added option outputBlockSize to control how it is written)
       - Removed two package dependencies
       - Remove ability to write output to .gen format (remove outputGenFormat)
       - Add change log to README and miscellaneous other changes
v1.0.1 - Added ability to use soft clipped bases
v1.0.0 - Version used for paper


INSTALLATION
============


STITCH requires dependencies parallel, Rsamtools, Rcpp and RcppArmadillo, which can be installed in the usual manner from either CRAN or Bioconductor

To install STITCH once dependencies are installed, use the install.packages function in R as follows
install.packages("STITCH_1.2.4.tar.gz")
substituting in the appropriate path and version number as necessary


INCLUDED FILES
==============


STITCH_README_1.2.4.txt - This README
STITCH_1.2.4.tar.gz - Package to be installed
STITCH_1.2.4.pdf - PDF output of command line options
STITCH_examples_1.2.4.R - Example commands for how to run STITCH in a few scenarios


PROGRAM OPTIONS
===============


Note: Only necessary and commonly used options are highlighted here. For a full list and description, please type ?STITCH in R once the library is loaded, or see the PDF

Required

chr -  What chromosome to run. Should match BAM header
posfile -  Where to find file with positions to run. File is tab separated with no header, one row per SNP, with col 1 = chromosome, col 2 = physical position (sorted from smallest to largest), col 3 = reference base, col 4 = alternate base. Bases are capitalized. STITCH only handles bi-allelic SNPs
K -  Integer, how many founder / mosaic haplotypes to use
outputdir -  What output directory to use / where output files go
tempdir - What directory to use as temporary directory. If possible, use ramdisk, like /dev/shm/
bamlist - Path to file with BAM file locations. File is one row per entry, path to BAM files. BAM index files should exist in same directory as for each BAM, suffixed either .bam.bai or .bai
cramlist - Same as bamlist, but path is to CRAM locations. If used, requires reference variable to be set with path to fasta file

Optional

method - How to run imputation - either diploid or pseudoHaploid, the former having quadratic time complexity in K, the later having linear time complexity in K
switchModelIteration - When selected, the iteration to switch from pseudoHaploid to diploid. Note that one EM iteration is defined as first using the parameters to estimate hidden phase, and secondly to use hidden phase to update parameters. So a choice of 39 with iterations = 40 would means 38 complete pseudo-haploid iterations, a 39th iteration of both estimating hidden phase and updating parameters, and a 40th iteration of updating hidden phase, and from this estimating dosages (parameter updates on the 40th iteration have no influence on dosages). Therefore, we say that a choice of 39 gives 38 pseudo-haploid iterations and 2 diploid iterations
genfile - Path to gen file with high coverage results. Empty for no genfile. File has a header row with a name for each sample, matching what is found in the bam file. Each subject is then a tab seperated column, with 0 = hom ref, 1 = het, 2 = hom alt and NA indicating missing genotype, with rows corresponding to rows of the posfile. Note therefore this file has one more row than posfile which has no header
regionStart - When running imputation, where to start from
regionEnd - When running imputation, where to stop
buffer - Buffer of region to perform imputation over. Imputation is run from bases including regionStart - buffer to regionEnd + buffer, including the bases, with 1-based positions. After imputation, the VCF is shrunk to only include positions from regionStart to regionEnd, inclusive
chrStart - When loading from BAM, some start position, before SNPs occur
chrEnd - When loading from Bam, some end position, far after last SNP
reference_haplotype_file - When initializing using a reference panel, path to haplotype files in IMPUTE2 format (see example R script for an example download)
reference_legend_file - When initializing using a reference panel, path to legend file in IMPUTE2 format
reference_sample_file - When initializing using a reference panel, path to sample file in IMPUTE2 format
reference_populations - When initializing using a reference panel, vector of populations from sample file POP column to use (see example R script for an example)
inputBundleBlockSize - How many sample input files to bundle together to reduce number of temporary files. Default NA or not used. Recommended to set to 100 or greater when using large sample sizes (> ~5000)
vcf_output_name - Override the default VCF output name with this given file name. Please note that this does not change the names of inputs or outputs (e.g. RData, plots), so if outputdir is unchanged and if multiple STITCH runs are processing on the same region then they may over-write each others inputs and outputs

Output

VCF named <outputdir>stitch.<chr>.<regionStart>.<regionEnd>.vcf.gz
or if no regionStart and regionEnd is given
<outputdir>stitch.<chr>.vcf.gz
unless vcf_output_name is used