DESCRIPTION =========== STITCH v1.1.4 Robert William Davies October 2, 2016 CHANGE LOG ========== v1.1.4 - Can work off CRAM files or BAM files. To use CRAM files, see cramlist and reference variables, or see examples script - Changed GL as genotype likelihood to GP as genotype posterior probability in output VCF v1.1.3 - Changed R example script to work on provided example data - Changed default downsampleToCov to 50 to reduce likelihood of overflow at high coverage SNPs - Miscellaneous small fixes to BAM conversion script to better handle samples with very few reads - Changed default region where reads from BAM are loaded (chrStart and chrEnd) to NA, to be inferred from posfile and the region to be imputed, rather than to grab reads from the whole chromosome - Fixed ability to use high coverage validation samples (genfile) when using generateInputOnly and regenerateInput v1.1.2 - Fixed typo in header of outputted VCF v1.1.1 - Fixed bug where samples with no reads on a chromosome gave an old input format v1.1.0 - Changed default output to VCF (added option outputBlockSize to control how it is written) - Removed two package dependencies - Remove ability to write output to .gen format (remove outputGenFormat) - Add change log to README and miscellaneous other changes v1.0.1 - Added ability to use soft clipped bases v1.0.0 - Version used for paper INSTALLATION ============ STITCH requires dependencies parallel, Rsamtools, Rcpp and RcppArmadillo, which can be installed in the usual manner from either CRAN or Bioconductor To install STITCH once dependencies are installed, use the install.packages function in R as follows install.packages("STITCH_1.1.4.tar.gz") substituting in the appropriate path and version number as necessary INCLUDED FILES ============== STITCH_README_1.1.4.txt - This README STITCH_1.1.4.tar.gz - Package to be installed STITCH_1.1.4.pdf - PDF output of command line options STITCH_1.1.4.R - Example commands for how to run STITCH in a few scenarios PROGRAM OPTIONS =============== Note: Only necessary and commonly used options are highlighted here. For a full list and description, please type ?STITCH in R once the library is loaded, or see the PDF Required chr - What chromosome to run. Should match BAM header posfile - Where to find file with positions to run. File is tab separated with no header, one row per SNP, with col 1 = chromosome, col 2 = physical position (sorted from smallest to largest), col 3 = reference base, col 4 = alternate base. Bases are capitalized. STITCH only handles bi-allelic SNPs K - Integer, how many founder / mosaic haplotypes to use outputdir - What output directory to use / where output files go tempdir - What directory to use as temporary directory. If possible, use ramdisk, like /dev/shm/ bamlist - Path to file with BAM file locations. File is one row per entry, path to BAM files. BAM index files should exist in same directory as for each BAM, suffixed either .bam.bai or .bai cramlist - Same as bamlist, but path is to CRAM locations. If used, requires reference variable to be set with path to fasta file Optional method - How to run imputation - either diploid or pseudoHaploid, the former having quadratic time complexity in K, the later having linear time complexity in K switchModelIteration - When selected, the iteration to switch from pseudoHaploid to diploid. Note that one EM iteration is defined as first using the parameters to estimate hidden phase, and secondly to use hidden phase to update parameters. So a choice of 39 with iterations = 40 would means 38 complete pseudo-haploid iterations, a 39th iteration of both estimating hidden phase and updating parameters, and a 40th iteration of updating hidden phase, and from this estimating dosages (parameter updates on the 40th iteration have no influence on dosages). Therefore, we say that a choice of 39 gives 38 pseudo-haploid iterations and 2 diploid iterations genfile - Path to gen file with high coverage results. Empty for no genfile. File has a header row with a name for each sample, matching what is found in the bam file. Each subject is then a tab seperated column, with 0 = hom ref, 1 = het, 2 = hom alt and NA indicating missing genotype, with rows corresponding to rows of the posfile. Note therefore this file has one more row than posfile which has no header regionStart - When running imputation, where to start from regionEnd - When running imputation, where to stop buffer - Buffer of region to perform imputation over. Imputation is run from bases including regionStart - buffer to regionEnd + buffer, including the bases, with 1-based positions. After imputation, the VCF is shrunk to only include positions from regionStart to regionEnd, inclusive chrStart - When loading from BAM, some start position, before SNPs occur chrEnd - When loading from Bam, some end position, far after last SNP Output VCF named stitch....vcf.gz or if no regionStart and regionEnd is given stitch..vcf.gz