DESCRIPTION =========== STITCH v1.2.4 Robert William Davies January 4, 2017 CHANGE LOG ========== v1.2.4 - Enable C++11 compilation - Fix bug where sample name from bam header was being grabbed from any line with @RG in it and not specifically lines starting with @RG v1.2.3 - Fix bug where the central SNP in a read was random from SNPs in read and not the central SNP by position among SNPs in the read as it ought to have been - Fix bug where the final SNP from posfile wasn't being loaded from the sample BAMs and as a result not being imputed - Fix bug where reads split into 3+ pieces were not being properly handled (e.g. long reads where sections map to multiple locations) - Faster internal handling of cigar string v1.2.2 - Reduce RAM footprint when loading reference haplotypes - Crash early if rsync is not in PATH - Added option to override default VCF output name. See vcf_output_name - Added unit tests under testthat framework v1.2.1 - Change internal system calls to reduce RAM usage - Fix bug passing through variable to subfunction v1.2.0 - Can use reference panels in IMPUTE2 format. See example script and reference_* variables - Can bundle together inputs to facilitate imputation of very large N. See inputBundleBlockSize - Example human data provided to showcase STITCH functionality. See examples script v1.1.4 - Can work off CRAM files or BAM files. To use CRAM files, see cramlist and reference variables, or see examples script - Changed GL as genotype likelihood to GP as genotype posterior probability in output VCF v1.1.3 - Changed R example script to work on provided example data - Changed default downsampleToCov to 50 to reduce likelihood of overflow at high coverage SNPs - Miscellaneous small fixes to BAM conversion script to better handle samples with very few reads - Changed default region where reads from BAM are loaded (chrStart and chrEnd) to NA, to be inferred from posfile and the region to be imputed, rather than to grab reads from the whole chromosome - Fixed ability to use high coverage validation samples (genfile) when using generateInputOnly and regenerateInput v1.1.2 - Fixed typo in header of outputted VCF v1.1.1 - Fixed bug where samples with no reads on a chromosome gave an old input format v1.1.0 - Changed default output to VCF (added option outputBlockSize to control how it is written) - Removed two package dependencies - Remove ability to write output to .gen format (remove outputGenFormat) - Add change log to README and miscellaneous other changes v1.0.1 - Added ability to use soft clipped bases v1.0.0 - Version used for paper INSTALLATION ============ STITCH requires dependencies parallel, Rsamtools, Rcpp and RcppArmadillo, which can be installed in the usual manner from either CRAN or Bioconductor To install STITCH once dependencies are installed, use the install.packages function in R as follows install.packages("STITCH_1.2.4.tar.gz") substituting in the appropriate path and version number as necessary INCLUDED FILES ============== STITCH_README_1.2.4.txt - This README STITCH_1.2.4.tar.gz - Package to be installed STITCH_1.2.4.pdf - PDF output of command line options STITCH_examples_1.2.4.R - Example commands for how to run STITCH in a few scenarios PROGRAM OPTIONS =============== Note: Only necessary and commonly used options are highlighted here. For a full list and description, please type ?STITCH in R once the library is loaded, or see the PDF Required chr - What chromosome to run. Should match BAM header posfile - Where to find file with positions to run. File is tab separated with no header, one row per SNP, with col 1 = chromosome, col 2 = physical position (sorted from smallest to largest), col 3 = reference base, col 4 = alternate base. Bases are capitalized. STITCH only handles bi-allelic SNPs K - Integer, how many founder / mosaic haplotypes to use outputdir - What output directory to use / where output files go tempdir - What directory to use as temporary directory. If possible, use ramdisk, like /dev/shm/ bamlist - Path to file with BAM file locations. File is one row per entry, path to BAM files. BAM index files should exist in same directory as for each BAM, suffixed either .bam.bai or .bai cramlist - Same as bamlist, but path is to CRAM locations. If used, requires reference variable to be set with path to fasta file Optional method - How to run imputation - either diploid or pseudoHaploid, the former having quadratic time complexity in K, the later having linear time complexity in K switchModelIteration - When selected, the iteration to switch from pseudoHaploid to diploid. Note that one EM iteration is defined as first using the parameters to estimate hidden phase, and secondly to use hidden phase to update parameters. So a choice of 39 with iterations = 40 would means 38 complete pseudo-haploid iterations, a 39th iteration of both estimating hidden phase and updating parameters, and a 40th iteration of updating hidden phase, and from this estimating dosages (parameter updates on the 40th iteration have no influence on dosages). Therefore, we say that a choice of 39 gives 38 pseudo-haploid iterations and 2 diploid iterations genfile - Path to gen file with high coverage results. Empty for no genfile. File has a header row with a name for each sample, matching what is found in the bam file. Each subject is then a tab seperated column, with 0 = hom ref, 1 = het, 2 = hom alt and NA indicating missing genotype, with rows corresponding to rows of the posfile. Note therefore this file has one more row than posfile which has no header regionStart - When running imputation, where to start from regionEnd - When running imputation, where to stop buffer - Buffer of region to perform imputation over. Imputation is run from bases including regionStart - buffer to regionEnd + buffer, including the bases, with 1-based positions. After imputation, the VCF is shrunk to only include positions from regionStart to regionEnd, inclusive chrStart - When loading from BAM, some start position, before SNPs occur chrEnd - When loading from Bam, some end position, far after last SNP reference_haplotype_file - When initializing using a reference panel, path to haplotype files in IMPUTE2 format (see example R script for an example download) reference_legend_file - When initializing using a reference panel, path to legend file in IMPUTE2 format reference_sample_file - When initializing using a reference panel, path to sample file in IMPUTE2 format reference_populations - When initializing using a reference panel, vector of populations from sample file POP column to use (see example R script for an example) inputBundleBlockSize - How many sample input files to bundle together to reduce number of temporary files. Default NA or not used. Recommended to set to 100 or greater when using large sample sizes (> ~5000) vcf_output_name - Override the default VCF output name with this given file name. Please note that this does not change the names of inputs or outputs (e.g. RData, plots), so if outputdir is unchanged and if multiple STITCH runs are processing on the same region then they may over-write each others inputs and outputs Output VCF named stitch....vcf.gz or if no regionStart and regionEnd is given stitch..vcf.gz unless vcf_output_name is used