Notes for Stampy v1.0.13   --   Gerton Lunter, June 2011
--------------------------------------------------------




1. Summary 
==========

Stampy has the following features:

- Maps single, paired-end, mate pair Illumina reads to a reference
- Fast: about 10 (with BWA) or 15 hours (without) per Gbase
- Low memory footprint: 2.7 Gb shared memory for a 3Gbase genome 
- High sensitivity for indels and divergent reads, up to 10-15% 
- Low mapping bias for reads with SNPs or indels
- Well calibrated mapping quality scores 
- Input: Fastq and Fasta; gzipped or plain; SAM and BAM
- Output: SAM, Maq's map file 
- Optionally calculates per-base alignment posteriors 
- Optionally processes part of the input 
- Handles reads up to 4500 bases

At the moment SOLiD reads are not supported.  Although Stampy was
designed for Illumina reads, mapping 454 reads should be no problem.

Bug reports are highly appreciated.  If you can, please design a 
small test file that reproduces the bug, and email the precise 
command line and details of the system you ran the program on.  
However, please read do this documentation before submitting a bug 
report.



2. Building 
===========

Just type "make"

Currently the linux x86_64 platform is supported, and support for 
Mac OS-X 10.6 (x64_64 only) is experimental.  Let me know if you 
need to run Stampy on another platform.

Stampy needs Python version 2.6 or 2.7.  Both 2-byte and 4-byte
Unicode encodings are supported.  If your default python is not
2.6 or 2.7, but say python2.6 is installed, use

    make python=python2.6

For Mac, only Python 2.6 with 2-byte Unicode is supported.

Most errors that occur at this stage are related to the Python 
installation.  Here is a checklist in case something goes wrong:

- Check that python version 2.6 or 2.7 is installed on your system
  (type "python" in a shell).
- Check that the executables python2.6/python2.7 and the related 
  python2.6-config or python2.7-config can be found in your path, and 
  that they live in the same directory as the Python executable you use.
- If you get linking errors, check that the -L<path> bit of the output 
  of "python-config--ldflags" points to a directory that contains 
  libpython.  If not, your python installation is faulty; setting 
  paths properly or re-installing Python might resolve the problem.
- Make sure the installation and run-time versions of python agree.
- For Mac users, check that you're using the standard Python ("which
  python"), rather than one from a third-party distribution such as
  Fink or Darwin.  Some versions of these distributions have 
  incomplete Python installations causing problems in the linking 
  stage.



3. Quick-start guide
====================


Building a genome (.stidx) file:

     ./stampy.py --species=human --assembly=hg18_ncbi36 \
                 -G hg18 /data/genomes/hg18/*.fa.gz

Building a hash (.sthash) file:

     ./stampy.py -g hg18 -H hg18

Single-end mapping:

     ./stampy.py -g hg18 -h hg18 -M illuminareads.fastq.gz

Paired-end mapping:

    ./stampy.py -g hg18 -h hg18 -M solexareads_1.fastq \
                solexareads_2.fastq

Use BWA to speed up mapping (recommended; v1.5.6 and v1.5.7 have
been tested and work well with Stampy):

     ./stampy.py -g hg18 -h hg18 \
     		 --bwaoptions="-q10 bwa-hg18-reference" \
		 -M illuminareads.fastq.gz

Set divergence for mapping to foreign reference:

     ./stampy.py -g hg18 -h hg18 --substitutionrate=0.05 \
     		 -M illuminareads.fastq.gz

Set the initial insert size distribution -- only inserts within 
4sd of mean are considered for training the actual size distribution:

    ./stampy.py -g hg18 -h hg18 --insertsize=400 --insertsd=75 \
    		-M solexareads_1.fastq solexareads_2.fastq

Use (post v1.3) Solexa quality scores; default is Sanger qualities:

     ./stampy.py -g hg18 -h hg18 --solexa -M illuminareads.fastq.gz

Process the 3rd of each set of 8 reads or read pairs:

     ./stampy.py -g hg18 -h hg18 --processpart=3/8 \
     		 -M illuminareads.fastq.gz



4. Running Stampy 
=================

First you have to build a genome file:

      ./stampy.py -G hg18 /data/genomes/hg18/*.fa.gz

You may provide .fa or gzipped .fa files, and the .fa files may
contain more than one chromosome or contig.  As chromosome or contig
identifer, the first word on the Fasta header line is taken.  If the
header follows NCBI formatting, the "ref" field is taken.  The genome
file has extension .stidx; building this takes a few minutes for large
genomes.

After building the genome file, you need to build a hash table:

      ./stampy.py -g hg18 -H hg18

This takes about 10 minutes for large genomes, and produces a file
hg18.sthash.  Finally, you're ready to map some reads:

     ./stampy.py -g hg18 -h hg18 -M illuminareads.fastq.gz

Note that the hash file may be large (up to 2 Gb); to improve startup
speed you may want to keep it on a local drive.  Note also that the
file system must support memory mapped files for Stampy to work; NFS
does, but e.g. GlusterFS does not.  Some file systems support memory
mapped files but become exceedingly slow; if this happens try moving
both files to a local drive.

You need not unzip the input fastq files.  You can use .fasta files
too, using the option --inputformat=fasta.  By default Stampy assumes
that the quality scores use Sanger encoding (score 0 is '!', ascii
33), and use the phred (=log p) scale, not the logit (=log p/(1-p))
scale.  If you don't entirely trust the quality scores in the fastq
file, you can recalibrate them:

     ./stampy.py -g hg18 -h hg18 -R solexareads.fastq.gz

By default this maps 1 percent of the fastq file onto the reference,
collects statistics, and writes a *.recaldata file.  It does not write
out the new recalibrated input file, but you can ask Stampy to do
this.  If you now start the mapper again,

    ./stampy.py -g hg18 -h hg18 -M solexareads.fastq.gz,

Stampy will use the .recaldata file and apply the recalibration before
mapping, which should improve the mapping quality statistics.  (Note
that if the original file is not ascii-33 based, or uses logit scores,
you need to provide the relevant options both times.)

Paired-end mapping is done by supplying two fastq files, one for each
mate:

    ./stampy.py -g hg18 -h hg18 -M solexareads_1.fastq \
                solexareads_2.fastq

(If -M is not the last option on the command line, the input files
need to be separated by a comma rather than a space.)  It is not
required to explicitly set parameters for the insert size
distribution; Stampy automatically determines these within the first
few hundred mappings, and adapts the scoring accordingly.  The default
settings are an average separation of 250 and standard deviation 60,
which is broad enough to capture all currently used libraries within
the default 4 standard deviations.  If a library with an insert size
distribution outside this range is used, you need to change these
defaults for autocalibration to work.

For large genomes, it is highly recommended to use BWA to speed up 
mapping.  First, create a BWA index for the same reference genome as 
used by Stampy.  Then:

    ./stampy.py --bwaoptions="-q10 BWAindex/hg18.fa" -g hg18 
             	-h hg18 -M solexareads_1.fastq solexareads_2.fastq

See the next section for more details.



5.  Faster mapping with BWA 
============================

Stampy's speed depends on the genome size.  For a 90 Mb genome it maps
about 1000 reads per second, but for the human genome this reduces to
about 150 per second.  To speed up mapping in this case, you can use
the much faster BWA as pre-mapper.  To do this, first create an index
of the reference genome for BWA in the usual way, and specify the
location of this index using the --bwaoptions option:

    ./stampy.py --bwaoptions="-q10 BWAindex/hg18.fa" -g hg18 \
                -h hg18 -M solexareads_1.fastq,solexareads_2.fastq

This first maps all reads using BWA, and those that map with a small
(read-length-dependent) number of mismatches are output without
further processing.  All others, and those that BWA does not map, are
re-mapped using Stampy.  Because of BWA's speed, reasonably good
sensitivity, and good mapping quality calibration, it is recommended
to use BWA in almost all cases.  For very small genomes the benefit is
reduced, while for divergent references (3% or more) BWA cannot map
the majority of reads, and hybrid mapping is switched off
automatically.

Note that options to BWA must be enclosed in quotes, and must precede
the index prefix.  BWA version 0.5.7 or higher, and the -q10 option, 
are recommended.

If you're mapping Illumina fastq files with base-64 quality values,
use the --solexa option, but do not use bwa's -I option -- Stampy
will convert Q scores on the fly.

The achieved speed-up depends on the quality of the input reads; if
BWA fails to map a large proportion of reads, the speed-up will be
comparatively low.  Note that BWA on a mammalian-size genome requires
about 3Gb to run, and this memory is not shared.  The combination will
run on a 4 Gb machine, but slowly -- 8 Gb is recommended.  For multiple
processes on a single node, add 3 Gb (not: 6 Gb) per running copy.

On some filesystems, Stampy is particularly slow.  In order to conserve 
main memory, Stampy shares its two large tables across multiple 
instances running on the same shared-memory node.  This is achieved by 
memory-mapping a shared file.  On standard filesystems (NFS, Linux 
etx3/4) this works fine, but certain higher-end distributed file 
systems this causes excessive slow-down.  The solution is to copy the 
.stidx and .sthash files to a local drive, e.g. /tmp, and share those 
copies locally rather than through the globally shared file system.




6.  Mapping divergent reads 
===========================

To calculate correct mapping qualities, Stampy needs to know the
expected divergence from the reference.  This is set with the
--substitutionrate= option.  The default is 0.001 substitutions per
site.

Increasing the read length, and using paired-end reads, helps mapping
divergent reads.  The following table gives an indication of the
divergence at which a reasonable proportion of reads can be correctly
mapped.  These numbers were obtained by simulation, using the human
genome as reference, and should be taken as an indication only; they
are dependent on error rates, the repetitiveness of the genome, the
insert size distribution, and local variations in divergence; in
addition no indel mutations were included.

		   36bp   36bp   72bp   72bp
      divergence | single paired single paired
      -------------------------------------------------------
      0%         | 82%    95%    87%    96%
      3%         | 73%    91%    80%    94%
      6%         | 60%    83%    72%    92%
      9%         | 41%    56%    56%    88%
      12%        | 28%    51%    48%    80%



7.  Mapping mate pair libraries
===============================

Mate pair libraries contain a mixture of ordinary (--> <--) paired-end
reads, and mate pairs (<-- -->).  To map these, Stampy trains separate
insert size distributions for the two types of pairs, and attempts 
realignment within both regions.

To allow Stampy to accurately train the two distributions, both
distributions need to be seeded properly.  This is best done by mapping 
a small fraction of the data, starting with a very wide distribution:

  ./stampy.py -g ref -h ref --numrecords=5000 \
              --insertsize2=-2000 --insertsd2=750 -M read1.fq read2.fq

Stampy trains a distribution with reads that fall within 3 standard 
deviations of the mean, using the current estimates of each.  Stampy 
reports the estimated distributions when its done:

  stampy: # Paired-end insert size: 266.4 +/- 91.0  (903 pairs)
  stampy: # Mate pair insert size: -1519.1 +/- 164.5  (3374 pairs)

The values -1519 and -165 can be used as new initial parameters, and
the procedure repeated to obtain more accurate approximations.

Note that the mean mate-pair insert size is negative, corresponding to
the fact that the read mapping to the reverse strand maps to the left
of the forward-mapping read.  Protocols other than the currently
standard Illumina mate-pair protocol may need positive insert sizes.

Note also that with the current protocol, the standard deviation of
the mate pair insert size distribution is always larger than that of the
paired-end distribution, and that wider distributions slow down the
realignment process, so that mate pair mapping can be slow.

It is not recommended to use BWA for pre-mapping when mapping mate
pair reads, as BWA does not consider the alternative positions, causing
incorrect mappings in certain cases.


8.  Reporting alternative mapping positions
===========================================

Stampy can output alternative mapping locations for reads and read pairs,
using the XA tag.  By default this is switched off; it can be enabled with
the options

  --xa-max=3 --xa-max-discordant=10

to report at most 3 alternative placements for single reads and concordant
read pairs, and at most 10 for discordant read pairs (these are the BWA
defaults).

The XA tag reports, for each additional hit, the chromosome name, position 
and strand, CIGAR string, and the number of base mismatches plus the total 
length of any insertions or deletions.

Note that when using BWA for pre-mapping, there is no need to set BWA's -n 
or -N options; Stampy automatically sets these to the correct values.


9.  Testing Stampy
==================


Stampy includes test code that generates reads under an empirical read
error model and introduce SNPs van indel variants.  From the output,
stampy estimates the sensitivity for mapping reads back to their
correct location, and it tabulates results by mapping posterior score
to see if these are well calibrated.

To run these tests, you need to create a genome file and corresponding
hash file, and provide one or two .fastq files with short reads and
quality scores.  Stampy will use the length and qualities of these
reads, but generate sequences from random locations in the genome
provided.

You run a test by using the -T command:

    ./stampy.py -g hg18 -h hg18 -T solexareads_1.fastq,solexareads_2.fastq

This uses paired-end reads; single-end reads are used if just one
.fastq file is provided.  The following options are useful in test
mode:

   --substitutionrate=S		  Introduce an expected fraction S of 
                                  Poisson-distributed substitutions 
				  (default: 0.001)
   --insertsize=N		  Set the mean insert size for 
				  paired-end reads (default: 250)
   --insertsd=N                   Set the standard deviation of the 
				  insert size distribution (default: 60)
   --numrecords=N		  Only map the first N reads, or read 
				  pairs (default: all)
   --simulate-minindellen=N       Set the lower bound for simulated indel 
				  lengths (default: 0)
   --simulate-maxindellen=N       Set the upper bound for simulated indel 
				  lengths (default: 0)
   --simulate-duplications        Introduce duplications rather than 
				  insertions of random sequence (default)
   --simulate-numsubstitutions=N  Introduce N substitutions in each read, 
				  rather than a Poisson-distributed number

The settings of the first three options are used in simulations, and
again when computing mapping statistics from the mapped reads.  These
options are therefore also meaningful outside of testing.

The default for librarysd is generous, to train the model on a wide
range of input data.  For simulations this needs to be adjusted.

Simulated indels are generated by drawing the indel length from a
uniform distribution.  This is not meant to be close to a real
distribution, but is useful for testing the behaviour of stampy
conditional on indels being present.  Negative indel length are
deletions, positive ones are insertions.  When single-end reads are
used, each read will contain a simulated indel; for paired-end reads,
one of the mate pairs will contain an indel.



10.  Mapping output 
===================


Output is written to stdout by default.  An output file can be chosen
with the -o or --output= option.  A number of formats can be chosen
using th -f (or --format=) option:

 -f sam            :  SAM format (default)
 -f maqtxt         :  Maq's text output format (produced by 'maq mapview')
 -f maqmap         :  Maq's binary .map format, new version (long reads)
 -f maqmapShort    :  Maq's binary .map format, old version (short reads)
 -f maqmapShortN   :  Maq's binary .map format, old version, including 
		      variant positions (produced with 'maq map -N')

By default all reads are represented in the output.  The default
output format is the SAM format.  See below for details on the various
formats.

The SAM format is the most comprehensive format, and is recommended.
The Samtools program (samtools.sourceforge.net) is recommended for
dealing with .SAM files.

Several tools have been developed to use Maq's .map format as well,
including those that are included with the Maq package, and therefore
the .maq format was included as a convenience.  However, Maq's .map
format cannot represent all useful information.  In particular indels
are better represented in the SAM format.



10.1.  A note on likelihoods and posteriors

The single most often used statistic to judge the trustworthiness of a
read map location is its "mapping quality".  This is an approximation
of the probability that a read is mapped to the wrong location
(represented as a Phred score).

The probabilistic model used by Stampy is a hybrid of three models:

  (1) a Bayesian model, which considers all candidate locations
      weighted by their likelihood.
  (2) an error model, which considers the possibility that read errors
      cause reads to be incorrectly mapped
  (3) a random model, which predicts how well a random sequence would
      match to the genome

The first model deals with errors due to repetitive and
nearly-repetitive sequence, and assumes that the correct mapping
location was considered among the candidates.  The second model
estimates the probability that the correct candidate was missed
because of (single-nucleotide) read errors.  The third model acts as a
post-hoc filter, and assesses whether a candidate locations looks
better than a random best match.

Together these models capture most of the error modes of read mapping.
The most obvious exception is that the error model does not consider
indel errors or mutations; these often lead to the correct candidate
location being missed, particularly for short single-end reads.  The
"best" map that results is often caught by the third model; however if
the sequence is mildly repetitive, it will also pass this filter.

As a result, mapping quality is well calibrated for almost all cases,
except short single-end maps of reads that contain indel mutations, in
which case the mapping quality is overly optimistic by about an order
of magnitude.  Consequently, indels that are supported by single-end
maps only should be treated with caution.

Stampy computes posteriors and likelihoods both for pairs of reads,
and for reads considered by themselves.  The following table
summarizes the SAM tags for these statistics for paired reads:


	  Read                     | Posterior   | Likelihood |
	  -------------------------+-------------+------------+
	  Pair                     | MAPQ column | PQ:i:      |
	  This read as single read | SM:i:       | UQ:i:      |
	  Mate as single read      | MQ:i:       | XQ:i:      |

For single reads, only the MAPQ column and SM:i: tag are present.



10.2.  SAM output format

This tab-delimited output format is described in detail on
samtools.sourceforge.net.  A few things to note:

- The MAPQ field is the phred-scaled estimated probability that the
  read was mapped to the wrong location.  See 8.1 for
  details.

- In addition to the mapping quality (the probability that a read
  was mapped incorrectly), Stampy also reports the read likelihood: 
  the likelihood that a read was produced from the reference, 
  conditional on the mapping location being correct.  This score is 
  the sum of phred qualities on mismatching sites, and includes 
  probabilities for indels and read separation as well.  The single-
  read, paired-read and mate likelihoods are reported in the
  optional "UQ", "PQ" and "XQ" fields.  See 8.1 for details.

- The SAM format requires that paired reads share identifiers; if a
  trailer like "/1" is present, it will be removed from the identifier
  to conform to the standard.

- If an identifier contains spaces, only the first word will be used
  (and paired-end /1, /2 trailers silently added if necessary).  The
  option --keeplabel changes this behaviour, and instead replaces spaces
  by underscores.  It is not possible to keep spaces, since BWA also 
  only uses the first word.

- The "proper pair" flag bit (value 2) is set if two reads are
  correctly oriented, and their separation is within 5 standard
  deviations from the mean

- The "mate unmapped" bit (value 8) is never set by itself; pairs are
  either both mapped or both unmapped

- The "UQ" optional field (single read likelihood) is always present.

- The "PQ" and "XQ" fields are always present for paired-end reads,
  and represent the paired likelihood (which includes terms for
  mismatches, indels and read separation), and the single read
  posterior, respectively.

- The optional "SM" and "MQ" fields (mapping quality of the read or
  its mate, considered as a single read) are always present for
  paired-end reads



10.3.  Maq output format

This format is described here:
http://maq.sourceforge.net/maq-manpage.shtml (see under mapview).

- The mapping quality field (column 7) includes terms for mismatches,
  indels, and (for pairs) read separation.

- The single-end and alternative quality fields (column 8 and 9)
  include terms for mismatches and indels
 
- Column 11, "sum of qualities of mismatched bases of the best hit",
  is the single-read likelihood, and also includes terms for indels.

- Column 12 and 13, "number of 0-mismatch hits of the first 24bp" and
  "number of 1-mismatch hits of the first 24bp on the reference", are
  given dummy values, chosen such that filtering based on these values
  would give roughly the same results as with Maq.  However it is highly
  recommended to only use posteriors for filtering on uniqueness.

- When a read contains an indel, the flag column is set to 130 (like
  Maq does) when the read is part of a pair, and 128 (unlike Maq) when
  it is a singleton.  In these cases columns 7 and 8 contain the
  position and length of the indel, just like Maq would do.  In case of
  paired reads, the mate pair does not have to have a flag byte 18.

- Unlike maq, unmapped reads are included in the output, and map to
  the first chromosome, position 0.  In a large genome it is unlikely
  that any read remains unmapped though; by default no hard filtering is
  applied. Both the text and binary formats can be produced, and the
  latter can optionally include information on the location of variants
  (-o maqmapN).  Take care to redirect standard output when choosing the
  binary format.



11. Hints and features


11.1 Ordering of chromosomes in SAM output

  The SAM/BAM format does not specify the order in which the reference
  sequences appear in the SAM header.  By default Stampy orders them
  alphabetically.

  Some downstream tools require the references to appear in the same
  order as in the original .fa input file.  To enable this behaviour,
  use the --keepreforder option while mapping.  

  Note that .stidx files created before Stampy v1.0.4 will not work
  correctly; you need to re-create the .stidx file.

11.2 Parsing NCBI fasta files

  NCBI fasta files use a >gi|nnn|ref|xxx identifier.  By default
  Stampy parses this and uses only the "xxx" part.  Use --noparseNCBI
  to switch off this behaviour and use the full NCBI identifier.

11.3 Base-level alignment qualities

  The placement of an indel into an alignment is uncertain, because of
  actual ambiguity, polymorphisms, and read errors.  A probabilistic
  alignment considers all possibilities (under a suitable model) and
  this information can be used to compute a posterior score.  To 
  have Stampy compute this, use the --alignquals option.

  Note that this increases the output file size, and also increases 
  the runtime, since the required forward and backward iterations 
  take time.


12. License
===========

This is a release version.  Permission is granted for the normal 
use of the program and its output in an academic setting, including
in publications.  If the program is used to generate data for a 
publication, you must cite the following paper:

  G. Lunter and M. Goodson.  Stampy: A statistical algorithm for 
  sensitive and fast mapping of Illumina sequence reads.  Genome
  Res. 201 21:936-939.

The program itself may not be modified in any way, and may not be
reverse-engineered.

This license does not allow the use of this program for any 
commercial purpose.  If you wish to use this program for commercial 
purposes, please contact the author.

No guarantees are given as to the program's correctness, or the 
accuracy or completeness of its output.  The author accepts no 
liability for damage or otherwise following from using and 
interpreting the output of this program.  The software is supplied 
"as is", without obligation by the author to provide any services 
or support.



13. Revision history
====================


1.0.5, rev 786 (1 October 2010)  
       - first release

1.0.6, rev 803 (20 October 2010)
       - added NM (number of mismatches) tag
       - added support for sorted BAM/SAM input
       - bugfix: unmapped reads were sometimes reversed

1.0.7, rev 827 (3 November 2010)
       - added FAQ section to README
       - bugfix: BWA support was broken in 1.0.6

1.0.8, rev 828 (4 November 2010)
       - bugfix: occasional undefined variable reference

1.0.9, rev 852 (22 November 2010)
       - added --ignore-improper-pairs option, useful when
         remapping broken BAM files
       - bugfix: memory leak when mapping with BWA
       - bugfix: improper read pairing when remapping part 
         of a paired BAM file 
       - bugfix: v1.0.7/8 occasionally printed debug output

1.0.10, rev 854 (23 November 2010)
       - added --bwatmpdir option
       - better adherence to SAM v1.3 format specification;
         added option --referenceuri
       - bugfix: BWA paired-end mapping broken in 1.0.9

1.0.11, rev 880 (22 December 2010)
       - Faster, particularly for longer reads
       - Improved reporting of errors
       - Base Alignment Quality scores (BAQ, --baq)
       - Improved scoring of low-quality inserted bases
       - Improved indexing for references with many contigs
       - Fixed a bug requiring index and hash to be writeable
       - Fixed a bug throwing occasional segmentation faults
         for references with many contigs

1.0.12, rev 1010 (20 March 2011)
       - Improved specificity in low complexity regions
       - More informative error messages
       - Bugfix: underflow could cause negative BAQ values
       - Bugfix: all-N reads sometimes caused segfaults
       - Bugfix: -P sometimes failed to parse read label
       - Bugfix: Cigars sometimes started with I..D..

1.0.13, rev 1157 (16 June 2011)
       - Added support for mate pairs
       - Added XA (alternative mappings) output tag
       - Added optional tagging of BWA-mapped reads
       - Bugfix: simulation of duplications did not work
       - Bugfix: estimation of insert sizes improved



12. FAQ
=======

1. Q: Stampy complains that there are no input files, if I don't 
      use a comma between the file names
   A: Use -M as the last option on the command line to avoid this.


2. Q: Stampy starts but doesn't produce data for hours - is this
      normal?
   A: No. Stampy loads its index files in memory, and uses memory
      mapping to allow the memory to be shared between instances
      of Stampy.  However, this does not work on all file systems
      (notably, the Sanger file system).  To avoid the problem,
      copy the .index and .hash files to a local drive, e.g. /tmp


3. Q: Using Stampy with BWA seems to be slow, and it's supposed
      to be fast?
   A1: Make sure you run BWA (use --bwaoptions="...")
   A2: You might be running out of memory.  BWA and Stampy each 
      need 3 Gb of memory.  To run n copies side-by-side, and 
      mapping to a human-size genome, 4+3n Gb of free memory is 
      recommended.
   A3: Note that multithreading (BWA's -t option) has negligible
      effect.  Stampy cannot be multithreaded at this moment.


4.  Q: OK, Stampy can't be multithreaded at the moment.  What do I
      do to make it faster?
    A: You can run multiple copies side by side (but see the answer 
      to question 3), and have each copy process part of the file,
      using the option --processpart=N/M  (1 <= N <= M).  The
      resulting SAM files can be merged using e.g. Samtools.


5. Q: Stampy exits with "Fatal Python error: PyImport_GetModuleDict:
      no module dictionary".  Should I worry?
   A: This is an obscure Python bug that occurs rarely at shutdown.  
      As long as Stampy says "Stampy: Done", it finished OK.


6. Q: Stampy cannot find "python*-config.py" during installation;
      how can I solve this?
   A: This happens on Ubuntu installations, where python*-config is
      not always installed.  Try installing the "python-dev" 
      package.  If that fails, try installing python from source 
      (be sure to do "make install", not just "make")


7. Q: I'm getting "Suspiciously high Q scores" warnings, but I'm
      sure the fastq data is fine.
   A: Your fastq data is in Solexa format (base-64), while Stampy
      defaults to Sanger format (base-33).  Use the option --solexa.
