eXpress

 

Streaming quantification for high-throughput sequencing

Loading

Manual

Documentation

Complete documentation for the source code is available in both html and pdf formats.

Back to top.

Usage

Prerequisites

eXpress runs on intel-based computers running Linux, Mac OS X, or Windows. You can install pre-compiled binaries or build eXpress from the source code. If you wish to build eXpress yourself, you must have a C++ compiler installed (for example, XCode for Mac OS X, Visual C++ Express for Windows 7) as well as CMake, BamTools, and the Boost C++ libraries. See the Installation section on the Getting Started page for detailed instructions.

Running eXpress

Run eXpress from the command line as follows:

    $ express [options]* <target_seqs.fasta> <aligned_reads.(sam/bam)>

The following is a detailed description of the options used to control eXpress:

Arguments:
<target_seqs.fasta> A file of target sequences in multi-FASTA format. See Input Files for more details.
<lib_1.sam,lib_2.sam,...,lib_N.sam> A comma-separated list of filenames for reads aligned to the target sequences in SAM format. See Input Files for more details.
Standard Options:
-h/--help Prints the help message and exits
-o/--output-dir <string> Sets the name of the directory in which eXpress will write all of its output. The default is "./".
-B/--additional-batch <int> Specifies the number of additional batch EM rounds to perform on the data using the initial results from the online EM as a seed. Can improve accuracy at the cost of time.
-O/--additional-online <int> Specifies the number of additional online EM rounds to perform on the data after the initial online round. Can improve accuracy at the cost of time.
-m/--frag-len-mean <int> Specifies the mean fragment length. While the empirical distribution is estimated from paired-end reads on-the-fly, this value paramaterizes the prior distribution. If only single-end reads are available, this prior distribution is also used to determine the effective length. Default is 200.
-s/--frag-len-stddev <int> Specifies the fragment length standard deviation. While the empirical distribution is estimated from paired-end reads on-the-fly, this value paramaterizes the prior distribution. If only single-end reads are available, this prior distribution is also used to determine the effective length. Default is 60.
-H/--haplotype-file <string> Specifies the location of a comma-separated file of sets of target IDs (one set per line) specifying which targets represent multiple haplotypes of a single feature (ie, transcript). Useful for allele-specific expression.
--output-align-prob With this option, eXpress outputs an additional file called hits.prob.(sam/bam) containing identical copies of all input alignments with an additional XP tag that contains the estimated probability that each alignment of the read (pair) is the "correct" one. The XP values for all alignments of of the same read (pair) will sum to 1.
--output-align-samp With this option, eXpress outputs an additional file called hits.samp.(sam/bam) containing a single alignment for each fragment sampled at random based on the alignment likelihoods calculated by eXpress.
--fr-stranded With this option, eXpress only accepts alignments (single-end or paired) where the first (or only) read is aligned to the forward target sequence and the second read is aligned to the reverse-complemented target sequence. In directional sequencing, this is equivalent to second-strand only. If all reads are single-end, --f-stranded should be used instead. Disabled by default.
--rf-stranded With this option, eXpress only accepts alignments (single-end or paired) where the first (or only) read is aligned to the reverse-completemented target sequence and the second read is aligned to the forward target sequence. In directional sequencing, this is equivalent to first-strand only. If all reads are single-end, --r-stranded should be used instead. Disabled by default.
--f-stranded With this option, eXpress only accepts single-end alignments to the forward target sequence. In directional sequencing, this is equivalent to second-strand only. Disabled by default.
--r-stranded With this option, eXpress only accepts single-end alignments to the reverse target sequence. In directional sequencing, this is equivalent to second-strand only. Disabled by default.
--no-update-check With this option, eXpress will not ping our server to see if a newer version is available.
Advanced Options:
-f/--forget-param <float> A parameter specifying the rate at which the prior is "forgotten" by increasing the mass of fragments during online processing. Larger numbers (max of 1) mean a slower rate, which decreases convergence but improves stability. Smaller numbers (minumum of 0.5) increase the rate, which may lead to faster convergence but can also lead to instability.
--library-size <int> Specifies the number of fragments in the library to be used in the FPKM calculation. If left unspecified, this number will be computed from the input.
--max-indel-size <int> A parameter specifying the maximum allowed size of a single indel. Alignments with larger indels will be ignored. A geometric prior for indel length is fit so that all but 10e-6 of the probability mass lies within the allowed region. The default is 10.
--calc-covar With this option, eXpress calculates the covariance between targets and outputs them for use in differential expression analysis. This calculation requires slightly more time and memory.
--expr-alpha <float> A parameter specifying the weight of uniform the target abundance prior, in pseudo-counts per bp. The default is 0.01.
--stop-at <int> A parameter specifying the number of fragments to process before quitting.
--burn-out <int> A parameter specifying the number of fragments after which to stop learning the auxiliary parameters (fragment length, bias, error).
--no-bias-correct With this option, eXpress will not measure and account for sequence-specific biases. Will lead to a slight initial increase in speed at the expense of accuracy.
--no-error-model With this option, eXpress will not measure and account for errors in alignments. Will lead to an increase in speed, but may greatly decrease accuracy.
--aux-param-file <string> Specifies an auxiliary parameter file output by a different run of eXpress to be used as the auxiliary parameters for this round. Greatly improves speed and should be used when a subset of the targets or fragments are being used in a second estimation.

Back to top.

Input Files

Target Sequences (FASTA)

eXpress requires a multi-FASTA file of target sequences for which the abundances will be measured. In the case of RNA-Seq, these are the transcript sequences. If the transcriptome of your organism is not annotated, you can generate this file from your sequencing reads using a de novo transcriptome assembler such as Trinity, Oases, or Trans-ABySS. If your organism has a reference genome, you can assemble transcripts directly from mapped reads using Cufflinks. If your genome is already annotated (in GTF/GFF), you can generate a multi-FASTA file using the UCSC Genome Browser by uploading your annotation as a track and downloading the sequences under the "Tables" tab.

Read Alignments (SAM/BAM)

eXpress also requires a file, multiple files, or a piped stream of SAM or binary SAM (BAM) alignments as input. The SAM alignments should be generated by mapping your sequencing reads to the target sequences specified in the multi-FASTA input file described above. For more details on the SAM format, see the specification. Many short read mappers including Bowtie, Bowtie2, BWA, and MAQ can produce output in this format. It is important that you allow many multi-mappings (preferably unlimited) in order to allow eXpress to select the correct alignment instead of the mapper. See Getting Started for an example using Bowtie in both streaming and file input modes.

If using paired-end reads, the read names must match for each pair, excluding '/1' and '/2' suffix identifiers. Also, the SAM file supplied to eXpress should be grouped by read id. If you aligned your reads with Bowtie, your alignments will be properly ordered already. If you used another tool, you should ensure that they are properly sorted. You can sort your SAM using the following command:

    sort -k 1 hits.sam > hits.sam.sorted

You can sort your BAM using this command:

    samtools sort -n hits.bam hits.sorted

If multiple libraries were prepared for the same sample or multiple read lengths were used in different sequencing runs, the alignments for each should be grouped in separate SAM files so that auxiliary parameters can be estimated independently. The filenames can then be input into eXpress as a comma-separated (with no spaces) list of SAM files. See above for an example. When this feature is used, separate parameter estimates will be output for each library, but only a single abundance file will be produced.

Back to top.

Output Files

Target Abundances (results.xprs)

This file is always output and contains the target abundances and other values calculated based on the input sequences and read alignments. The file has 10 tab-delimited columns, sorted by the bundle_id (column 1). The columns are defined as follows:

#Column NameExampleDescription
1bundle_id10ID of bundle the target belongs to. A bundle is defined as the transitive closure of targets that share multi-mapping reads.
2target_idNM_016467The ID given to the target in the input multi-FASTA file.
3length2182The number of base pairs in the target sequence given in the input multi-FASTA file.
4eff_length783.136288The length of the target adjusted for fragment biases (length, sequence-specificity, and relative position). This number is what the fragment counts are normalized by to calculate FPKM, not the true length.
5tot_counts99The number of fragments mapping (uniquely or ambiguously) to this target.
6uniq_counts7The number of fragments uniquely mapping to this target.
7est_counts26.702456The estimated number of fragments generated from this target in the sequencing experiment.
8eff_counts74.399258The estimated number of fragments generated from this target in the sequencing experiment, adjusted for fragment and length biases. In other words, his is the expected number of reads from the experiment if these biases did not exist. This is the value recommended for input to count-biased differential expression tools.
8ambig_distr_alpha3.154652The alpha parameter for the posterior beta-binomial distribution fit to the ambiguous reads.
10ambig_distr_beta2.293653The beta parameter for the posterior beta-binomial distribution fit to the ambiguous reads.
9fpkm3.514176The estimated relative abundance of this target in the sample in units of fragments per kilobase per million mapped. This value is proportional to est_counts divided by eff_length.
10fpkm_conf_low2.119151The lower bound of the 95% confidence interval for the FPKM.
11fpkm_conf_high4.909200The upper bound of the 95% confidence interval for the FPKM.
12solvableTA binary (T/F) value indicating whether the likelihood function has a unique maximum. If false (F), the reported posterior distribution is uniform.
13tpm2.347222e+05Transcripts per million. See description.

See the Methods for more details on how these values are calculated.

Parameter Estimates (params.xprs)

This file contains the values of the other parameters (besides abundances and counts) estimated by eXpress. The file is separated into sections for each parameter type, beginning with a '>' symbol. Following this symbol is the section header containing a name for the parameter type followed by the values on subsequent lines. All values belong to this parameter field until the next '>' or the end of the file. The following parameter types are output to this file:

#Parameter TypeDescriptionOutput Format
1Fragment Length DistributionThe empirical distribution on fragments lengths.The fragment length range is listed next to the section header in paranthesis (0-800 by default). The next line contains a tab-delimited list of probabilities for these lengths in order.
2First Read MismatchThe first-order Markov model for mismatches between the reference and observed nucleotides for the first read sequenced in a pair.Each line begins with the nucleotide position in the read followed by a colon (0-indexed). The column header denotes the which "substitution" the probability is for. For example, a value in the column labeled "CG->*T" in the row labeled 10 is the conditional probability that a read has a 'T' at the 11th position given it is mapped to a reference having a 'C' in the 10th position and a 'G' in the 11th. Note that since this is a conditional probability, CG->*A, CG->*C, CG->*G, CG->*T will sum to 1.
3Second Read MismatchThe first-order Markov model for mismatches between the reference and observed nucleotides for the second read sequenced in a pair.Same as above.
45' Sequence-Specific BiasParameters relating to the likelihood of the sequence surrounding the 5' end of a fragment in transcript coordinates. See Roberts, et al. (2010a) for more details.This section is divided into 3 subsections. First is a matrix of the empirical nucleotide distribution for observed fragments ("Observed Marginal Distribution") at each position in a window surrounding the 5' end of the fragment. The column headers give the 0-indexed position number with negatives being upstream in the target sequence. Each row gives the probability for a different nucleotide, which is specified in the first column followed by a colon. Note that since this is a probability distribution, each column will sum to 1. The second subsection contains the "Observed Conditional Probabilities". These are the conditional probabilities for the 3rd order Markov model, the columns specifying the conditional event in the observed fragments and the row specifying the window position. The third matrix is the "Expected Conditional Probabilities". This matrix is similar to the previous, except the probabilities are calculated assuming target sampling based only on fragment length and relative abundance, and fragment sampling within a target dependent only on length (no sequence biases). Bias weights in eXpress are calculated by taking the ratio of obesrved to expected probability.
53' Sequence-Specific BiasParameters relating to the likelihood of the sequence surrounding the 3' end of a fragment in transcript coordinates. See Roberts, et al. (2010a) for more details.Same as above, except for the 3' fragment end.

If multiple alignment files were provided, a separate parameter output will be output for each with a unique index identifying its position in the command-line argument given by the user (ie, the second SAM file in the argument list will be named 'params.2.xprs').

Count Variance-Covariance (varcov.xprs)

This file is produced only when the --calc-covars option flag is used as described above. The file contains the estimated variances and covariances on the counts between pairs of targets that shared multi-mapped reads, primarily to be used in differential expression analysis. Since the covariance between targets in different bundles is always 0, the full sparse matrix is broken up into smaller tab-delimited matrices for each bundle. An example of this output for the sample dataset used in the Getting Starting tutorial is shown below:

  1. >>1: NM_014212
    0.000000e+00
    >2: NM_001168316, NM_174914, NR_031764
    3.234847e+02	-2.570762e+02	-6.640854e+01
    -2.570762e+02	4.082292e+02	-0.000000e+00
    -6.640854e+01	-0.000000e+00	2.175616e+02
    >3: NM_022658
    0.000000e+00
    >4: NM_173860
    0.000000e+00
    >5: NM_014620, NM_153693, NR_003084, NM_153633, NM_018953, NM_004503
    2.067753e+02	-0.000000e+00	-0.000000e+00	-0.000000e+00	-0.000000e+00	-0.000000e+00
    -0.000000e+00	6.035824e+01	-0.000000e+00	-0.000000e+00	-0.000000e+00	-0.000000e+00
    -0.000000e+00	-0.000000e+00	1.731434e+01	-0.000000e+00	-0.000000e+00	-0.000000e+00
    -0.000000e+00	-0.000000e+00	-0.000000e+00	1.879961e+02	-0.000000e+00	-2.499948e-01
    -0.000000e+00	-0.000000e+00	-0.000000e+00	-0.000000e+00	1.149211e+01	-0.000000e+00
    -0.000000e+00	-0.000000e+00	-0.000000e+00	-2.499948e-01	-0.000000e+00	4.581855e+01
    >6: NM_017409
    0.000000e+00
    >7: NM_017410
    0.000000e+00
    >8: NM_006897
    0.000000e+00

Each bundle's matrix is headed by an identifier line that begins with a greater than symbol (>) followed by the bundle id and a comma-separated list of targets in the bundle. The ordering of this list provides the indices for the matrix that is to follow. For example, in bundle 1 of the output above, the fifth value in the second row (-2.862072e+02) is the covariance between NM_153633 and NM_014620. Notice that an identical value is also in the second column of the fifth row, as the variance-covariance matrix will always be symmetric.

See the Methods for more details on how these values are calculated.

Back to top.