eXpress

 

Streaming quantification for high-throughput sequencing

Loading

Getting Started

Installation

Installing a pre-compiled binary release

In order to make it easy to install eXpress, we provide a few binary packages that save you the trouble of having to compile from source. To use the binary packages for OSX/Linux (Windows), simply download the appropriate one for your machine, untar (unzip) it, and make sure the express(.exe) binary is in a directory in your PATH environment variable.

If using the Windows binary, you will also need to install the Visual C++ 2010 Runtime Library.

Installing from source

Note: Installing from source is often an unnecessary hassle. If a binary is available for your system, it is highly recommended that you use it. If one is not available, please send us a request to ask.xprs@gmail.com.

The instructions below include commands that are compatible with Unix/Linux/Mac. If compiling on a Windows machine, use the appropriate DOS commands instead. Assuming you are using the Visual C++ compiler, you must type your commands from the Visual Studio Command Prompt.

• Install C++ Compiler (if not already available)

While most Linux systems will already have GCC installed, Mac OS X and Windows do not have a C++ compiler installed by default. Free options include XCode for Mac OS X and Visual C++ Express for Windows.

• Download and extract the source

First you must download source code from the Download menu at the top of this page. Untar the file using:

  1. $ tar -xf express-<EXPRESS_VERSION>-src.tgz

From now on we refer to the path to the directory that is created as <YOUR_EXPRESS_DIR>.

• Install CMake

Fortunately, the developers of CMake provide simple installation packages for most architectures. Find the right package for your system at their website.

• Install BamTools

With CMake installed, BamTools installation is straightforward.

  1. Download the source from here and untar into <YOUR_EXPRESS_DIR>.
  2. Navigate to <YOUR_EXPRESS_DIR>/bamtools.
    $ cd <YOUR_EXPRESS_DIR>/bamtools
  3. Make a new directory called build and navigate to it.
    $ mkdir build
    $ cd build
  4. Have CMake generate the makefile.
    $ cmake ..
  5. Build the BamTools libraries by calling make.
    $ make

• Install BOOST

If you have a package manager such as yum (Linux) or MacPorts (OS X) installed, these will probably be the easiest way to install BOOST. Be sure to install boost-devel so that the headers are included. If you are using Windows, you can use the installer found here. Otherwise, follow the instructions below.

  1. Download Boost and the bjam build engine.
  2. Unpack bjam and add it to your PATH.
  3. Unpack the Boost tarball and cd to the Boost source directory. This directory is called the BOOST_ROOT in some Boost installation instructions.
  4. Build Boost. Note that you can specify where to put Boost with the --prefix option. The default Boost installation directory is /usr/local.
    • If you are on Mac OS X, type:
      $ bjam --prefix=<YOUR_BOOST_INSTALL_DIRECTORY> --toolset=darwin architecture=x86 address_model=32_64 link=static runtime-link=static --layout=versioned stage install
    • If you are on a 32-bit Linux system, type:
      $ bjam --prefix=<YOUR_BOOST_INSTALL_DIRECTORY> --toolset=gcc architecture=x86 address_model=32 link=static runtime-link=static stage install
    • If you are on a 64-bit Linux system, type:
      $ bjam --prefix=<YOUR_BOOST_INSTALL_DIRECTORY> --toolset=gcc architecture=x86 address_model=64 link=static runtime-link=static stage install

• Install eXpress

We are now ready to build and install eXpress!

  1. Navigate to <YOUR_EXPRESS_DIR>.
    $ cd <YOUR_EXPRESS_DIR>
  2. Make a new directory called build and navigate to it.
    $ mkdir build
    $ cd build
  3. Have CMake generate the makefile.
    $ cmake ..
  4. Build the eXpress binary by calling make.
    $ make
  5. Copy the binary to /usr/lib/bin (or alternatively to another directory in your PATH).
    $ sudo make
                        install

You should now be able to type express from any directory and see a print-out of the usage and options. If you do not see this and there were no errors in the compilation, double check to see that the binary was copied into a directory in your PATH.

Back to top.

General Use Case: RNA-Seq abundances

Required input

eXpress requires two input files:

  1. A multi-FASTA file containing the transcript sequences. If the transcriptome of your organism is not annotated, you can generate this file from your sequencing reads using a de novo transcriptome assembler such as Trinity, Oases, or Trans-ABySS. If your organism has a reference genome you can assemble transcripts directly from mapped reads using Cufflinks. If your genome is already annotated (in GTF/GFF), you can generate a multi-FASTA file using the UCSC Genome Browser by uploading your annotation as a track and downloading the sequences under the "Tables" tab.
  2. Read alignments to the multi-FASTA file in SAM or BAM format. These can either be stored in a file or streamed directly from an aligner. It is important that you allow as many multi-mappings as possible. You can also allow many mismatches during mapping since eXpress builds an error model to probabalistically assign the reads, although this will increase mapping time. If you are combining reads from several library preparations or from sequencing runs using different read lengths, please see the Manual for important details on how the alignments should be input.

An example

In the following two sub-sections, you will run eXpress on a sample RNA-Seq dataset with simulated reads from UGT3A2 and the HOXC cluster using the human genome build hg18. Both the transcript sequences (transcripts.fasta) and raw reads (reads_1.fastq, reads_2.fastq) can be found in the <YOUR_EXPRESS_DIR>/sample_data directory. For this example to work, you will need to have both Bowtie and SAMtools installed, but in general any aligner will work and the conversion to BAM is not necessary unless you have insufficient disk space to store the uncompressed SAM.

Before you begin, you must prepare your Bowtie index. Since you wish to allow many multi-mappings, it is useful to build the index with a small offrate (in this case 1). The smaller the offrate, the larger the index and the faster the mapping. If you have disk space to spare, always use an offrate of 1. Build the index with the following commands.

  1. $ cd <YOUR_EXPRESS_DIR>/sample_data 
    $ bowtie-build --offrate 1 transcripts.fasta transcripts

This command will populate your directory with several index files that allow Bowtie to more easily align reads to the transcripts.

You can now map the reads to the transcript sequences using the following Bowtie command, which outputs in SAM (-S), allows for unlimited multi-mappings (-a), and a maximum insert distance of 800 bp between the paired-ends (-X 800). The first three options (a,S,X) are highly recommended for best results. You should also allow for many mismatches, since eXpress models sequencing errors. Furthermore, you will want to take advantage of multiple processors when mapping large files using the -p option. See the Bowtie Manual for more details on various parameters and options.

The SAM output from Bowtie is piped into SAMtools in order to compress it to BAM format. This conversion is optional, but will greatly reduce the size of the alignment file.

  1. $ bowtie -aS -X 800 --offrate 1 transcripts -1 reads_1.fastq -2 reads_2.fastq | samtools view -Sb - > hits.bam 

Input from SAM/BAM file

Once you have aligned your reads to the transcriptome and stored them in a SAM or BAM file, you can run eXpress in default mode with the command:

  1. $ express transcripts.fasta hits.bam

The default settings will be sufficient for most users, but please see the Manual for full descriptions of all available options.

Streaming input from aligner

If you do not wish to store an intermediate SAM/BAM file, you can pipe the Bowtie output directly into eXpress with the command:

  1. $ bowtie -aS -X 800 --offrate 1 transcripts -1 reads_1.fastq -2 reads_2.fastq | express transcripts.fasta 

This will give you the exact same output as the previous command, while avoiding the need potentially large amounts of disk space for storing the mapped reads.

Understanding the output

The output for eXpress is saved in a file called results.xprs in an easy-to-parse tab-delimited format. Since an output directory was not specified, it is simply placed in the working directory (<YOUR_EXPRESS_DIR>/sample_data). You can view the output for the run using the command:

  1. $ less results.xprs

Your results should look like this:

  1. bundle_id	target_id	length	eff_length	tot_counts	uniq_counts	est_counts	eff_counts	ambig_distr_alpha	ambig_distr_beta	fpkm	fpkm_conf_low	fpkm_conf_high	solvable
    1	NM_014620	2300	2123.247776	958	77	520.554767	563.888952	39.404036	38.861284	24516.910965	21256.860893	27776.961037	T
    1	NM_153693	2072	1895.275039	481	5	119.916333	131.097934	19.284056	60.593282	6327.120373	4587.418700	8066.822046	T
    1	NR_003084	1640	1463.326696	266	0	17.689103	19.824779	6.175215	86.684623	1208.828007	348.657645	2068.998369	T
    1	NM_153633	1666	1489.323587	762	10	416.151511	465.518994	64.183691	54.654069	27942.316563	23818.690311	32065.942816	T
    1	NM_018953	1612	1435.330044	228	91	212.893398	239.097731	3.148257	0.390173	14832.365459	11791.549012	17873.181907	T
    1	NM_004503	1681	1504.321793	384	37	297.794888	332.770029	16.332212	5.398573	19795.956525	16293.650427	23298.262622	T
    2	NM_014212	2037	1860.279224	55	55	55.000000	60.224830	0.000000	0.000000	2956.545409	1737.816128	4175.274690	T
    3	NM_173860	849	672.421281	962	962	962.000000	1214.622475	0.000000	0.000000	143065.073604	129374.738511	156755.408697	T
    4	NM_022658	2288	2111.249211	4881	4881	4881.000000	5289.630397	0.000000	0.000000	231190.139724	221343.835827	241036.443620	T
    5	NM_017410	2396	2219.236296	42	42	42.000000	45.345329	0.000000	0.000000	1892.542947	953.138236	2831.947658	T
    6	NM_006897	1541	1364.338534	664	664	664.000000	749.978084	0.000000	0.000000	48668.272828	42912.224535	54424.321120	T
    7	NM_017409	1959	1782.288551	47	47	47.000000	51.659985	0.000000	0.000000	2637.058964	1581.064918	3693.053009	T
    8	NM_001168316	2283	2106.249809	1552	12	443.212928	480.406032	86.571551	222.603290	21042.752189	18083.116378	24002.388000	T
    8	NM_174914	2385	2208.237612	1745	38	1049.949880	1133.995024	74.786096	51.366264	47546.961175	43110.530924	51983.391426	T
    8	NR_031764	1853	1676.301226	1243	7	270.837192	299.386118	100.878291	371.706966	16156.833161	13233.910476	19079.755847	T
    

While it may be difficult to read in your terminal, opening the file with R or Excel should help you to visualize the columns. An important column is FPKM, which reports the estimated abundance in expected Fragments Per Kilobase per Million mapped fragments. Other fields include the estimated counts (est_counts) and parameters for the the posterior count distribution (ambig_distr_alpha/beta), as well as the "effective" estimated counts (eff_counts) after correction for bias. Transcripts are sorted by their bundle_id, denoting which multi-mapping group the transcript belongs to and can help determine isoforms and gene families. More details on the output including full descriptions of all columns can be found in the Manual.

Back to top.

Calculating differential expression

If you have multiple samples and replicates, you may want to discover if there is a significant change in abundance of any genes or isoforms under different conditions. Unfortunately, a differential expression tool does not yet exist to take advantage of the full distribution on estimated counts that we output, but we are working on one that will be available soon. For now, we recommend inputting the rounded effective counts for your samples into a count-based differential expression tool such as DEGSeq or edgeR.

Back to top.

Additional Resources

Harold Pimentel has made an excellent walkthrough for today's *Seq I Meeting, which is available here. New users should have a look if you need help getting started with eXpress!

Back to top.