Installation
Installing a pre-compiled binary release
In order to make it easy to install eXpress, we provide a few binary packages that save you the trouble of having to compile from source. To use the binary packages for OSX/Linux (Windows), simply download the appropriate one for your machine, untar (unzip) it, and make sure the express(.exe)
binary is in a directory in your PATH environment variable.
If using the Windows binary, you will also need to install the Visual C++ 2010 Runtime Library.
Installing from source
Note: Installing from source is often an unnecessary hassle. If a binary is available for your system, it is highly recommended that you use it. If one is not available, please send us a request to ask.xprs@gmail.com.
The instructions below include commands that are compatible with Unix/Linux/Mac. If compiling on a Windows machine, use the appropriate DOS commands instead. Assuming you are using the Visual C++ compiler, you must type your commands from the Visual Studio Command Prompt.
• Install C++ Compiler (if not already available)
While most Linux systems will already have GCC installed, Mac OS X and Windows do not have a C++ compiler installed by default. Free options include XCode for Mac OS X and Visual C++ Express for Windows.
• Download and extract the source
First you must download source code from the Download menu at the top of this page. Untar the file using:
$ tar -xf express-<EXPRESS_VERSION>-src.tgz
From now on we refer to the path to the directory that is created as <YOUR_EXPRESS_DIR>
.
• Install CMake
Fortunately, the developers of CMake provide simple installation packages for most architectures. Find the right package for your system at their website.
• Install BamTools
With CMake installed, BamTools installation is straightforward.
- Download the source from here and untar into
<YOUR_EXPRESS_DIR>
. - Navigate to
<YOUR_EXPRESS_DIR>/bamtools
.$ cd <YOUR_EXPRESS_DIR>/bamtools
- Make a new directory called build and navigate to it.
$ mkdir build
$ cd build - Have CMake generate the makefile.
$ cmake ..
- Build the BamTools libraries by calling make.
$ make
• Install BOOST
If you have a package manager such as yum (Linux) or MacPorts (OS X) installed, these will probably be the easiest way to install BOOST. Be sure to install boost-devel
so that the headers are included. If you are using Windows, you can use the installer found here. Otherwise, follow the instructions below.
- Download
Boost and the
bjam
build engine. - Unpack
bjam
and add it to your PATH. - Unpack the Boost tarball and
cd
to the Boost source directory. This directory is called theBOOST_ROOT
in some Boost installation instructions. - Build Boost. Note that you can specify where to
put Boost with the
--prefix
option. The default Boost installation directory is/usr/local
.
- If you are on Mac OS X, type:
$ bjam --prefix=<YOUR_BOOST_INSTALL_DIRECTORY> --toolset=darwin architecture=x86 address_model=32_64 link=static runtime-link=static --layout=versioned stage install
- If you are on a 32-bit Linux system, type:
$ bjam --prefix=<YOUR_BOOST_INSTALL_DIRECTORY> --toolset=gcc architecture=x86 address_model=32 link=static runtime-link=static stage install
- If you are on a 64-bit Linux system, type:
$ bjam --prefix=<YOUR_BOOST_INSTALL_DIRECTORY> --toolset=gcc architecture=x86 address_model=64 link=static runtime-link=static stage install
• Install eXpress
We are now ready to build and install eXpress!
- Navigate to
<YOUR_EXPRESS_DIR>
.$ cd <YOUR_EXPRESS_DIR>
- Make a new directory called build and navigate to it.
$ mkdir build
$ cd build - Have CMake generate the makefile.
$ cmake ..
- Build the eXpress binary by calling make.
$ make
- Copy the binary to /usr/lib/bin (or alternatively to another directory in your PATH).
$ sudo make install
You should now be able to type express from any directory and see a print-out of the usage and options. If you do not see this and there were no errors in the compilation, double check to see that the binary was copied into a directory in your PATH.
General Use Case: RNA-Seq abundances
Required input
eXpress requires two input files:
- A multi-FASTA file containing the transcript sequences. If the transcriptome of your organism is not annotated, you can generate this file from your sequencing reads using a de novo transcriptome assembler such as Trinity, Oases, or Trans-ABySS. If your organism has a reference genome you can assemble transcripts directly from mapped reads using Cufflinks. If your genome is already annotated (in GTF/GFF), you can generate a multi-FASTA file using the UCSC Genome Browser by uploading your annotation as a track and downloading the sequences under the "Tables" tab.
- Read alignments to the multi-FASTA file in SAM or BAM format. These can either be stored in a file or streamed directly from an aligner. It is important that you allow as many multi-mappings as possible. You can also allow many mismatches during mapping since eXpress builds an error model to probabalistically assign the reads, although this will increase mapping time. If you are combining reads from several library preparations or from sequencing runs using different read lengths, please see the Manual for important details on how the alignments should be input.
An example
In the following two sub-sections, you will run eXpress on a sample RNA-Seq dataset with simulated reads from UGT3A2 and the HOXC cluster using the human genome build hg18. Both the transcript sequences (transcripts.fasta) and raw reads (reads_1.fastq, reads_2.fastq) can be found in the <YOUR_EXPRESS_DIR>/sample_data directory. For this example to work, you will need to have both Bowtie and SAMtools installed, but in general any aligner will work and the conversion to BAM is not necessary unless you have insufficient disk space to store the uncompressed SAM.
Before you begin, you must prepare your Bowtie index. Since you wish to allow many multi-mappings, it is useful to build the index with a small offrate (in this case 1). The smaller the offrate, the larger the index and the faster the mapping. If you have disk space to spare, always use an offrate of 1. Build the index with the following commands.
$ cd <YOUR_EXPRESS_DIR>/sample_data
$ bowtie-build --offrate 1 transcripts.fasta transcripts
This command will populate your directory with several index files that allow Bowtie to more easily align reads to the transcripts.
You can now map the reads to the transcript sequences using the following Bowtie command, which outputs in SAM (-S
), allows for unlimited multi-mappings (-a
), and a maximum insert distance of 800 bp between the paired-ends (-X 800
). The first three options (a,S,X
) are highly recommended for best results. You should also allow for many mismatches, since eXpress models sequencing errors. Furthermore, you will want to take advantage of multiple processors when mapping large files using the -p option. See the Bowtie Manual for more details on various parameters and options.
The SAM output from Bowtie is piped into SAMtools in order to compress it to BAM format. This conversion is optional, but will greatly reduce the size of the alignment file.
$ bowtie -aS -X 800 --offrate 1 transcripts -1 reads_1.fastq -2 reads_2.fastq | samtools view -Sb - > hits.bam
Input from SAM/BAM file
Once you have aligned your reads to the transcriptome and stored them in a SAM or BAM file, you can run eXpress in default mode with the command:
$ express transcripts.fasta hits.bam
The default settings will be sufficient for most users, but please see the Manual for full descriptions of all available options.
Streaming input from aligner
If you do not wish to store an intermediate SAM/BAM file, you can pipe the Bowtie output directly into eXpress with the command:
$ bowtie -aS -X 800 --offrate 1 transcripts -1 reads_1.fastq -2 reads_2.fastq | express transcripts.fasta
This will give you the exact same output as the previous command, while avoiding the need potentially large amounts of disk space for storing the mapped reads.
Understanding the output
The output for eXpress is saved in a file called results.xprs in an easy-to-parse tab-delimited format. Since an output directory was not specified, it is simply placed in the working directory (<YOUR_EXPRESS_DIR>/sample_data). You can view the output for the run using the command:
$ less results.xprs
Your results should look like this:
bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable 1 NM_014620 2300 2123.247776 958 77 520.554767 563.888952 39.404036 38.861284 24516.910965 21256.860893 27776.961037 T 1 NM_153693 2072 1895.275039 481 5 119.916333 131.097934 19.284056 60.593282 6327.120373 4587.418700 8066.822046 T 1 NR_003084 1640 1463.326696 266 0 17.689103 19.824779 6.175215 86.684623 1208.828007 348.657645 2068.998369 T 1 NM_153633 1666 1489.323587 762 10 416.151511 465.518994 64.183691 54.654069 27942.316563 23818.690311 32065.942816 T 1 NM_018953 1612 1435.330044 228 91 212.893398 239.097731 3.148257 0.390173 14832.365459 11791.549012 17873.181907 T 1 NM_004503 1681 1504.321793 384 37 297.794888 332.770029 16.332212 5.398573 19795.956525 16293.650427 23298.262622 T 2 NM_014212 2037 1860.279224 55 55 55.000000 60.224830 0.000000 0.000000 2956.545409 1737.816128 4175.274690 T 3 NM_173860 849 672.421281 962 962 962.000000 1214.622475 0.000000 0.000000 143065.073604 129374.738511 156755.408697 T 4 NM_022658 2288 2111.249211 4881 4881 4881.000000 5289.630397 0.000000 0.000000 231190.139724 221343.835827 241036.443620 T 5 NM_017410 2396 2219.236296 42 42 42.000000 45.345329 0.000000 0.000000 1892.542947 953.138236 2831.947658 T 6 NM_006897 1541 1364.338534 664 664 664.000000 749.978084 0.000000 0.000000 48668.272828 42912.224535 54424.321120 T 7 NM_017409 1959 1782.288551 47 47 47.000000 51.659985 0.000000 0.000000 2637.058964 1581.064918 3693.053009 T 8 NM_001168316 2283 2106.249809 1552 12 443.212928 480.406032 86.571551 222.603290 21042.752189 18083.116378 24002.388000 T 8 NM_174914 2385 2208.237612 1745 38 1049.949880 1133.995024 74.786096 51.366264 47546.961175 43110.530924 51983.391426 T 8 NR_031764 1853 1676.301226 1243 7 270.837192 299.386118 100.878291 371.706966 16156.833161 13233.910476 19079.755847 T
While it may be difficult to read in your terminal, opening the file with R or Excel should help you to visualize the columns. An important column is FPKM, which reports the estimated abundance in expected Fragments Per Kilobase per Million mapped fragments. Other fields include the estimated counts (est_counts) and parameters for the the posterior count distribution (ambig_distr_alpha/beta), as well as the "effective" estimated counts (eff_counts) after correction for bias. Transcripts are sorted by their bundle_id, denoting which multi-mapping group the transcript belongs to and can help determine isoforms and gene families. More details on the output including full descriptions of all columns can be found in the Manual.
Calculating differential expression
If you have multiple samples and replicates, you may want to discover if there is a significant change in abundance of any genes or isoforms under different conditions. Unfortunately, a differential expression tool does not yet exist to take advantage of the full distribution on estimated counts that we output, but we are working on one that will be available soon. For now, we recommend inputting the rounded effective counts for your samples into a count-based differential expression tool such as DEGSeq or edgeR.
Additional Resources
Harold Pimentel has made an excellent walkthrough for today's *Seq I Meeting, which is available here. New users should have a look if you need help getting started with eXpress!