RNA-Seq-based prediction with Augustus-cgp

This tutorial describes how RNA-Seq data can be incorporated into Augustus-cgp. In general, the same types of extrinsic evidence can be incorporated as in single-species gene finding with Augustus (including RNA-Seq, cDNA, ESTs, protein sequences, etc). In cgp mode, evidence is species-specific an can be incorporated for each or a subset of genomes.

1. Generate hints from RNA-Seq

For most species raw RNA-Seq data (in fastq format) is readly available and can be downloaded from the NCBI Sequence Read Archive (SRA).

[+] Some advice on how to choose the RNA-Seq libraries...

In order to integrate RNA-Seq into Augustus, the usual procedure involves

aligning RNA-Seq reads to the source genome with a spliced-aligner (f.e. STAR)
quality and uniqueness filtering of alignments with filterBam
generation of intron hints with bam2hints
generation of exonpart hints with bam2wig and wig2hints.pl

We will skip this procedure as it is explained in detail in other tutorials (see section 'RNA-Seq integration' in the Augustus Wiki)
and assume that we have generated hints files for some of the species (chicken, human, mouse and rhesus) in our clade.

2. Load RNA-Seq hints into the SQLite database

Prepare a text file with a list of species names and location of the corresponding hints files.

[+] file format...

name_of_genome_1  path/to/hintsfile_1
name_of_genome_2  path/to/hintsfile_2
...
name_of_genome_N  path/to/hintsfile_N

for f in $PWD/hints/*.gff; do echo -ne "$(basename $f .hints.gff)\t$f\n"; done >hints.tbl

The file hints.tbl will now look like this (except for the parent directory)

galGal4 /var/www/augustus/htdocs/binaries/tutorial-cgp/data/hints/galGal4.hints.gff
hg38    /var/www/augustus/htdocs/binaries/tutorial-cgp/data/hints/hg38.hints.gff
mm10    /var/www/augustus/htdocs/binaries/tutorial-cgp/data/hints/mm10.hints.gff
rheMac3 /var/www/augustus/htdocs/binaries/tutorial-cgp/data/hints/rheMac3.hints.gff

If you don't have a database with the genomes, yet, follow the instructions in 1. Load genomes ...
to create the database vertebrates.db.

Make a copy of the database

cp vertebrates.db vertebrates_rnaseq.db

and load the hints into the new database

while read line
do
  species=$(echo "$line" | cut -f 1)
  hints=$(echo "$line" | cut -f 2)
  load2sqlitedb --noIdx --species=$species --dbaccess=vertebrates_rnaseq.db $hints
done <hints.tbl

Finalize database by creating indices on the tables

load2sqlitedb --makeIdx --dbaccess=vertebrates_rnaseq.db

You can check if loading was successful with following database query

sqlite3 -header -column vertebrates_rnaseq.db "\
  SELECT count(*) AS '#hints',typename,speciesname\
  FROM (hints as H join featuretypes as F on H.type=F.typeid)\
        natural join speciesnames\
  GROUP BY speciesid,typename;"

that returns a summary of how many hints of each type are in the database for each species.

#hints      typename    speciesname
----------  ----------  -----------
3368        exonpart    galGal4
129         intron      galGal4
7905        exonpart    hg38
267         intron      hg38
7930        exonpart    mm10
378         intron      mm10
11050       exonpart    rheMac3
265         intron      rheMac3

3. Prepare an extrinsic config file

Start by copying following extrinsic configuration file:

cp ${AUGUSTUS_CONFIG_PATH}/extrinsic/extrinsic-cgp.cfg extrinsic-rnaseq.cfg

Open the extrinsic-rnaseq.cfg file with a text editor, go to the first [GROUP] section and replace the next line

[GROUP] # replace 'none' by the names of genomes with src=W and src=E hints in the database
none

as instruced by the names of genomes with RNA-Seq hints, i.e.

[GROUP] # replace 'none' by the names of genomes with src=W and src=E hints in the database
hg38 mm10 rheMac3 galGal4

[+] format of the extrinsic.cfg file in cgp mode ...

4. Run AUGUSTUS-CGP with RNA-Seq hints

Create a new folder for the rnaseq experiments and switch to the new directory

mkdir augCGP_rnaseq
cd augCGP_rnaseq

For convenience assign each alignment chunk to a job ID by creating softlinks

num=1
for f in ../mafs/*.maf; do ln -s $f $num.maf; ((num++)); done

Run Augustus with retrieval of hints from the database (~3min).

for ali in *.maf
do
 id=${ali%.maf} # remove .maf suffix
 augustus \
  --species=human \
  --softmasking=1 \
  --treefile=../tree.nwk \
  --alnfile=$ali \
  --dbaccess=../vertebrates_rnaseq.db \
  --speciesfilenames=../genomes.tbl \
  --alternatives-from-evidence=0 \
  --dbhints=1 \
  --UTR=1 \
  --allow_hinted_splicesites=atac \
  --extrinsicCfgFile=../extrinsic-rnaseq.cfg \
  --/CompPred/outdir=pred$id > aug$id.out 2> err$id.out &
done

The option UTR=1 enables the UTR model and is recommended whenever 'exonpart' hints are incorporated.

This will generate the folders pred*/ (one for each alignment chunk) that contain gff files with gene predictions for each input genome.

bosTau8.cgp.gff
canFam3.cgp.gff
galGal4.cgp.gff
hg38.cgp.gff
mm10.cgp.gff
monDom5.cgp.gff
rheMac3.cgp.gff
rn6.cgp.gff

Note that the parallelization with the bash '&' command above is quite simple and rather for demonstration purposes.
For real applications with several hundreds or thousands of alignment chunks, we recommend to run job arrays on a compute cluster.

5. Merge gene predictions from parallel runs

6. Upload gene predictions into the assembly hub

Convert the final gene predictions from gff to BED format and place each BED file in a separate folder with the name of the corresponding genome. It is important that directory names are consistent with the names in the HAL alignment.

for f in joined_pred/*.gff
do
mkdir "$(basename $f .gff)"
gtf2bed.pl <$f >$(basename $f .gff)/augCGP_rnaseq.bed --itemRgb=34,139,34
done

Specify any RGB color you like for the track with option --itemRgb, e.g. 34,139,34.
The name of the current directory (i.e. augCGP_rnaseq) will be used as track name on the browser.
Switch back to the main working directory data/

cd ..

and rerun the hal2assemblyHub.py script. Include gene tracks with option --bedDirs

hal2assemblyHub.py vertebrates.hal vertHub --lod \
--alignability --gcContent \
--hub vertCompHub --shortLabel VertebratesCompHub \
--bedDirs augCGP_rnaseq \
--tabBed \
--maxThreads=10 --longLabel "Vertebrates Comparative Assembly Hub"

You can also include gene tracks from other exercises by passing a comma-separated list of directories e.g. --bedDirs refseq,augCGP_denovo,augCGP_rnaseq,augCGP_liftover,...

Repeat 4. Load the hub and browser the alignment.