Combining Annotation transfer and RNA-Seq-based prediction

In the typical clade annotation scenario, most genomes are supplemented with RNA-Seq data, whereas a few may represent model organsims for which high quality annotations exist (e.g. human and mouse in the vertebrate clade). This tutorial describes how RNA-Seq evidence and evidence from existing annotations can be combined in Augustus-cgp.

1. Create a database with RNA-Seq and annotation hints

If you don't have a database with the genomes and RNA-Seq hints, yet, follow the instructions in 1. Load RNA-Seq hints ...
to create the database vertebrates_rnaseq.db.

Make a copy of the database

cp vertebrates_rnaseq.db vertebrates_rnaseq+anno.db

and load the annotation hints from exercise 4.1 into the new database

load2sqlitedb --species=hg38 --dbaccess=vertebrates_rnaseq+anno.db refseq/hg38.hints.gff

You can check if loading was successful with following database query

sqlite3 -header -column vertebrates_rnaseq+anno.db "SELECT count(*) AS '#hints',typename,speciesname FROM 
(hints as H join featuretypes as F on H.type=F.typeid) natural join speciesnames group by speciesid,typename;"

that returns a summary of how many hints of each type are in the database for each species.
Your database should now contain both RNA-Seq hints from vertebrates_rnaseq.db and annotation hints from vertebrates_anno.db

#hints      typename    speciesname
----------  ----------  -----------
3368        exonpart    galGal4    
129         intron      galGal4    
86          CDS         hg38       
7905        exonpart    hg38       
345         intron      hg38       
7930        exonpart    mm10       
378         intron      mm10       
11050       exonpart    rheMac3    
265         intron      rheMac3

3. Prepare an extrinsic config file

Start by copying following extrinsic configuration file:

cp ${AUGUSTUS_CONFIG_PATH}extrinsic/extrinsic-cgp.cfg extrinsic-rnaseq+anno.cfg

Open the extrinsic-rnaseq+anno.cfg file with a text editor, go to the first [GROUP] section and replace the following line

[GROUP] # replace 'none' by the names of genomes with src=W and src=E hints in the database
none

by the names of genomes with annotation RNA-Seq hints, i.e.

[GROUP]
hg38 mm10 rheMac3 galGal4

Note, that in our case, nothing further has to be done, since the only genome with annotation hints - hg38 - is already covered in the first table. In other applications, you may have genomes with annotations, but no RNA-Seq data. In this case the names of the genomes that ONLY have annotation hints must be listed in the second [GROUP] section.

[+] format of the extrinsic.cfg file in cgp mode ...

4. Run AUGUSTUS-CGP with RNA-Seq and annotation hints

Create a new folder for the liftover experiments and switch to the new directory

mkdir augCGP_rnaseq+liftover
cd augCGP_rnaseq+liftover

For convenience assign each alignment chunk to a job ID by creating softlinks

num=1
for f in ../mafs/*.maf; do ln -s $f $num.maf; ((num++)); done

Run Augustus with retrieval of hints from the database (~3min).

for id in *.maf
do
augustus \
--species=human \
--softmasking=1 \
--treefile=../tree.nwk \
--alnfile=$id \
--dbaccess=../vertebrates_rnaseq+anno.db \
--speciesfilenames=../genomes.tbl \
--alternatives-from-evidence=0 \
--dbhints=1 \
--UTR=1 \
--allow_hinted_splicesites=atac \
--extrinsicCfgFile=../extrinsic-rnaseq+anno.cfg \
--/CompPred/outdir=pred${id%.maf} > aug${id%.maf}.out 2> err${id%.maf}.out &
done

This will generate the folders pred*/ (one for each alignment chunk) that contain gff files with gene predictions for each input genome.

bosTau8.cgp.gff
canFam3.cgp.gff
galGal4.cgp.gff
hg38.cgp.gff
mm10.cgp.gff
monDom5.cgp.gff
rheMac3.cgp.gff
rn6.cgp.gff

Note that the parallelization with the bash '&' command above is quite simple and rather for demonstration purposes.
For real applications with several hundreds or thousands of alignment chunks, we recommend to run job arrays on a compute cluster.

5. Merge gene predictions from parallel runs

6. Upload gene predictions into the assembly hub

Convert the final gene predictions from gff to BED format and place each BED file in a separate folder with the name of the corresponding genome. It is important that directory names are consistent with the names in the HAL alignment.

for f in joined_pred/*.gff
do
mkdir "$(basename $f .gff)"
gtf2bed.pl <$f >$(basename $f .gff)/augCGP_rnaseq+anno.bed --itemRgb=255,165,0
done

Specify any RGB color you like for the track with option --itemRgb, e.g. 255,165,0.
The name of the current directory (i.e. augCGP_rnaseq+liftover) will be used as track name on the browser.
Switch back to the main working directory data/

cd ..

and rerun the hal2assemblyHub.py script. Include gene tracks with option --bedDirs

hal2assemblyHub.py vertebrates.hal vertHub --lod \
--alignability --gcContent \
--hub vertCompHub --shortLabel VertebratesCompHub \
--bedDirs augCGP_rnaseq+liftover \
--tabBed \
--maxThreads=10 --longLabel "Vertebrates Comparative Assembly Hub"

You can also include gene tracks from other exercises by passing a comma-separated list of directories e.g. --bedDirs refseq,augCGP_denovo,augCGP_rnaseq,augCGP_liftover,...

Repeat 4. Load the hub and browser the alignment.