Annotation Liftover with Augustus-cgp

An increasingly important strategy in genome annotation is the transfer of annotations from well-annotated genomes to genomes of closely related species.
In this tutorial, we will show how Augustus-cgp can be utilized for this particular application by compiling the RefSeq annotation for human into 'CDS' and 'intron' hints.

1. Generate 'CDS' and 'intron' hints from annotations

Use the hg38 RefSeq annotation (in gtf format) and convert it into gff format by moving stop codons into the coding sequence and including intron lines.

grep -P "\t(CDS|stop_codon|start_codon)\t" refseq/refseq.hg38.gtf | gtf2gff.pl --printIntron --includeStopInCDS --out=refseq/refseq.hg38.gff

[+] What if my annotation is not in GTF format? ...

Grep all CDS and intron lines and replace the last column in the gff by a manual source

grep -P "\t(CDS|intron)\t" refseq/refseq.hg38.gff | cut -f1-8 | perl -pe 's/$/\tsource=M/' >refseq/hg38.hints.gff

Finally, summarize multiple identical hints into a single one with multiplicity

sort -n -k 4,4 refseq/hg38.hints.gff | sort -s -n -k 5,5 | sort -s -n -k 3,3 | sort -s -k 1,1 | join_mult_hints.pl >temp
mv temp refseq/hg38.hints.gff

2. Load annotation hints into the database

If you don't have a database with the genomes, yet, follow the instructions in 1. Load genomes ...
to create the database vertebrates.db.

Make a copy of the database

cp vertebrates.db vertebrates_anno.db

and load the annotation hints into the new database

load2sqlitedb --species=hg38 --dbaccess=vertebrates_anno.db refseq/hg38.hints.gff

You can check if loading was successful with following database query

sqlite3 -header -column vertebrates_anno.db "SELECT count(*) AS '#hints',typename,speciesname FROM 
(hints as H join featuretypes as F on H.type=F.typeid) natural join speciesnames group by speciesid,typename;"

that returns a summary of how many hints of each type are in the database for each species.

#hints      typename    speciesname
----------  ----------  -----------
86          CDS         hg38       
78          intron      hg38

3. Prepare an extrinsic config file

Start by copying following extrinsic configuration file:

cp ${AUGUSTUS_CONFIG_PATH}extrinsic/extrinsic-cgp.cfg extrinsic-anno.cfg

Open the extrinsic-anno.cfg file with a text editor, go to the second [GROUP] section and replace the following line

[GROUP] # replace 'none' by the names of genomes with src=M hints in the database
none

as instruced by the names of genomes with annotation hints, i.e.

[GROUP] # replace 'none' by the names of genomes with src=M hints in the database
hg38

[+] format of the extrinsic.cfg file in cgp mode ...

4. Run AUGUSTUS-CGP with annotation hints

Create a new folder for the liftover experiments and switch to the new directory

mkdir augCGP_liftover
cd augCGP_liftover

For convenience assign each alignment chunk to a job ID by creating softlinks

num=1
for f in ../mafs/*.maf; do ln -s $f $num.maf; ((num++)); done

Run Augustus with retrieval of hints from the database (~3min).

for id in *.maf
do
augustus \
--species=human \
--softmasking=1 \
--treefile=../tree.nwk \
--alnfile=$id \
--dbaccess=../vertebrates_anno.db \
--speciesfilenames=../genomes.tbl \
--alternatives-from-evidence=0 \
--dbhints=1 \
--allow_hinted_splicesites=atac \
--extrinsicCfgFile=../extrinsic-anno.cfg \
--/CompPred/outdir=pred${id%.maf} > aug${id%.maf}.out 2> err${id%.maf}.out &
done

This will generate the folders pred*/ (one for each alignment chunk) that contain gff files with gene predictions for each input genome.

bosTau8.cgp.gff
canFam3.cgp.gff
galGal4.cgp.gff
hg38.cgp.gff
mm10.cgp.gff
monDom5.cgp.gff
rheMac3.cgp.gff
rn6.cgp.gff

Note that the parallelization with the bash '&' command above is quite simple and rather for demonstration purposes.
For real applications with several hundreds or thousands of alignment chunks, we recommend to run job arrays on a compute cluster.

5. Merge gene predictions from parallel runs

6. Upload gene predictions into the assembly hub

Convert the final gene predictions from gff to BED format and place each BED file in a separate folder with the name of the corresponding genome. It is important that directory names are consistent with the names in the HAL alignment.

for f in joined_pred/*.gff
do
mkdir "$(basename $f .gff)"
gtf2bed.pl <$f >$(basename $f .gff)/augCGP_liftover.bed --itemRgb=0,0,225
done

Specify any RGB color you like for the track with option --itemRgb, e.g. 0,0,225.
The name of the current directory (i.e. augCGP_liftover) will be used as track name on the browser.
Switch back to the main working directory data/

cd ..

and rerun the hal2assemblyHub.py script. Include gene tracks with option --bedDirs

hal2assemblyHub.py vertebrates.hal vertHub --lod \
--alignability --gcContent \
--hub vertCompHub --shortLabel VertebratesCompHub \
--bedDirs augCGP_liftover \
--tabBed \
--maxThreads=10 --longLabel "Vertebrates Comparative Assembly Hub"

You can also include gene tracks from other exercises by passing a comma-separated list of directories e.g. --bedDirs refseq,augCGP_denovo,augCGP_rnaseq,augCGP_liftover,...

Repeat 4. Load the hub and browser the alignment.