Navigate to main AugCGP tutorial. Cactus alignments and assembly Hubs. AugCGP with RNA-Seq. Annotation transfer with AugCGP. Combining RNA-Seq and annotation evidence. Cross-species consistency of gene sets.
Show all / no details.

De novo comparative gene prediction with Augustus-cgp

In this tutorial, we run Augustus-cgp in its most basic form: de novo. That means only the raw genomes and the alignment of the genomes are input, but no extrinsic evidence, e.g. from transcriptome data. In most applications, however, RNA-Seq data will be available and it is recommended to incorporate them to obtain a better accuracy.

1. Load genomes into an SQLite database

Prepare a text file with a list of species names and location of the corresponding genome fasta files.

[+] file format...

for f in $PWD/genomes/*.fa; do echo -ne "$(basename $f .fa)\t$f\n"; done >genomes.tbl
The file genomes.tbl will now look like this (except for the parent directory)
bosTau8 /var/www/augustus/htdocs/binaries/tutorial-cgp/data/genomes/bosTau8.fa
canFam3 /var/www/augustus/htdocs/binaries/tutorial-cgp/data/genomes/canFam3.fa
galGal4 /var/www/augustus/htdocs/binaries/tutorial-cgp/data/genomes/galGal4.fa
hg38    /var/www/augustus/htdocs/binaries/tutorial-cgp/data/genomes/hg38.fa
mm10    /var/www/augustus/htdocs/binaries/tutorial-cgp/data/genomes/mm10.fa
monDom5 /var/www/augustus/htdocs/binaries/tutorial-cgp/data/genomes/monDom5.fa
rheMac3 /var/www/augustus/htdocs/binaries/tutorial-cgp/data/genomes/rheMac3.fa
rn6     /var/www/augustus/htdocs/binaries/tutorial-cgp/data/genomes/rn6.fa
Load the genomes into an SQLite database
while read line
do
  species=$(echo "$line" | cut -f 1)
  genome=$(echo "$line" | cut -f 2)
  load2sqlitedb --noIdx --species=$species --dbaccess=vertebrates.db $genome
done <genomes.tbl
Finalize database by creating indices on the tables
load2sqlitedb --makeIdx --dbaccess=vertebrates.db
You can check if loading was successful with following database query
sqlite3 -header -column vertebrates.db "\
 SELECT speciesname, \
  sum(end-start+1) AS 'genome length',\
  count(*) AS '# chunks',\
  count(distinct seqnr) AS '# seqs'\
 FROM genomes natural join speciesnames\
 GROUP BY speciesname;"
that returns a summary of the genomes in the database, e.g. their sizes (total number of bases per genome), number of sequences per genome, etc.
speciesname  genome length  # chunks    # seqs    
-----------  -------------  ----------  ----------
bosTau8      156091         4           1         
canFam3      184728         4           1         
galGal4      149999         3           1         
hg38         210155         5           1         
mm10         178393         4           1         
monDom5      540519         11          1         
rheMac3      220640         5           1         
rn6          99944          2           1         

2. Run AUGUSTUS in CGP mode

Create a new folder for the de novo experiments and switch to the new directory
mkdir augCGP_denovo
cd augCGP_denovo
For convenience assign each alignment chunk to a job ID by creating softlinks
num=1
for f in ../mafs/*.maf; do ln -s $f $num.maf; ((num++)); done
Run Augustus in cgp mode on all alignment chunks in parallel (~3min).
Use the option --softmasking=1 in cases where the genomes are soft-masked.
The parameter --species is the same as in the standard version of Augustus. You can take the species identifier that best represents your clade,
(e.g. --species=human for mammalian clades, --species=fly for fly clades, --species=chicken for bird clades, ...)
A complete list of species identifiers can be found in the directory augustus/config/species.
for ali in *.maf; do
id=${ali%.maf} # remove .maf suffix
augustus \
--species=human \
--softmasking=1 \
--treefile=../tree.nwk \
--alnfile=$ali \
--dbaccess=../vertebrates.db \
--speciesfilenames=../genomes.tbl \
--alternatives-from-evidence=0 \
--/CompPred/outdir=pred$id > aug$id.out 2> err$id.out &
done
This will generate the folders pred*/ (one for each alignment chunk) that contain gff files with gene predictions for each input genome.
bosTau8.cgp.gff
canFam3.cgp.gff
galGal4.cgp.gff
hg38.cgp.gff
mm10.cgp.gff
monDom5.cgp.gff
rheMac3.cgp.gff
rn6.cgp.gff
Note that the parallelization with the bash '&' command above is quite simple and rather for demonstration purposes.
For real applications with several hundreds or thousands of alignment chunks, we recommend to run job arrays on a compute cluster.

3. Merge gene predictions from parallel runs

mkdir joined_pred
while read line
do
  species=$(echo "$line" | cut -f 1)
  find pred* -name "${species}.cgp.gff" >${species}_gtfs.lst;
  joingenes -f ${species}_gtfs.lst -o joined_pred/$species.gff
done < ../genomes.tbl
This will create the folder joined_pred/ with the final gene predictions for each input genome.
bosTau8.gff
canFam3.gff
galGal4.gff
hg38.gff
mm10.gff
monDom5.gff
rheMac3.gff
rn6.gff

4. Upload gene predictions into the assembly hub

Convert the final gene predictions from gff to BED format and place each BED file in a separate folder with the name of the corresponding genome. It is important that directory names are consistent with the names in the HAL alignment.
for f in joined_pred/*.gff
do
mkdir "$(basename $f .gff)"
gtf2bed.pl <$f >$(basename $f .gff)/augCGP_denovo.bed --itemRgb=225,0,0
done
Specify any RGB color you like for the track with option --itemRgb, e.g. 225,0,0.
The name of the current directory (i.e. augCGP_denovo) will be used as track name on the browser.
Switch back to the main working directory data/
cd ..
and rerun the hal2assemblyHub.py script. Include gene tracks with option --bedDirs
hal2assemblyHub.py vertebrates.hal vertHub --lod \
--alignability --gcContent \
--hub vertCompHub --shortLabel VertebratesCompHub \
--bedDirs augCGP_denovo \
--tabBed \
--maxThreads=10 --longLabel "Vertebrates Comparative Assembly Hub"
You can also include gene tracks from other exercises by passing a comma-separated list of directories e.g. --bedDirs refseq,augCGP_denovo,augCGP_rnaseq,augCGP_liftover,...

Repeat 4. Load the hub and browser the alignment.