Navigate to main AugCGP tutorial.
Cactus alignments and assembly Hubs.
AugCGP de novo.
AugCGP with RNA-Seq.
Annotation transfer with AugCGP.
Cross-species consistency of gene sets.
Combining Annotation transfer and RNA-Seq-based prediction
In the typical clade annotation scenario, most genomes are supplemented with RNA-Seq data, whereas a few may represent model organsims for which high quality annotations exist (e.g. human and mouse in the vertebrate clade). This tutorial describes how RNA-Seq evidence and evidence from existing annotations can be combined in Augustus-cgp.
1. Create a database with RNA-Seq and annotation hints
If you don't have a database with the genomes and RNA-Seq hints, yet, follow the instructions in
1. Load RNA-Seq hints ...
to create the database vertebrates_rnaseq.db.
Make a copy of the database
cp vertebrates_rnaseq.db vertebrates_rnaseq+anno.db
and load the annotation hints from
exercise 4.1 into the new database
load2sqlitedb --species=hg38 --dbaccess=vertebrates_rnaseq+anno.db refseq/hg38.hints.gff
You can check if loading was successful with following
database query
sqlite3 -header -column vertebrates_rnaseq+anno.db "SELECT count(*) AS '#hints',typename,speciesname FROM
(hints as H join featuretypes as F on H.type=F.typeid) natural join speciesnames group by speciesid,typename;"
that returns a summary of how many hints of each type are in the database for each species.
Your database should now contain both RNA-Seq hints from vertebrates_rnaseq.db and annotation hints from vertebrates_anno.db
#hints typename speciesname
---------- ---------- -----------
3368 exonpart galGal4
129 intron galGal4
86 CDS hg38
7905 exonpart hg38
345 intron hg38
7930 exonpart mm10
378 intron mm10
11050 exonpart rheMac3
265 intron rheMac3
3. Prepare an extrinsic config file
Start by copying following extrinsic configuration file:
cp ${AUGUSTUS_CONFIG_PATH}extrinsic/extrinsic-cgp.cfg extrinsic-rnaseq+anno.cfg
Open the extrinsic-rnaseq+anno.cfg file with a text editor,
go to the first [GROUP] section and replace the following line
[GROUP] # replace 'none' by the names of genomes with src=W and src=E hints in the database
none
by the names of genomes with annotation RNA-Seq hints, i.e.
[GROUP]
hg38 mm10 rheMac3 galGal4
Note, that in our case, nothing further has to be done, since the only genome with annotation hints - hg38 -
is already covered in the first table. In other applications, you may have genomes with annotations, but
no RNA-Seq data. In this case the names of the genomes that ONLY have annotation hints must be listed in the
second [GROUP] section.
[+]
format of the extrinsic.cfg file in cgp mode ...
In cgp mode hints can be integrated for multiple species.
In order to have different extrinsic config settings for different species,
multiple [GENERAL] tables are specified. Each table is followed by a [GROUP] section,
a single line, in which a subset of the species is listed, for which the table is valid.
Use the same species identifiers as in the genome alignment and in the phylogenetic tree.
If a species is not assigned to any of the tables, all hints for that species are
ignored. To assign all species to a single table, the key 'all' can be used instead of itemizing
every single species identifier. Use the key 'other' to specify a table for all species, not
listed in any previous table.
Note that the source RM must be specified in case that the softmasking option is turned on.
Also note that all tables have the same dimension, i.e. each table must contain all sources
listed in the section [SOURCES], even sources for which no hints exist for any of species
in group.
4. Run AUGUSTUS-CGP with RNA-Seq and annotation hints
Create a new folder for the liftover experiments and
switch to the new directory
mkdir augCGP_rnaseq+liftover
cd augCGP_rnaseq+liftover
For convenience assign each alignment chunk to a job ID by
creating softlinks
num=1
for f in ../mafs/*.maf; do ln -s $f $num.maf; ((num++)); done
Run Augustus with retrieval of hints from the
database (~3min).
for id in *.maf
do
augustus \
--species=human \
--softmasking=1 \
--treefile=../tree.nwk \
--alnfile=$id \
--dbaccess=../vertebrates_rnaseq+anno.db \
--speciesfilenames=../genomes.tbl \
--alternatives-from-evidence=0 \
--dbhints=1 \
--UTR=1 \
--allow_hinted_splicesites=atac \
--extrinsicCfgFile=../extrinsic-rnaseq+anno.cfg \
--/CompPred/outdir=pred${id%.maf} > aug${id%.maf}.out 2> err${id%.maf}.out &
done
This will generate the folders pred*/ (one for each alignment chunk)
that contain gff files with gene predictions for each input genome.
bosTau8.cgp.gff
canFam3.cgp.gff
galGal4.cgp.gff
hg38.cgp.gff
mm10.cgp.gff
monDom5.cgp.gff
rheMac3.cgp.gff
rn6.cgp.gff
Note that the parallelization with the bash '&' command above is quite simple and rather for demonstration purposes.
For real applications with several hundreds or thousands of alignment chunks, we recommend to
run job arrays on a compute cluster.
6. Upload gene predictions into the assembly hub
Convert the final gene predictions from gff to BED format and place
each BED file in a separate folder with the name of the corresponding genome. It is important that directory names are consistent with the names in the HAL alignment.
for f in joined_pred/*.gff
do
mkdir "$(basename $f .gff)"
gtf2bed.pl <$f >$(basename $f .gff)/augCGP_rnaseq+anno.bed --itemRgb=255,165,0
done
Specify any RGB color you like for the track with option --itemRgb, e.g. 255,165,0.
The name of the current directory (i.e. augCGP_rnaseq+liftover) will be used as track name on the browser.
Switch back to the main working directory data/
cd ..
and rerun the hal2assemblyHub.py script. Include gene tracks with option --bedDirs
hal2assemblyHub.py vertebrates.hal vertHub --lod \
--alignability --gcContent \
--hub vertCompHub --shortLabel VertebratesCompHub \
--bedDirs augCGP_rnaseq+liftover \
--tabBed \
--maxThreads=10 --longLabel "Vertebrates Comparative Assembly Hub"
You can also include gene tracks from other exercises by passing a comma-separated list of directories e.g.
--bedDirs refseq,augCGP_denovo,augCGP_rnaseq,augCGP_liftover,...
Repeat 4. Load the hub and browser the alignment.