Navigate to main AugCGP tutorial. Cactus alignments and assembly Hubs. AugCGP de novo. AugCGP with RNA-Seq. Annotation transfer with AugCGP. Combining RNA-Seq and annotation evidence.
Show all / no details.

cross-species comparison of gene sets

The consistency of gene sets within a clade is very important, in particular, when studying the genomic differences.
We developed the tool 'homGeneMapping' for evaluation of consistency of gene sets across species. HomGeneMapping uses a whole-genome alignment of the species to map coordinates between genomes. Furthermore, accuracy can be measured by incorporating evidence from RNA-Seq and evaluating how well a gene structure agrees with the supporting RNA-Seq data.

1. Run HomGeneMapping to map coordinates between genomes

Prepare a text file with a list of genome names and locations of the corresponding input gene files in GTF format.

[+] file format...

for f in $PWD/augCGP_rnaseq+liftover/joined_pred/*.gff; do echo -ne "$(basename $f .gff)\t$f\n"; done >gtfs.tbl
The file gtfs.tbl will now look like this (except for the parent directory)
bosTau8 /var/www/augustus/htdocs/binaries/tutorial-cgp/data/augCGP_rnaseq+liftover/joined_pred/bosTau8.gff
canFam3 /var/www/augustus/htdocs/binaries/tutorial-cgp/data/augCGP_rnaseq+liftover/joined_pred/canFam3.gff
galGal4 /var/www/augustus/htdocs/binaries/tutorial-cgp/data/augCGP_rnaseq+liftover/joined_pred/galGal4.gff
hg38    /var/www/augustus/htdocs/binaries/tutorial-cgp/data/augCGP_rnaseq+liftover/joined_pred/hg38.gff
mm10    /var/www/augustus/htdocs/binaries/tutorial-cgp/data/augCGP_rnaseq+liftover/joined_pred/mm10.gff
monDom5 /var/www/augustus/htdocs/binaries/tutorial-cgp/data/augCGP_rnaseq+liftover/joined_pred/monDom5.gff
rheMac3 /var/www/augustus/htdocs/binaries/tutorial-cgp/data/augCGP_rnaseq+liftover/joined_pred/rheMac3.gff
rn6     /var/www/augustus/htdocs/binaries/tutorial-cgp/data/augCGP_rnaseq+liftover/joined_pred/rn6.gff
Run homGeneMapping as follows
homGeneMapping --halfile=vertebrates.hal --gtfs=gtfs.tbl --cpus=8 --outdir=hgm --dbaccess=vertebrates_rnaseq+anno.db --details
When a database is passed with the parameter --dbaccess, 'CDS' and 'intron' hints are imported from the database and information is printed which and how many of a transcript's CDS exons and introns have hints support.

2. homGeneMapping output explained

The folder hgm/. contains one gtf file for each of the input gtf files.
-rw-r--r--  1 stefanie bioinf  55K Sep 21 12:27 bosTau8.gtf
-rw-r--r--  1 stefanie bioinf  55K Sep 21 12:27 canFam3.gtf
-rw-r--r--  1 stefanie bioinf  42K Sep 21 12:27 galGal4.gtf
-rw-r--r--  1 stefanie bioinf  63K Sep 21 12:27 hg38.gtf
-rw-r--r--  1 stefanie bioinf  51K Sep 21 12:27 mm10.gtf
-rw-r--r--  1 stefanie bioinf  54K Sep 21 12:27 monDom5.gtf
-rw-r--r--  1 stefanie bioinf  54K Sep 21 12:27 rheMac3.gtf
-rw-r--r--  1 stefanie bioinf  22K Sep 21 12:27 rn6.gtf
Each of the files starts with a header
# 0     bosTau8
# 1     canFam3
# 2     galGal4
# 3     hg38
# 4     mm10
# 5     monDom5
# 6     rheMac3
# 7     rn6
###
that assigns each genome name to a unique identification number that is used as a shortcut. The header is followed by a list of transcripts. The output for each transcript T can be split into three sections

[1] extended gtf format...
[2] info on gene feature level...
[3] info on transcript level...

sections 2 and 3 are only printed if the --details option is turned on.