Predicting Genes with AUGUSTUS

chr16   AUGUSTUS        transcript      93718   96048   .       +       .       jg6.t1
chr16   AUGUSTUS        CDS     93718   93790   1.0     +       0       transcript_id "jg6.t1"; gene_id "jg6"; hgm_info "3M-1,5"; 
chr16   AUGUSTUS        exon    93718   93790   0.0     +       .       transcript_id "jg6.t1"; gene_id "jg6"; hgm_info "3,5"; 
chr16   AUGUSTUS        start_codon     93718   93720   0.0     +       .       transcript_id "jg6.t1"; gene_id "jg6"; 
chr16   AUGUSTUS        intron  93791   94782   0.0     +       .       transcript_id "jg6.t1"; gene_id "jg6"; hgm_info "3E:M-6,5"; 
chr16   AUGUSTUS        CDS     94783   94962   1.0     +       2       transcript_id "jg6.t1"; gene_id "jg6"; hgm_info "3M-1,0,1,2,5,6"; 
chr16   AUGUSTUS        exon    94783   94962   0.0     +       .       transcript_id "jg6.t1"; gene_id "jg6"; hgm_info "3,0,1,2,5,6"; 
chr16   AUGUSTUS        intron  94963   95061   0.0     +       .       transcript_id "jg6.t1"; gene_id "jg6"; hgm_info "3E:M-6,0,1,2E-2335,5,6E-53"; 
chr16   AUGUSTUS        CDS     95062   95145   1.0     +       2       transcript_id "jg6.t1"; gene_id "jg6"; hgm_info "3M-1,0,2,5,6"; 
chr16   AUGUSTUS        exon    95062   95145   0.0     +       .       transcript_id "jg6.t1"; gene_id "jg6"; hgm_info "3,0,2,5,6"; 
chr16   AUGUSTUS        intron  95146   95333   0.0     +       .       transcript_id "jg6.t1"; gene_id "jg6"; hgm_info "3E:M-29,0,2E-1908,5,6E-65"; 
chr16   AUGUSTUS        CDS     95334   95410   1.0     +       2       transcript_id "jg6.t1"; gene_id "jg6"; hgm_info "3M-1,0,1,2,5,6"; 
chr16   AUGUSTUS        exon    95334   95410   0.0     +       .       transcript_id "jg6.t1"; gene_id "jg6"; hgm_info "3,0,1,2,5,6"; 
chr16   AUGUSTUS        intron  95411   95503   0.0     +       .       transcript_id "jg6.t1"; gene_id "jg6"; hgm_info "3E:M-28,0,1,2E-1877,5,6E*-108"; 
chr16   AUGUSTUS        CDS     95504   95567   1.0     +       0       transcript_id "jg6.t1"; gene_id "jg6"; hgm_info "3M-1,0,1,2,5"; 
chr16   AUGUSTUS        exon    95504   95567   0.0     +       .       transcript_id "jg6.t1"; gene_id "jg6"; hgm_info "3,0,1,2,5"; 
chr16   AUGUSTUS        intron  95568   95651   0.0     +       .       transcript_id "jg6.t1"; gene_id "jg6"; hgm_info "3E:M-18,0,1,2E-1823,5"; 
chr16   AUGUSTUS        exon    95652   96048   0.0     +       .       transcript_id "jg6.t1"; gene_id "jg6"; hgm_info "3"; 
chr16   AUGUSTUS        CDS     95652   95851   1.0     +       2       transcript_id "jg6.t1"; gene_id "jg6"; hgm_info "3M-1,0,1,2,5"; 
chr16   AUGUSTUS        stop_codon      95849   95851   0.0     +       .       transcript_id "jg6.t1"; gene_id "jg6";

In the extended gtf format CDS, exon and intron features of T have an additional attribute 'hgm_info' in the last column that encodes a comma-separated list of tuples of genome name, sources of evidence and multiplicity, e.g the tuple

2E-1908

encodes genome_name=galGal4 (see header), source=E and mult=1908. The 'hgm_info' string of the second intron

hgm_info "3E:M-6,0,1,2E-2335,5,6E-53";

states that the introns is consistent with an intron in the gene sets of species 3 (=hg38), 0 (=bosTau8), 1 (=canFam3), 2 (=galGal4), 5 (=monDom5) and 6 (=rheMac3). An intron/exon of species A is considered consistent with an intron/exon of species B, if both boundaries are aligned. CDS exons must additionally be in the same reading frame to be consistent with one another. In species 3,2 and 6 the intron is supported by RNA-Seq splice junctions (source E) with multiplicities 6, 2335 and 53, respectively. Furthermore, there is even evidence from an existing annotation in species 3 (source M) for the intron. If a source is followed by '*', the gene feature is not present in that particular species, although there is evidence for it, e.g.

hgm_info "...,6E*-108";

means that the gene feature is not present in the gene set of species 6, but has RNA-Seq support - a sign for a false negative in species 6, in particular if the gene feature is present and has strong evidence in many of the other species.

# gene feature level:
#
#
# number/% of features with exact homologs in at least k other genomes:
#
#---------------------------------------------------------------------------------------------------
#   k           CDS             Intr             Exon              Intr+Exon
#---------------------------------------------------------------------------------------------------
#
#   0         6 100.0%         5 100.0%         6 100.0%         11 100.0% *************************
#   1         6 100.0%         5 100.0%         5  83.3%         10  90.9% **********************
#   2         5  83.3%         4  80.0%         4  66.7%          8  72.7% ******************
#   3         5  83.3%         4  80.0%         4  66.7%          8  72.7% ******************
#   4         5  83.3%         4  80.0%         4  66.7%          8  72.7% ******************
#   5         2  33.3%         1  20.0%         2  33.3%          3  27.3% ******
#   6         0   0.0%         0   0.0%         0   0.0%          0   0.0% 
#   7         0   0.0%         0   0.0%         0   0.0%          0   0.0% 
#
# number/% of features supported by extrinsic evidence in at least k genomes :
#
#---------------------------------------------------------------------------------------------------
#   k           CDS             Intr             Exon              Intr+Exon
#---------------------------------------------------------------------------------------------------
#
#   0         6 100.0%         5 100.0%         6 100.0%         11 100.0% *************************
#   1         6 100.0%         5 100.0%         0   0.0%          5  45.5% ***********
#   2         0   0.0%         4  80.0%         0   0.0%          4  36.4% *********
#   3         0   0.0%         3  60.0%         0   0.0%          3  27.3% ******
#   4         0   0.0%         0   0.0%         0   0.0%          0   0.0% 
#   5         0   0.0%         0   0.0%         0   0.0%          0   0.0% 
#   6         0   0.0%         0   0.0%         0   0.0%          0   0.0% 
#   7         0   0.0%         0   0.0%         0   0.0%          0   0.0% 
#   8         0   0.0%         0   0.0%         0   0.0%          0   0.0%

The gene feature level displays the cumulative sum of percentages of CDS/intron/exon features of the transcript T, that are

consistent with CDS/intron/exon features in k other genomes (first table)
are supported by hints in at least k genomes (second table)

Let's have a closer look at the first table. For k=3

#---------------------------------------------------------------------------------------------------
#   k           CDS             Intr             Exon              Intr+Exon
#---------------------------------------------------------------------------------------------------
#
#   0         6 100.0%         5 100.0%         6 100.0%         11 100.0% *************************
#   1         6 100.0%         5 100.0%         5  83.3%         10  90.9% **********************
#   2         5  83.3%         4  80.0%         4  66.7%          8  72.7% ******************
#   3         5  83.3%         4  80.0%         4  66.7%          8  72.7% ******************
#   4         5  83.3%         4  80.0%         4  66.7%          8  72.7% ******************
#   5         2  33.3%         1  20.0%         2  33.3%          3  27.3% ******
#   6         0   0.0%         0   0.0%         0   0.0%          0   0.0% 
#   7         0   0.0%         0   0.0%         0   0.0%          0   0.0%

we can see that

5 out of the 6 CDS exons (83.3%) are consistent with a CDS exon
4 out of the 5 introns (80.0%) are consistent with an intron
4 out of the 6 exons (66.7%) are consistent with an exon

in at least k=3 of the other gene sets. For k=1, the second table shows that

#---------------------------------------------------------------------------------------------------
#   k           CDS             Intr             Exon              Intr+Exon
#---------------------------------------------------------------------------------------------------
#
#   0         6 100.0%         5 100.0%         6 100.0%         11 100.0% *************************
#   1         6 100.0%         5 100.0%         0   0.0%          5  45.5% ***********
#   2         0   0.0%         4  80.0%         0   0.0%          4  36.4% *********
#   3         0   0.0%         3  60.0%         0   0.0%          3  27.3% ******
#   4         0   0.0%         0   0.0%         0   0.0%          0   0.0%
#   5         0   0.0%         0   0.0%         0   0.0%          0   0.0%
#   6         0   0.0%         0   0.0%         0   0.0%          0   0.0%
#   7         0   0.0%         0   0.0%         0   0.0%          0   0.0%
#   8         0   0.0%         0   0.0%         0   0.0%          0   0.0%

6 out of the 6 CDS exons (100.0%)
5 out of the 5 introns (100.0%)

are supported by evidence (any source) in at least k=1 of the gene sets. Note that the database contains no 'exon' hints. Thus, none of the exon features have support.

# transcript level:
#
#   0 gene_id=jg4                           tx_id=jg4.t1                          5/6 CDS  4/5 Intr  4/6 exon
#   1 gene_id=jg7                           tx_id=jg7.t1                          4/6 CDS  3/5 Intr  3/6 exon
#   2 gene_id=jg5                           tx_id=jg5.t1                          5/6 CDS  4/5 Intr  4/6 exon
#   5 gene_id=jg4                           tx_id=jg4.t1                          6/6 CDS  5/5 Intr  5/6 exon
#   6 gene_id=jg5                           tx_id=jg5.t1                          3/5 CDS  2/4 Intr  3/5 exon
#
# transcript has an exact homolog in 0 other genomes.

The transcript itemizes all transcripts in any of the other genomes that are partly consistent with transcript T. For example the transcript jg7.t1 in species 1 (=canFam3) has

4 out of 6 CDS exons
3 out of 5 introns
3 out of 6 exons

that are consistent with T. A transcript is only considered an 'exact homolog' of T, if all its exons are consistent with T and it has the same number of exons as T.

cross-species comparison of gene sets

1. Run HomGeneMapping to map coordinates between genomes

2. homGeneMapping output explained