Navigate to Lab Session on AUGUSTUS. Using Scipio Training AUGUSTUS. Predicting Genes. AUGUSTUS-PPX.
Show all / no details.

Using Scipio to create a (training) set of gene structures

In the case where a genome and protein sequences from the same or very closely related species are given, Scipio can be used to construct the gene structure from the sequences. Scipio uses BLAT to align the protein sequences against the genome and then determines the exact boundaries of the exons, and adds small exons that were not found by BLAT.

[+] Installing Scipio...
[+] Some comments about when to use Scipio...

1. Run Scipio

As example input we use data from 5 mega bases of the Drosophila genome. We first create symbolic links to have aliases to the file names with intuitive names
ln -s chr2R.2M-7M.fa genome.fa
ln -s chr2R.2M-7M.aa proteins.aa
(The first two megabases at the chromosome tip were not taken because they are highly repetitive.)

Run Scipio:
scipio.1.4.pl --blat_output=prot.vs.genome.psl genome.fa proteins.aa > scipio.yaml   # takes ~7m
The option --blat_output=prot.vs.genome.psl means that the time-consuming protein-to-dna alignments are stored, in case one wants to rerun Scipio with other parameters. The file scipio.yaml now contains for each protein alignment details, in particular also the exact sequence coordinates we require. (Protein BLAT resolves exon boundaries at best up the accuracy of complete amino acids, which is not enough for creating a training set of gene structures.)

Extract a GFF file from the yaml output and convert the GFF output to one suitable for further processing
cat scipio.yaml | yaml2gff.1.4.pl > scipio.scipiogff
scipiogff2gff.pl --in=scipio.scipiogff --out=scipio.gff
The command
cat scipio.yaml | yaml2log.1.4.pl > scipio.log
creates a log file that can be examined to see for each protein whether a perfect alignment was found (status: complete).

The file scipio.gff should now look like this
chr2R   Scipio  CDS     900562  900621  1.000   +       0       transcript_id "392"
chr2R   Scipio  CDS     904518  904880  1.000   +       0       transcript_id "392"
chr2R   Scipio  CDS     904940  905131  1.000   +       0       transcript_id "392"
chr2R   Scipio  CDS     905195  905263  1.000   +       0       transcript_id "392"
chr2R   Scipio  CDS     3595076 3596041 1.000   +       0       transcript_id "2517"
...

1.1 Optional: visualize gene structures

We will check what the gene structures constructed above "look like" using the UCSC Genome Browser. This is here possible as the UCSC Genome Browser holds Drosophila assembly dm3. In general, you will have to set up your own browser (e.g. GBrowse).

Prepare a custom track file
echo -e "browser position chr2R:3000000-3200000\n\
browser hide multiz15way bdtnpChipper\n\
track name=scipio description=\"training and test genes\"\
 db=dm3 offset=2000000 visibility=3" > scipio.browser
cat scipio.gff >> scipio.browser
gzip -c scipio.browser > scipio.browser.gz
The file scipio.browser and its packed version scipio.browser.gz will now look like this
browser position chr2R:3000000-3200000
browser hide multiz15way bdtnpChipper
track name=scipio description="training and test genes" db=dm3 offset=2000000 visibility=2
chr2R   Scipio  CDS     900562  900621  1.000   +       0       transcript_id "392"
chr2R   Scipio  CDS     904518  904880  1.000   +       0       transcript_id "392"
chr2R   Scipio  CDS     904940  905131  1.000   +       0       transcript_id "392"
chr2R   Scipio  CDS     905195  905263  1.000   +       0       transcript_id "392"
chr2R   Scipio  CDS     3595076 3596041 1.000   +       0       transcript_id "2517"
...
This is a format that the UCSC Genome Browser understands. The first three lines tell it which region to display first (chr2R:3000000-3200000), to hide the tracks multiz15way and bdtnpChipper (we don't need to look at them), the track name (scipio) a header for the track, the assembly it refers to (dm3), that the Browser needs to add 2,000,000 to the coordinates because our example genome starts after 2 mega bases only (offset=2000000) and that the display mode should be "full" (visibility=2).
  1. open the browser for Drosophila melanogaster,
  2. click on "add custom tracks" or "manage custom tracks",
  3. upload the file scipio.browser.gz and
  4. click on the link in the column "Pos" in the table of custom tracks.
The lazy ones can look at the results by clicking this link to a previously prepared custom track.

The UCSC Genome Browser group has an excellent help page for setting up UCSC custom tracks.

2. Convert GFF file to Genbank format file

The training program etraining of the AUGUSTUS package requires input in the form of Genbank format, that includes both the genomic sequences and the sequence coordinates of the coding regions CDS and optionally of the UTR. In many cases the coordinates of the gene structures are available in GFF format or GTF format. The tool gff2gbSmallDNA.pl that comes with AUGUSTUS can be used to convert files in some GFF formats or GTF to Genbank format:
gff2gbSmallDNA.pl scipio.gff genome.fa 1000 genes.raw.gb
This command takes each gene in scipio.gff together with 1000 bp flanking intergenic region upstream and downstream of the gene and creates a LOCUS in Genbank format.

[+] Input format of gff2gbSmallDNA.pl...

[+] Size of flanking regions...

The file genes.raw.gb now contains training genes in the right format. However, some fraction of them may be problematic with respect to AUGUSTUS training. They may contain genes with nonconsensus splice sites, missing start codons, missing stop codons, in-frame stop codons. To avoid warning messages later and because genes with these features are partially ignored anyways, we remove these problematic locis from genes.raw.gb:

etraining --species=generic --stopCodonExcludedFromCDS=true genes.raw.gb 2> train.err
etraining complains about and reports problematic genes. The option --stopCodonExcludedFromCDS determines whether in the input Genbank file the stop codon is considered to be part of the CDS or not. This is sometimes the case and sometimes not. When the GFF stems from other sources than Scipio this may have to be set to false rather than to true. We write the error messages into a file train.err which looks like this:
gene 186 transcr. 1 in sequence chr2R_549753-555284: Initial exon has length < 3!
gene 461 transcr. 1 in sequence chr2R_1034318-1036751: Initial Exon doesn't begin with start codon.
gene 567 transcr. 1 in sequence chr2R_1198437-1201521: Initial Exon doesn't begin with start codon.
gene 4537 transcr. 1 in sequence chr2R_1354359-1361857: Initial Exon doesn't begin with start codon.
gene 4783 transcr. 1 in sequence chr2R_1669860-1673241: Terminal exon doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right?
gene 5161 transcr. 1 in sequence chr2R_2043765-2046183: Single exon gene doesn't begin with atg codon but with ccc
gene 6319 transcr. 1 in sequence chr2R_3734070-3735386: Initial Exon doesn't begin with start codon.
gene 3577 transcr. 1 in sequence chr2R_4767472-4770826: Initial Exon doesn't begin with start codon.

We can now filter out the problematic genes, e.g. by issuing
cat train.err | perl -pe 's/.*in sequence (\S+): .*/$1/' > badgenes.lst
filterGenes.pl badgenes.lst genes.raw.gb > genes.gb
grep -c "LOCUS" genes.raw.gb genes.gb
# genes.raw.gb:594
# genes.gb:586
Here, badgenes.lst ist a file with gene names that were reportet in train.err.

genes.gb now contains 586 loci with one "unproblematic" gene each.