In the case where a genome and protein sequences from the same or very closely related species are given,
Scipio can be used to construct the gene structure from the sequences.
Scipio uses BLAT to align the protein sequences against the genome
and then determines the exact boundaries of the exons, and adds small exons that were not found by BLAT.
ln -s chr2R.2M-7M.fa genome.fa ln -s chr2R.2M-7M.aa proteins.aa(The first two megabases at the chromosome tip were not taken because they are highly repetitive.)
scipio.1.4.pl --blat_output=prot.vs.genome.psl genome.fa proteins.aa > scipio.yaml # takes ~7mThe option --blat_output=prot.vs.genome.psl means that the time-consuming protein-to-dna alignments are stored, in case one wants to rerun Scipio with other parameters. The file scipio.yaml now contains for each protein alignment details, in particular also the exact sequence coordinates we require. (Protein BLAT resolves exon boundaries at best up the accuracy of complete amino acids, which is not enough for creating a training set of gene structures.)
cat scipio.yaml | yaml2gff.1.4.pl > scipio.scipiogff scipiogff2gff.pl --in=scipio.scipiogff --out=scipio.gffThe command
cat scipio.yaml | yaml2log.1.4.pl > scipio.logcreates a log file that can be examined to see for each protein whether a perfect alignment was found (status: complete).
chr2R Scipio CDS 900562 900621 1.000 + 0 transcript_id "392" chr2R Scipio CDS 904518 904880 1.000 + 0 transcript_id "392" chr2R Scipio CDS 904940 905131 1.000 + 0 transcript_id "392" chr2R Scipio CDS 905195 905263 1.000 + 0 transcript_id "392" chr2R Scipio CDS 3595076 3596041 1.000 + 0 transcript_id "2517" ...
We will check what the gene structures constructed above "look like" using the UCSC Genome Browser. This is here possible as the UCSC Genome Browser holds Drosophila assembly dm3. In general, you will have to set up your own browser (e.g. GBrowse).
Prepare a custom track fileecho -e "browser position chr2R:3000000-3200000\n\ browser hide multiz15way bdtnpChipper\n\ track name=scipio description=\"training and test genes\"\ db=dm3 offset=2000000 visibility=3" > scipio.browser cat scipio.gff >> scipio.browser gzip -c scipio.browser > scipio.browser.gzThe file scipio.browser and its packed version scipio.browser.gz will now look like this
browser position chr2R:3000000-3200000 browser hide multiz15way bdtnpChipper track name=scipio description="training and test genes" db=dm3 offset=2000000 visibility=2 chr2R Scipio CDS 900562 900621 1.000 + 0 transcript_id "392" chr2R Scipio CDS 904518 904880 1.000 + 0 transcript_id "392" chr2R Scipio CDS 904940 905131 1.000 + 0 transcript_id "392" chr2R Scipio CDS 905195 905263 1.000 + 0 transcript_id "392" chr2R Scipio CDS 3595076 3596041 1.000 + 0 transcript_id "2517" ...This is a format that the UCSC Genome Browser understands. The first three lines tell it which region to display first (chr2R:3000000-3200000), to hide the tracks multiz15way and bdtnpChipper (we don't need to look at them), the track name (scipio) a header for the track, the assembly it refers to (dm3), that the Browser needs to add 2,000,000 to the coordinates because our example genome starts after 2 mega bases only (offset=2000000) and that the display mode should be "full" (visibility=2).
The UCSC Genome Browser group has an excellent help page for setting up UCSC custom tracks.
gff2gbSmallDNA.pl scipio.gff genome.fa 1000 genes.raw.gbThis command takes each gene in scipio.gff together with 1000 bp flanking intergenic region upstream and downstream of the gene and creates a LOCUS in Genbank format.
The file genes.raw.gb now contains training genes in the right format. However, some fraction of them may be problematic with respect to AUGUSTUS training. They may contain genes with nonconsensus splice sites, missing start codons, missing stop codons, in-frame stop codons. To avoid warning messages later and because genes with these features are partially ignored anyways, we remove these problematic locis from genes.raw.gb:
etraining --species=generic --stopCodonExcludedFromCDS=true genes.raw.gb 2> train.erretraining complains about and reports problematic genes. The option --stopCodonExcludedFromCDS determines whether in the input Genbank file the stop codon is considered to be part of the CDS or not. This is sometimes the case and sometimes not. When the GFF stems from other sources than Scipio this may have to be set to false rather than to true. We write the error messages into a file train.err which looks like this:
gene 186 transcr. 1 in sequence chr2R_549753-555284: Initial exon has length < 3! gene 461 transcr. 1 in sequence chr2R_1034318-1036751: Initial Exon doesn't begin with start codon. gene 567 transcr. 1 in sequence chr2R_1198437-1201521: Initial Exon doesn't begin with start codon. gene 4537 transcr. 1 in sequence chr2R_1354359-1361857: Initial Exon doesn't begin with start codon. gene 4783 transcr. 1 in sequence chr2R_1669860-1673241: Terminal exon doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right? gene 5161 transcr. 1 in sequence chr2R_2043765-2046183: Single exon gene doesn't begin with atg codon but with ccc gene 6319 transcr. 1 in sequence chr2R_3734070-3735386: Initial Exon doesn't begin with start codon. gene 3577 transcr. 1 in sequence chr2R_4767472-4770826: Initial Exon doesn't begin with start codon.We can now filter out the problematic genes, e.g. by issuing
cat train.err | perl -pe 's/.*in sequence (\S+): .*/$1/' > badgenes.lst filterGenes.pl badgenes.lst genes.raw.gb > genes.gb grep -c "LOCUS" genes.raw.gb genes.gb # genes.raw.gb:594 # genes.gb:586Here, badgenes.lst ist a file with gene names that were reportet in train.err.