Scipio

Using Scipio to create a (training) set of gene structures

In the case where a genome and protein sequences from the same or very closely related species are given, Scipio can be used to construct the gene structure from the sequences. Scipio uses BLAT to align the protein sequences against the genome and then determines the exact boundaries of the exons, and adds small exons that were not found by BLAT.

[+] Installing Scipio...

Download Scipio from http://www.webscipio.org/webscipio/download_scipio.
Scipio requires that BLAT and the yaml perl module YAML.pm is installed.
Download and install BLAT.
On UBUNTU linux install libyaml-perl doing

sudo apt-get install libyaml-perl

[+] Some comments about when to use Scipio...

Scipio is post-processing spliced alignments from BLAT. It is therefore suitable only for cases in which the input protein sequences are expected to be fairly similar to the proteins encoded by the genome (e.g. >90% similarity). One use case is the migration of gene structures from one assembly to another assembly. Another use case is the identification of exon coordinates for protein of the same species as the genome when the gene structures (GFF) is not available (anymore). A third use case is the mapping of protein sequences from a related known species to a new species, e.g. from human to Orangutan. Scipio is well-suited for draft assemblies with short contigs, as it is assembling alignments of proteins where different fragments match different contigs.

1. Run Scipio

chr2R Scipio CDS 900562 900621 1.000 + 0 transcript_id "392" chr2R Scipio CDS 904518 904880 1.000 + 0 transcript_id "392" chr2R Scipio CDS 904940 905131 1.000 + 0 transcript_id "392" chr2R Scipio CDS 905195 905263 1.000 + 0 transcript_id "392" chr2R Scipio CDS 3595076 3596041 1.000 + 0 transcript_id "2517" ...

1.1 Optional: visualize gene structures

We will check what the gene structures constructed above "look like" using the UCSC Genome Browser. This is here possible as the UCSC Genome Browser holds Drosophila assembly dm3. In general, you will have to set up your own browser (e.g. GBrowse).

echo -e "browser position chr2R:3000000-3200000\n\ browser hide multiz15way bdtnpChipper\n\ track name=scipio description=\"training and test genes\"\ db=dm3 offset=2000000 visibility=3" > scipio.browser cat scipio.gff >> scipio.browser gzip -c scipio.browser > scipio.browser.gz

browser position chr2R:3000000-3200000 browser hide multiz15way bdtnpChipper track name=scipio description="training and test genes" db=dm3 offset=2000000 visibility=2 chr2R Scipio CDS 900562 900621 1.000 + 0 transcript_id "392" chr2R Scipio CDS 904518 904880 1.000 + 0 transcript_id "392" chr2R Scipio CDS 904940 905131 1.000 + 0 transcript_id "392" chr2R Scipio CDS 905195 905263 1.000 + 0 transcript_id "392" chr2R Scipio CDS 3595076 3596041 1.000 + 0 transcript_id "2517" ...

2. Convert GFF file to Genbank format file

The last column (column 9) of the GFF input file is used to group exons of the same transcript together. For an example see the scipio.gff from above. In GTF format this is based on whether they agree on the transcript_id field. If the last column contains the string Parent= (as in GFF3) then this is used for grouping. If no grouping from the standard formats is detected then the whole 9th column is used for grouping exons to genes. It is possible to convert genes just containing the coding regions (CDS in the third column) but also to convert genes including UTR annotation (keyword UTR or exon) in the third column. It is also possible to restrict output using lists of gene names to include or exclude:

Usage: gff2gbSmallDNA.pl gff-file seq-file max-size-of-gene-flanking-DNA output-file [options]
options:
--bad=badfile    Specify a file with gene names. All except these are included in the output.
--good=goodfile  Specify a file with gene names. Only these genes are included in the output.
--overlap        Overlap filtering turned off.

The flanking regions upstream and downstream of each gene are used to train non-coding DNA content, i.e. estimate k-mer frequencies in these regions. When two neighboring genes in the gff file are closer to each other than 2*max-size-of-gene-flanking-DNA, then then flanking region is decreased such that half of the intergenic region between the two genes is included in each locus. This avoids to some extend to have overlapping non-coding regions for training and to include genes in the "intergenic" region. The choice of the parameter max-size-of-gene-flanking-DNA is important for several reasons.

The flanking region should be large enough to allow a prepresentative estimate of non coding regions. When the GFF file only contains CDS, then part of the flanking regions are UTR and may not be representative of intergenic region (e.g. CpG islands in vertebrates).
Usually the gff file is not complete and genes are missing from it. In that case the flanking regions may contain genic regions of genes that are missing in the gff file. Apart from the bias on k-mer frequencies this will yield a bad specificity when these loci are used for assessing accuracy.
A small flanking region may lead to an overestimation of gene level sensitivity if it is used for assessing accuracy. Generally, gene finders may be more accurate on sequences that contain exactly one locus and only little flanking regions.
A large flanking region may increase the running time of optimize_augustus.pl substantially.

A compromise between above conflicting aims needs to be sought.

The file genes.raw.gb now contains training genes in the right format. However, some fraction of them may be problematic with respect to AUGUSTUS training. They may contain genes with nonconsensus splice sites, missing start codons, missing stop codons, in-frame stop codons. To avoid warning messages later and because genes with these features are partially ignored anyways, we remove these problematic locis from genes.raw.gb:

gene 186 transcr. 1 in sequence chr2R_549753-555284: Initial exon has length < 3! gene 461 transcr. 1 in sequence chr2R_1034318-1036751: Initial Exon doesn't begin with start codon. gene 567 transcr. 1 in sequence chr2R_1198437-1201521: Initial Exon doesn't begin with start codon. gene 4537 transcr. 1 in sequence chr2R_1354359-1361857: Initial Exon doesn't begin with start codon. gene 4783 transcr. 1 in sequence chr2R_1669860-1673241: Terminal exon doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right? gene 5161 transcr. 1 in sequence chr2R_2043765-2046183: Single exon gene doesn't begin with atg codon but with ccc gene 6319 transcr. 1 in sequence chr2R_3734070-3735386: Initial Exon doesn't begin with start codon. gene 3577 transcr. 1 in sequence chr2R_4767472-4770826: Initial Exon doesn't begin with start codon.