Navigate to Lab Session on AUGUSTUS.
Using Scipio.
Training AUGUSTUS.
AUGUSTUS-PPX.
AUGUSTUS-PPX: Using protein profiles
This tutorial describes how to prepare and use protein profiles that
can be provided as an additional input to AUGUSTUS. AUGUSTUS will then
use its Protein Profile eXtension (PPX) to
identify and better predict genes matching the profile.
This way, collections of related protein sequences, represented by
multiple sequence alignments, can be extended by adding
sequences predicted on newly sequenced genomes.
The profiles AUGUSTUS uses are block profiles; a block
is a conserved region in a multiple sequence alignment of related
protein sequences that has no insertions or deletions. The block
profile is a matrix containing the position-specific frequencies of
block columns.
1. COLLECT AND PREPARE MULTIPLE SEQUENCE ALIGNMENTS
The first step in using the protein extension is to prepare a Multiple
Sequence Alignment. If you are working in a group specialized on
particular protein families, you may already have Multiple Sequence
Alignments of the protein sequences of interest at hand. As examples,
we will use Multiple Sequence Alignments downloaded from
PFAM. It is important that the
MSAs contain enough blocks in order to be convertible into useful
profiles.
Click on "Keyword Search" and enter "Kinesin". The first item "Kinesin
motor domain" will lead to the
Kinesin family. By
clicking on "Alignments", the Mulitple Sequence Alignments can be
downloaded directly.
Choose "FASTA" in the Formatting options and clicking on "Generate"
will produce the following example files:
Similarly, you can access the
Aldedh (Aldehyde
dehydrogenase) family as another example:
Now, convert the MSA into a profile by issuing the command:
msa2prfl.pl --prefix_from_seqnames --max_entropy=0.75 \
--blockscorefile=PF00225_seed.blocks.txt PF00225_seed.txt > PF00225_seed.prfl
This will give the message
Determining block columns...
done
Profile has 13 blocks with 138 columns.
Minimum admissible sequence length: 238
and create a file PF00225_seed.prfl that looks like this:
[name]
unknown
[dist]
# distance from previous block
# <min> <max>
8 1272
[block]
# block no. 0 follows, 87 sequences, length 9
# corresponding to MSA columns:
# 0-8
name=unknown_A
#
# <colnr> <probs for GDERKNQSTAVLIFYWHMCP>
# G D E R K ...
0 0.00627 0.00591 0.00999 0.82584 0.07328 ...
1 0.00559 0.00377 0.00473 0.00489 0.04273 ...
...
[dist]
# distance from previous block
# <min> <max>
17 84
[block]
...
[dist]
# distance from previous block
# <min> <max>
0 *
# created by:
# msa2prfl.pl --prefix_from_seqnames --max_entropy=0.75 --blockscorefile=PF00225_seed.blocks PF00225_seed.fa
This file is in the block profile format that AUGUSTUS-PPX can process.
The file starts with a [name] section, stating the name of
the profile, which is followed by alternating [dist]
and [block] sections. The [dist] sections contain
the maximum and minimum distance of the blocks in the sequences of the
alignment, while the [block] section contains the position
specific frequencies, one line containing 20 values, one for each amino acid
describing one column of the alignment. Each line starting with #
is considered a comment and ignored.
[+]
Parameters for msa2prfl.pl ...
Three parameters were specified in the example; all of them are optional:
- prefix_from_seqnames
useful for PFAM alignments which are usually partial alignments
that do not cover the full member sequences, with sequence names
followed by the coordinates in the form /NNN-NNN. If the parameter
is specified, these coordinates are added to the profile coordinates.
- max_entropy=0.75
ensures that the columns of the
blocks are indeed conserved. High entropy means a highly random
distribution of amino acids. This way, block parts with a entropy of
more than 75% of the random entropy are discarded.
- blockscorefile=PF00225_seed.blocks.txt
creates a file
PF00225_seed.blocks.txt that looks as follows:
KIF7_DICDI:
33 unknown_A 4.63754 6.41905 33 RVRPLTELE 42
25 unknown_B 7.20572 6.42577 67 FTFDRIF 74
16 unknown_C 2.64746 3.25498 90 IVNDFL 96
...
8 unknown_K 8.90132 10.07313 300 YRDSKLTRVLQDSLG 315
1 unknown_L 4.88459 6.74977 316 NSKTSLIINCSP 328
8 unknown_M 3.49252 6.13358 336 ITTLQFGTRAKTI 349
--
KINH_HUMAN:
13 unknown_A 4.82384 6.52148 13 RFRPLNESE 22
23 unknown_B 6.53722 6.22901 45 YAFDRVF 52
...
This is useful to reconstruct the block motif in the original sequences
of the alignment, in the format
seqname:
(dist) (block) (score) (spec) (start) (motif) (end)
- dist: distance from previous block
- block: name of block
- score: mean odds-ratio score of block motif
- spec: specificity (measured in standard deviations above the mean score)
- start: start of the motif in the sequence
- motif: the block motif
- end: end of the motif in the sequence
A short usage explanation of all parameters is
displayed when you type:
msa2prfl.pl --help
For example, you can specify a name and accession number to be used
in the profile. If you don't, it will simply be called "unknown".
In the PFAM website, there is a choice between the seed and the full
alignment. An alignment with fewer sequences is more likely to
contain a good number of blocks: since no gaps are allowed in blocks,
already a single sequence can make a block impossible at a particular
position if it has a gap there. Trying to run msa2prfl.pl on
the full alignment results in the message
No blocks found in MSA. Use "prepareAlign" to eliminate sequences.
This alignment contains too many sequences: in each column, at least
one of the sequences has a gap. This can be resolved by deleting
gap-rich sequences from the alignment by using the
program prepareAlign, as follows:
prepareAlign < PF00225_full.txt > PF00225_filtered.txt
...
msa2prfl.pl --prefix_from_seqnames --max_entropy=0.75 --blockscorefile=PF00225_full.blocks.txt PF00225_filtered.txt > PF00225.prfl
or, in one step:
prepareAlign < PF00225_full.txt | msa2prfl.pl --prefix_from_seqnames --max_entropy=0.75 --blockscorefile=PF00225_full.blocks.txt - > PF00225.prfl
prepareAlign has some logging output, showing the columns that will go into blocks, ending with
...
Deleted 2154 sequences, 1632 sequences remaining.
This means that from the original alignment, 1632 sequences are used to create
the profile.
In the same way, prepare a file PF00171.prfl from PF00171_seed.txt or
PF00171_full.txt.
2. USE A FINGERPRINT SIGNATURE
As an alternative to Multiple Sequence Alignments,
fingerprints can be used as resource for protein signatures.
These are collected by the
PRINTS database. They offer the full database for download
via ftp.
An example file containing a data sets is provided here:
Converting the PRINTS database to block profiles is straightforward: Running
prints2prfl.pl PRINTS-example.txt
converts it into a block profile PR00249.prfl.
The full database contains about 2000 datasets (packed size: 11 MB).
The prints2prfl.pl script can be used to convert them
simultaneously into block profiles.
3. RUN A PRELIMINARY PROFILE SEARCH
Now that the profiles are ready, they need to be located in the genome.
AUGUSTUS-PPX could already be run with the profiles, but it will take too
much time on a whole chromosome when equipped with a protein profile. Therefore,
we will perform a fast block search to determine regions containing
profile hits.
Run a fast block search with the three profiles (PF00225, PF00171 and PR00249) on
the following example genome sequences:
The command to run the fast block search is
fastBlockSearch --cutoff=1.1 chr4.103M.fa PF00225_seed.prfl
(replace the genome and profile names as needed). The cutoff can be used to
adjust the sensitivity of the block search. The standard cutoff is 0.7, which
is very sensitive but can give many false positive hits for smaller profiles.
In any case, hits are sorted by score, so the last displayed hit is the best
one found in the region. Play with the cutoff in order to have more or less
hits displayed. (The cutoff is the average log score of the motifs found.)
The output should look like this:
Hits found in chr4 103000000 105000000
Score:207.987
Mult. score:4.83391
1081586 unknown_M[5,13] - 2.32574 5.04633 .....YATRLKNI
1103952 unknown_L - 4.85363 6.75245 NAKTRIICTITP
1103991 unknown_K - 8.38065 9.92928 YRDSKLTRILQNSLG
1104375 unknown_J - 3.96065 6.79408 RSLFILGQVIKKL
1106992 unknown_I - 9.22487 7.64306 LVDLAGSE
1115567 unknown_H[5,16] - 2.31869 5.58986 .....ESRHYGETKMN
1116319 unknown_G - 7.34282 8.29425 EIYNETITDLL
1117092 unknown_F - 5.10694 6.10274 VIPRAIHDIF
1117146 unknown_E - 9.43596 9.18891 QTASGKTYTM
1117176 unknown_D[1,8] - 5.73796 6.31532 .GTIFAYG
1117399 unknown_B[1,7] - 3.59083 5.03059 .CLDRVF
1119420 unknown_A[0,8] - 4.64107 6.44285 RVRPLNSR.
--
(You can suppress any logging output by putting "2> /dev/null"
after the command). The format is similar to that of the blockscore file
(which is optionally generated by msa2prfl.pl): It shows coordinate, strand, mean
odds-ratio score, and specificity of score, and the motif.
With this, you can choose a region around the hit for the final AUGUSTUS run.
4. RUN AUGUSTUS-PPX
Call
augustus --optCfgFile=ppx.cfg --predictionStart=1070000 --predictionEnd=1130000 \
--proteinprofile=PF00225_seed.prfl chr4.103M.fa > augustus-ppx.gff
(Again, replace the profile and region coordinates as needed). In this example, we added approximately
10 Kpbs around the profile hit, since the actual gene may be larger. The
config file ppx.cfg contains additional parameters for
AUGUSTUS:
sample 0
species human
progress true
gff3 true
/ProteinModel/block_threshold_spec 4.0
/ProteinModel/block_threshold_sens 0.4
/ProteinModel/blockpart_threshold_spec 4.5
print_blocks true
/IntronModel/allow_dss_consensus_gc true
/IntronModel/non_gt_dss_prob 0.01
Here, we do not use sampling (sampling is not supported in the protein extension),
we use the human training parameter set, and we want the output in gff3 format.
The second part contains the PPX-specific parameters. These are the default values
(given here since defaults are subject to change in future versions of AUGUSTUS),
adjusting the search for block hits. Lowering the spec parameter may result in
more false positive block hits, but is mostly safe (will only make runs slower).
Since insignificant blocks are discarded in the AUGUSTUS and fastBlockSearch runs before the prediction,
reducing the specificity can make more blocks be used in the prediction. Try this
by adding the parameter
--/ProteinModel/block_threshold_spec=2.5
to AUGUSTUS or the fast block search.
The last two parameters allow AUGUSTUS to predict introns that have a gc-ag splice
site (with a probability of 1%). This is not specific for the protein extension.
In order to compare the result with the ab-initio variant, run AUGUSTUS with the
same command, but without the proteinprofile parameter, to create a file
output-abinitio.gff:
augustus --optCfgFile=ppx.cfg --predictionStart=1070000 --predictionEnd=1130000 chr4.103M.fa > output-abinitio.gff
Repeat this with the other profiles at the genomic regions for which the fast block search has found
a hit.
5. VIEW THE RESULTS IN THE UCSC BROWSER
In order to format the results for
the UCSC browser, run the gff2browser.pl
script:
./gff2browser.pl 102999999 augustus-ppx.gff > augustus-ppx.browser
This adds an offset of 103 Mbps to the genome coordinates (since the provided example genome starts at
42Mbps of chromosome 3), and reformats the output so the browser can read it.
You can also call AUGUSTUS and gff2browser.pl in one step:
augustus [...] chr4.103M.fa | ./gff2browser.pl 102999999 > browser-output.gff
Add the files created from AUGUSTUS-PPX and AUGUSTUS abinitio as
custom tracks,
in order to compare the predictions. The difference is not always obvious, and
they may generate the same output. Predictions are likely to differ around the
blocks, so you may have to zoom to where block hits are displayed.