Navigate to Lab Session on AUGUSTUS. Using Scipio. Training AUGUSTUS. AUGUSTUS-PPX.

Show all / no details.

AUGUSTUS-PPX: Using protein profiles

This tutorial describes how to prepare and use protein profiles that can be provided as an additional input to AUGUSTUS. AUGUSTUS will then use its Protein Profile eXtension (PPX) to identify and better predict genes matching the profile.

This way, collections of related protein sequences, represented by multiple sequence alignments, can be extended by adding sequences predicted on newly sequenced genomes.

The profiles AUGUSTUS uses are block profiles; a block is a conserved region in a multiple sequence alignment of related protein sequences that has no insertions or deletions. The block profile is a matrix containing the position-specific frequencies of block columns.

1. COLLECT AND PREPARE MULTIPLE SEQUENCE ALIGNMENTS

The first step in using the protein extension is to prepare a Multiple Sequence Alignment. If you are working in a group specialized on particular protein families, you may already have Multiple Sequence Alignments of the protein sequences of interest at hand. As examples, we will use Multiple Sequence Alignments downloaded from PFAM. It is important that the MSAs contain enough blocks in order to be convertible into useful profiles.

Click on "Keyword Search" and enter "Kinesin". The first item "Kinesin motor domain" will lead to the Kinesin family. By clicking on "Alignments", the Mulitple Sequence Alignments can be downloaded directly. Choose "FASTA" in the Formatting options and clicking on "Generate" will produce the following example files:

Similarly, you can access the Aldedh (Aldehyde dehydrogenase) family as another example:

Now, convert the MSA into a profile by issuing the command:

msa2prfl.pl --prefix_from_seqnames --max_entropy=0.75 \
 --blockscorefile=PF00225_seed.blocks.txt PF00225_seed.txt > PF00225_seed.prfl

This will give the message

Determining block columns...
done
Profile has 13 blocks with 138 columns.
Minimum admissible sequence length: 238

and create a file PF00225_seed.prfl that looks like this:

[name]
unknown

[dist]
# distance from previous block
# <min> <max>
8	1272

[block]
# block no. 0 follows, 87 sequences, length 9
# corresponding to MSA columns:
# 0-8
name=unknown_A
#
# <colnr> <probs for GDERKNQSTAVLIFYWHMCP>
#	G	D	E	R	K	...
0	0.00627	0.00591	0.00999	0.82584	0.07328	...
1	0.00559	0.00377	0.00473	0.00489	0.04273	...
...

[dist]
# distance from previous block
# <min> <max>
17	84

[block]
...

[dist]
# distance from previous block
# <min> <max>
0	*

# created by:
# msa2prfl.pl --prefix_from_seqnames --max_entropy=0.75 --blockscorefile=PF00225_seed.blocks PF00225_seed.fa

This file is in the block profile format that AUGUSTUS-PPX can process.

The file starts with a [name] section, stating the name of the profile, which is followed by alternating [dist] and [block] sections. The [dist] sections contain the maximum and minimum distance of the blocks in the sequences of the alignment, while the [block] section contains the position specific frequencies, one line containing 20 values, one for each amino acid describing one column of the alignment. Each line starting with # is considered a comment and ignored.

[+] Parameters for msa2prfl.pl ...

Three parameters were specified in the example; all of them are optional:

prefix_from_seqnames

partial

/NNN-NNN

max_entropy=0.75
blockscorefile=PF00225_seed.blocks.txt

PF00225_seed.blocks.txt

KIF7_DICDI:
33	unknown_A	 4.63754	 6.41905	   33 RVRPLTELE 42
25	unknown_B	 7.20572	 6.42577	   67 FTFDRIF 74
16	unknown_C	 2.64746	 3.25498	   90 IVNDFL 96
...
8	unknown_K	 8.90132	10.07313	  300 YRDSKLTRVLQDSLG 315
1	unknown_L	 4.88459	 6.74977	  316 NSKTSLIINCSP 328
8	unknown_M	 3.49252	 6.13358	  336 ITTLQFGTRAKTI 349
--
KINH_HUMAN:
13	unknown_A	 4.82384	 6.52148	   13 RFRPLNESE 22
23	unknown_B	 6.53722	 6.22901	   45 YAFDRVF 52
...

seqname:
(dist) (block) (score) (spec) (start) (motif) (end)

dist: distance from previous block
block: name of block
score: mean odds-ratio score of block motif
spec: specificity (measured in standard deviations above the mean score)
start: start of the motif in the sequence
motif: the block motif
end: end of the motif in the sequence

A short usage explanation of all parameters is displayed when you type:

msa2prfl.pl --help

For example, you can specify a name and accession number to be used in the profile. If you don't, it will simply be called "unknown".

In the PFAM website, there is a choice between the seed and the full alignment. An alignment with fewer sequences is more likely to contain a good number of blocks: since no gaps are allowed in blocks, already a single sequence can make a block impossible at a particular position if it has a gap there. Trying to run msa2prfl.pl on the full alignment results in the message

No blocks found in MSA. Use "prepareAlign" to eliminate sequences.

This alignment contains too many sequences: in each column, at least one of the sequences has a gap. This can be resolved by deleting gap-rich sequences from the alignment by using the program prepareAlign, as follows:

prepareAlign < PF00225_full.txt > PF00225_filtered.txt
...
msa2prfl.pl --prefix_from_seqnames --max_entropy=0.75 --blockscorefile=PF00225_full.blocks.txt PF00225_filtered.txt > PF00225.prfl

or, in one step:

prepareAlign < PF00225_full.txt | msa2prfl.pl --prefix_from_seqnames --max_entropy=0.75 --blockscorefile=PF00225_full.blocks.txt - > PF00225.prfl

prepareAlign has some logging output, showing the columns that will go into blocks, ending with

...
Deleted 2154 sequences, 1632 sequences remaining.

This means that from the original alignment, 1632 sequences are used to create the profile.

In the same way, prepare a file PF00171.prfl from PF00171_seed.txt or PF00171_full.txt.

2. USE A FINGERPRINT SIGNATURE

As an alternative to Multiple Sequence Alignments, fingerprints can be used as resource for protein signatures. These are collected by the PRINTS database. They offer the full database for download via ftp.

An example file containing a data sets is provided here:

PRINTS-example.txt

Converting the PRINTS database to block profiles is straightforward: Running

prints2prfl.pl PRINTS-example.txt

converts it into a block profile PR00249.prfl.

The full database contains about 2000 datasets (packed size: 11 MB). The prints2prfl.pl script can be used to convert them simultaneously into block profiles.

3. RUN A PRELIMINARY PROFILE SEARCH

Now that the profiles are ready, they need to be located in the genome. AUGUSTUS-PPX could already be run with the profiles, but it will take too much time on a whole chromosome when equipped with a protein profile. Therefore, we will perform a fast block search to determine regions containing profile hits.

Run a fast block search with the three profiles (PF00225, PF00171 and PR00249) on the following example genome sequences:

The command to run the fast block search is

fastBlockSearch --cutoff=1.1 chr4.103M.fa PF00225_seed.prfl

(replace the genome and profile names as needed). The cutoff can be used to adjust the sensitivity of the block search. The standard cutoff is 0.7, which is very sensitive but can give many false positive hits for smaller profiles. In any case, hits are sorted by score, so the last displayed hit is the best one found in the region. Play with the cutoff in order to have more or less hits displayed. (The cutoff is the average log score of the motifs found.)

The output should look like this:

Hits found in chr4 103000000 105000000
Score:207.987
Mult. score:4.83391
1081586 unknown_M[5,13] -       2.32574 5.04633 .....YATRLKNI
1103952 unknown_L       -       4.85363 6.75245 NAKTRIICTITP
1103991 unknown_K       -       8.38065 9.92928 YRDSKLTRILQNSLG
1104375 unknown_J       -       3.96065 6.79408 RSLFILGQVIKKL
1106992 unknown_I       -       9.22487 7.64306 LVDLAGSE
1115567 unknown_H[5,16] -       2.31869 5.58986 .....ESRHYGETKMN
1116319 unknown_G       -       7.34282 8.29425 EIYNETITDLL
1117092 unknown_F       -       5.10694 6.10274 VIPRAIHDIF
1117146 unknown_E       -       9.43596 9.18891 QTASGKTYTM
1117176 unknown_D[1,8]  -       5.73796 6.31532 .GTIFAYG
1117399 unknown_B[1,7]  -       3.59083 5.03059 .CLDRVF
1119420 unknown_A[0,8]  -       4.64107 6.44285 RVRPLNSR.
--

(You can suppress any logging output by putting "2> /dev/null" after the command). The format is similar to that of the blockscore file (which is optionally generated by msa2prfl.pl): It shows coordinate, strand, mean odds-ratio score, and specificity of score, and the motif.

With this, you can choose a region around the hit for the final AUGUSTUS run.

4. RUN AUGUSTUS-PPX

Call

augustus --optCfgFile=ppx.cfg --predictionStart=1070000 --predictionEnd=1130000 \
 --proteinprofile=PF00225_seed.prfl chr4.103M.fa > augustus-ppx.gff

(Again, replace the profile and region coordinates as needed). In this example, we added approximately 10 Kpbs around the profile hit, since the actual gene may be larger. The config file ppx.cfg contains additional parameters for AUGUSTUS:

sample          0
species         human
progress        true
gff3            true

/ProteinModel/block_threshold_spec      4.0
/ProteinModel/block_threshold_sens      0.4
/ProteinModel/blockpart_threshold_spec  4.5
print_blocks    true

/IntronModel/allow_dss_consensus_gc     true
/IntronModel/non_gt_dss_prob    0.01

Here, we do not use sampling (sampling is not supported in the protein extension), we use the human training parameter set, and we want the output in gff3 format.

The second part contains the PPX-specific parameters. These are the default values (given here since defaults are subject to change in future versions of AUGUSTUS), adjusting the search for block hits. Lowering the spec parameter may result in more false positive block hits, but is mostly safe (will only make runs slower). Since insignificant blocks are discarded in the AUGUSTUS and fastBlockSearch runs before the prediction, reducing the specificity can make more blocks be used in the prediction. Try this by adding the parameter

--/ProteinModel/block_threshold_spec=2.5

to AUGUSTUS or the fast block search.

The last two parameters allow AUGUSTUS to predict introns that have a gc-ag splice site (with a probability of 1%). This is not specific for the protein extension.

In order to compare the result with the ab-initio variant, run AUGUSTUS with the same command, but without the proteinprofile parameter, to create a file output-abinitio.gff:

augustus --optCfgFile=ppx.cfg --predictionStart=1070000 --predictionEnd=1130000 chr4.103M.fa > output-abinitio.gff

Repeat this with the other profiles at the genomic regions for which the fast block search has found a hit.

5. VIEW THE RESULTS IN THE UCSC BROWSER

In order to format the results for the UCSC browser, run the gff2browser.pl script:

./gff2browser.pl 102999999 augustus-ppx.gff > augustus-ppx.browser

This adds an offset of 103 Mbps to the genome coordinates (since the provided example genome starts at 42Mbps of chromosome 3), and reformats the output so the browser can read it.

You can also call AUGUSTUS and gff2browser.pl in one step:

augustus [...] chr4.103M.fa | ./gff2browser.pl 102999999 > browser-output.gff

Add the files created from AUGUSTUS-PPX and AUGUSTUS abinitio as custom tracks, in order to compare the predictions. The difference is not always obvious, and they may generate the same output. Predictions are likely to differ around the blocks, so you may have to zoom to where block hits are displayed.