Using codon models

Section author: Gavin Huttley

The basic paradigm for evolutionary modelling is:

  1. construct the codon substitution model

  2. constructing likelihood function

  3. modify likelihood function (setting rules)

  4. providing the alignment(s)

  5. optimisation

  6. get results out

Note

In the following, a result followed by ‘…’ just means the output has been truncated for the sake of a succinct presentation.

Constructing the codon substitution model

For the time-reversible category, Cogent3 implements 4 basic rate matrices: i) NF models, these are nucleotide frequency weighted rate matrices and were initially described by Muse and Gaut (1994); ii) a variant of (i) where position specific nucleotide frequencies are used; iii) TF models, these are tuple (codon in this case) frequency weighted rate matrices and were initially described by Goldman and Yang (1994); iv) CNF, these use the conditional nucleotide frequency and have developed by Yap, Lindsay, Easteal and Huttley. These different models can be created using provided convenience functions which will be the case here, or specified by directly calling the TimeReversibleCodon substitution model class and setting the argument mprob_model equal to:

  • NF, mprob_model='monomer'

  • NF with position specific nucleotide frequencies, mprob_model='monomers'

  • TF, mprob_model=None

  • CNF, mprob_model='conditional'

In the following I will construct GTR variants of i and iv and a HKY variant of iii.

We import these explicitly from the cogent3.evolve.models module.

These are functions and calling them returns the indicated substitution model with default behaviour of recoding gap characters into N’s.

In the following demonstration I will use only the CNF form (cnf).

For our example we load a sample alignment and tree as per usual. To reduce the computational overhead for this example we will limit the number of sampled taxa.

Standard test of neutrality

We construct a likelihood function and constrain omega parameter (the ratio of nonsynonymous to synonymous substitutions) to equal 1. We also set some display formatting parameters.

We then provide an alignment and optimise the model. In the current case we just use the local optimiser (hiding progress to keep this document succinct). We then print(the results.)

Note

I’m going to specify a set of conditions that will be used for all optimiser steps. For those new to python, one can construct a dictionary with the following form {'argument_name': argument_value}, or alternatively dict(argument_name=argument_value). I’m doing the latter. This dictionary is then passed to functions/methods by prefacing it with **.

In the above output, the first table shows the maximum likelihood estimates (MLEs) for the substitution model parameters that are ‘global’ in scope. For instance, the C/T=4.58 MLE indicates that the relative rate of substitutions between C and T is nearly 5 times the background substitution rate.

The above function has been fit using the default counting procedure for estimating the motif frequencies, i.e. codon frequencies are estimated as the average of the observed codon frequencies. If you wanted to numerically optimise the motif probabilities, then modify the likelihood function creation line to

lf = cnf.make_likelihood_function(tree, optimise_motif_probs=True)

We can then free up the omega parameter, but before we do that we’ll store the log-likelihood and number of free parameters for the current model form for reuse later.

We then conduct a likelihood ratio test whether the MLE of omega significantly improves the fit over the constraint it equals 1. We import the convenience function from the cogent3 stats module.

Not surprisingly, this is significant. We then ask whether the Human and Chimpanzee edges have a value of omega that is significantly different from the rest of the tree.

This is basically a replication of the original Huttley et al (2000) result for BRCA1.

Rate-heterogeneity model variants

It is also possible to specify rate-heterogeneity variants of these models. In the first instance we’ll create a likelihood function where these rate-classes are global across the entire tree. Because fitting these models can be time consuming I’m going to recreate the non-neutral likelihood function from above first, fit it, and then construct the rate-heterogeneity likelihood function. By doing this I can ensure that the richer model starts with parameter values that produce a log-likelihood the same as the null model, ensuring the subsequent optimisation step improves the likelihood over the null.

Now, we have a null model which we know (from having fit it above) has a MLE < 1. We will construct a rate-heterogeneity model with just 2 rate-classes (neutral and adaptive) that are separated by the boundary of omega=1. These rate-classes are specified as discrete bins in Cogent3 and the model configuration steps for a bin or bins are done using the set_param_rule method. To ensure the alternate model starts with a likelihood at least as good as the previous we need to make the probability of the neutral site-class bin ~= 1 (these are referenced by the bprobs parameter type) and assign the null model omega MLE to this class.

To get all the parameter MLEs (branch lengths, GTR terms, etc ..) into the alternate model we get an annotated tree from the null model which will have these values associated with it.

We can then construct a new likelihood function, specifying the rate-class properties.

We define a very small value (epsilon) that is used to specify the starting values.

We now provide starting parameter values for omega for the two bins, setting the boundary

and provide the starting values for the bin probabilities (bprobs).

The above statement essentially assigns a probability of nearly 1 to the ‘neutral’ bin. We now set the alignment and fit the model.

We can get the posterior probabilities of site-classifications out of this model as

This is a DictArray class which stores the probabilities as a numpy.array.

Mixing branch and site-heterogeneity

The following implements a modification of the approach of Zhang, Nielsen and Yang (Mol Biol Evol, 22:2472–9, 2005). For this model class, there are groups of branches for which all positions are evolving neutrally but some proportion of those neutrally evolving sites change to adaptively evolving on so-called foreground edges. For the current example, we’ll define the Chimpanzee and Human branches as foreground and everything else as background. The following table defines the parameter scopes.

Note

Our implementation is not as parametrically succinct as that of Zhang et al, we have 1 additional bin probability.

After Zhang et al, we first define a null model that has 2 rate classes ‘0’ and ‘1’. We also get all the MLEs out using get_statistics, just printing out the bin parameters table in the current case.

We’re also going to use the MLEs from the rate_lf model, since that nests within the more complex branch by rate-class model. This is unfortunately quite ugly compared with just using the annotated tree approach described above. It is currently necessary, however, due to a bug in constructing annotated trees for models with binned parameters.

We now create the more complex model,

and set from the nested null model the branch lengths,

GTR term MLES,

binned parameter values,

and the bin probabilities.

The result of these steps is to create a rate/branch model with initial parameter values that result in likelihood the same as the null.

rate_branch_lf.optimise(**optimiser_args)
print(rate_branch_lf)
Likelihood function statistics
log-likelihood = -6753.4561
number of free parameters = 21
=========================
      edge   bin    omega
-------------------------
    Galago     0     0.00
    Galago     1     1.00
    Galago    2a     0.00
    Galago    2b     1.00
 HowlerMon     0     0.00
 HowlerMon     1     1.00
 HowlerMon    2a     0.00
 HowlerMon    2b     1.00
    Rhesus     0     0.00
    Rhesus     1     1.00
    ...