Class DistanceMatrixCalculator

java.lang.Object
org.biojava.nbio.phylo.DistanceMatrixCalculator

public class DistanceMatrixCalculator extends Object
The DistanceMatrixCalculator methods generate a DistanceMatrix from a MultipleSequenceAlignment or other indirect distance infomation (RMSD).
Since:
4.1.1
Author:
Aleix Lafita
  • Method Details

    • fractionalDissimilarity

      public static <C extends Sequence<D>, D extends Compound> DistanceMatrix fractionalDissimilarity(MultipleSequenceAlignment<C,D> msa) throws IOException
      The fractional dissimilarity (D) is defined as the percentage of sites that differ between two aligned sequences. The percentage of identity (PID) is the fraction of identical sites between two aligned sequences.
       D = 1 - PID
       
      The gapped positons in the alignment are ignored in the calculation. This method is a wrapper to the forester implementation of the calculation: PairwiseDistanceCalculator.calcFractionalDissimilarities(Msa)
      Parameters:
      msa - MultipleSequenceAlignment
      Returns:
      DistanceMatrix
      Throws:
      Exception
      IOException
    • poissonDistance

      public static <C extends Sequence<D>, D extends Compound> DistanceMatrix poissonDistance(MultipleSequenceAlignment<C,D> msa) throws IOException
      The Poisson (correction) evolutionary distance (d) is a function of the fractional dissimilarity (D), given by:
       d = -log(1 - D)
       
      The gapped positons in the alignment are ignored in the calculation. This method is a wrapper to the forester implementation of the calculation: PairwiseDistanceCalculator.calcPoissonDistances(Msa)
      Parameters:
      msa - MultipleSequenceAlignment
      Returns:
      DistanceMatrix
      Throws:
      IOException
    • kimuraDistance

      public static <C extends Sequence<D>, D extends Compound> DistanceMatrix kimuraDistance(MultipleSequenceAlignment<C,D> msa) throws IOException
      The Kimura evolutionary distance (d) is a correction of the fractional dissimilarity (D) specially needed for large evolutionary distances. It is given by:
       d = -log(1 - D - 0.2 * D2)
       
      The equation is derived by fitting the relationship between the evolutionary distance (d) and the fractional dissimilarity (D) according to the PAM model of evolution (it is an empirical approximation for the method {@link #pamDistance(MultipleSequenceAlignment}). The gapped positons in the alignment are ignored in the calculation. This method is a wrapper to the forester implementation of the calculation: PairwiseDistanceCalculator.calcKimuraDistances(Msa).
      Parameters:
      msa - MultipleSequenceAlignment
      Returns:
      DistanceMatrix
      Throws:
      IOException
    • percentageIdentity

      public static <C extends Sequence<D>, D extends Compound> DistanceMatrix percentageIdentity(MultipleSequenceAlignment<C,D> msa)
      BioJava implementation for percentage of identity (PID). Although the name of the method is percentage of identity, the DistanceMatrix contains the fractional dissimilarity (D), computed as D = 1 - PID.

      It is recommended to use the method fractionalDissimilarity(MultipleSequenceAlignment) instead of this one.

      Parameters:
      msa - MultipleSequenceAlignment
      Returns:
      DistanceMatrix
    • fractionalDissimilarityScore

      public static <C extends Sequence<D>, D extends Compound> DistanceMatrix fractionalDissimilarityScore(MultipleSequenceAlignment<C,D> msa, SubstitutionMatrix<D> M)
      The fractional dissimilarity score (Ds) is a relative measure of the dissimilarity between two aligned sequences. It is calculated as:
       Ds = sum( max(M) - Mai,bi ) / (max(M)-min(M)) ) / L
       
      Where the sum through i runs for all the alignment positions, ai and bi are the AA at position i in the first and second aligned sequences, respectively, and L is the total length of the alignment (normalization).

      The fractional dissimilarity score (Ds) is in the interval [0, 1], where 0 means that the sequences are identical and 1 that the sequences are completely different.

      Gaps do not have a contribution to the similarity score calculation (gap penalty = 0)

      Parameters:
      msa - MultipleSequenceAlignment
      M - SubstitutionMatrix for similarity scoring
      Returns:
      DistanceMatrix
    • dissimilarityScore

      public static <C extends Sequence<D>, D extends Compound> DistanceMatrix dissimilarityScore(MultipleSequenceAlignment<C,D> msa, SubstitutionMatrix<D> M)
      The dissimilarity score is the additive inverse of the similarity score (sum of scores) between two aligned sequences using a substitution model (Substitution Matrix). The maximum dissimilarity score is taken to be the maximum similarity score between self-alignments (each sequence against itself). Calculation of the score is as follows:
       Ds = maxScore - sumi(Mai,bi)
       
      It is recommended to use the method fractionalDissimilarityScore(MultipleSequenceAlignment, SubstitutionMatrix) , since the maximum similarity score is not relative to the data set, but relative to the Substitution Matrix, and the score is normalized by the alignment length (fractional).

      Gaps do not have a contribution to the similarity score calculation (gap penalty = 0).

      Parameters:
      msa - MultipleSequenceAlignment
      M - SubstitutionMatrix for similarity scoring
      Returns:
      DistanceMatrix
    • pamMLdistance

      public static <C extends Sequence<D>, D extends Compound> DistanceMatrix pamMLdistance(MultipleSequenceAlignment<C,D> msa)
      The PAM (Point Accepted Mutations) distance is a measure of evolutionary distance in protein sequences. The PAM unit represents an average substitution rate of 1% per site. The fractional dissimilarity (D) of two aligned sequences is related with the PAM distance (d) by the equation:
       D = sum(fi * (1 - Miid))
       
      Where the sum is for all 20 AA, fi denotes the natural fraction of the given AA and M is the substitution matrix (in this case the PAM1 matrix).

      To calculate the PAM distance between two aligned sequences the maximum likelihood (ML) approach is used, which consists in finding d that maximazies the function:

       L(d) = product(fai * (1 - Mai,bid))
       
      Where the product is for every position i in the alignment, and ai and bi are the AA at position i in the first and second aligned sequences, respectively.
      Parameters:
      msa - MultipleSequenceAlignment
      Returns:
    • structuralDistance

      public static <C extends Sequence<D>, D extends Compound> DistanceMatrix structuralDistance(double[][] rmsdMat, double alpha, double rmsdMax, double rmsd0)
      The structural distance (dS) uses the structural similarity (or dissimilarity) from a the structural alignment of two protein strutures. It is based on the diffusive model for protein fold evolution (Grishin 1995). The structural deviations are captured as RMS deviations.
       dSij = (rmsdmax2 / alpha2) *
              ln( (rmsdmax2 - rmsd02) /
              (rmsdmax2 - (rmsdij2) )
       
      Parameters:
      rmsdMat - RMSD matrix for all structure pairs (symmetric matrix)
      alpha - change in CA positions introduced by a single AA substitution (Grishin 1995: 1 A)
      rmsdMax - estimated RMSD between proteins of the same fold when the percentage of identity is infinitely low (the maximum allowed RMSD of proteins with the same fold). (Grishin 1995: 5 A)
      rmsd0 - arithmetical mean of squares of the RMSD for identical proteins (Grishin 1995: 0.4 A)
      Returns:
      DistanceMatrix
    • jointSeqStrucDistance

      public static <C extends Sequence<D>, D extends Compound> DistanceMatrix jointSeqStrucDistance(double[][] rmsdMat)
      The joint sequence-structure distance (dSS) is a combination of the sequence-based and the structure-based distances.
      Parameters:
      rmsdMat - RMSD matrix for all structure pairs (symmetric matrix)
      Returns:
      DistanceMatrix