org.apache.commons.math3.stat.inference.KolmogorovSmirnovTest

public class KolmogorovSmirnovTest extends Object

Implementation of the Kolmogorov-Smirnov (K-S) test for equality of continuous distributions.

The K-S test uses a statistic based on the maximum deviation of the empirical distribution of sample data points from the distribution expected under the null hypothesis. For one-sample tests evaluating the null hypothesis that a set of sample data points follow a given distribution, the test statistic is \(D_n=\sup_x |F_n(x)-F(x)|\), where \(F\) is the expected distribution and \(F_n\) is the empirical distribution of the \(n\) sample data points. The distribution of \(D_n\) is estimated using a method based on [1] with certain quick decisions for extreme values given in [2].

Two-sample tests are also supported, evaluating the null hypothesis that the two samples x and y come from the same underlying distribution. In this case, the test statistic is \(D_{n,m}=\sup_t | F_n(t)-F_m(t)|\) where \(n\) is the length of x, \(m\) is the length of y, \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values in x and \(F_m\) is the empirical distribution of the y values. The default 2-sample test method, kolmogorovSmirnovTest(double[], double[]) works as follows:

For small samples (where the product of the sample sizes is less than 10000), the method presented in [4] is used to compute the exact p-value for the 2-sample test.
When the product of the sample sizes exceeds 10000, the asymptotic distribution of \(D_{n,m}\) is used. See approximateP(double, int, int) for details on the approximation.

If the product of the sample sizes is less than 10000 and the sample data contains ties, random jitter is added to the sample data to break ties before applying the algorithm above. Alternatively, the bootstrap(double[], double[], int, boolean) method, modeled after ks.boot in the R Matching package [3], can be used if ties are known to be present in the data.

In the two-sample case, \(D_{n,m}\) has a discrete distribution. This makes the p-value associated with the null hypothesis \(H_0 : D_{n,m} \ge d \) differ from \(H_0 : D_{n,m} > d \) by the mass of the observed value \(d\). To distinguish these, the two-sample tests use a boolean strict parameter. This parameter is ignored for large samples.

The methods used by the 2-sample default implementation are also exposed directly:

exactP(double, int, int, boolean) computes exact 2-sample p-values
approximateP(double, int, int) uses the asymptotic distribution The boolean arguments in the first two methods allow the probability used to estimate the p-value to be expressed using strict or non-strict inequality. See kolmogorovSmirnovTest(double[], double[], boolean).

References:

[1] Evaluating Kolmogorov's Distribution by George Marsaglia, Wai Wan Tsang, and Jingbo Wang
[2] Computing the Two-Sided Kolmogorov-Smirnov Distribution by Richard Simard and Pierre L'Ecuyer
[3] Jasjeet S. Sekhon. 2011. Multivariate and Propensity Score Matching Software with Automated Balance Optimization: The Matching package for R Journal of Statistical Software, 42(7): 1-52.
[4] Wilcox, Rand. 2012. Introduction to Robust Estimation and Hypothesis Testing, Chapter 5, 3rd Ed. Academic Press.

Note that [1] contains an error in computing h, refer to MATH-437 for details.

Since:: 3.3

Field Summary

Fields

Modifier and Type

Field

Description

protected static final double

KS_SUM_CAUCHY_CRITERION

Convergence criterion for ksSum(double, double, int)

protected static final int

LARGE_SAMPLE_PRODUCT

When product of sample sizes exceeds this value, 2-sample K-S test uses asymptotic distribution to compute the p-value.

protected static final int

MAXIMUM_PARTIAL_SUM_COUNT

Bound on the number of partial sums in ksSum(double, double, int)

protected static final int

MONTE_CARLO_ITERATIONS

Deprecated.

protected static final double

PG_SUM_RELATIVE_ERROR

Convergence criterion for the sums in #pelzGood(double, double, int)}

protected static final int

SMALL_SAMPLE_PRODUCT

Deprecated.
Constructor Summary

Constructors

Constructor

Description

KolmogorovSmirnovTest()

Construct a KolmogorovSmirnovTest instance with a default random data generator.

KolmogorovSmirnovTest(RandomGenerator rng)

Deprecated.
Method Summary

Modifier and Type

Method

Description

double

approximateP(double d, int n, int m)

Uses the Kolmogorov-Smirnov distribution to approximate \(P(D_{n,m} > d)\) where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic.

double

bootstrap(double[] x, double[] y, int iterations)

Computes bootstrap(x, y, iterations, true).

double

bootstrap(double[] x, double[] y, int iterations, boolean strict)

Estimates the p-value of a two-sample Kolmogorov-Smirnov test evaluating the null hypothesis that x and y are samples drawn from the same probability distribution.

double

cdf(double d, int n)

Calculates \(P(D_n invalid input: '<' d)\) using the method described in [1] with quick decisions for extreme values given in [2] (see above).

double

cdf(double d, int n, boolean exact)

Calculates P(D_n < d) using method described in [1] with quick decisions for extreme values given in [2] (see above).

double

cdfExact(double d, int n)

Calculates P(D_n < d).

double

exactP(double d, int n, int m, boolean strict)

Computes \(P(D_{n,m} > d)\) if strict is true; otherwise \(P(D_{n,m} \ge d)\), where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic.

double

kolmogorovSmirnovStatistic(double[] x, double[] y)

Computes the two-sample Kolmogorov-Smirnov test statistic, \(D_{n,m}=\sup_x |F_n(x)-F_m(x)|\) where \(n\) is the length of x, \(m\) is the length of y, \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values in x and \(F_m\) is the empirical distribution of the y values.

double

kolmogorovSmirnovStatistic(RealDistribution distribution, double[] data)

Computes the one-sample Kolmogorov-Smirnov test statistic, \(D_n=\sup_x |F_n(x)-F(x)|\) where \(F\) is the distribution (cdf) function associated with distribution, \(n\) is the length of data and \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values in data.

double

kolmogorovSmirnovTest(double[] x, double[] y)

Computes the p-value, or observed significance level, of a two-sample Kolmogorov-Smirnov test evaluating the null hypothesis that x and y are samples drawn from the same probability distribution.

double

kolmogorovSmirnovTest(double[] x, double[] y, boolean strict)

Computes the p-value, or observed significance level, of a two-sample Kolmogorov-Smirnov test evaluating the null hypothesis that x and y are samples drawn from the same probability distribution.

double

kolmogorovSmirnovTest(RealDistribution distribution, double[] data)

Computes the p-value, or observed significance level, of a one-sample Kolmogorov-Smirnov test evaluating the null hypothesis that data conforms to distribution.

double

kolmogorovSmirnovTest(RealDistribution distribution, double[] data, boolean exact)

Computes the p-value, or observed significance level, of a one-sample Kolmogorov-Smirnov test evaluating the null hypothesis that data conforms to distribution.

boolean

kolmogorovSmirnovTest(RealDistribution distribution, double[] data, double alpha)

Performs a Kolmogorov-Smirnov test evaluating the null hypothesis that data conforms to distribution.

double

ksSum(double t, double tolerance, int maxIterations)

Computes \( 1 + 2 \sum_{i=1}^\infty (-1)^i e^{-2 i^2 t^2} \) stopping when successive partial sums are within tolerance of one another, or when maxIterations partial sums have been computed.

double

monteCarloP(double d, int n, int m, boolean strict, int iterations)

Uses Monte Carlo simulation to approximate \(P(D_{n,m} > d)\) where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic.

double

pelzGood(double d, int n)

Computes the Pelz-Good approximation for \(P(D_n invalid input: '<' d)\) as described in [2] in the class javadoc.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- MAXIMUM_PARTIAL_SUM_COUNT
  
  protected static final int MAXIMUM_PARTIAL_SUM_COUNT
  
  Bound on the number of partial sums in ksSum(double, double, int)
  See Also:
  
  Constant Field Values
- KS_SUM_CAUCHY_CRITERION
  
  protected static final double KS_SUM_CAUCHY_CRITERION
  
  Convergence criterion for ksSum(double, double, int)
  See Also:
  
  Constant Field Values
- PG_SUM_RELATIVE_ERROR
  
  protected static final double PG_SUM_RELATIVE_ERROR
  
  Convergence criterion for the sums in #pelzGood(double, double, int)}
  See Also:
  
  Constant Field Values
- SMALL_SAMPLE_PRODUCT
  
  @Deprecated protected static final int SMALL_SAMPLE_PRODUCT
  
  Deprecated.
  
  No longer used.
  See Also:
  
  Constant Field Values
- LARGE_SAMPLE_PRODUCT
  
  protected static final int LARGE_SAMPLE_PRODUCT
  
  When product of sample sizes exceeds this value, 2-sample K-S test uses asymptotic distribution to compute the p-value.
  See Also:
  
  Constant Field Values
- MONTE_CARLO_ITERATIONS
  
  @Deprecated protected static final int MONTE_CARLO_ITERATIONS
  
  Deprecated.
  
  Default number of iterations used by monteCarloP(double, int, int, boolean, int). Deprecated as of version 3.6, as this method is no longer needed.
  See Also:
  
  Constant Field Values
Constructor Details
- KolmogorovSmirnovTest
  
  public KolmogorovSmirnovTest()
  
  Construct a KolmogorovSmirnovTest instance with a default random data generator.
- KolmogorovSmirnovTest
  
  @Deprecated public KolmogorovSmirnovTest(RandomGenerator rng)
  
  Deprecated.
  
  Construct a KolmogorovSmirnovTest with the provided random data generator. The #monteCarloP(double, int, int, boolean, int) that uses the generator supplied to this constructor is deprecated as of version 3.6.
  
  Parameters:
  
  rng - random data generator used by monteCarloP(double, int, int, boolean, int)
Method Details
- kolmogorovSmirnovTest
  
  public double kolmogorovSmirnovTest(RealDistribution distribution, double[] data, boolean exact)
  
  Computes the p-value, or observed significance level, of a one-sample Kolmogorov-Smirnov test evaluating the null hypothesis that data conforms to distribution. If exact is true, the distribution used to compute the p-value is computed using extended precision. See cdfExact(double, int).
  
  Parameters:
  
  distribution - reference distribution
  
  data - sample being being evaluated
  
  exact - whether or not to force exact computation of the p-value
  
  Returns:
  
  the p-value associated with the null hypothesis that data is a sample from distribution
  
  Throws:
  
  InsufficientDataException - if data does not have length at least 2
  
  NullArgumentException - if data is null
- kolmogorovSmirnovStatistic
  
  public double kolmogorovSmirnovStatistic(RealDistribution distribution, double[] data)
  
  Computes the one-sample Kolmogorov-Smirnov test statistic, \(D_n=\sup_x |F_n(x)-F(x)|\) where \(F\) is the distribution (cdf) function associated with distribution, \(n\) is the length of data and \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values in data.
  
  Parameters:
  
  distribution - reference distribution
  
  data - sample being evaluated
  
  Returns:
  
  Kolmogorov-Smirnov statistic \(D_n\)
  
  Throws:
  
  InsufficientDataException - if data does not have length at least 2
  
  NullArgumentException - if data is null
- kolmogorovSmirnovTest
  
  public double kolmogorovSmirnovTest(double[] x, double[] y, boolean strict)
  Computes the p-value, or observed significance level, of a two-sample Kolmogorov-Smirnov test evaluating the null hypothesis that x and y are samples drawn from the same probability distribution. Specifically, what is returned is an estimate of the probability that the kolmogorovSmirnovStatistic(double[], double[]) associated with a randomly selected partition of the combined sample into subsamples of sizes x.length and y.length will strictly exceed (if strict is true) or be at least as large as strict = false) as kolmogorovSmirnovStatistic(x, y).
  
  For small samples (where the product of the sample sizes is less than 10000), the exact p-value is computed using the method presented in [4], implemented in exactP(double, int, int, boolean).
  
  When the product of the sample sizes exceeds 10000, the asymptotic distribution of \(D_{n,m}\) is used. See approximateP(double, int, int) for details on the approximation.
  
  If x.length * y.length invalid input: '<' 10000 and the combined set of values in x and y contains ties, random jitter is added to x and y to break ties before computing \(D_{n,m}\) and the p-value. The jitter is uniformly distributed on (-minDelta / 2, minDelta / 2) where minDelta is the smallest pairwise difference between values in the combined sample.
  
  If ties are known to be present in the data, bootstrap(double[], double[], int, boolean) may be used as an alternative method for estimating the p-value.
  Parameters:
  
  x - first sample dataset
  
  y - second sample dataset
  
  strict - whether or not the probability to compute is expressed as a strict inequality (ignored for large samples)
  
  Returns:
  
  p-value associated with the null hypothesis that x and y represent samples from the same distribution
  
  Throws:
  
  InsufficientDataException - if either x or y does not have length at least 2
  
  NullArgumentException - if either x or y is null
  
  See Also:
  
  bootstrap(double[], double[], int, boolean)
- kolmogorovSmirnovTest
  
  public double kolmogorovSmirnovTest(double[] x, double[] y)
  
  Computes the p-value, or observed significance level, of a two-sample Kolmogorov-Smirnov test evaluating the null hypothesis that x and y are samples drawn from the same probability distribution. Assumes the strict form of the inequality used to compute the p-value. See kolmogorovSmirnovTest(RealDistribution, double[], boolean).
  
  Parameters:
  
  x - first sample dataset
  
  y - second sample dataset
  
  Returns:
  
  p-value associated with the null hypothesis that x and y represent samples from the same distribution
  
  Throws:
  
  InsufficientDataException - if either x or y does not have length at least 2
  
  NullArgumentException - if either x or y is null
- kolmogorovSmirnovStatistic
  
  public double kolmogorovSmirnovStatistic(double[] x, double[] y)
  
  Computes the two-sample Kolmogorov-Smirnov test statistic, \(D_{n,m}=\sup_x |F_n(x)-F_m(x)|\) where \(n\) is the length of x, \(m\) is the length of y, \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values in x and \(F_m\) is the empirical distribution of the y values.
  
  Parameters:
  
  x - first sample
  
  y - second sample
  
  Returns:
  
  test statistic \(D_{n,m}\) used to evaluate the null hypothesis that x and y represent samples from the same underlying distribution
  
  Throws:
  
  InsufficientDataException - if either x or y does not have length at least 2
  
  NullArgumentException - if either x or y is null
- kolmogorovSmirnovTest
  
  public double kolmogorovSmirnovTest(RealDistribution distribution, double[] data)
  
  Computes the p-value, or observed significance level, of a one-sample Kolmogorov-Smirnov test evaluating the null hypothesis that data conforms to distribution.
  
  Parameters:
  
  distribution - reference distribution
  
  data - sample being being evaluated
  
  Returns:
  
  the p-value associated with the null hypothesis that data is a sample from distribution
  
  Throws:
  
  InsufficientDataException - if data does not have length at least 2
  
  NullArgumentException - if data is null
- kolmogorovSmirnovTest
  
  public boolean kolmogorovSmirnovTest(RealDistribution distribution, double[] data, double alpha)
  
  Performs a Kolmogorov-Smirnov test evaluating the null hypothesis that data conforms to distribution.
  
  Parameters:
  
  distribution - reference distribution
  
  data - sample being being evaluated
  
  alpha - significance level of the test
  
  Returns:
  
  true iff the null hypothesis that data is a sample from distribution can be rejected with confidence 1 - alpha
  
  Throws:
  
  InsufficientDataException - if data does not have length at least 2
  
  NullArgumentException - if data is null
- bootstrap
  
  public double bootstrap(double[] x, double[] y, int iterations, boolean strict)
  Estimates the p-value of a two-sample Kolmogorov-Smirnov test evaluating the null hypothesis that x and y are samples drawn from the same probability distribution. This method estimates the p-value by repeatedly sampling sets of size x.length and y.length from the empirical distribution of the combined sample. When strict is true, this is equivalent to the algorithm implemented in the R function ks.boot, described in
  Jasjeet S. Sekhon. 2011. 'Multivariate and Propensity Score Matching Software with Automated Balance Optimization: The Matching package for R.' Journal of Statistical Software, 42(7): 1-52.
  Parameters:
  
  x - first sample
  
  y - second sample
  
  iterations - number of bootstrap resampling iterations
  
  strict - whether or not the null hypothesis is expressed as a strict inequality
  
  Returns:
  
  estimated p-value
- bootstrap
  
  public double bootstrap(double[] x, double[] y, int iterations)
  
  Computes bootstrap(x, y, iterations, true). This is equivalent to ks.boot(x,y, nboots=iterations) using the R Matching package function. See #bootstrap(double[], double[], int, boolean).
  
  Parameters:
  
  x - first sample
  
  y - second sample
  
  iterations - number of bootstrap resampling iterations
  
  Returns:
  
  estimated p-value
- cdf
  
  public double cdf(double d, int n) throws MathArithmeticException
  
  Calculates \(P(D_n invalid input: '<' d)\) using the method described in [1] with quick decisions for extreme values given in [2] (see above). The result is not exact as with cdfExact(double, int) because calculations are based on double rather than BigFraction.
  
  Parameters:
  
  d - statistic
  
  n - sample size
  
  Returns:
  
  \(P(D_n invalid input: '<' d)\)
  
  Throws:
  
  MathArithmeticException - if algorithm fails to convert h to a BigFraction in expressing d as \((k - h) / m\) for integer k, m and \(0 \le h invalid input: '<' 1\)
- cdfExact
  
  public double cdfExact(double d, int n) throws MathArithmeticException
  
  Calculates P(D_n < d). The result is exact in the sense that BigFraction/BigReal is used everywhere at the expense of very slow execution time. Almost never choose this in real applications unless you are very sure; this is almost solely for verification purposes. Normally, you would choose cdf(double, int). See the class javadoc for definitions and algorithm description.
  
  Parameters:
  
  d - statistic
  
  n - sample size
  
  Returns:
  
  \(P(D_n invalid input: '<' d)\)
  
  Throws:
  
  MathArithmeticException - if the algorithm fails to convert h to a BigFraction in expressing d as \((k - h) / m\) for integer k, m and \(0 \le h invalid input: '<' 1\)
- cdf
  
  public double cdf(double d, int n, boolean exact) throws MathArithmeticException
  
  Calculates P(D_n < d) using method described in [1] with quick decisions for extreme values given in [2] (see above).
  
  Parameters:
  
  d - statistic
  
  n - sample size
  
  exact - whether the probability should be calculated exact using BigFraction everywhere at the expense of very slow execution time, or if double should be used convenient places to gain speed. Almost never choose true in real applications unless you are very sure; true is almost solely for verification purposes.
  
  Returns:
  
  \(P(D_n invalid input: '<' d)\)
  
  Throws:
  
  MathArithmeticException - if algorithm fails to convert h to a BigFraction in expressing d as \((k - h) / m\) for integer k, m and \(0 \le h invalid input: '<' 1\).
- pelzGood
  
  public double pelzGood(double d, int n)
  
  Computes the Pelz-Good approximation for \(P(D_n invalid input: '<' d)\) as described in [2] in the class javadoc.
  
  Parameters:
  
  d - value of d-statistic (x in [2])
  
  n - sample size
  
  Returns:
  
  \(P(D_n invalid input: '<' d)\)
  
  Since:
  
  3.4
- ksSum
  
  public double ksSum(double t, double tolerance, int maxIterations)
  
  Computes \( 1 + 2 \sum_{i=1}^\infty (-1)^i e^{-2 i^2 t^2} \) stopping when successive partial sums are within tolerance of one another, or when maxIterations partial sums have been computed. If the sum does not converge before maxIterations iterations a TooManyIterationsException is thrown.
  
  Parameters:
  
  t - argument
  
  tolerance - Cauchy criterion for partial sums
  
  maxIterations - maximum number of partial sums to compute
  
  Returns:
  
  Kolmogorov sum evaluated at t
  
  Throws:
  
  TooManyIterationsException - if the series does not converge
- exactP
  
  public double exactP(double d, int n, int m, boolean strict)
  
  Computes \(P(D_{n,m} > d)\) if strict is true; otherwise \(P(D_{n,m} \ge d)\), where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic. See kolmogorovSmirnovStatistic(double[], double[]) for the definition of \(D_{n,m}\).
  The returned probability is exact, implemented by unwinding the recursive function definitions presented in [4] (class javadoc).
  
  Parameters:
  
  d - D-statistic value
  
  n - first sample size
  
  m - second sample size
  
  strict - whether or not the probability to compute is expressed as a strict inequality
  
  Returns:
  
  probability that a randomly selected m-n partition of m + n generates \(D_{n,m}\) greater than (resp. greater than or equal to) d
- approximateP
  
  public double approximateP(double d, int n, int m)
  
  Uses the Kolmogorov-Smirnov distribution to approximate \(P(D_{n,m} > d)\) where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic. See kolmogorovSmirnovStatistic(double[], double[]) for the definition of \(D_{n,m}\).
  Specifically, what is returned is \(1 - k(d \sqrt{mn / (m + n)})\) where \(k(t) = 1 + 2 \sum_{i=1}^\infty (-1)^i e^{-2 i^2 t^2}\). See ksSum(double, double, int) for details on how convergence of the sum is determined. This implementation passes ksSum 1.0E-20 as tolerance and 100000 as maxIterations.
  
  Parameters:
  
  d - D-statistic value
  
  n - first sample size
  
  m - second sample size
  
  Returns:
  
  approximate probability that a randomly selected m-n partition of m + n generates \(D_{n,m}\) greater than d
- monteCarloP
  
  public double monteCarloP(double d, int n, int m, boolean strict, int iterations)
  
  Uses Monte Carlo simulation to approximate \(P(D_{n,m} > d)\) where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic. See kolmogorovSmirnovStatistic(double[], double[]) for the definition of \(D_{n,m}\).
  The simulation generates iterations random partitions of m + n into an n set and an m set, computing \(D_{n,m}\) for each partition and returning the proportion of values that are greater than d, or greater than or equal to d if strict is false.
  
  Parameters:
  
  d - D-statistic value
  
  n - first sample size
  
  m - second sample size
  
  strict - whether or not the probability to compute is expressed as a strict inequality
  
  iterations - number of random partitions to generate
  
  Returns:
  
  proportion of randomly generated m-n partitions of m + n that result in \(D_{n,m}\) greater than (resp. greater than or equal to) d

Class KolmogorovSmirnovTest

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Details

MAXIMUM_PARTIAL_SUM_COUNT

KS_SUM_CAUCHY_CRITERION

PG_SUM_RELATIVE_ERROR

SMALL_SAMPLE_PRODUCT

LARGE_SAMPLE_PRODUCT

MONTE_CARLO_ITERATIONS

Constructor Details

KolmogorovSmirnovTest

KolmogorovSmirnovTest

Method Details

kolmogorovSmirnovTest

kolmogorovSmirnovStatistic

kolmogorovSmirnovTest

kolmogorovSmirnovTest

kolmogorovSmirnovStatistic

kolmogorovSmirnovTest

kolmogorovSmirnovTest

bootstrap

bootstrap

cdf

cdfExact

cdf

pelzGood

ksSum

exactP

approximateP

monteCarloP