GNU Scientific Library – Reference Manual: Robust linear regression

38.5 Robust linear regression

Ordinary least squares (OLS) models are often heavily influenced by the presence of outliers. Outliers are data points which do not follow the general trend of the other observations, although there is strictly no precise definition of an outlier. Robust linear regression refers to regression algorithms which are robust to outliers. The most common type of robust regression is M-estimation. The general M-estimator minimizes the objective function

\sum_i \rho(e_i) = \sum_i \rho (y_i - Y(c, x_i))

where e_i = y_i - Y(c, x_i) is the residual of the ith data point, and \rho(e_i) is a function which should have the following properties:

\rho(e) \ge 0
\rho(0) = 0
\rho(-e) = \rho(e)
\rho(e_1) > \rho(e_2) for |e_1| > |e_2|

The special case of ordinary least squares is given by \rho(e_i) = e_i^2. Letting \psi = \rho' be the derivative of \rho, differentiating the objective function with respect to the coefficients c and setting the partial derivatives to zero produces the system of equations

\sum_i \psi(e_i) X_i = 0

where X_i is a vector containing row i of the design matrix X. Next, we define a weight function w(e) = \psi(e)/e, and let w_i = w(e_i):

\sum_i w_i e_i X_i = 0

This system of equations is equivalent to solving a weighted ordinary least squares problem, minimizing \chi^2 = \sum_i w_i e_i^2. The weights however, depend on the residuals e_i, which depend on the coefficients c, which depend on the weights. Therefore, an iterative solution is used, called Iteratively Reweighted Least Squares (IRLS).

Compute initial estimates of the coefficients c^{(0)} using ordinary least squares
For iteration k, form the residuals e_i^{(k)} = (y_i - X_i c^{(k-1)})/(t \sigma^{(k)} \sqrt{1 - h_i}), where t is a tuning constant depending on the choice of \psi, and h_i are the statistical leverages (diagonal elements of the matrix X (X^T X)^{-1} X^T). Including t and h_i in the residual calculation has been shown to improve the convergence of the method. The residual standard deviation is approximated as \sigma^{(k)} = MAD / 0.6745, where MAD is the Median-Absolute-Deviation of the n-p largest residuals from the previous iteration.
Compute new weights w_i^{(k)} = \psi(e_i^{(k)})/e_i^{(k)}.
Compute new coefficients c^{(k)} by solving the weighted least squares problem with weights w_i^{(k)}.
Steps 2 through 4 are iterated until the coefficients converge or until some maximum iteration limit is reached. Coefficients are tested for convergence using the critera:
```
|c_i^(k) - c_i^(k-1)| \le \epsilon \times max(|c_i^(k)|, |c_i^(k-1)|)
```
for all 0 \le i < p where \epsilon is a small tolerance factor.

The key to this method lies in selecting the function \psi(e_i) to assign smaller weights to large residuals, and larger weights to smaller residuals. As the iteration proceeds, outliers are assigned smaller and smaller weights, eventually having very little or no effect on the fitted model.

Function: gsl_multifit_robust_workspace * gsl_multifit_robust_alloc (const gsl_multifit_robust_type * T, const size_t n, const size_t p)

This function allocates a workspace for fitting a model to n observations using p parameters. The size of the workspace is O(np + p^2). The type T specifies the function \psi and can be selected from the following choices.

Robust type: gsl_multifit_robust_default: This specifies the gsl_multifit_robust_bisquare type (see below) and is a good general purpose choice for robust regression.

Robust type: gsl_multifit_robust_bisquare

This is Tukey’s biweight (bisquare) function and is a good general purpose choice for robust regression. The weight function is given by

w(e) = (1 - e^2)^2

and the default tuning constant is t = 4.685.

Robust type: gsl_multifit_robust_cauchy

This is Cauchy’s function, also known as the Lorentzian function. This function does not guarantee a unique solution, meaning different choices of the coefficient vector c could minimize the objective function. Therefore this option should be used with care. The weight function is given by

w(e) = 1 / (1 + e^2)

and the default tuning constant is t = 2.385.

Robust type: gsl_multifit_robust_fair

This is the fair \rho function, which guarantees a unique solution and has continuous derivatives to three orders. The weight function is given by

w(e) = 1 / (1 + |e|)

and the default tuning constant is t = 1.400.

Robust type: gsl_multifit_robust_huber

This specifies Huber’s \rho function, which is a parabola in the vicinity of zero and increases linearly for a given threshold |e| > t. This function is also considered an excellent general purpose robust estimator, however, occasional difficulties can be encountered due to the discontinuous first derivative of the \psi function. The weight function is given by

w(e) = 1/max(1,|e|)

and the default tuning constant is t = 1.345.

Robust type: gsl_multifit_robust_ols

This specifies the ordinary least squares solution, which can be useful for quickly checking the difference between the various robust and OLS solutions. The weight function is given by

w(e) = 1

and the default tuning constant is t = 1.

Robust type: gsl_multifit_robust_welsch

This specifies the Welsch function which can perform well in cases where the residuals have an exponential distribution. The weight function is given by

w(e) = \exp(-e^2)

and the default tuning constant is t = 2.985.

Function: void gsl_multifit_robust_free (gsl_multifit_robust_workspace * w): This function frees the memory associated with the workspace w.

Function: const char * gsl_multifit_robust_name (const gsl_multifit_robust_workspace * w): This function returns the name of the robust type T specified to gsl_multifit_robust_alloc.

Function: int gsl_multifit_robust_tune (const double tune, gsl_multifit_robust_workspace * w): This function sets the tuning constant t used to adjust the residuals at each iteration to tune. Decreasing the tuning constant increases the downweight assigned to large residuals, while increasing the tuning constant decreases the downweight assigned to large residuals.

Function: int gsl_multifit_robust_maxiter (const size_t maxiter, gsl_multifit_robust_workspace * w): This function sets the maximum number of iterations in the iteratively reweighted least squares algorithm to maxiter. By default, this value is set to 100 by gsl_multifit_robust_alloc.

Function: int gsl_multifit_robust_weights (const gsl_vector * r, gsl_vector * wts, gsl_multifit_robust_workspace * w): This function assigns weights to the vector wts using the residual vector r and previously specified weighting function. The output weights are given by wts_i = w(r_i / (t \sigma)), where the weighting functions w are detailed in gsl_multifit_robust_alloc. \sigma is an estimate of the residual standard deviation based on the Median-Absolute-Deviation and t is the tuning constant. This function is useful if the user wishes to implement their own robust regression rather than using the supplied gsl_multifit_robust routine below.

Function: int gsl_multifit_robust (const gsl_matrix * X, const gsl_vector * y, gsl_vector * c, gsl_matrix * cov, gsl_multifit_robust_workspace * w)

This function computes the best-fit parameters c of the model y = X c for the observations y and the matrix of predictor variables X, attemping to reduce the influence of outliers using the algorithm outlined above. The p-by-p variance-covariance matrix of the model parameters cov is estimated as \sigma^2 (X^T X)^{-1}, where \sigma is an approximation of the residual standard deviation using the theory of robust regression. Special care must be taken when estimating \sigma and other statistics such as R^2, and so these are computed internally and are available by calling the function gsl_multifit_robust_statistics.

If the coefficients do not converge within the maximum iteration limit, the function returns GSL_EMAXITER. In this case, the current estimates of the coefficients and covariance matrix are returned in c and cov and the internal fit statistics are computed with these estimates.

Function: int gsl_multifit_robust_est (const gsl_vector * x, const gsl_vector * c, const gsl_matrix * cov, double * y, double * y_err): This function uses the best-fit robust regression coefficients c and their covariance matrix cov to compute the fitted function value y and its standard deviation y_err for the model y = x.c at the point x.

Function: int gsl_multifit_robust_residuals (const gsl_matrix * X, const gsl_vector * y, const gsl_vector * c, gsl_vector * r, gsl_multifit_robust_workspace * w): This function computes the vector of studentized residuals r_i = {y_i - (X c)_i \over \sigma \sqrt{1 - h_i}} for the observations y, coefficients c and matrix of predictor variables X. The routine gsl_multifit_robust must first be called to compute the statisical leverages h_i of the matrix X and residual standard deviation estimate \sigma.

Function: gsl_multifit_robust_stats gsl_multifit_robust_statistics (const gsl_multifit_robust_workspace * w)

This function returns a structure containing relevant statistics from a robust regression. The function gsl_multifit_robust must be called first to perform the regression and calculate these statistics. The returned gsl_multifit_robust_stats structure contains the following fields.

double sigma_ols This contains the standard deviation of the residuals as computed from ordinary least squares (OLS).
double sigma_mad This contains an estimate of the standard deviation of the final residuals using the Median-Absolute-Deviation statistic
double sigma_rob This contains an estimate of the standard deviation of the final residuals from the theory of robust regression (see Street et al, 1988).
double sigma This contains an estimate of the standard deviation of the final residuals by attemping to reconcile sigma_rob and sigma_ols in a reasonable way.
double Rsq This contains the R^2 coefficient of determination statistic using the estimate sigma.
double adj_Rsq This contains the adjusted R^2 coefficient of determination statistic using the estimate sigma.
double rmse This contains the root mean squared error of the final residuals
double sse This contains the residual sum of squares taking into account the robust covariance matrix.
size_t dof This contains the number of degrees of freedom n - p
size_t numit Upon successful convergence, this contains the number of iterations performed
gsl_vector * weights This contains the final weight vector of length n
gsl_vector * r This contains the final residual vector of length n, r = y - X c