Next: Large Dense Linear Systems, Previous: Regularized regression, Up: Least-Squares Fitting [Index]
Ordinary least squares (OLS) models are often heavily influenced by the presence of outliers. Outliers are data points which do not follow the general trend of the other observations, although there is strictly no precise definition of an outlier. Robust linear regression refers to regression algorithms which are robust to outliers. The most common type of robust regression is M-estimation. The general M-estimator minimizes the objective function
\sum_i \rho(e_i) = \sum_i \rho (y_i - Y(c, x_i))
where e_i = y_i - Y(c, x_i) is the residual of the ith data point, and \rho(e_i) is a function which should have the following properties:
The special case of ordinary least squares is given by \rho(e_i) = e_i^2. Letting \psi = \rho' be the derivative of \rho, differentiating the objective function with respect to the coefficients c and setting the partial derivatives to zero produces the system of equations
\sum_i \psi(e_i) X_i = 0
where X_i is a vector containing row i of the design matrix X. Next, we define a weight function w(e) = \psi(e)/e, and let w_i = w(e_i):
\sum_i w_i e_i X_i = 0
This system of equations is equivalent to solving a weighted ordinary least squares problem, minimizing \chi^2 = \sum_i w_i e_i^2. The weights however, depend on the residuals e_i, which depend on the coefficients c, which depend on the weights. Therefore, an iterative solution is used, called Iteratively Reweighted Least Squares (IRLS).
|c_i^(k) - c_i^(k-1)| \le \epsilon \times max(|c_i^(k)|, |c_i^(k-1)|)
for all 0 \le i < p where \epsilon is a small tolerance factor.
The key to this method lies in selecting the function \psi(e_i) to assign smaller weights to large residuals, and larger weights to smaller residuals. As the iteration proceeds, outliers are assigned smaller and smaller weights, eventually having very little or no effect on the fitted model.
This function allocates a workspace for fitting a model to n observations using p parameters. The size of the workspace is O(np + p^2). The type T specifies the function \psi and can be selected from the following choices.
This specifies the gsl_multifit_robust_bisquare
type (see below) and is a good
general purpose choice for robust regression.
This is Tukey’s biweight (bisquare) function and is a good general purpose choice for robust regression. The weight function is given by
w(e) = (1 - e^2)^2
and the default tuning constant is t = 4.685.
This is Cauchy’s function, also known as the Lorentzian function. This function does not guarantee a unique solution, meaning different choices of the coefficient vector c could minimize the objective function. Therefore this option should be used with care. The weight function is given by
w(e) = 1 / (1 + e^2)
and the default tuning constant is t = 2.385.
This is the fair \rho function, which guarantees a unique solution and has continuous derivatives to three orders. The weight function is given by
w(e) = 1 / (1 + |e|)
and the default tuning constant is t = 1.400.
This specifies Huber’s \rho function, which is a parabola in the vicinity of zero and increases linearly for a given threshold |e| > t. This function is also considered an excellent general purpose robust estimator, however, occasional difficulties can be encountered due to the discontinuous first derivative of the \psi function. The weight function is given by
w(e) = 1/max(1,|e|)
and the default tuning constant is t = 1.345.
This specifies the ordinary least squares solution, which can be useful for quickly checking the difference between the various robust and OLS solutions. The weight function is given by
w(e) = 1
and the default tuning constant is t = 1.
This specifies the Welsch function which can perform well in cases where the residuals have an exponential distribution. The weight function is given by
w(e) = \exp(-e^2)
and the default tuning constant is t = 2.985.
This function frees the memory associated with the workspace w.
This function returns the name of the robust type T specified to gsl_multifit_robust_alloc
.
This function sets the tuning constant t used to adjust the residuals at each iteration to tune. Decreasing the tuning constant increases the downweight assigned to large residuals, while increasing the tuning constant decreases the downweight assigned to large residuals.
This function sets the maximum number of iterations in the iteratively
reweighted least squares algorithm to maxiter. By default,
this value is set to 100 by gsl_multifit_robust_alloc
.
This function assigns weights to the vector wts using the residual vector r and
previously specified weighting function. The output weights are given by wts_i = w(r_i / (t \sigma)),
where the weighting functions w are detailed in gsl_multifit_robust_alloc
. \sigma
is an estimate of the residual standard deviation based on the Median-Absolute-Deviation and t
is the tuning constant. This
function is useful if the user wishes to implement their own robust regression rather than using
the supplied gsl_multifit_robust
routine below.
This function computes the best-fit parameters c of the model
y = X c for the observations y and the matrix of
predictor variables X, attemping to reduce the influence
of outliers using the algorithm outlined above.
The p-by-p variance-covariance matrix of the model parameters
cov is estimated as \sigma^2 (X^T X)^{-1}, where \sigma is
an approximation of the residual standard deviation using the theory of robust
regression. Special care must be taken when estimating \sigma and
other statistics such as R^2, and so these
are computed internally and are available by calling the function
gsl_multifit_robust_statistics
.
If the coefficients do not converge within the maximum iteration
limit, the function returns GSL_EMAXITER
. In this case,
the current estimates of the coefficients and covariance matrix
are returned in c and cov and the internal fit statistics
are computed with these estimates.
This function uses the best-fit robust regression coefficients c and their covariance matrix cov to compute the fitted function value y and its standard deviation y_err for the model y = x.c at the point x.
This function computes the vector of studentized residuals
r_i = {y_i - (X c)_i \over \sigma \sqrt{1 - h_i}} for
the observations y, coefficients c and matrix of predictor
variables X. The routine gsl_multifit_robust
must
first be called to compute the statisical leverages h_i of
the matrix X and residual standard deviation estimate \sigma.
This function returns a structure containing relevant statistics from a robust regression. The function
gsl_multifit_robust
must be called first to perform the regression and calculate these statistics.
The returned gsl_multifit_robust_stats
structure contains the following fields.
sigma_ols
This contains the standard deviation of the residuals as computed from ordinary least squares (OLS).
sigma_mad
This contains an estimate of the standard deviation of the final residuals using the Median-Absolute-Deviation statistic
sigma_rob
This contains an estimate of the standard deviation of the final residuals from the theory of robust regression (see Street et al, 1988).
sigma
This contains an estimate of the standard deviation of the final residuals by attemping to reconcile sigma_rob
and sigma_ols
in a reasonable way.
Rsq
This contains the R^2 coefficient of determination statistic using the estimate sigma
.
adj_Rsq
This contains the adjusted R^2 coefficient of determination statistic using the estimate sigma
.
rmse
This contains the root mean squared error of the final residuals
sse
This contains the residual sum of squares taking into account the robust covariance matrix.
dof
This contains the number of degrees of freedom n - p
numit
Upon successful convergence, this contains the number of iterations performed
weights
This contains the final weight vector of length n
r
This contains the final residual vector of length n, r = y - X c
Next: Large Dense Linear Systems, Previous: Regularized regression, Up: Least-Squares Fitting [Index]