BACON: blocked adaptive computationally efficient outlier nominators

https://doi.org/10.1016/S0167-9473(99)00101-2Get rights and content

Abstract

Although it is customary to assume that data are homogeneous, in fact, they often contain outliers or subgroups. Methods for identifying multiple outliers and subgroups must deal with the challenge of establishing a metric that is not itself contaminated by inhomogeneities by which to measure how extraordinary a data point is. For samples of a sufficient size to support sophisticated methods, the computation cost often makes outlier detection unattractive. All multiple outlier detection methods have suffered in the past from a computational cost that escalated rapidly with the sample size. We propose a new general approach, based on the methods of Hadi (1992a,1994) and Hadi and Simonoff (1993) that can be computed quickly — often requiring less than five evaluations of the model being fit to the data, regardless of the sample size. Two cases of this approach are presented in this paper (algorithms for the detection of outliers in multivariate and regression data). The algorithms, however, can be applied more broadly than to these two cases. We show that the proposed methods match the performance of more computationally expensive methods on standard test problems and demonstrate their superior performance on large simulated challenges.

Introduction

Data often contain outliers. Most statistics methods assume homogeneous data in which all data points satisfy the same model. However, as the aphorism above illustrates, scientists and philosophers have recognized for at least 380 years that real data are not homogeneous and that the identification of outliers is an important step in the progress of scientific understanding.

Robust methods relax the homogeneity assumption, but they have not been widely adopted, partly because they hide the identification of outliers within the black box of the estimation method, but mainly because they are often computationally infeasible for moderate to large size data. Several books have been devoted either entirely or in large part to robust methods; see, for example, Huber (1981), Hampel et al. (1986), Rousseeuw and Leroy (1987), and Staudte and Sheather (1990).

Outlier detection methods provide the analyst with a set of proposed outliers. These can then be corrected (if identifiable errors are the cause) or separated from the body of the data for separate analysis. The remaining data then more nearly satisfy homogeneity assumptions and can be safely analyzed with standard methods. There is a large literature on outlier detection; see, for example, the books by Hawkins (1980), Belsley et al. (1980), Cook and Weisberg (1982), Atkinson (1985), Chatterjee and Hadi (1988), and Barnett and Lewis (1994), and the articles by Gray and Ling (1984), Gray (1986), Kianifard and Swallow (1989), Rousseeuw and van Zomeren (1990), Paul and Fung (1991), Simonoff (1991), Hadi (1992b), Hadi and Simonoff 1993, Hadi and Simonoff 1994, Atkinson (1994), Woodruff and Rocke (1994), Rocke and Woodruff (1996), Barrett and Gray (1997), and Mayo and Gray (1997).

A good outlier detection method defines a robust method that works simply by omitting identified outliers and computing a standard nonrobust measure on the remaining points. Conversely, each robust method defines an outlier detection method by looking at the deviation from the robust fit (robust residuals or robust distances). Often outlier detection and robust estimation are discussed together, as we do here.

Although the detection of a single outlier is now relatively standard, the more realistic situation in which there may be multiple outliers poses greater challenges. Indeed, a number of leading researchers have opined that outlier detection is inherently computationally expensive.

Outlier detection requires a metric with which to measure the “outlyingness” of a data point. Typically, the metric arises from some model for the data (for example, a center or a fitted equation) and some measure of discrepancy from that model. Multiple outliers threaten the possibility that the metric itself may be contaminated by an unidentified outlier. The breakdown point of an estimator is commonly defined as the smallest fraction of the data whose arbitrary modification can carry estimator beyond all bounds (Donoho and Huber, 1983). Contamination of the outlier metric breaks down an outlier detector and, of course, any robust estimator based on that outlier detector. Attempts in the literature to solve this problem are summarized in Section 2.

Section snippets

Optimality, breakdown, equivariance, and cost of outlier detection

Suppose that the data set at hand consists of n observations on p variables and contains k<n/2 outliers. In practice, the number k and the outliers themselves are usually unknown. One method for the detection of these outliers is the brute force search. This method checks all possible subsets of size k=1,…,n/2 and for each subset determines whether the subset is outlying relative to the remaining observations in the data. The number of all possible subsets, k=1n/2nk, is so huge that brute

The general BACON algorithm

To obtain computationally efficient robust point estimators and multiple outlier detection methods, we propose to abandon optimality conditions and work with iterative estimates. Experiments and experience have shown that the results of the iteration are relatively insensitive to the starting point. Nevertheless, a robust starting point offers greater assurance of high breakdown and, in simulation trials, a breakdown point in excess of 40%. However, the robust starting point is not affine

BACON algorithm for multivariate data

Given a matrix X of n rows (observations) and of p columns (variables), Step 1 of Algorithm 1 requires finding an initial basic subset of size m>p. This subset can either be specified by the data analyst or obtained by an algorithm. The analyst may have reasons to believe that a certain subset of observations is “clean”. In this case, the number m and/or the observations themselves can be chosen by the analyst. There is some tension between the assurance that a small initial basic subset will

BACON algorithm for regression data

Consider the standard linear model y=, where y is an n-vector of responses, X is an n×p matrix representing p explanatory variables with rank p<n,β is a p-vector of unknown parameters, and ε is an n-vector of random disturbances (errors) whose conditional mean and variance are given by E(ε|X)=0 and Var|X)=σ2In, where σ2 is an unknown parameter and In is the identity matrix of order n.

The least-squares estimates of β and σ2 are given by β̂=(XTX)−1XTy and the residual mean square, σ̂2

Assumptions and the role of the data analyst

All outlier nomination and robust methods must assume some simple structure for the non-outlying points — otherwise one cannot know what it means for an observation to be discrepant. The BACON algorithms assume that the model used to define the basic subsets is a good description of the non-outlying data. In the regression version, there must in fact be an underlying linear model for the non-outlying data. In the multivariate outlier nominator, the non-outlying data should be roughly

Simulations

Hadi (1994) showed that his method matched or surpasses the performance of other published methods. Therefore, we performed simulation experiments to (a) compare the BACON method with the Hadi's (1994) method (H94) with regard to both performance and computational expense and (b) assess the performance of the BACON method for large data. The experiment considers outlier detection in multivariate data.

The H94 method is computationally expensive for large data sets. Therefore for comparison

Examples

We illustrate the computational efficiency of the proposed methods using two data sets: The Wood Gravity data and the Philips data. Rousseeuw and Leroy (1987) use the wood gravity data (originally given by Draper and Smith, 1966) to illustrate the performance of LMS. The data consist of 20 observations on six variables. Observations 4, 6, and 19 are known to be outliers. Cook and Hawkins (1990) apply MVE with sampling to these data and they report that they needed over 57,000 samples to find

Large data sets

Remarkably, the computing cost of the BACON algorithms for multivariate outliers and for regression is low. The major costs are the computing of a covariance matrix and the computing of the distances themselves. Because the number of iterations is small, none of these costs grows out of bounds. In practical terms, current desktop computers can find Mahalanobis distances for a million cases in about ten seconds. It is thus practical to apply BACON algorithms to data sets of millions of cases on

Summary and recommendations

Outlier detection methods have suffered in the past from a lack of generality and a computational cost that escalated rapidly with the sample size. Small samples provide too small a base for reliable detection of multiple outliers, so suitable graphics are often the detection method of choice. Samples of a size sufficient to support sophisticated methods rapidly grow too large for previously published outlier detection methods to be practical. The BACON algorithms given here reliably detect

References (47)

  • A.S. Hadi

    A new measure of overall potential influence in linear regression

    Computational Statistics and Data Analysis

    (1992)
  • Atkinson, A.C., 1985. In: Plots, Transformations, and Regression: An Introduction to Graphical Methods of Diagnostic...
  • A.C. Atkinson

    Fast very robust methods for the detection of multiple outliers

    Journal of the American Statistical Association

    (1994)
  • Bacon, F., 1620. In: Urbach, P., Gibson, J. (Translators, Eds.), Novum Organum. Open Court Publishing Co, Chicago,...
  • Barrett, B.E., Gray, J.B., 1997. On the use of robust diagnostics in least squares regression analysis. Proceedings of...
  • Barnett, V., Lewis, T., 1994. In: Outliers in Statistical Data. Wiley, New...
  • Belsley, D.A., Kuh, E., Welsch, R.E., 1980. In: Regression Diagnostics: Identifying Influential Data and Sources of...
  • S. Chatterjee et al.

    Sensitivity Analysis in Linear Regression.

    (1988)
  • Cook, R.D., Hawkins, D.M., 1990. In: Comment on unmasking multivariate outliers and leverage points. Journal of the...
  • Cook, R.D., Weisberg, S., 1982. In: Residuals and Influence in Regression. Chapman & Hall,...
  • Donoho, D.L., 1982. In: Breakdown properties of multivariate location estimators. Qualifying Paper. Harvard University,...
  • Donoho, D.L., Huber, P.J., 1983. The notion of breakdown point. In: Bickel, P., Doksum, K., Hodges J.L. Jr. (Eds.), a...
  • Draper, N., Smith, H., 1966. Applied Regression Analysis. John Wiley and Sons, New...
  • Friedman, J.H., Stuetzle, W., 1981. Projection pursuit regression, Journal of the American Statistical Association 76,...
  • Glymour, C., Madigan, D., Pregibon, D., Smyth, P., 1997. Statistical themes and lessons for data mining. Data Mining...
  • W. Gould et al.

    Identifying multivariate outliers

    Stata Technical Bulletin

    (1993)
  • J.B. Gray

    A simple graphic for assessing influence in regression

    Journal of Statistical Computation and Simulation

    (1986)
  • J.B. Gray et al.

    K-clustering as a detection tool for influential subsets in regression

    Technometrics

    (1984)
  • A.S. Hadi

    Identifying multiple outliers in multivariate data

    Journal of the Royal Statistical Society Series (B)

    (1992)
  • A.S. Hadi

    A modification of a method for the detection of outliers in multivariate samples

    Journal of the Royal Statistical Society Series (B)

    (1994)
  • A.S. Hadi et al.

    Procedures for the identification of multiple outliers in linear models

    Journal of the American Statistical Association

    (1993)
  • A.S. Hadi et al.

    Improving the estimation and outlier identification properties of the least median of squares and minimum volume ellipsoid estimators

    Parisankhyan Sammikkha

    (1994)
  • Hadi, A.S., Simonoff, J.S., 1997. A more robust outlier identifier for regression data. Bulletin of the International...
  • Cited by (0)

    View full text