BACON: blocked adaptive computationally efficient outlier nominators
Introduction
Data often contain outliers. Most statistics methods assume homogeneous data in which all data points satisfy the same model. However, as the aphorism above illustrates, scientists and philosophers have recognized for at least 380 years that real data are not homogeneous and that the identification of outliers is an important step in the progress of scientific understanding.
Robust methods relax the homogeneity assumption, but they have not been widely adopted, partly because they hide the identification of outliers within the black box of the estimation method, but mainly because they are often computationally infeasible for moderate to large size data. Several books have been devoted either entirely or in large part to robust methods; see, for example, Huber (1981), Hampel et al. (1986), Rousseeuw and Leroy (1987), and Staudte and Sheather (1990).
Outlier detection methods provide the analyst with a set of proposed outliers. These can then be corrected (if identifiable errors are the cause) or separated from the body of the data for separate analysis. The remaining data then more nearly satisfy homogeneity assumptions and can be safely analyzed with standard methods. There is a large literature on outlier detection; see, for example, the books by Hawkins (1980), Belsley et al. (1980), Cook and Weisberg (1982), Atkinson (1985), Chatterjee and Hadi (1988), and Barnett and Lewis (1994), and the articles by Gray and Ling (1984), Gray (1986), Kianifard and Swallow (1989), Rousseeuw and van Zomeren (1990), Paul and Fung (1991), Simonoff (1991), Hadi (1992b), Hadi and Simonoff 1993, Hadi and Simonoff 1994, Atkinson (1994), Woodruff and Rocke (1994), Rocke and Woodruff (1996), Barrett and Gray (1997), and Mayo and Gray (1997).
A good outlier detection method defines a robust method that works simply by omitting identified outliers and computing a standard nonrobust measure on the remaining points. Conversely, each robust method defines an outlier detection method by looking at the deviation from the robust fit (robust residuals or robust distances). Often outlier detection and robust estimation are discussed together, as we do here.
Although the detection of a single outlier is now relatively standard, the more realistic situation in which there may be multiple outliers poses greater challenges. Indeed, a number of leading researchers have opined that outlier detection is inherently computationally expensive.
Outlier detection requires a metric with which to measure the “outlyingness” of a data point. Typically, the metric arises from some model for the data (for example, a center or a fitted equation) and some measure of discrepancy from that model. Multiple outliers threaten the possibility that the metric itself may be contaminated by an unidentified outlier. The breakdown point of an estimator is commonly defined as the smallest fraction of the data whose arbitrary modification can carry estimator beyond all bounds (Donoho and Huber, 1983). Contamination of the outlier metric breaks down an outlier detector and, of course, any robust estimator based on that outlier detector. Attempts in the literature to solve this problem are summarized in Section 2.
Section snippets
Optimality, breakdown, equivariance, and cost of outlier detection
Suppose that the data set at hand consists of n observations on p variables and contains k<n/2 outliers. In practice, the number k and the outliers themselves are usually unknown. One method for the detection of these outliers is the brute force search. This method checks all possible subsets of size k=1,…,n/2 and for each subset determines whether the subset is outlying relative to the remaining observations in the data. The number of all possible subsets, , is so huge that brute
The general BACON algorithm
To obtain computationally efficient robust point estimators and multiple outlier detection methods, we propose to abandon optimality conditions and work with iterative estimates. Experiments and experience have shown that the results of the iteration are relatively insensitive to the starting point. Nevertheless, a robust starting point offers greater assurance of high breakdown and, in simulation trials, a breakdown point in excess of 40%. However, the robust starting point is not affine
BACON algorithm for multivariate data
Given a matrix of n rows (observations) and of p columns (variables), Step 1 of Algorithm 1 requires finding an initial basic subset of size m>p. This subset can either be specified by the data analyst or obtained by an algorithm. The analyst may have reasons to believe that a certain subset of observations is “clean”. In this case, the number m and/or the observations themselves can be chosen by the analyst. There is some tension between the assurance that a small initial basic subset will
BACON algorithm for regression data
Consider the standard linear model , where is an n-vector of responses, is an n×p matrix representing p explanatory variables with rank p<n,β is a p-vector of unknown parameters, and ε is an n-vector of random disturbances (errors) whose conditional mean and variance are given by and , where σ2 is an unknown parameter and is the identity matrix of order n.
The least-squares estimates of β and σ2 are given by and the residual mean square,
Assumptions and the role of the data analyst
All outlier nomination and robust methods must assume some simple structure for the non-outlying points — otherwise one cannot know what it means for an observation to be discrepant. The BACON algorithms assume that the model used to define the basic subsets is a good description of the non-outlying data. In the regression version, there must in fact be an underlying linear model for the non-outlying data. In the multivariate outlier nominator, the non-outlying data should be roughly
Simulations
Hadi (1994) showed that his method matched or surpasses the performance of other published methods. Therefore, we performed simulation experiments to (a) compare the BACON method with the Hadi's (1994) method (H94) with regard to both performance and computational expense and (b) assess the performance of the BACON method for large data. The experiment considers outlier detection in multivariate data.
The H94 method is computationally expensive for large data sets. Therefore for comparison
Examples
We illustrate the computational efficiency of the proposed methods using two data sets: The Wood Gravity data and the Philips data. Rousseeuw and Leroy (1987) use the wood gravity data (originally given by Draper and Smith, 1966) to illustrate the performance of LMS. The data consist of 20 observations on six variables. Observations 4, 6, and 19 are known to be outliers. Cook and Hawkins (1990) apply MVE with sampling to these data and they report that they needed over 57,000 samples to find
Large data sets
Remarkably, the computing cost of the BACON algorithms for multivariate outliers and for regression is low. The major costs are the computing of a covariance matrix and the computing of the distances themselves. Because the number of iterations is small, none of these costs grows out of bounds. In practical terms, current desktop computers can find Mahalanobis distances for a million cases in about ten seconds. It is thus practical to apply BACON algorithms to data sets of millions of cases on
Summary and recommendations
Outlier detection methods have suffered in the past from a lack of generality and a computational cost that escalated rapidly with the sample size. Small samples provide too small a base for reliable detection of multiple outliers, so suitable graphics are often the detection method of choice. Samples of a size sufficient to support sophisticated methods rapidly grow too large for previously published outlier detection methods to be practical. The BACON algorithms given here reliably detect
References (47)
A new measure of overall potential influence in linear regression
Computational Statistics and Data Analysis
(1992)- Atkinson, A.C., 1985. In: Plots, Transformations, and Regression: An Introduction to Graphical Methods of Diagnostic...
Fast very robust methods for the detection of multiple outliers
Journal of the American Statistical Association
(1994)- Bacon, F., 1620. In: Urbach, P., Gibson, J. (Translators, Eds.), Novum Organum. Open Court Publishing Co, Chicago,...
- Barrett, B.E., Gray, J.B., 1997. On the use of robust diagnostics in least squares regression analysis. Proceedings of...
- Barnett, V., Lewis, T., 1994. In: Outliers in Statistical Data. Wiley, New...
- Belsley, D.A., Kuh, E., Welsch, R.E., 1980. In: Regression Diagnostics: Identifying Influential Data and Sources of...
- et al.
Sensitivity Analysis in Linear Regression.
(1988) - Cook, R.D., Hawkins, D.M., 1990. In: Comment on unmasking multivariate outliers and leverage points. Journal of the...
- Cook, R.D., Weisberg, S., 1982. In: Residuals and Influence in Regression. Chapman & Hall,...