Summary
Classical and robust/resistant procedures for the estimation of population parameters and the identification of multiple outliers in univariate and multivariate populations are reviewed. The successful identification of anomalous observations depends on the statistical procedures employed. Commercial industries, local communities, and government agencies such as the United States Environmental Protection Agency (U.S. EPA), often need to assess the extent of contamination at polluted sites. Identification of these contaminants having potentially adverse effects on human health is especially important in various ecological and environmental applications. An environmental scientist typically generates and analyzes large amounts of multidimensional data. These practioners often need to identify experimental conditions and results which look suspicious and are significantly different from the rest of the data. The classical Mahalanobis distance (MD) and its variants (e.g., multivariate kurtosis) are routinely used to identify these anomalies. These test statistics depend upon the estimates of population location and scale. The presence of anomalous observations usually results in distorted and unreliable maximum likelihood estimates (MLEs) and ordinary least-squares (OLS) estimates of the population parameters. These in turn result in deflated and distorted classical MDs and lead to masking effects. This means that the results from statistical tests and inference based upon these classical estimates may be misleading. For example, in an environmental monitoring application, it is possible that the classification procedure based upon the distorted estimates may classify a contaminated sample as coming from the clean population and a clean sample as coming from the contaminated part of the site. This in turn can lead to incorrect remediation decisions.
It is well established among practioners that, for the identification of multiple outliers, one should use robust procedures with a high breakdown point. The estimates obtained using the robust procedures should be in close agreement with the corresponding classical OLS and MLEs when no discordant observations (from different population(s)) are present. Robust procedures for the identification of outliers and the estimation of population parameters of location and scale typically use an influence function. The robust procedure based upon a recently developed “proposed” influence function, called the PROP function, works quite effectively in estimating population parameters accurately, and correctly identifying multiple outliers in univariate and multivariate populations. The control-chart-type quantile-quantile (Q-Q) graphical display of multivariate data combines the effect of a formal test procedure and an informal graphical display into one powerful multiple outlier identification procedure. The scatter plot of the robustified square root leverage distances vs the residuals identifies all regression outliers and distinguishes between significant and insignificant leverage points. The procedures discussed here unmask multiple anomalies and provide reliable estimates of the population parameters in several areas of interest, including linear regression models, discriminant and principal component analyses, and variogram modeling in geostatistical applications. The U.S. EPA, through the Office of Research and Development (ORD), has research interests in optimizing its quality assurance program by developing statistical procedures that are insensitive to outliers (resistant) and the underlying assumptions (robust).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Abbreviations
- ANOVA:
-
analysis of variance
- CC:
-
confidence coefficient
- CI:
-
confidence interval
- CLP:
-
Contract Laboratory Program
- R 2 :
-
coefficient of determination
- DF, v :
-
degrees of freedom
- Huber:
-
Huber’s influence function
- Biweight:
-
Tukey’s Biweight influence function
- IRLS:
-
iteratively reweighted least squares
- LCL:
-
lower confidence limit for the population mean
- UCL:
-
upper confidence limit for the population mean
- LPL:
-
lower limit for the prediction interval
- UPL:
-
upper limit for the prediction interval
- LSL:
-
lower limit for the simultaneous confidence interval
- USL:
-
upper limit for the simultaneous confidence interval
- LMS:
-
least median squares
- M:
-
median
- M-estimator:
-
generalized maximum likelihood estimator
- MAD:
-
median absolute deviation
- \({\hat \sigma _{MAD}}\) :
-
estimate of σ based upon MAD
- Max:
-
maximum
- MD:
-
Mahalanobis distance
- Max(MDs):
-
largest Mahalanobis distance
- MLE:
-
maximum likelihood estimation
- MS:
-
mean square
- MVE:
-
minimum variance ellipsoid
- MVT:
-
multivariate trimming
- OLS:
-
ordinary least squares
- PCA:
-
principal component analysis
- PE:
-
performance evaluation
- PLS:
-
partial least squares
- PROP:
-
proposed influence function
- QA/QC:
-
quality assurance/quality control
- Q-Q:
-
Quantile-Quantile
- sd:
-
standard deviation
- SEDOP:
-
statistical experimental design and optimization
- sgn:
-
the signum function
- SIMCA:
-
Soft Independent Modelling of Class Analogy
- SS:
-
sum of squares
- TSP:
-
three step procedure
- n :
-
sample size
- p :
-
dimension of the data set
- μ :
-
nivariate population mean
- σ :
-
univariate population standard deviation
- \(\bar x\) :
-
sample mean
- s :
-
sample sd, and min(g−1,p)
- \(\mathop {x*}\limits^ - \) :
-
robust estimator of population mean, μ
- s*:
-
robust estimator of population sd, σ
- k :
-
number of outliers, cutoff constant from the Gaussian distribution, and number of populations (groups)
- x :
-
p-dimensional random vector representing an observation
- f(x):
-
the density function of the vector, x
- Σ:
-
summation sign
- μ :
-
p-dimensional population mean vector (location)
- Σ:
-
p × p population dispersion matrix (scale)
- μ*:
-
robust estimator of population location
- Σ*:
-
robust estimator of population scale
- h :
-
a spherically symmetric density in p-dimensional space
- d 2i :
-
Mahalanobis distance for the i-th observation
- d bα :
-
α100% critical value of the test statistic Max(MDs)
- d 20 , d 2ind :
-
α100% critical value from the distribution of d 2i
- d 2m, α :
-
α 100% critical value from the distribution of the Max( d 2i )
- ψ(d i):
-
the PROP influence function
- w(d i):
-
the weight function
- wsum :
-
sum of the weights, w(d i)
- wsum2:
-
sum of the squared weights, w 2(d i)
- \(\bar x{*_{bi}}\) :
-
Biweight estimator of μ
- \(\bar x{*_{H}}\) :
-
Huber estimator of μ
- ψ bi(u):
-
Biweight influence function
- ψ H(u):
-
Huber influence function
- u i :
-
i-th standardized observation
- c :
-
tuning constant
- α:
-
level of significance
- t v :
-
Student’s t-value with v DF
- t bi :
-
t-value associated with the Biweight function
- t 0.7(n−1) (α:
-
1) Student’s t-value with 0.7(n − 1) DF
- t α/2,v :
-
(α/2)100% Student’s t-value with v DF
- t c :
-
classical critical value of Student’s t distribution
- t r :
-
robust critical value of Student’s t distribution
- β(a, b):
-
beta distribution with parameters a and b
- Г :
-
gamma function
- χ 2 :
-
chi-square distribution
- Q 1 :
-
Dixon’s statistic for finding a single upper outlier
- Q 2 :
-
Dixon’s statistic for finding two upper outliers
- R p,n :
-
the correlation coefficient
- y :
-
observed response variable
- e i :
-
normally distributed error term in regression model
- \(\hat \sigma \) :
-
estimate of the sd of the error term
- r 2i :
-
the residual sum of squares
- β :
-
vector of regression coefficients
- \({\hat \beta _{OLS}}\) :
-
ordinary least squares estimate of β
- \({\hat \beta _{R}}\) :
-
robust estimator of β
- \({\hat \beta _{LMS}}\) :
-
least median squares estimate of β
- \({\hat \beta _{PROP}}\) :
-
estimate of β
- Ld 2i :
-
MDs using the x-explanatory variables
- Ld bα :
-
α100% critical value from the distribution of Ld 2i
- w(x i, d i):
-
weight function used in robust regression
- P i :
-
eigenvector corresponding to the i-th eigenvalue
- q(k):
-
normal quantile
- g :
-
number of distinct populations (groups)
- π i :
-
the i-th population
- µ i :
-
mean vector of the i-th population
- Σi :
-
dispersion matrix of the i-th population
- \({\hat B^*}\) :
-
between-groups matrix
- W*:
-
within-groups matrix
- S *pooled :
-
pooled estimate of the common dispersion matrix, Σ
- \({\hat \lambda _i}\) :
-
an eigenvalue of \({W^{{*^{ - 1}}}}{\hat B^*}\)
- l i :
-
normalized eigenvector corresponding to \({\hat \lambda _i}\)
- y i :
-
i-th discriminant function
References
Grubbs FE (1950) Ann Math Statist 21: 27
Dixon WJ (1953) Biometrics 9: 74
Miller JC, Miller JN (1990) Statistics for analytical chemistry, 2nd edn Ellis Horwood, Chichester
Miller JN (1993) Analyst 118: 455
Wilks SS (1963) Sankhya 25: 407
Mardia KV (1972) Biometrika 57: 519
Stapanian MA, Garner FC, Fitzgerald KE, Flatman GT, Englund EJ (1990) Communication in Statistics-Simulation. 20: 667
Stapanian MA, Garner FC, Fitzgerald KE, Flatman GT, Nocerino JM (1993) J of Chemometrics 7: 165
Anderson TW (1984) An Introduction to Multivariate Statistical Analysis. John Wiley, New York
Devlin SJ, Gnanadesikan R, Kettenring JR (1981) J Amer Statist Assoc 76: 354
Rousseeuw PJ, van Zomeren C (1990) J Amer Statist Assoc 85: 633
Rousseeuw Pi, van Zomeren C (1991) In: Stahel S, Weisberg S (ed) Direction in Robust Statistics and Diagnostics. Springer- Verlag, part II, Vol 34, New York, p 195
Campbell NA (1980) Applied Statistics 29 (3): 231
Huber PJ (1964) Ann Math Statist 35: 73
Hampel FR (1974) J Amer Statist Assoc 69: 383
Hampel FR, Rousseeuw PJ, Ronchetti R (1981) J Amer Statist Assoc 76: 643
Tukey JW (1977) Exploratory Data Analysis. Reading Ma: Addison Wesley
Andrews DF (1974) Technometrics 16 (4): 523
Maronna RA (1976) Annals of Statist 4: 51
Hawkins DM, Bradu D, Kass GV (1984) Technometrics 26 (13): 197
Rousseeuw PJ, Leroy AM (1987) Robust Regression and Outlier Detection. John Wiley, New York
Singh A (1993) In: Patil GP, Rao CR (ed) Multivariate Environmental Statistics. Elsevier Science Publishers, Amsterdam, 445
Singh A, Nocerino JM. Proceedings of the Ninth International conference on Systems Engineering. July 14–16, 1993, Las Vegas, NV, 370
Singh A, Nocerino JM under review
Kafadar K (1982) J of the Amer Statist Assoc 77(378):416
Huber PJ (1981) Robust Statistics. John Wiley, New York
Singh A, Singh AK, Flatman GT (1994) Int J Math Geology 26 (3): 361
Lavine BK (1992) J of Chemometrics 6: 357
Scout: A Data Analysis Program, Technology Support Project, U.S. EPA, EMSL-LV, Las Vegas, NV 89193–3478
Rosner B (1975) Technometrics 17: 221
Gilbert RO (1987) Statistical Methods for Environmental Pollution Monitoring. Van Nostrand, Reinhold Company, New York
Barnett V, Lewis T (1984) Outliers in Statistical Data. John Wiley, New York
Gnanadesikan R (1977) Methods for Statistical Data Analysis of Multivariate Observations. John Wiley, New York
Jennings LW, Young DM (1988) Communications in Statistics-Simulation 17 (4): 1359
Schwager SJ, Margolin BH (1982) Ann Statist 10: 943
Neykov MN, Neytchev PN (1991) In: Stahel W and Weisberg S (ed) Direction in Robust Statistics and Diagnostics, part II, Vol 34, Springer-Verlag, New York, p 115
Hahn GJ, Meeker WQ (1991) Statistical Intervals. John Wiley, New York
Horn PS, Britton PW, Lewis DF (1988) The Statistician 37: 165
Gross AM (1976) J Amer Statist Assoc 71 (356): 409
Iglewicz B, In: Hoaglin DC, Mosteller F, Tukey JW (ed) Understanding Robust and Exploratory Data Analysis. John Wiley, New York, p 404
Stigler SM (1977) The Annals of Statistics 5 (6): 1055
Ruppert D, Carroll RJ (1980) J of Amer Statist Assoc 75: 828
Carroll RJ, Ruppert D (1985) Technometrics 27: 1
Jongh PJ, De Wet T, Welsh AH (1988) J of Amer Statist Assoc 83: 806
Draper NR, Smith H (1981) Applied Regression Analysis, 2nd ed. John Wiley, New York
Brownlee KA (1965) Statistical Theory and Methodology in Science and Engineering, 2nd ed. John Wiley, New York
Gittins R (1985) Canonical Analysis, A review with applications in Ecology, Springer-Verlag, Berlin, Heidelberg
Coomans D, Jonckheer M, Massart DL, Broeckaert I, Blockx P (1978) Anal Chim Acta 103: 409
Coomans D, Massart DL, Broeckaert I, Tassin A (1981) Anal Chim Acta 133: 215
Hopke PK, Massart DL (1993) Chemometrics and Intelligent Laboratory Systems 19: 35
Sharaf MA, Illman DL, Kowalski BR (1986) Chemometrics, John Wiley, New York
Derde MP, Coomans D, Massart DL (1984) J of the Assoc of Official Analytical Chemists 67: 721
Swain D, Dunn III WJ, Talaat RE (1993) Anal Chim Acta 277: 305
Lavine BK, Stine A, Mayfield HT (1993) Anal Chim Acta 277: 357
Massart DL, Kaufman L, Rousseeuw PJ, Leroy A (1986) Anal Chim Acta 187: 171
Wold S, Johnson J, Sjostrom M, Sandberg M, Rannar S (1993) Anal Chim Acta 277: 239
Scherer A, Inal OT, Singh AJ (1983) Solar Energy Materials 9: 139
Patel S, Inal OT, Singh M (1985) Solar Energy Materials 11: 381
Jiang H, Lee K, Singh Anita, Singh AK, Torma AE (1988) in Torma AE, Gundiler IH (ed) Precious and Rare Metal Technologies. Elsevier Science Publishers, Amsterdam, The Netherlands, p 547
Deming SN, Morgan SL (1973) Anal Chem 45 (3): 278A
Deming SN, Morgan SL (1983) Anal Chim Acta 150: 183
Deming SN (1985) J of Research of the National Bur of Stds 90 (6): 479
Shoemaker AC, Kwok-Leung Tsui, Jeffwu CF (1991) Technometrics 33 (4): 415
Thompson M, Mertens B, Kessler M, Fearn T (1993) Analyst 118: 235
Gnanadesikan R, Kettenring JR (1972) Biometrics 28: 81
Anderberg MR (1973) Cluster Analysis for Applications. Academic Press, New York
Hartigan JA (1975) Clustering Algorithms. John Wiley, New York
Fisher RA (1936) Ann Eugenics 7: 179
Tukey JW (1979) In: Launer RL, Wilkinson GN (ed) Robustness in Statistics. Academic Press, p 103
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1995 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Singh, A., Nocerino, J.M. (1995). Robust Procedures for the Identification of Multiple Outliers. In: Einax, J. (eds) Chemometrics in Environmental Chemistry - Statistical Methods. The Handbook of Environmental Chemistry, vol 2 / 2G. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-49148-4_8
Download citation
DOI: https://doi.org/10.1007/978-3-540-49148-4_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-14885-3
Online ISBN: 978-3-540-49148-4
eBook Packages: Springer Book Archive