Robust Procedures for the Identification of Multiple Outliers

Singh, Anita; Nocerino, John M.

doi:10.1007/978-3-540-49148-4_8

Anita Singh³ &
John M. Nocerino⁴

Part of the book series: The Handbook of Environmental Chemistry ((HEC2,volume 2 / 2G))

371 Accesses
9 Citations

Summary

Classical and robust/resistant procedures for the estimation of population parameters and the identification of multiple outliers in univariate and multivariate populations are reviewed. The successful identification of anomalous observations depends on the statistical procedures employed. Commercial industries, local communities, and government agencies such as the United States Environmental Protection Agency (U.S. EPA), often need to assess the extent of contamination at polluted sites. Identification of these contaminants having potentially adverse effects on human health is especially important in various ecological and environmental applications. An environmental scientist typically generates and analyzes large amounts of multidimensional data. These practioners often need to identify experimental conditions and results which look suspicious and are significantly different from the rest of the data. The classical Mahalanobis distance (MD) and its variants (e.g., multivariate kurtosis) are routinely used to identify these anomalies. These test statistics depend upon the estimates of population location and scale. The presence of anomalous observations usually results in distorted and unreliable maximum likelihood estimates (MLEs) and ordinary least-squares (OLS) estimates of the population parameters. These in turn result in deflated and distorted classical MDs and lead to masking effects. This means that the results from statistical tests and inference based upon these classical estimates may be misleading. For example, in an environmental monitoring application, it is possible that the classification procedure based upon the distorted estimates may classify a contaminated sample as coming from the clean population and a clean sample as coming from the contaminated part of the site. This in turn can lead to incorrect remediation decisions.

It is well established among practioners that, for the identification of multiple outliers, one should use robust procedures with a high breakdown point. The estimates obtained using the robust procedures should be in close agreement with the corresponding classical OLS and MLEs when no discordant observations (from different population(s)) are present. Robust procedures for the identification of outliers and the estimation of population parameters of location and scale typically use an influence function. The robust procedure based upon a recently developed “proposed” influence function, called the PROP function, works quite effectively in estimating population parameters accurately, and correctly identifying multiple outliers in univariate and multivariate populations. The control-chart-type quantile-quantile (Q-Q) graphical display of multivariate data combines the effect of a formal test procedure and an informal graphical display into one powerful multiple outlier identification procedure. The scatter plot of the robustified square root leverage distances vs the residuals identifies all regression outliers and distinguishes between significant and insignificant leverage points. The procedures discussed here unmask multiple anomalies and provide reliable estimates of the population parameters in several areas of interest, including linear regression models, discriminant and principal component analyses, and variogram modeling in geostatistical applications. The U.S. EPA, through the Office of Research and Development (ORD), has research interests in optimizing its quality assurance program by developing statistical procedures that are insensitive to outliers (resistant) and the underlying assumptions (robust).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Abbreviations

ANOVA:: analysis of variance
CC:: confidence coefficient
CI:: confidence interval
CLP:: Contract Laboratory Program
R ² :: coefficient of determination
DF, v :: degrees of freedom
Huber:: Huber’s influence function
Biweight:: Tukey’s Biweight influence function
IRLS:: iteratively reweighted least squares
LCL:: lower confidence limit for the population mean
UCL:: upper confidence limit for the population mean
LPL:: lower limit for the prediction interval
UPL:: upper limit for the prediction interval
LSL:: lower limit for the simultaneous confidence interval
USL:: upper limit for the simultaneous confidence interval
LMS:: least median squares
M:: median
M-estimator:: generalized maximum likelihood estimator
MAD:: median absolute deviation
\({\hat \sigma _{MAD}}\) :: estimate of σ based upon MAD
Max:: maximum
MD:: Mahalanobis distance
Max(MDs):: largest Mahalanobis distance
MLE:: maximum likelihood estimation
MS:: mean square
MVE:: minimum variance ellipsoid
MVT:: multivariate trimming
OLS:: ordinary least squares
PCA:: principal component analysis
PE:: performance evaluation
PLS:: partial least squares
PROP:: proposed influence function
QA/QC:: quality assurance/quality control
Q-Q:: Quantile-Quantile
sd:: standard deviation
SEDOP:: statistical experimental design and optimization
sgn:: the signum function
SIMCA:: Soft Independent Modelling of Class Analogy
SS:: sum of squares
TSP:: three step procedure
n :: sample size
p :: dimension of the data set
μ :: nivariate population mean
σ :: univariate population standard deviation
\(\bar x\) :: sample mean
s :: sample sd, and min(g−1,p)
\(\mathop {x*}\limits^ - \) :: robust estimator of population mean, μ
s*:: robust estimator of population sd, σ
k :: number of outliers, cutoff constant from the Gaussian distribution, and number of populations (groups)
x :: p-dimensional random vector representing an observation
f(x):: the density function of the vector, x
Σ:: summation sign
μ :: p-dimensional population mean vector (location)
Σ:: p × p population dispersion matrix (scale)
μ*:: robust estimator of population location
Σ*:: robust estimator of population scale
h :: a spherically symmetric density in p-dimensional space
d ²_i :: Mahalanobis distance for the i-th observation
d ^b_α :: α100% critical value of the test statistic Max(MDs)
d ²₀ , d ²_ind :: α100% critical value from the distribution of d ²_i
d ²_{m, α} :: α 100% critical value from the distribution of the Max( d ²_i )
ψ(d _i):: the PROP influence function
w(d _i):: the weight function
wsum :: sum of the weights, w(d _i)
wsum2:: sum of the squared weights, w ²(d _i)
\(\bar x{*_{bi}}\) :: Biweight estimator of μ
\(\bar x{*_{H}}\) :: Huber estimator of μ
ψ _bi(u):: Biweight influence function
ψ _H(u):: Huber influence function
u _i :: i-th standardized observation
c :: tuning constant
α:: level of significance
t _v :: Student’s t-value with v DF
t _bi :: t-value associated with the Biweight function
t _0.7(n−1) (α:: 1) Student’s t-value with 0.7(n − 1) DF
t _α/2,v :: (α/2)100% Student’s t-value with v DF
t _c :: classical critical value of Student’s t distribution
t _r :: robust critical value of Student’s t distribution
β(a, b):: beta distribution with parameters a and b
Г :: gamma function
χ ² :: chi-square distribution
Q ₁ :: Dixon’s statistic for finding a single upper outlier
Q ₂ :: Dixon’s statistic for finding two upper outliers
R _p,n :: the correlation coefficient
y :: observed response variable
e _i :: normally distributed error term in regression model
\(\hat \sigma \) :: estimate of the sd of the error term
r ²_i :: the residual sum of squares
β :: vector of regression coefficients
\({\hat \beta _{OLS}}\) :: ordinary least squares estimate of β
\({\hat \beta _{R}}\) :: robust estimator of β
\({\hat \beta _{LMS}}\) :: least median squares estimate of β
\({\hat \beta _{PROP}}\) :: estimate of β
Ld ²_i :: MDs using the x-explanatory variables
Ld ^b_α :: α100% critical value from the distribution of Ld ²_i
w(x _i, d _i):: weight function used in robust regression
P _i :: eigenvector corresponding to the i-th eigenvalue
q(k):: normal quantile
g :: number of distinct populations (groups)
π _i :: the i-th population
µ _i :: mean vector of the i-th population
Σ_i :: dispersion matrix of the i-th population
\({\hat B^*}\) :: between-groups matrix
W*:: within-groups matrix
S ^*_pooled :: pooled estimate of the common dispersion matrix, Σ
\({\hat \lambda _i}\) :: an eigenvalue of \({W^{{*^{ - 1}}}}{\hat B^*}\)
l _i :: normalized eigenvector corresponding to \({\hat \lambda _i}\)
y _i :: i-th discriminant function

References

Grubbs FE (1950) Ann Math Statist 21: 27
Article Google Scholar
Dixon WJ (1953) Biometrics 9: 74
Article Google Scholar
Miller JC, Miller JN (1990) Statistics for analytical chemistry, 2nd edn Ellis Horwood, Chichester
Google Scholar
Miller JN (1993) Analyst 118: 455
Article CAS Google Scholar
Wilks SS (1963) Sankhya 25: 407
Google Scholar
Mardia KV (1972) Biometrika 57: 519
Article Google Scholar
Stapanian MA, Garner FC, Fitzgerald KE, Flatman GT, Englund EJ (1990) Communication in Statistics-Simulation. 20: 667
Article Google Scholar
Stapanian MA, Garner FC, Fitzgerald KE, Flatman GT, Nocerino JM (1993) J of Chemometrics 7: 165
Article CAS Google Scholar
Anderson TW (1984) An Introduction to Multivariate Statistical Analysis. John Wiley, New York
Google Scholar
Devlin SJ, Gnanadesikan R, Kettenring JR (1981) J Amer Statist Assoc 76: 354
Article Google Scholar
Rousseeuw PJ, van Zomeren C (1990) J Amer Statist Assoc 85: 633
Article Google Scholar
Rousseeuw Pi, van Zomeren C (1991) In: Stahel S, Weisberg S (ed) Direction in Robust Statistics and Diagnostics. Springer- Verlag, part II, Vol 34, New York, p 195
Google Scholar
Campbell NA (1980) Applied Statistics 29 (3): 231
Article Google Scholar
Huber PJ (1964) Ann Math Statist 35: 73
Article Google Scholar
Hampel FR (1974) J Amer Statist Assoc 69: 383
Article Google Scholar
Hampel FR, Rousseeuw PJ, Ronchetti R (1981) J Amer Statist Assoc 76: 643
Google Scholar
Tukey JW (1977) Exploratory Data Analysis. Reading Ma: Addison Wesley
Google Scholar
Andrews DF (1974) Technometrics 16 (4): 523
Article Google Scholar
Maronna RA (1976) Annals of Statist 4: 51
Article Google Scholar
Hawkins DM, Bradu D, Kass GV (1984) Technometrics 26 (13): 197
Article Google Scholar
Rousseeuw PJ, Leroy AM (1987) Robust Regression and Outlier Detection. John Wiley, New York
Book Google Scholar
Singh A (1993) In: Patil GP, Rao CR (ed) Multivariate Environmental Statistics. Elsevier Science Publishers, Amsterdam, 445
Google Scholar
Singh A, Nocerino JM. Proceedings of the Ninth International conference on Systems Engineering. July 14–16, 1993, Las Vegas, NV, 370
Google Scholar
Singh A, Nocerino JM under review
Google Scholar
Kafadar K (1982) J of the Amer Statist Assoc 77(378):416
Google Scholar
Huber PJ (1981) Robust Statistics. John Wiley, New York
Book Google Scholar
Singh A, Singh AK, Flatman GT (1994) Int J Math Geology 26 (3): 361
Article CAS Google Scholar
Lavine BK (1992) J of Chemometrics 6: 357
Article Google Scholar
Scout: A Data Analysis Program, Technology Support Project, U.S. EPA, EMSL-LV, Las Vegas, NV 89193–3478
Google Scholar
Rosner B (1975) Technometrics 17: 221
Article Google Scholar
Gilbert RO (1987) Statistical Methods for Environmental Pollution Monitoring. Van Nostrand, Reinhold Company, New York
Google Scholar
Barnett V, Lewis T (1984) Outliers in Statistical Data. John Wiley, New York
Google Scholar
Gnanadesikan R (1977) Methods for Statistical Data Analysis of Multivariate Observations. John Wiley, New York
Google Scholar
Jennings LW, Young DM (1988) Communications in Statistics-Simulation 17 (4): 1359
Article Google Scholar
Schwager SJ, Margolin BH (1982) Ann Statist 10: 943
Article Google Scholar
Neykov MN, Neytchev PN (1991) In: Stahel W and Weisberg S (ed) Direction in Robust Statistics and Diagnostics, part II, Vol 34, Springer-Verlag, New York, p 115
Book Google Scholar
Hahn GJ, Meeker WQ (1991) Statistical Intervals. John Wiley, New York
Book Google Scholar
Horn PS, Britton PW, Lewis DF (1988) The Statistician 37: 165
Article Google Scholar
Gross AM (1976) J Amer Statist Assoc 71 (356): 409
Article Google Scholar
Iglewicz B, In: Hoaglin DC, Mosteller F, Tukey JW (ed) Understanding Robust and Exploratory Data Analysis. John Wiley, New York, p 404
Google Scholar
Stigler SM (1977) The Annals of Statistics 5 (6): 1055
Article Google Scholar
Ruppert D, Carroll RJ (1980) J of Amer Statist Assoc 75: 828
Article Google Scholar
Carroll RJ, Ruppert D (1985) Technometrics 27: 1
Article Google Scholar
Jongh PJ, De Wet T, Welsh AH (1988) J of Amer Statist Assoc 83: 806
Google Scholar
Draper NR, Smith H (1981) Applied Regression Analysis, 2nd ed. John Wiley, New York
Google Scholar
Brownlee KA (1965) Statistical Theory and Methodology in Science and Engineering, 2nd ed. John Wiley, New York
Google Scholar
Gittins R (1985) Canonical Analysis, A review with applications in Ecology, Springer-Verlag, Berlin, Heidelberg
Google Scholar
Coomans D, Jonckheer M, Massart DL, Broeckaert I, Blockx P (1978) Anal Chim Acta 103: 409
Article CAS Google Scholar
Coomans D, Massart DL, Broeckaert I, Tassin A (1981) Anal Chim Acta 133: 215
Article CAS Google Scholar
Hopke PK, Massart DL (1993) Chemometrics and Intelligent Laboratory Systems 19: 35
Article CAS Google Scholar
Sharaf MA, Illman DL, Kowalski BR (1986) Chemometrics, John Wiley, New York
Google Scholar
Derde MP, Coomans D, Massart DL (1984) J of the Assoc of Official Analytical Chemists 67: 721
CAS Google Scholar
Swain D, Dunn III WJ, Talaat RE (1993) Anal Chim Acta 277: 305
Article CAS Google Scholar
Lavine BK, Stine A, Mayfield HT (1993) Anal Chim Acta 277: 357
Article CAS Google Scholar
Massart DL, Kaufman L, Rousseeuw PJ, Leroy A (1986) Anal Chim Acta 187: 171
Article CAS Google Scholar
Wold S, Johnson J, Sjostrom M, Sandberg M, Rannar S (1993) Anal Chim Acta 277: 239
Article CAS Google Scholar
Scherer A, Inal OT, Singh AJ (1983) Solar Energy Materials 9: 139
Article CAS Google Scholar
Patel S, Inal OT, Singh M (1985) Solar Energy Materials 11: 381
Article CAS Google Scholar
Jiang H, Lee K, Singh Anita, Singh AK, Torma AE (1988) in Torma AE, Gundiler IH (ed) Precious and Rare Metal Technologies. Elsevier Science Publishers, Amsterdam, The Netherlands, p 547
Google Scholar
Deming SN, Morgan SL (1973) Anal Chem 45 (3): 278A
CAS Google Scholar
Deming SN, Morgan SL (1983) Anal Chim Acta 150: 183
Article CAS Google Scholar
Deming SN (1985) J of Research of the National Bur of Stds 90 (6): 479
Article CAS Google Scholar
Shoemaker AC, Kwok-Leung Tsui, Jeffwu CF (1991) Technometrics 33 (4): 415
Article Google Scholar
Thompson M, Mertens B, Kessler M, Fearn T (1993) Analyst 118: 235
Article CAS Google Scholar
Gnanadesikan R, Kettenring JR (1972) Biometrics 28: 81
Article Google Scholar
Anderberg MR (1973) Cluster Analysis for Applications. Academic Press, New York
Google Scholar
Hartigan JA (1975) Clustering Algorithms. John Wiley, New York
Google Scholar
Fisher RA (1936) Ann Eugenics 7: 179
Article Google Scholar
Tukey JW (1979) In: Launer RL, Wilkinson GN (ed) Robustness in Statistics. Academic Press, p 103
Google Scholar

Download references

Author information

Authors and Affiliations

980 Kelly Johnson Drive, Lockheed Environmental Systems & Technologies Company, Las Vegas, NV, 89119, USA
Anita Singh
Environmental Monitoring Systems Laboratory-Las Vegas, United States Environmental Protection Agency, P.O. Box 93478, Las Vegas, NV, 89193-3478, USA
John M. Nocerino

Authors

Anita Singh
View author publications
You can also search for this author in PubMed Google Scholar
John M. Nocerino
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Inorganic and Analytical Chemistry, Friedrich Schiller University, Lessingstraße 8, D-07743, Jena, Germany
Jürgen Einax

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Singh, A., Nocerino, J.M. (1995). Robust Procedures for the Identification of Multiple Outliers. In: Einax, J. (eds) Chemometrics in Environmental Chemistry - Statistical Methods. The Handbook of Environmental Chemistry, vol 2 / 2G. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-49148-4_8

Download citation

DOI: https://doi.org/10.1007/978-3-540-49148-4_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-14885-3
Online ISBN: 978-3-540-49148-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics