Skip to main content

Part of the book series: The Handbook of Environmental Chemistry ((HEC2,volume 2 / 2G))

Summary

Classical and robust/resistant procedures for the estimation of population parameters and the identification of multiple outliers in univariate and multivariate populations are reviewed. The successful identification of anomalous observations depends on the statistical procedures employed. Commercial industries, local communities, and government agencies such as the United States Environmental Protection Agency (U.S. EPA), often need to assess the extent of contamination at polluted sites. Identification of these contaminants having potentially adverse effects on human health is especially important in various ecological and environmental applications. An environmental scientist typically generates and analyzes large amounts of multidimensional data. These practioners often need to identify experimental conditions and results which look suspicious and are significantly different from the rest of the data. The classical Mahalanobis distance (MD) and its variants (e.g., multivariate kurtosis) are routinely used to identify these anomalies. These test statistics depend upon the estimates of population location and scale. The presence of anomalous observations usually results in distorted and unreliable maximum likelihood estimates (MLEs) and ordinary least-squares (OLS) estimates of the population parameters. These in turn result in deflated and distorted classical MDs and lead to masking effects. This means that the results from statistical tests and inference based upon these classical estimates may be misleading. For example, in an environmental monitoring application, it is possible that the classification procedure based upon the distorted estimates may classify a contaminated sample as coming from the clean population and a clean sample as coming from the contaminated part of the site. This in turn can lead to incorrect remediation decisions.

It is well established among practioners that, for the identification of multiple outliers, one should use robust procedures with a high breakdown point. The estimates obtained using the robust procedures should be in close agreement with the corresponding classical OLS and MLEs when no discordant observations (from different population(s)) are present. Robust procedures for the identification of outliers and the estimation of population parameters of location and scale typically use an influence function. The robust procedure based upon a recently developed “proposed” influence function, called the PROP function, works quite effectively in estimating population parameters accurately, and correctly identifying multiple outliers in univariate and multivariate populations. The control-chart-type quantile-quantile (Q-Q) graphical display of multivariate data combines the effect of a formal test procedure and an informal graphical display into one powerful multiple outlier identification procedure. The scatter plot of the robustified square root leverage distances vs the residuals identifies all regression outliers and distinguishes between significant and insignificant leverage points. The procedures discussed here unmask multiple anomalies and provide reliable estimates of the population parameters in several areas of interest, including linear regression models, discriminant and principal component analyses, and variogram modeling in geostatistical applications. The U.S. EPA, through the Office of Research and Development (ORD), has research interests in optimizing its quality assurance program by developing statistical procedures that are insensitive to outliers (resistant) and the underlying assumptions (robust).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Abbreviations

ANOVA:

analysis of variance

CC:

confidence coefficient

CI:

confidence interval

CLP:

Contract Laboratory Program

R 2 :

coefficient of determination

DF, v :

degrees of freedom

Huber:

Huber’s influence function

Biweight:

Tukey’s Biweight influence function

IRLS:

iteratively reweighted least squares

LCL:

lower confidence limit for the population mean

UCL:

upper confidence limit for the population mean

LPL:

lower limit for the prediction interval

UPL:

upper limit for the prediction interval

LSL:

lower limit for the simultaneous confidence interval

USL:

upper limit for the simultaneous confidence interval

LMS:

least median squares

M:

median

M-estimator:

generalized maximum likelihood estimator

MAD:

median absolute deviation

\({\hat \sigma _{MAD}}\) :

estimate of σ based upon MAD

Max:

maximum

MD:

Mahalanobis distance

Max(MDs):

largest Mahalanobis distance

MLE:

maximum likelihood estimation

MS:

mean square

MVE:

minimum variance ellipsoid

MVT:

multivariate trimming

OLS:

ordinary least squares

PCA:

principal component analysis

PE:

performance evaluation

PLS:

partial least squares

PROP:

proposed influence function

QA/QC:

quality assurance/quality control

Q-Q:

Quantile-Quantile

sd:

standard deviation

SEDOP:

statistical experimental design and optimization

sgn:

the signum function

SIMCA:

Soft Independent Modelling of Class Analogy

SS:

sum of squares

TSP:

three step procedure

n :

sample size

p :

dimension of the data set

μ :

nivariate population mean

σ :

univariate population standard deviation

\(\bar x\) :

sample mean

s :

sample sd, and min(g−1,p)

\(\mathop {x*}\limits^ - \) :

robust estimator of population mean, μ

s*:

robust estimator of population sd, σ

k :

number of outliers, cutoff constant from the Gaussian distribution, and number of populations (groups)

x :

p-dimensional random vector representing an observation

f(x):

the density function of the vector, x

Σ:

summation sign

μ :

p-dimensional population mean vector (location)

Σ:

p × p population dispersion matrix (scale)

μ*:

robust estimator of population location

Σ*:

robust estimator of population scale

h :

a spherically symmetric density in p-dimensional space

d 2i :

Mahalanobis distance for the i-th observation

d bα :

α100% critical value of the test statistic Max(MDs)

d 20 , d 2ind :

α100% critical value from the distribution of d 2i

d 2m, α :

α 100% critical value from the distribution of the Max( d 2i )

ψ(d i):

the PROP influence function

w(d i):

the weight function

wsum :

sum of the weights, w(d i)

wsum2:

sum of the squared weights, w 2(d i)

\(\bar x{*_{bi}}\) :

Biweight estimator of μ

\(\bar x{*_{H}}\) :

Huber estimator of μ

ψ bi(u):

Biweight influence function

ψ H(u):

Huber influence function

u i :

i-th standardized observation

c :

tuning constant

α:

level of significance

t v :

Student’s t-value with v DF

t bi :

t-value associated with the Biweight function

t 0.7(n−1) (α:

1) Student’s t-value with 0.7(n − 1) DF

t α/2,v :

(α/2)100% Student’s t-value with v DF

t c :

classical critical value of Student’s t distribution

t r :

robust critical value of Student’s t distribution

β(a, b):

beta distribution with parameters a and b

Г :

gamma function

χ 2 :

chi-square distribution

Q 1 :

Dixon’s statistic for finding a single upper outlier

Q 2 :

Dixon’s statistic for finding two upper outliers

R p,n :

the correlation coefficient

y :

observed response variable

e i :

normally distributed error term in regression model

\(\hat \sigma \) :

estimate of the sd of the error term

r 2i :

the residual sum of squares

β :

vector of regression coefficients

\({\hat \beta _{OLS}}\) :

ordinary least squares estimate of β

\({\hat \beta _{R}}\) :

robust estimator of β

\({\hat \beta _{LMS}}\) :

least median squares estimate of β

\({\hat \beta _{PROP}}\) :

estimate of β

Ld 2i :

MDs using the x-explanatory variables

Ld bα :

α100% critical value from the distribution of Ld 2i

w(x i, d i):

weight function used in robust regression

P i :

eigenvector corresponding to the i-th eigenvalue

q(k):

normal quantile

g :

number of distinct populations (groups)

π i :

the i-th population

µ i :

mean vector of the i-th population

Σi :

dispersion matrix of the i-th population

\({\hat B^*}\) :

between-groups matrix

W*:

within-groups matrix

S *pooled :

pooled estimate of the common dispersion matrix, Σ

\({\hat \lambda _i}\) :

an eigenvalue of \({W^{{*^{ - 1}}}}{\hat B^*}\)

l i :

normalized eigenvector corresponding to \({\hat \lambda _i}\)

y i :

i-th discriminant function

References

  1. Grubbs FE (1950) Ann Math Statist 21: 27

    Article  Google Scholar 

  2. Dixon WJ (1953) Biometrics 9: 74

    Article  Google Scholar 

  3. Miller JC, Miller JN (1990) Statistics for analytical chemistry, 2nd edn Ellis Horwood, Chichester

    Google Scholar 

  4. Miller JN (1993) Analyst 118: 455

    Article  CAS  Google Scholar 

  5. Wilks SS (1963) Sankhya 25: 407

    Google Scholar 

  6. Mardia KV (1972) Biometrika 57: 519

    Article  Google Scholar 

  7. Stapanian MA, Garner FC, Fitzgerald KE, Flatman GT, Englund EJ (1990) Communication in Statistics-Simulation. 20: 667

    Article  Google Scholar 

  8. Stapanian MA, Garner FC, Fitzgerald KE, Flatman GT, Nocerino JM (1993) J of Chemometrics 7: 165

    Article  CAS  Google Scholar 

  9. Anderson TW (1984) An Introduction to Multivariate Statistical Analysis. John Wiley, New York

    Google Scholar 

  10. Devlin SJ, Gnanadesikan R, Kettenring JR (1981) J Amer Statist Assoc 76: 354

    Article  Google Scholar 

  11. Rousseeuw PJ, van Zomeren C (1990) J Amer Statist Assoc 85: 633

    Article  Google Scholar 

  12. Rousseeuw Pi, van Zomeren C (1991) In: Stahel S, Weisberg S (ed) Direction in Robust Statistics and Diagnostics. Springer- Verlag, part II, Vol 34, New York, p 195

    Google Scholar 

  13. Campbell NA (1980) Applied Statistics 29 (3): 231

    Article  Google Scholar 

  14. Huber PJ (1964) Ann Math Statist 35: 73

    Article  Google Scholar 

  15. Hampel FR (1974) J Amer Statist Assoc 69: 383

    Article  Google Scholar 

  16. Hampel FR, Rousseeuw PJ, Ronchetti R (1981) J Amer Statist Assoc 76: 643

    Google Scholar 

  17. Tukey JW (1977) Exploratory Data Analysis. Reading Ma: Addison Wesley

    Google Scholar 

  18. Andrews DF (1974) Technometrics 16 (4): 523

    Article  Google Scholar 

  19. Maronna RA (1976) Annals of Statist 4: 51

    Article  Google Scholar 

  20. Hawkins DM, Bradu D, Kass GV (1984) Technometrics 26 (13): 197

    Article  Google Scholar 

  21. Rousseeuw PJ, Leroy AM (1987) Robust Regression and Outlier Detection. John Wiley, New York

    Book  Google Scholar 

  22. Singh A (1993) In: Patil GP, Rao CR (ed) Multivariate Environmental Statistics. Elsevier Science Publishers, Amsterdam, 445

    Google Scholar 

  23. Singh A, Nocerino JM. Proceedings of the Ninth International conference on Systems Engineering. July 14–16, 1993, Las Vegas, NV, 370

    Google Scholar 

  24. Singh A, Nocerino JM under review

    Google Scholar 

  25. Kafadar K (1982) J of the Amer Statist Assoc 77(378):416

    Google Scholar 

  26. Huber PJ (1981) Robust Statistics. John Wiley, New York

    Book  Google Scholar 

  27. Singh A, Singh AK, Flatman GT (1994) Int J Math Geology 26 (3): 361

    Article  CAS  Google Scholar 

  28. Lavine BK (1992) J of Chemometrics 6: 357

    Article  Google Scholar 

  29. Scout: A Data Analysis Program, Technology Support Project, U.S. EPA, EMSL-LV, Las Vegas, NV 89193–3478

    Google Scholar 

  30. Rosner B (1975) Technometrics 17: 221

    Article  Google Scholar 

  31. Gilbert RO (1987) Statistical Methods for Environmental Pollution Monitoring. Van Nostrand, Reinhold Company, New York

    Google Scholar 

  32. Barnett V, Lewis T (1984) Outliers in Statistical Data. John Wiley, New York

    Google Scholar 

  33. Gnanadesikan R (1977) Methods for Statistical Data Analysis of Multivariate Observations. John Wiley, New York

    Google Scholar 

  34. Jennings LW, Young DM (1988) Communications in Statistics-Simulation 17 (4): 1359

    Article  Google Scholar 

  35. Schwager SJ, Margolin BH (1982) Ann Statist 10: 943

    Article  Google Scholar 

  36. Neykov MN, Neytchev PN (1991) In: Stahel W and Weisberg S (ed) Direction in Robust Statistics and Diagnostics, part II, Vol 34, Springer-Verlag, New York, p 115

    Book  Google Scholar 

  37. Hahn GJ, Meeker WQ (1991) Statistical Intervals. John Wiley, New York

    Book  Google Scholar 

  38. Horn PS, Britton PW, Lewis DF (1988) The Statistician 37: 165

    Article  Google Scholar 

  39. Gross AM (1976) J Amer Statist Assoc 71 (356): 409

    Article  Google Scholar 

  40. Iglewicz B, In: Hoaglin DC, Mosteller F, Tukey JW (ed) Understanding Robust and Exploratory Data Analysis. John Wiley, New York, p 404

    Google Scholar 

  41. Stigler SM (1977) The Annals of Statistics 5 (6): 1055

    Article  Google Scholar 

  42. Ruppert D, Carroll RJ (1980) J of Amer Statist Assoc 75: 828

    Article  Google Scholar 

  43. Carroll RJ, Ruppert D (1985) Technometrics 27: 1

    Article  Google Scholar 

  44. Jongh PJ, De Wet T, Welsh AH (1988) J of Amer Statist Assoc 83: 806

    Google Scholar 

  45. Draper NR, Smith H (1981) Applied Regression Analysis, 2nd ed. John Wiley, New York

    Google Scholar 

  46. Brownlee KA (1965) Statistical Theory and Methodology in Science and Engineering, 2nd ed. John Wiley, New York

    Google Scholar 

  47. Gittins R (1985) Canonical Analysis, A review with applications in Ecology, Springer-Verlag, Berlin, Heidelberg

    Google Scholar 

  48. Coomans D, Jonckheer M, Massart DL, Broeckaert I, Blockx P (1978) Anal Chim Acta 103: 409

    Article  CAS  Google Scholar 

  49. Coomans D, Massart DL, Broeckaert I, Tassin A (1981) Anal Chim Acta 133: 215

    Article  CAS  Google Scholar 

  50. Hopke PK, Massart DL (1993) Chemometrics and Intelligent Laboratory Systems 19: 35

    Article  CAS  Google Scholar 

  51. Sharaf MA, Illman DL, Kowalski BR (1986) Chemometrics, John Wiley, New York

    Google Scholar 

  52. Derde MP, Coomans D, Massart DL (1984) J of the Assoc of Official Analytical Chemists 67: 721

    CAS  Google Scholar 

  53. Swain D, Dunn III WJ, Talaat RE (1993) Anal Chim Acta 277: 305

    Article  CAS  Google Scholar 

  54. Lavine BK, Stine A, Mayfield HT (1993) Anal Chim Acta 277: 357

    Article  CAS  Google Scholar 

  55. Massart DL, Kaufman L, Rousseeuw PJ, Leroy A (1986) Anal Chim Acta 187: 171

    Article  CAS  Google Scholar 

  56. Wold S, Johnson J, Sjostrom M, Sandberg M, Rannar S (1993) Anal Chim Acta 277: 239

    Article  CAS  Google Scholar 

  57. Scherer A, Inal OT, Singh AJ (1983) Solar Energy Materials 9: 139

    Article  CAS  Google Scholar 

  58. Patel S, Inal OT, Singh M (1985) Solar Energy Materials 11: 381

    Article  CAS  Google Scholar 

  59. Jiang H, Lee K, Singh Anita, Singh AK, Torma AE (1988) in Torma AE, Gundiler IH (ed) Precious and Rare Metal Technologies. Elsevier Science Publishers, Amsterdam, The Netherlands, p 547

    Google Scholar 

  60. Deming SN, Morgan SL (1973) Anal Chem 45 (3): 278A

    CAS  Google Scholar 

  61. Deming SN, Morgan SL (1983) Anal Chim Acta 150: 183

    Article  CAS  Google Scholar 

  62. Deming SN (1985) J of Research of the National Bur of Stds 90 (6): 479

    Article  CAS  Google Scholar 

  63. Shoemaker AC, Kwok-Leung Tsui, Jeffwu CF (1991) Technometrics 33 (4): 415

    Article  Google Scholar 

  64. Thompson M, Mertens B, Kessler M, Fearn T (1993) Analyst 118: 235

    Article  CAS  Google Scholar 

  65. Gnanadesikan R, Kettenring JR (1972) Biometrics 28: 81

    Article  Google Scholar 

  66. Anderberg MR (1973) Cluster Analysis for Applications. Academic Press, New York

    Google Scholar 

  67. Hartigan JA (1975) Clustering Algorithms. John Wiley, New York

    Google Scholar 

  68. Fisher RA (1936) Ann Eugenics 7: 179

    Article  Google Scholar 

  69. Tukey JW (1979) In: Launer RL, Wilkinson GN (ed) Robustness in Statistics. Academic Press, p 103

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 1995 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Singh, A., Nocerino, J.M. (1995). Robust Procedures for the Identification of Multiple Outliers. In: Einax, J. (eds) Chemometrics in Environmental Chemistry - Statistical Methods. The Handbook of Environmental Chemistry, vol 2 / 2G. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-49148-4_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-49148-4_8

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-662-14885-3

  • Online ISBN: 978-3-540-49148-4

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics