Skip to main content
Log in

Estimation of influential points in any data set from coefficient of determination and its leave-one-out cross-validated counterpart

  • Published:
Journal of Computer-Aided Molecular Design Aims and scope Submit manuscript

Abstract

Coefficient of determination (R 2) and its leave-one-out cross-validated analogue (denoted by Q 2 or R 2cv ) are the most frequantly published values to characterize the predictive performance of models. In this article we use R 2 and Q 2 in a reversed aspect to determine uncommon points, i.e. influential points in any data sets. The term (1 − Q 2)/(1 − R 2) corresponds to the ratio of predictive residual sum of squares and the residual sum of squares. The ratio correlates to the number of influential points in experimental and random data sets. We propose an (approximate) F test on (1 − Q 2)/(1 − R 2) term to quickly pre-estimate the presence of influential points in training sets of models. The test is founded upon the routinely calculated Q 2 and R 2 values and warns the model builders to verify the training set, to perform influence analysis or even to change to robust modeling.

Graphical Abstract

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Frank IE, Todeschini R (1994) The data analysis handbook, 1st edn. Elsevier, Amsterdam

    Google Scholar 

  2. Golbraikh A, Tropsha A (2002) J Mol Graph Model 20:269–276

    Article  CAS  Google Scholar 

  3. Todeschini R, Consonni V, Mauri A, Pavan M (2004) Anal Chim Acta 515:199–208

    Article  CAS  Google Scholar 

  4. Kubinyi H (2006) QSAR and molecular modelling in rational design of bioactive molecules. In: Yalcin I, Aki Sener E (eds) Proceedings of the 15th European symposium on QSAR and molecular modelling, Istanbul, Turkey, 2004. CADDD Society, Ankara, pp 30–33

  5. Consonni V, Ballabio D, Todeschini R (2009) J Chem Inf Model 49:1669–1678

    Article  CAS  Google Scholar 

  6. Roy PP, Paul S, Mitra I, Roy K (2009) Molecules 14:1660–1701

    Article  CAS  Google Scholar 

  7. Consonni V, Ballabio D, Todeschini R (2010) J Chemom 24:194–201

    Article  CAS  Google Scholar 

  8. Manvar AT, Pissurlenkar RRS, Virsodia VR, Upadhyay KD, Manvar DR, Mishra AK, Acharya HD, Parecha AR, Dholakia CD, Shah AK, Coutinhi EC (2010) Mol Divers 14:285–305

    Article  CAS  Google Scholar 

  9. Golbraikh A, Shen M, Xiao Z, Xiao YD, Lee KH, Tropsha A (2003) J Comput Aided Mol Des 17:241–253

    Article  CAS  Google Scholar 

  10. Cramer RD, Wendt B (2007) J Comput Aided Mol Des 21:23–32

    Article  CAS  Google Scholar 

  11. Jiménez-Contreras E, Torres-Salinas D, Bailón-Moreno R, Ruiz-Baños R, Delgado-López-Cózar E (2008) Scientometrics 79:201–218

    Article  Google Scholar 

  12. Doweyko AM (2008) J Comput Aided Mol Des 22:81–89

    Article  CAS  Google Scholar 

  13. Chirico N, Gramatica P (2011) J Chem Inf Model 51:2320–2335

    Article  CAS  Google Scholar 

  14. Chirico N, Gramatica P (2012) J Chem Inf Model 52:2044–2058

    Article  CAS  Google Scholar 

  15. Roy K, Mitra I, Ojha PK, Kar S, Das RN, Kabir H (2012) Chemom Intell Lab Syst 118:200–210

    Article  CAS  Google Scholar 

  16. Bagheri A, Midi H, Ganjali M, Eftekhari S (2010) Appl Math Sci 4:1367–1386

    Google Scholar 

  17. Cook DR, Weisberg S (1982) Residuals and influence regression. Chapman & Hall, New York

    Google Scholar 

  18. Chatterjee S, Hadi AS (1986) Stat Sci 1:379–416

    Google Scholar 

  19. Rousseeuw P, Hubert M (1997) Lab statistical procedures and related topics. In: Dodge Y (ed) Papers from the 3rd international conference on lab-norm related methods Neuchatel 1997, Ins. Math Stat. Hayward, pp 201–214

  20. Belsley DA, Kuh E, Welsch RE (1980) Regression diagnostics: identifying influential data and sources of collinearity. Wiley, New York

    Book  Google Scholar 

  21. Bevington PR (1969) Data reduction and error analysis for the physical sciences. McGraw-Hill Book Co., New York

    Google Scholar 

  22. van der Voet H (1999) J Chemom 13:195–208

    Article  Google Scholar 

  23. Zhang X, Ding L, Sun Z, Song L, Sun T (2009) Chromatographia 70:511–518

    Article  CAS  Google Scholar 

  24. Tetko IV, Gasteiger J, Todeschini R, Mauri A, Livingstone D, Ertl P, Palyulin VA, Radchenko EV, Zefirov NS, Makarenko AS, Tanchuk VY, Prokopenko VV (2005) J Comput Aid Mol Des 19:453–463

    Article  CAS  Google Scholar 

  25. Dearden JC, Netzeva TI (2004) QSAR modelling of hERG potassium channel inhibition with low-dimensional descriptors. J Pharm Pharmacol 56(Suppl):S82–S82

    Google Scholar 

  26. Seipel HA, Kalivas JH (2004) J Chemom 18:306–311

    Article  CAS  Google Scholar 

  27. Zhang L, Garcia-Munoz S (2009) Chemometr Intell Lab Syst 97:152–158

    Article  CAS  Google Scholar 

  28. Fox J (2008) Applied regression analysis and generalized linear models, 2nd edn. SAGE Publications, Thousand Oaks

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Károly Héberger.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tóth, G., Bodai, Z. & Héberger, K. Estimation of influential points in any data set from coefficient of determination and its leave-one-out cross-validated counterpart. J Comput Aided Mol Des 27, 837–844 (2013). https://doi.org/10.1007/s10822-013-9680-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10822-013-9680-4

Keywords

Navigation