skip to main content
research-article

Leakage in data mining: Formulation, detection, and avoidance

Published:18 December 2012Publication History
Skip Abstract Section

Abstract

Deemed “one of the top ten data mining mistakes”, leakage is the introduction of information about the data mining target that should not be legitimately available to mine from. In addition to our own industry experience with real-life projects, controversies around several major public data mining competitions held recently such as the INFORMS 2010 Data Mining Challenge and the IJCNN 2011 Social Network Challenge are evidence that this issue is as relevant today as it has ever been. While acknowledging the importance and prevalence of leakage in both synthetic competitions and real-life data mining projects, existing literature has largely left this idea unexplored. What little has been said turns out not to be broad enough to cover more complex cases of leakage, such as those where the classical independently and identically distributed (i.i.d.) assumption is violated, that have been recently documented. In our new approach, these cases and others are explained by explicitly defining modeling goals and analyzing the broader framework of the data mining problem. The resulting definition enables us to derive general methodology for dealing with the issue. We show that it is possible to avoid leakage with a simple specific approach to data management followed by what we call a learn-predict separation, and present several ways of detecting leakage when the modeler has no control over how the data have been collected. We also offer an alternative point of view on leakage that is based on causal graph modeling concepts.

References

  1. Buneman, P., Khanna, S., and Wang-Chiew, T. 2001. Why and where: A characterization of data provenance. In Proceedings of the International Conference on Database Theory (ICDT'01). 316--330. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Engle, R. and Granger, C. 1987. Co-Integration and error correction: Representation, estimation and testing. Econometrica 55, 2, 251--276.Google ScholarGoogle ScholarCross RefCross Ref
  3. Hastie, T., Tibshirani, R., and Friedman, J. H. 2009. The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer.Google ScholarGoogle Scholar
  4. Kohavi, R., Brodley, C., Frasca, B., Mason, L., and Zheng, Z. 2000. Kdd-Cup 2000 organizers' report: Peeling the onion. ACM SIGKDD Explor. Newslett. 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Kohavi, R., Mason, L., Parekh, R., and Zheng, Z. 2004. Lessons and challenges from mining retail e-commerce data. Mach. Learn. 1, 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Kohavi, R. and Parekh, R. 2003. Ten supplementary analyses to improve e-commerce web sites. In Proceedings of the 5th WEBKDD Workshop.Google ScholarGoogle Scholar
  7. Laiy, S., Xiang, L., Diao, R., Liu, Y., Gu, H., Xu, L., Li, H., Wang, D., Liu, K., Zhao, J., et al. 2011. Hybrid recommendation models for binary user preference prediction problem. http://jmir.csail.mit.edu/proceedings/papers/v18/lai12a/lai12a.pdf.Google ScholarGoogle Scholar
  8. Lo, A. and MacKinlay, A. 1990. Data-Snooping biases in tests of financial asset pricing models. Rev. Finan. Stud. 1, 431--467.Google ScholarGoogle ScholarCross RefCross Ref
  9. Narayanan, A., Shi, E., and Rubinstein, B. 2011. Link prediction by deanonymization: How we won the kaggle social network challenge. In Proceedings of the International Joint Conference on Neural Networks (IJCNN'11).Google ScholarGoogle Scholar
  10. Nisbet, R., Elder, J., and Miner, G. 2009. Handbook of Statistical Analysis and Data Mining Applications. Academic Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Pearl, J. 1995. Causal diagrams for empirical research. Biometrika 82, 4, 669--688.Google ScholarGoogle ScholarCross RefCross Ref
  12. Pearl, J. 2009. Causality: Models, Reasoning and Inference 2nd Ed. Cambridge University Press, Cambridge, UK. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Perlich, C., Melville, P., Liu, Y., Swirszcz, G., Lawrence, R., and Rosset, S. 2008. Breast cancer identification: KDD cup winner's report. ACM SIGKDD Explor. Newslett. 2, 39--42. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Pyle, D. 1999. Data Preparation for Data Mining. Morgan Kaufmann Publishers. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Pyle, D. 2003. Business Modeling and Data Mining. Morgan Kaufmann Publishers. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Pyle, D. 2009. Data Mining: Know it All. Morgan Kaufmann Publishers, Chapter 9.Google ScholarGoogle Scholar
  17. Robins, J. 1997. Causal inference from complex longitudinal data. Lat. Variab. Model. Appl. Causal. 120, 69--117.Google ScholarGoogle ScholarCross RefCross Ref
  18. Rosset, S., Perlich, C., and Liu, Y. 2007. Making the most of your data: KDD cup 2007 “how many ratings” winner's report. ACM SIGKDD Explor. Newslett. 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Rosset, S., Perlich, C., Swirszcz, G., Liu, Y., and Melville, P. 2010. Medical data mining: Lessons from winning two competitions. Data Min. Knowl. Discov. 3, 439--468. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Smith, A. and Elkan, C. 2007. Making generative classifiers robust to selection bias. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM Press, New York, 657--666. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Tukey, J. 1977. Exploratory Data Analysis. Addison-Wesley.Google ScholarGoogle Scholar
  22. Widmer, G. and Kubat, M. 1996. Learning in the presence of concept drift and hidden contexts. Mach. Learn. 1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Xie, J. and Coggeshall, S. 2010. Prediction of transfers to tertiary care and hospital mortality: A gradient boosting decision tree approach. Stastist. Anal. Data Min. 3, 4, 253-258. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Leakage in data mining: Formulation, detection, and avoidance

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Knowledge Discovery from Data
      ACM Transactions on Knowledge Discovery from Data  Volume 6, Issue 4
      Special Issue on the Best of SIGKDD 2011
      December 2012
      141 pages
      ISSN:1556-4681
      EISSN:1556-472X
      DOI:10.1145/2382577
      Issue’s Table of Contents

      Copyright © 2012 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 18 December 2012
      • Accepted: 1 May 2012
      • Revised: 1 March 2012
      • Received: 1 November 2011
      Published in tkdd Volume 6, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader