research-article

Leakage in data mining: Formulation, detection, and avoidance

Authors:
Shachar Kaufman

Tel Aviv University, Israel

Tel Aviv University, Israel
View Profile

,
Saharon Rosset

Tel Aviv University, Israel

Tel Aviv University, Israel
View Profile

,
Claudia Perlich

m6d

m6d
View Profile

,
Ori Stitelman

m6d

m6d
View Profile

ACM Transactions on Knowledge Discovery from Data Volume 6 Issue 4Article No.: 15pp 1–21https://doi.org/10.1145/2382577.2382579

Published:18 December 2012Publication History

ACM Transactions on Knowledge Discovery from Data

Abstract

Deemed “one of the top ten data mining mistakes”, leakage is the introduction of information about the data mining target that should not be legitimately available to mine from. In addition to our own industry experience with real-life projects, controversies around several major public data mining competitions held recently such as the INFORMS 2010 Data Mining Challenge and the IJCNN 2011 Social Network Challenge are evidence that this issue is as relevant today as it has ever been. While acknowledging the importance and prevalence of leakage in both synthetic competitions and real-life data mining projects, existing literature has largely left this idea unexplored. What little has been said turns out not to be broad enough to cover more complex cases of leakage, such as those where the classical independently and identically distributed (i.i.d.) assumption is violated, that have been recently documented. In our new approach, these cases and others are explained by explicitly defining modeling goals and analyzing the broader framework of the data mining problem. The resulting definition enables us to derive general methodology for dealing with the issue. We show that it is possible to avoid leakage with a simple specific approach to data management followed by what we call a learn-predict separation, and present several ways of detecting leakage when the modeler has no control over how the data have been collected. We also offer an alternative point of view on leakage that is based on causal graph modeling concepts.

References

Buneman, P., Khanna, S., and Wang-Chiew, T. 2001. Why and where: A characterization of data provenance. In Proceedings of the International Conference on Database Theory (ICDT'01). 316--330. Google ScholarDigital Library
Engle, R. and Granger, C. 1987. Co-Integration and error correction: Representation, estimation and testing. Econometrica 55, 2, 251--276.Google ScholarCross Ref
Hastie, T., Tibshirani, R., and Friedman, J. H. 2009. The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer.Google Scholar
Kohavi, R., Brodley, C., Frasca, B., Mason, L., and Zheng, Z. 2000. Kdd-Cup 2000 organizers' report: Peeling the onion. ACM SIGKDD Explor. Newslett. 2. Google ScholarDigital Library
Kohavi, R., Mason, L., Parekh, R., and Zheng, Z. 2004. Lessons and challenges from mining retail e-commerce data. Mach. Learn. 1, 2. Google ScholarDigital Library
Kohavi, R. and Parekh, R. 2003. Ten supplementary analyses to improve e-commerce web sites. In Proceedings of the 5^th WEBKDD Workshop.Google Scholar
Laiy, S., Xiang, L., Diao, R., Liu, Y., Gu, H., Xu, L., Li, H., Wang, D., Liu, K., Zhao, J., et al. 2011. Hybrid recommendation models for binary user preference prediction problem. http://jmir.csail.mit.edu/proceedings/papers/v18/lai12a/lai12a.pdf.Google Scholar
Lo, A. and MacKinlay, A. 1990. Data-Snooping biases in tests of financial asset pricing models. Rev. Finan. Stud. 1, 431--467.Google ScholarCross Ref
Narayanan, A., Shi, E., and Rubinstein, B. 2011. Link prediction by deanonymization: How we won the kaggle social network challenge. In Proceedings of the International Joint Conference on Neural Networks (IJCNN'11).Google Scholar
Nisbet, R., Elder, J., and Miner, G. 2009. Handbook of Statistical Analysis and Data Mining Applications. Academic Press. Google ScholarDigital Library
Pearl, J. 1995. Causal diagrams for empirical research. Biometrika 82, 4, 669--688.Google ScholarCross Ref
Pearl, J. 2009. Causality: Models, Reasoning and Inference 2^nd Ed. Cambridge University Press, Cambridge, UK. Google ScholarDigital Library
Perlich, C., Melville, P., Liu, Y., Swirszcz, G., Lawrence, R., and Rosset, S. 2008. Breast cancer identification: KDD cup winner's report. ACM SIGKDD Explor. Newslett. 2, 39--42. Google ScholarDigital Library
Pyle, D. 1999. Data Preparation for Data Mining. Morgan Kaufmann Publishers. Google ScholarDigital Library
Pyle, D. 2003. Business Modeling and Data Mining. Morgan Kaufmann Publishers. Google ScholarDigital Library
Pyle, D. 2009. Data Mining: Know it All. Morgan Kaufmann Publishers, Chapter 9.Google Scholar
Robins, J. 1997. Causal inference from complex longitudinal data. Lat. Variab. Model. Appl. Causal. 120, 69--117.Google ScholarCross Ref
Rosset, S., Perlich, C., and Liu, Y. 2007. Making the most of your data: KDD cup 2007 “how many ratings” winner's report. ACM SIGKDD Explor. Newslett. 2. Google ScholarDigital Library
Rosset, S., Perlich, C., Swirszcz, G., Liu, Y., and Melville, P. 2010. Medical data mining: Lessons from winning two competitions. Data Min. Knowl. Discov. 3, 439--468. Google ScholarDigital Library
Smith, A. and Elkan, C. 2007. Making generative classifiers robust to selection bias. In Proceedings of the 13^th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM Press, New York, 657--666. Google ScholarDigital Library
Tukey, J. 1977. Exploratory Data Analysis. Addison-Wesley.Google Scholar
Widmer, G. and Kubat, M. 1996. Learning in the presence of concept drift and hidden contexts. Mach. Learn. 1. Google ScholarDigital Library
Xie, J. and Coggeshall, S. 2010. Prediction of transfers to tertiary care and hospital mortality: A gradient boosting decision tree approach. Stastist. Anal. Data Min. 3, 4, 253-258. Google ScholarDigital Library

Index Terms

Leakage in data mining: Formulation, detection, and avoidance
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Leakage in data mining: formulation, detection, and avoidance
KDD '11: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining

Deemed "one of the top ten data mining mistakes", leakage is essentially the introduction of information about the data mining target, which should not be legitimately available to mine from. In addition to our own industry experience with real-life ...
Read More
Mining fuzzy specific rare itemsets for education data

Association rule mining is an important data analysis method for the discovery of associations within data. There have been many studies focused on finding fuzzy association rules from transaction databases. Unfortunately, in the real world, one may ...
Read More
Leakage- and variability-conscious circuit designs for the 0.5-v nanoscale CMOS era
ISLPED '09: Proceedings of the 2009 ACM/IEEE international symposium on Low power electronics and design

Low-voltage scaling limitations of memory-rich CMOS LSIs are one of the major problems in the nanoscale era because they cause the evermore-serious power crisis with device scaling. The problems stem from two unscalable device parameters: The first is ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Knowledge Discovery from Data Volume 6, Issue 4
Special Issue on the Best of SIGKDD 2011
December 2012
141 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/2382577
Issue’s Table of Contents

Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 December 2012
- Accepted: 1 May 2012
- Revised: 1 March 2012
- Received: 1 November 2011
Published in tkdd Volume 6, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Data mining
leakage
predictive modeling
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 247
  Total Citations
  View Citations
- 2,184
  Total Downloads
- Downloads (Last 12 months)330
- Downloads (Last 6 weeks)41
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Leakage in data mining: Formulation, detection, and avoidance

ACM Transactions on Knowledge Discovery from Data

Abstract

References

Cited By

Index Terms

Recommendations

Leakage in data mining: formulation, detection, and avoidance

Mining fuzzy specific rare itemsets for education data

Leakage- and variability-conscious circuit designs for the 0.5-v nanoscale CMOS era

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Leakage in data mining: Formulation, detection, and avoidance

ACM Transactions on Knowledge Discovery from Data

Abstract

References

Cited By

Index Terms

Recommendations

Leakage in data mining: formulation, detection, and avoidance

Mining fuzzy specific rare itemsets for education data

Leakage- and variability-conscious circuit designs for the 0.5-v nanoscale CMOS era

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media