skip to main content
10.1145/1273496.1273614acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicmlConference Proceedingsconference-collections
Article

Experimental perspectives on learning from imbalanced data

Published:20 June 2007Publication History

ABSTRACT

We present a comprehensive suite of experimentation on the subject of learning from imbalanced data. When classes are imbalanced, many learning algorithms can suffer from the perspective of reduced performance. Can data sampling be used to improve the performance of learners built from imbalanced data? Is the effectiveness of sampling related to the type of learner? Do the results change if the objective is to optimize different performance metrics? We address these and other issues in this work, showing that sampling in many cases will improve classifier performance.

References

  1. Barandela, R., Valdovinos, R. M., Sanchez, J. S., & Ferri, F. J. (2004). The imbalanced training sample problem: Under or over sampling? In Joint IAPR International Workshops on Structural, Syntactic, and Statistical Pattern Recognition (SSPR/SPR'04), Lecture Notes in Computer Science 3138, 806--814.Google ScholarGoogle ScholarCross RefCross Ref
  2. Berenson, M. L., Levine, D. M., & Goldstein, M. (1983). Intermediate statistical methods and applications: A computer package approach. Prentice-Hall, Inc.Google ScholarGoogle Scholar
  3. Blake, C., & Merz, C. (1998). UCI repository of machine learning databases. http://www.ics.uci.edu/mlearn/MLRepository.html. Department of Information and Computer Sciences, University of California, Irvine.Google ScholarGoogle Scholar
  4. Breiman, L. (2001). Random forests. Machine Learning, 45, 5--32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Chawla, N. V., Hall, L. O., Bowyer, K. W., & Kegelmeyer, W. P. (2002). Smote: Synthetic minority oversampling technique. Journal of Artificial Intelligence Research, 321--357. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Drummond, C., & Holte, R. C. (2003). C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. Workshop on Learning from Imbalanced Data Sets II, International Conference on Machine Learning.Google ScholarGoogle Scholar
  7. Han, H., Wang, W. Y., & Mao, B. H. (2005). Borderlinesmote: A new over-sampling method in imbalanced data sets learning. In International Conference on Intelligent Computing (ICIC'05). Lecture Notes in Computer Science 3644 (pp. 878--887). Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Hand, D. J. (2005). Good practice in retail credit scorecard assessment. Journal of the Operational Research Society, 56, 1109--1117.Google ScholarGoogle ScholarCross RefCross Ref
  9. Japkowicz, N. (2000). Learning from imbalanced data sets: a comparison of various strategies. AAAI Workshop on Learning from Imbalanced Data Sets (AAAI'00) (pp. 10--15).Google ScholarGoogle Scholar
  10. Jo, T., & Japkowicz, N. (2004). Class imbalances versus small disjuncts. SIGKDD Explorations, 6, 40--49. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Kubat, M., & Matwin, S. (1997). Addressing the curse of imbalanced training sets: One sided selection. Proceedings of the Fourteenth International Conference on Machine Learning (pp. 179--186). Morgan Kaufmann.Google ScholarGoogle Scholar
  12. Maloof, M. (2003). Learning when data sets are imbalanced and when costs are unequal and unknown. Proceedings of the ICML'03 Workshop on Learning from Imbalanced Data Sets.Google ScholarGoogle Scholar
  13. Monard, M. C., & Batista, G. E. A. P. A. (2002). Learning with skewed class distributions. Advances in Logic, Artificial Intelligence and Robotics (LAPTEC'02) (pp. 173--180).Google ScholarGoogle Scholar
  14. Quinlan, J. R. (1993). C4.5: Programs for machine learning. San Mateo, California: Morgan Kaufmann. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. SAS Institute (2004). SAS/STAT user's guide. SAS Institute Inc.Google ScholarGoogle Scholar
  16. Weiss, G. M., & Provost, F. (2003). Learning when training data are costly: the effect of class distribution on tree induction. Journal of Artificial Intelligence Research, 315--354. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Witten, I. H., & Frank, E. (2005). Data mining: Practical machine learning tools and techniques. San Francisco, California: Morgan Kaufmann. 2nd edition. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Wohlin, C., Runeson, P., Host, M., Ohlsson, M. C., Regnell, B., & Wesslen, A. (2000). Experimentation in software engineering: An introduction. Kluwer International Series in Software Engineering. Boston, MA: Kluwer Academic Publishers. Google ScholarGoogle ScholarDigital LibraryDigital Library
  1. Experimental perspectives on learning from imbalanced data

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      ICML '07: Proceedings of the 24th international conference on Machine learning
      June 2007
      1233 pages
      ISBN:9781595937933
      DOI:10.1145/1273496

      Copyright © 2007 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 20 June 2007

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      Overall Acceptance Rate140of548submissions,26%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader