ABSTRACT
We present a comprehensive suite of experimentation on the subject of learning from imbalanced data. When classes are imbalanced, many learning algorithms can suffer from the perspective of reduced performance. Can data sampling be used to improve the performance of learners built from imbalanced data? Is the effectiveness of sampling related to the type of learner? Do the results change if the objective is to optimize different performance metrics? We address these and other issues in this work, showing that sampling in many cases will improve classifier performance.
- Barandela, R., Valdovinos, R. M., Sanchez, J. S., & Ferri, F. J. (2004). The imbalanced training sample problem: Under or over sampling? In Joint IAPR International Workshops on Structural, Syntactic, and Statistical Pattern Recognition (SSPR/SPR'04), Lecture Notes in Computer Science 3138, 806--814.Google ScholarCross Ref
- Berenson, M. L., Levine, D. M., & Goldstein, M. (1983). Intermediate statistical methods and applications: A computer package approach. Prentice-Hall, Inc.Google Scholar
- Blake, C., & Merz, C. (1998). UCI repository of machine learning databases. http://www.ics.uci.edu/mlearn/MLRepository.html. Department of Information and Computer Sciences, University of California, Irvine.Google Scholar
- Breiman, L. (2001). Random forests. Machine Learning, 45, 5--32. Google ScholarDigital Library
- Chawla, N. V., Hall, L. O., Bowyer, K. W., & Kegelmeyer, W. P. (2002). Smote: Synthetic minority oversampling technique. Journal of Artificial Intelligence Research, 321--357. Google ScholarDigital Library
- Drummond, C., & Holte, R. C. (2003). C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. Workshop on Learning from Imbalanced Data Sets II, International Conference on Machine Learning.Google Scholar
- Han, H., Wang, W. Y., & Mao, B. H. (2005). Borderlinesmote: A new over-sampling method in imbalanced data sets learning. In International Conference on Intelligent Computing (ICIC'05). Lecture Notes in Computer Science 3644 (pp. 878--887). Springer-Verlag. Google ScholarDigital Library
- Hand, D. J. (2005). Good practice in retail credit scorecard assessment. Journal of the Operational Research Society, 56, 1109--1117.Google ScholarCross Ref
- Japkowicz, N. (2000). Learning from imbalanced data sets: a comparison of various strategies. AAAI Workshop on Learning from Imbalanced Data Sets (AAAI'00) (pp. 10--15).Google Scholar
- Jo, T., & Japkowicz, N. (2004). Class imbalances versus small disjuncts. SIGKDD Explorations, 6, 40--49. Google ScholarDigital Library
- Kubat, M., & Matwin, S. (1997). Addressing the curse of imbalanced training sets: One sided selection. Proceedings of the Fourteenth International Conference on Machine Learning (pp. 179--186). Morgan Kaufmann.Google Scholar
- Maloof, M. (2003). Learning when data sets are imbalanced and when costs are unequal and unknown. Proceedings of the ICML'03 Workshop on Learning from Imbalanced Data Sets.Google Scholar
- Monard, M. C., & Batista, G. E. A. P. A. (2002). Learning with skewed class distributions. Advances in Logic, Artificial Intelligence and Robotics (LAPTEC'02) (pp. 173--180).Google Scholar
- Quinlan, J. R. (1993). C4.5: Programs for machine learning. San Mateo, California: Morgan Kaufmann. Google ScholarDigital Library
- SAS Institute (2004). SAS/STAT user's guide. SAS Institute Inc.Google Scholar
- Weiss, G. M., & Provost, F. (2003). Learning when training data are costly: the effect of class distribution on tree induction. Journal of Artificial Intelligence Research, 315--354. Google ScholarDigital Library
- Witten, I. H., & Frank, E. (2005). Data mining: Practical machine learning tools and techniques. San Francisco, California: Morgan Kaufmann. 2nd edition. Google ScholarDigital Library
- Wohlin, C., Runeson, P., Host, M., Ohlsson, M. C., Regnell, B., & Wesslen, A. (2000). Experimentation in software engineering: An introduction. Kluwer International Series in Software Engineering. Boston, MA: Kluwer Academic Publishers. Google ScholarDigital Library
- Experimental perspectives on learning from imbalanced data
Recommendations
Learning from Imbalanced Data
With the continuous expansion of data availability in many large-scale, complex, and networked systems, such as surveillance, security, Internet, and finance, it becomes critical to advance the fundamental understanding of knowledge discovery and ...
Multiset feature learning for highly imbalanced data classification
AAAI'17: Proceedings of the Thirty-First AAAI Conference on Artificial IntelligenceWith the expansion of data, increasing imbalanced data has emerged. When the imbalance ratio of data is high, most existing imbalanced learning methods decline in classification performance. To address this problem, a few highly imbalanced learning ...
MWMOTE--Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning
Imbalanced learning problems contain an unequal distribution of data samples among different classes and pose a challenge to any classifier as it becomes hard to learn the minority class samples. Synthetic oversampling methods address this problem by ...
Comments