Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning

Han, Hui; Wang, Wen-Yuan; Mao, Bing-Huan

doi:10.1007/11538059_91

Hui Han¹⁹,
Wen-Yuan Wang¹⁹ &
Bing-Huan Mao²⁰

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3644))

Included in the following conference series:

International Conference on Intelligent Computing

7755 Accesses
1281 Citations
12 Altmetric

Abstract

In recent years, mining with imbalanced data sets receives more and more attentions in both theoretical and practical aspects. This paper introduces the importance of imbalanced data sets and their broad application domains in data mining, and then summarizes the evaluation metrics and the existing methods to evaluate and solve the imbalance problem. Synthetic minority over-sampling technique (SMOTE) is one of the over-sampling methods addressing this problem. Based on SMOTE method, this paper presents two new minority over-sampling methods, borderline-SMOTE1 and borderline-SMOTE2, in which only the minority examples near the borderline are over-sampled. For the minority class, experiments show that our approaches achieve better TP rate and F-value than SMOTE and random over-sampling methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Chawla, N.V., Japkowicz, N., Kolcz, A.: Editorial: Special Issue on Learning from Imbalanced Data Sets. SIGKDD Explorations 6(1), 1–6 (2004)
Article Google Scholar
Weiss, G.: Mining with rarity: A unifying framework. SIGKDD Explorations 6(1), 7–19 (2004)
Article Google Scholar
Ezawa, K.J., Singh, M., Norton, S.W.: Learning Goal Oriented Bayesian Networks for Telecommunications Management. In: Proceedings of the International Conference on Machine Learning, ICML 1996, Bari, Italy, pp. 139–147. Morgan Kaufmann, San Francisco (1996)
Google Scholar
Kubat, m., Holte, R., Matwin, S.: Machine Learning for the Detection of Oil Spills in Satellite Radar Images. Machine Learning 30, 195–215
Google Scholar
van den Bosch, A., Weijters, T., van den Herik, H.J., Daelemans, W.: When small disjuncts abound, try lazy learning: A case study. In: Proceedings of the Seventh Belgian-Dutch Conference on Machine Learning, pp. 109–118 (1997)
Google Scholar
Zheng, Z., Wu, X., Srihari, R.: Feature Selection for Text Categorization on Imbalanced Data. SIGKDD Explorations 6(1), 80–89 (2004)
Article Google Scholar
Fawcett, T., Provost, F.: Combining Data Mining and Machine Learning for Effective User Profile. In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland OR, pp. 8–13. AAAI Press, Menlo Park (1996)
Google Scholar
Lewis, D., Catlett, H.J.: Uncertainty Sampling for Supervized Learning. In: Proceedings of the 11th International Conference on Machine Learning, ICML1994, pp. 148–156 (1994)
Google Scholar
Bradley, A.: The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition 30(7), 1145–1159 (1997)
Article Google Scholar
van Rijsbergen, C.J.: Information Retrieval. Butterworths, London (1979)
Google Scholar
Kubat, M., Matwin, S.: Addressing the Course of Imbalanced Training Sets: One-sided Selection. In: ICML 1997, pp. 179–186 (1997)
Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: Synthetic Minority Over-Sampling Technique. Journal of Artificial Intelligence Research 16, 321–357 (2002)
MATH Google Scholar
Chawla, N.V., Lazarevic, A., Hall, L.O., Bowyer, K.: SMOTEBoost: Improving prediction of the Minority Class in Boosting. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) PKDD 2003. LNCS (LNAI), vol. 2838, pp. 107–119. Springer, Heidelberg (2003)
Chapter Google Scholar
Gustavo, E.A., Batista, P.A., Ronaldo, C., Prati, Monard, M.C.: A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data. SIGKDD Explorations 6(1), 20–29 (2004)
Article Google Scholar
Estabrooks, A., Jo, T., Japkowicz, N.: A Multiple Resampling Method for Learning from Imbalanced Data Sets. Computational Intelligence 20(1), 18–36 (2004)
Article MathSciNet Google Scholar
Jo, T., Japkowicz, N.: Class Imbalances versus Small Disjuncts. Sigkdd Explorations 6(1), 40–49 (2004)
Article MathSciNet Google Scholar
Guo, H., Viktor, H.L.: Learning from Imbalanced Data Sets with Boosting and Data Generation: The DataBoost-IM Approach. Sigkdd Explorations 6(1), 30–39 (2004)
Article Google Scholar
Freund, Y., Schapire, R.: A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55(1), 119–139 (1997)
Article MATH MathSciNet Google Scholar
Joshi, M., Kumar, V., Agarwal, R.: Evaluating Boosting Algorithms to Classify Rare Classes: Comparison and Improvements. In: First IEEE International Conference on Data Mining, San Jose, CA (2001)
Google Scholar
Wu, G., Chang, E.Y.: Class-Boundary Alignment for Imbalanced Dataset Learning. In: Workshop on Learning from Imbalanced Datasets II, ICML, Washington, DC (2003)
Google Scholar
Huang, K., Yang, H., King, I., Lyu, M.R.: Learning Classifiers from Imbalanced Data Based on Biased Minimax Probability Machine. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (2004)
Google Scholar
Dietterich, T., Margineantu, D., Provost, F., Turney, P. (eds.): Proceedings of the ICML 2000 Workshop on Cost-sensitive Learning (2000)
Google Scholar
Manevitz, L.M., Yousef, M.: One-class SVMs for document classification. Journal of Machine Learning Research 2, 139–154 (2001)
Article Google Scholar
Blake, C., Merz, C.: UCI Repository of Machine Learning Databases. Department of Information and Computer Sciences, University of California, Irvine (1998), http://www.ics.uci.edu/~mlearn/~MLRepository.html
Quinlan, J.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo (1992)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Automation, Tsinghua University, Beijing, 100084, P. R. China
Hui Han & Wen-Yuan Wang
Department of Statistics, Central University of Finance and Economics, Beijing, 100081, P. R. China
Bing-Huan Mao

Authors

Hui Han
View author publications
You can also search for this author in PubMed Google Scholar
Wen-Yuan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Bing-Huan Mao
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Intelligent Computing Lab, Institute of Intelligent Machines, Chinese Academy of Sciences,, China
De-Shuang Huang
School of Computer & Information Technology, Beijing Jiaotong University, 100044, Beijing, P.R. China
Xiao-Ping Zhang
School of Electrical and Electronic Engineering, Nanyang Technological University, P.O. Box, Singapore
Guang-Bin Huang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Han, H., Wang, WY., Mao, BH. (2005). Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. In: Huang, DS., Zhang, XP., Huang, GB. (eds) Advances in Intelligent Computing. ICIC 2005. Lecture Notes in Computer Science, vol 3644. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11538059_91

Download citation

DOI: https://doi.org/10.1007/11538059_91
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-28226-6
Online ISBN: 978-3-540-31902-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics