ReviewLearning from class-imbalanced data: Review of methods and applications
Introduction
Rare events, unusual patterns and abnormal behavior are difficult to detect, but often require responses from various management functions in a timely manner. By definition, rare events refer to events that occur much less frequently than commonly occurring events (Maalouf and Trafalis, 2011). Examples of rare events include software defects (Rodriguez et al., 2014), natural disasters (Maalouf and Trafalis, 2011), cancer gene expressions (Yu et al., 2012), fraudulent credit card transactions (Panigrahi et al., 2009), and telecommunications fraud (Olszewski, 2012).
In the field of data mining, detecting events is a prediction problem, or, typically, a data classification problem. Rare events are difficult to detect because of their infrequency and casualness; however, misclassifying rare events can result in heavy costs. For financial fraud detection, invalid transactions may only emerge out of hundreds of thousands of transaction records, but failing to identify a serious fraudulent transaction would cause enormous losses. The scarce occurrences of rare events impair the detection task to imbalanced data classification problem. Imbalanced data refers to a dataset within which one or some of the classes have a much greater number of examples than the others. The most prevalent class is called the majority class, while the rarest class is called the minority class (Li et al., 2016c). Although data mining approaches have been widely used to build classification models to guide commercial and managerial decision-making, classifying imbalanced data significantly challenges these traditional classification models. As were discussed on existing surveys, the reasons are fivefold:
- (1)
Standard classifiers such as logistic regression, Support Vector Machine (SVM) and decision tree are suitable for balanced training sets. When facing imbalanced scenarios, these models often provide suboptimal classification results, i.e. a good coverage of the majority examples, whereas the minority examples are distorted (López et al., 2013).
- (2)
The learning process guided by global performance metrics such as prediction accuracy induces a bias towards the majority class, while the rare episodes remain unknown even if the prediction model produces a high overall precision (Loyola-González et al., 2016). Some original discussion can be found in Weiss and Hirsh (2000) and Weiss (2004).
- (3)
Rare minority examples may possibly be treated as noise by the learning model. Contrarily, noise may be wrongly identified as minority examples, since both of them are rare patterns in the data space (Beyan and Fisher, 2015).
- (4)
Even though skewed sample distributions are not always difficult to learn (such as when the classes are separable), minority examples usually overlap with other regions where the prior probabilities of both classes are almost equal. Denil and Trappenberg (2010) has discussed overlapping problem under imbalanced scenario.
- (5)
Besides, small disjuncts (Jo and Japkowicz, 2004), a lack of density and small sample size with high feature dimensionality (Wasikowski and Chen, 2010) are challenges to imbalanced learning, which often cause learning models to fail in detecting rare patterns (Branco et al., 2016; López et al., 2013).
Many machine learning approaches have been developed in the past decade to cope with imbalanced data classification, most of which have been based on sample techniques, cost sensitive learning and ensemble methods (Galar et al., 2012; Krawczyk et al., 2014; Loyola-González et al., 2016). There is also one book in this area, see He and Ma (2013). Although several surveys related to imbalanced learning have been published (Branco et al., 2016, Fernández et al., 2013; Galar et al., 2012; He and Garcia, 2009, Lerner et al., 2007; Sun et al., 2009), all of them focused on detailed techniques while application literature is neglected. For researchers from management, biology or other domains, rather than sophisticated algorithms, the problems that can be solved using imbalanced learning techniques and the building of imbalanced learning systems with mature yet effective methods may be of more concern.
In this paper, we aim to provide a thorough overview of the classification of imbalanced data that includes both techniques and applications. At the technical level, we introduce common approaches to deal with imbalanced learning and propose a general framework within which each algorithm can be placed. This framework is a unified data mining model and includes preprocessing, classification and evaluation. At the practical level, we review 162 papers that tried to build specific systems to tackle rare pattern detection problems and develop a taxonomy of existing imbalanced learning application domains. The existing application literature covers most research fields from medical to industry to management.
The rest of this paper is organized as follows. Section 2 describes the research methodology for this study, along with the initial statistics regarding recent trends in imbalanced learning. Section 3 presents approaches to address both binary and multiple class imbalanced data. In Section 4, we first categorize the existing imbalanced learning application literature into 13 domains and then introduce the respective research frameworks. In Section 5, we discuss our thoughts on future imbalanced learning research directions from both a technical and practical point of view. Finally, Section 6 present the conclusions of this paper.
Section snippets
Research methodology
For this study, based on the research methodology of Govindan and Jepsen (2016), a two-stage search procedure was conducted to compile relevant papers published from 2006 to October 2016. In the initial phase, seven library databases which covered most natural science and social science research fields were used to search for and collect literature: Elsevier, IEEExplore, Springer, ACM, Cambridge, Wiley and Sage. Full text search was used and the search term was designed following the search
Imbalanced data classification approaches
Hundreds of algorithms have been proposed in the past decade to address imbalanced data classification problems. In this section, we give an overview of the state-of-the-art imbalanced learning techniques. These techniques are discussed in line with the basic machine learning model framework. In Section 3.1, two basic strategies for addressing imbalanced learning are introduced, which are, preprocessing and cost-sensitive learning. Preprocessing approaches include resampling methods conducted
Imbalanced data classification application domains
There is currently a great deal of interest in utilizing automated methods—particularly data mining and machine learning methods—to analyze the enormous amount of data that is routinely being collected. An important class of problems involves predicting future events based on past events. Event prediction often involves predicting rare events (Weiss and Hirsh, 2000). Rare events are events that occur with low frequency but may cause far-reaching impact and disrupt the society (King and Zeng,
Future research directions of imbalanced learning
In this section, we propose possible research directions of imbalanced learning based on our survey. In particular, imbalanced techniques we think still need to be considered are proposed in Section 5.1. Some application domains that imbalanced data are frequently observed but not well-studied are pointed out in Section 5.2.
Conclusions
In this paper, we attempted to provide a thorough review of rare event detection techniques and its applications. In particular, a data mining and a machine learning perspective was taken to view rare event detection as a class imbalanced data classification problem. We collected 527 papers that are related to imbalanced learning and rare event detection for this study. Unlike other surveys that have been published in the imbalanced learning field, we reviewed all papers from both a technical
Acknowledgements
This research has been supported by National Natural Science Foundation of China under Grant No.71103163, No.71573237; New Century Excellent Talents in University of China under Grant No. NCET-13-1012; Research Foundation of Humanities and Social Sciences of Ministry of Education of China No.15YJA630019; Special Funding for Basic Scientific Research of Chinese Central University under Grant No.CUG120111, CUG110411, G2012002A, CUG140604, CUG160605; Open Foundation for the Research Center of
References (300)
- et al.
Bankruptcy forecasting: An empirical comparison of AdaBoost and neural networks
Decision Support Systems
(2008) - et al.
Can-CSC-GBE: Developing Cost-sensitive Classifier with Gentleboost Ensemble for breast cancer classification using protein amino acids and imbalanced data
Computers in biology and medicine
(2016) - et al.
DBFS: An effective Density Based Feature Selection scheme for small sample size and high dimensional imbalanced data sets
Data & Knowledge Engineering
(2012) - et al.
A proposal for evolutionary fuzzy systems using feature weighting: Dealing with overlapping in imbalanced datasets
Knowledge-Based Systems
(2015) - et al.
Governing events and life:‘Emergency'in UK Civil Contingencies
Political Geography
(2012) - et al.
Boosted Near-miss Under-sampling on SVM ensembles for concept detection in large-scale imbalanced datasets
Neurocomputing
(2016) - et al.
Classifying imbalanced data sets using similarity based hierarchical decomposition
Pattern Recognition
(2015) - et al.
An experimental comparison of classification algorithms for imbalanced credit scoring data sets
Expert Systems with Applications
(2012) - et al.
Projective ART for clustering data sets in high dimensional spaces
Neural Networks
(2002) - et al.
A method for resampling imbalanced datasets in binary classification tasks for real-world problems
Neurocomputing
(2014)
Parameter-free classification in multi-class imbalanced data sets
Data & Knowledge Engineering
A hierarchical multiple kernel support vector machine for customer churn prediction using longitudinal behavioral data
European Journal of Operational Research
Cost-Sensitive Large margin Distribution Machine for classification of imbalanced data
Pattern Recognition Letters
Affective detection based on an imbalanced fuzzy support vector machine
Biomedical Signal Processing and Control
Ensemble aggregation methods for relocating models of rare events
Engineering Applications of Artificial Intelligence
Parallel selective sampling method for imbalanced and large data classification
Pattern Recognition Letters
Near-Bayesian Support Vector Machines for imbalanced data classification with equal or unequal misclassification costs
Neural Networks
On the use of MapReduce for imbalanced big data using random forest
Information Sciences
Random balance: ensembles of variable priors classifiers for imbalanced data
Knowledge-Based Systems
Diversity techniques improve the performance of the best imbalance learning ensembles
Information Sciences
Semi-supervised classification method through oversampling and common hidden space
Information Sciences
A new support vector data description method for machinery fault diagnosis with unbalanced datasets
Expert Systems with Applications
Analysis of sampling techniques for imbalanced data: An n= 648 ADNI study
NeuroImage
Quantitative models for managing supply chain risks: A review
European Journal of Operational Research
A data mining framework for detecting subscription fraud in telecommunication
Engineering Applications of Artificial Intelligence
On the 2-tuples based genetic tuning performance for fuzzy rule based classification systems in imbalanced data-sets
Information Sciences
Analysing the classification of imbalanced data-sets with multiple classes: Binarization techniques and ad-hoc approaches
Knowledge-Based Systems
A neural network algorithm for semi-supervised node label learning from unbalanced data
Neural Networks
A decision-theoretic generalization of on-line learning and an application to boosting
Journal of computer and system sciences
Certainty-based active learning for sampling imbalanced datasets
Neurocomputing
EUSBoost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling
Pattern Recognition
Adaptive weighted imbalance learning with application to abnormal activity recognition
Neurocomputing
Ensemble of online neural networks for non-stationary and imbalanced data streams
Neurocomputing
A Kolmogorov–Smirnov statistic based segmentation approach to learning from imbalanced datasets: With application in property refinance prediction
Expert Systems with Applications
ELECTRE: A comprehensive literature review on methodologies and applications
European Journal of Operational Research
An efficient algorithm coupled with synthetic minority over-sampling technique to classify imbalanced PubChem BioAssay data
Analytica chimica acta
A comparison of fraud cues and classification methods for fake escrow website detection
Information Technology and Management
A Classifier Hub for Imbalanced Financial Data
A Study of Feature Selection of Magnetogram Complexity Features in an Imbalanced Solar Flare Prediction Data-set
Identity verification based on haptic handwritten signatures: Genetic programming with unbalanced data
An approach for classification of highly imbalanced data using weighting and undersampling
Amino acids
Classifying imbalanced data in distance-based feature space
Knowledge and Information Systems
Application of fuzzy support vector machine for determining the health index of the insulation system of in-service power transformers
Dielectrics and Electrical Insulation, IEEE Transactions on
Behavioral Analysis of Insider Threat: A Survey and Bootstrapped Prediction in Imbalanced Data
Computational Social Systems, IEEE Transactions on
Polyp Detection via Imbalanced Learning and Discriminative Feature Learning
Medical Imaging, IEEE Transactions on
FIR as Classifier in the Presence of Imbalanced Data
Cost sensitive credit card fraud detection using Bayes minimum risk
ACID: association correction for imbalanced data in GWAS
IEEE/ACM Transactions on Computational Biology and Bioinformatics
SMOTE for high-dimensional class-imbalanced data
BMC bioinformatics
Diversity Analysis on Imbalanced Data Using Neighbourhood and Roughly Balanced Bagging Ensembles
Cited by (1647)
Undersampling method based on minority class density for imbalanced data
2024, Expert Systems with ApplicationsFCM-CSMOTE: Fuzzy C-Means Center-SMOTE
2024, Expert Systems with ApplicationsCustomer churn prediction in imbalanced datasets with resampling methods: A comparative study
2024, Expert Systems with ApplicationsImprovement of crack detectivity for noisy concrete surface by machine learning methods and infrared images
2024, Case Studies in Construction MaterialsAn overlapping minimization-based over-sampling algorithm for binary imbalanced classification
2024, Engineering Applications of Artificial IntelligenceA review of ensemble learning and data augmentation models for class imbalanced problems: Combination, implementation and evaluation
2024, Expert Systems with Applications
- 1
Guo Haixiang and Li Yijing contributed equally in this paper.