Review
Learning from class-imbalanced data: Review of methods and applications

https://doi.org/10.1016/j.eswa.2016.12.035Get rights and content

Highlights

  • 527 articles related to imbalanced data and rare events are reviewed.

  • Viewing reviewed papers from both technical and practical perspectives.

  • Summarizing existing methods and corresponding statistics by a new taxonomy idea.

  • Categorizing 162 application papers into 13 domains and giving introduction.

  • Some opening questions are discussed at the end of this manuscript.

Abstract

Rare events, especially those that could potentially negatively impact society, often require humans’ decision-making responses. Detecting rare events can be viewed as a prediction task in data mining and machine learning communities. As these events are rarely observed in daily life, the prediction task suffers from a lack of balanced data. In this paper, we provide an in depth review of rare event detection from an imbalanced learning perspective. Five hundred and seventeen related papers that have been published in the past decade were collected for the study. The initial statistics suggested that rare events detection and imbalanced learning are concerned across a wide range of research areas from management science to engineering. We reviewed all collected papers from both a technical and a practical point of view. Modeling methods discussed include techniques such as data preprocessing, classification algorithms and model evaluation. For applications, we first provide a comprehensive taxonomy of the existing application domains of imbalanced learning, and then we detail the applications for each category. Finally, some suggestions from the reviewed papers are incorporated with our experiences and judgments to offer further research directions for the imbalanced learning and rare event detection fields.

Introduction

Rare events, unusual patterns and abnormal behavior are difficult to detect, but often require responses from various management functions in a timely manner. By definition, rare events refer to events that occur much less frequently than commonly occurring events (Maalouf and Trafalis, 2011). Examples of rare events include software defects (Rodriguez et al., 2014), natural disasters (Maalouf and Trafalis, 2011), cancer gene expressions (Yu et al., 2012), fraudulent credit card transactions (Panigrahi et al., 2009), and telecommunications fraud (Olszewski, 2012).

In the field of data mining, detecting events is a prediction problem, or, typically, a data classification problem. Rare events are difficult to detect because of their infrequency and casualness; however, misclassifying rare events can result in heavy costs. For financial fraud detection, invalid transactions may only emerge out of hundreds of thousands of transaction records, but failing to identify a serious fraudulent transaction would cause enormous losses. The scarce occurrences of rare events impair the detection task to imbalanced data classification problem. Imbalanced data refers to a dataset within which one or some of the classes have a much greater number of examples than the others. The most prevalent class is called the majority class, while the rarest class is called the minority class (Li et al., 2016c). Although data mining approaches have been widely used to build classification models to guide commercial and managerial decision-making, classifying imbalanced data significantly challenges these traditional classification models. As were discussed on existing surveys, the reasons are fivefold:

  • (1)

    Standard classifiers such as logistic regression, Support Vector Machine (SVM) and decision tree are suitable for balanced training sets. When facing imbalanced scenarios, these models often provide suboptimal classification results, i.e. a good coverage of the majority examples, whereas the minority examples are distorted (López et al., 2013).

  • (2)

    The learning process guided by global performance metrics such as prediction accuracy induces a bias towards the majority class, while the rare episodes remain unknown even if the prediction model produces a high overall precision (Loyola-González et al., 2016). Some original discussion can be found in Weiss and Hirsh (2000) and Weiss (2004).

  • (3)

    Rare minority examples may possibly be treated as noise by the learning model. Contrarily, noise may be wrongly identified as minority examples, since both of them are rare patterns in the data space (Beyan and Fisher, 2015).

  • (4)

    Even though skewed sample distributions are not always difficult to learn (such as when the classes are separable), minority examples usually overlap with other regions where the prior probabilities of both classes are almost equal. Denil and Trappenberg (2010) has discussed overlapping problem under imbalanced scenario.

  • (5)

    Besides, small disjuncts (Jo and Japkowicz, 2004), a lack of density and small sample size with high feature dimensionality (Wasikowski and Chen, 2010) are challenges to imbalanced learning, which often cause learning models to fail in detecting rare patterns (Branco et al., 2016; López et al., 2013).

Many machine learning approaches have been developed in the past decade to cope with imbalanced data classification, most of which have been based on sample techniques, cost sensitive learning and ensemble methods (Galar et al., 2012; Krawczyk et al., 2014; Loyola-González et al., 2016). There is also one book in this area, see He and Ma (2013). Although several surveys related to imbalanced learning have been published (Branco et al., 2016, Fernández et al., 2013; Galar et al., 2012; He and Garcia, 2009, Lerner et al., 2007; Sun et al., 2009), all of them focused on detailed techniques while application literature is neglected. For researchers from management, biology or other domains, rather than sophisticated algorithms, the problems that can be solved using imbalanced learning techniques and the building of imbalanced learning systems with mature yet effective methods may be of more concern.

In this paper, we aim to provide a thorough overview of the classification of imbalanced data that includes both techniques and applications. At the technical level, we introduce common approaches to deal with imbalanced learning and propose a general framework within which each algorithm can be placed. This framework is a unified data mining model and includes preprocessing, classification and evaluation. At the practical level, we review 162 papers that tried to build specific systems to tackle rare pattern detection problems and develop a taxonomy of existing imbalanced learning application domains. The existing application literature covers most research fields from medical to industry to management.

The rest of this paper is organized as follows. Section 2 describes the research methodology for this study, along with the initial statistics regarding recent trends in imbalanced learning. Section 3 presents approaches to address both binary and multiple class imbalanced data. In Section 4, we first categorize the existing imbalanced learning application literature into 13 domains and then introduce the respective research frameworks. In Section 5, we discuss our thoughts on future imbalanced learning research directions from both a technical and practical point of view. Finally, Section 6 present the conclusions of this paper.

Section snippets

Research methodology

For this study, based on the research methodology of Govindan and Jepsen (2016), a two-stage search procedure was conducted to compile relevant papers published from 2006 to October 2016. In the initial phase, seven library databases which covered most natural science and social science research fields were used to search for and collect literature: Elsevier, IEEExplore, Springer, ACM, Cambridge, Wiley and Sage. Full text search was used and the search term was designed following the search

Imbalanced data classification approaches

Hundreds of algorithms have been proposed in the past decade to address imbalanced data classification problems. In this section, we give an overview of the state-of-the-art imbalanced learning techniques. These techniques are discussed in line with the basic machine learning model framework. In Section 3.1, two basic strategies for addressing imbalanced learning are introduced, which are, preprocessing and cost-sensitive learning. Preprocessing approaches include resampling methods conducted

Imbalanced data classification application domains

There is currently a great deal of interest in utilizing automated methods—particularly data mining and machine learning methods—to analyze the enormous amount of data that is routinely being collected. An important class of problems involves predicting future events based on past events. Event prediction often involves predicting rare events (Weiss and Hirsh, 2000). Rare events are events that occur with low frequency but may cause far-reaching impact and disrupt the society (King and Zeng,

Future research directions of imbalanced learning

In this section, we propose possible research directions of imbalanced learning based on our survey. In particular, imbalanced techniques we think still need to be considered are proposed in Section 5.1. Some application domains that imbalanced data are frequently observed but not well-studied are pointed out in Section 5.2.

Conclusions

In this paper, we attempted to provide a thorough review of rare event detection techniques and its applications. In particular, a data mining and a machine learning perspective was taken to view rare event detection as a class imbalanced data classification problem. We collected 527 papers that are related to imbalanced learning and rare event detection for this study. Unlike other surveys that have been published in the imbalanced learning field, we reviewed all papers from both a technical

Acknowledgements

This research has been supported by National Natural Science Foundation of China under Grant No.71103163, No.71573237; New Century Excellent Talents in University of China under Grant No. NCET-13-1012; Research Foundation of Humanities and Social Sciences of Ministry of Education of China No.15YJA630019; Special Funding for Basic Scientific Research of Chinese Central University under Grant No.CUG120111, CUG110411, G2012002A, CUG140604, CUG160605; Open Foundation for the Research Center of

References (300)

  • L. Cerf et al.

    Parameter-free classification in multi-class imbalanced data sets

    Data & Knowledge Engineering

    (2013)
  • Z.-Y. Chen et al.

    A hierarchical multiple kernel support vector machine for customer churn prediction using longitudinal behavioral data

    European Journal of Operational Research

    (2012)
  • F. Cheng et al.

    Cost-Sensitive Large margin Distribution Machine for classification of imbalanced data

    Pattern Recognition Letters

    (2016)
  • J. Cheng et al.

    Affective detection based on an imbalanced fuzzy support vector machine

    Biomedical Signal Processing and Control

    (2015)
  • C. D'Este et al.

    Ensemble aggregation methods for relocating models of rare events

    Engineering Applications of Artificial Intelligence

    (2014)
  • A. D'Addabbo et al.

    Parallel selective sampling method for imbalanced and large data classification

    Pattern Recognition Letters

    (2015)
  • S. Datta et al.

    Near-Bayesian Support Vector Machines for imbalanced data classification with equal or unequal misclassification costs

    Neural Networks

    (2015)
  • S. del Río et al.

    On the use of MapReduce for imbalanced big data using random forest

    Information Sciences

    (2014)
  • J.F. Díez-Pastor et al.

    Random balance: ensembles of variable priors classifiers for imbalanced data

    Knowledge-Based Systems

    (2015)
  • J.F. Díez-Pastor et al.

    Diversity techniques improve the performance of the best imbalance learning ensembles

    Information Sciences

    (2015)
  • A. Dong et al.

    Semi-supervised classification method through oversampling and common hidden space

    Information Sciences

    (2016)
  • L. Duan et al.

    A new support vector data description method for machinery fault diagnosis with unbalanced datasets

    Expert Systems with Applications

    (2016)
  • R. Dubey et al.

    Analysis of sampling techniques for imbalanced data: An n= 648 ADNI study

    NeuroImage

    (2014)
  • B. Fahimnia et al.

    Quantitative models for managing supply chain risks: A review

    European Journal of Operational Research

    (2015)
  • H. Farvaresh et al.

    A data mining framework for detecting subscription fraud in telecommunication

    Engineering Applications of Artificial Intelligence

    (2011)
  • A. Fernández et al.

    On the 2-tuples based genetic tuning performance for fuzzy rule based classification systems in imbalanced data-sets

    Information Sciences

    (2010)
  • A. Fernández et al.

    Analysing the classification of imbalanced data-sets with multiple classes: Binarization techniques and ad-hoc approaches

    Knowledge-Based Systems

    (2013)
  • M. Frasca et al.

    A neural network algorithm for semi-supervised node label learning from unbalanced data

    Neural Networks

    (2013)
  • Y. Freund et al.

    A decision-theoretic generalization of on-line learning and an application to boosting

    Journal of computer and system sciences

    (1997)
  • J. Fu et al.

    Certainty-based active learning for sampling imbalanced datasets

    Neurocomputing

    (2013)
  • M. Galar et al.

    EUSBoost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling

    Pattern Recognition

    (2013)
  • X. Gao et al.

    Adaptive weighted imbalance learning with application to abnormal activity recognition

    Neurocomputing

    (2016)
  • A. Ghazikhani et al.

    Ensemble of online neural networks for non-stationary and imbalanced data streams

    Neurocomputing

    (2013)
  • R. Gong et al.

    A Kolmogorov–Smirnov statistic based segmentation approach to learning from imbalanced datasets: With application in property refinance prediction

    Expert Systems with Applications

    (2012)
  • K. Govindan et al.

    ELECTRE: A comprehensive literature review on methodologies and applications

    European Journal of Operational Research

    (2016)
  • M. Hao et al.

    An efficient algorithm coupled with synthetic minority over-sampling technique to classify imbalanced PubChem BioAssay data

    Analytica chimica acta

    (2014)
  • A. Abbasi et al.

    A comparison of fraud cues and classification methods for fake escrow website detection

    Information Technology and Management

    (2009)
  • C. Abeysinghe et al.

    A Classifier Hub for Imbalanced Financial Data

  • A. Al-Ghraibah et al.

    A Study of Feature Selection of Magnetogram Complexity Features in an Imbalanced Solar Flare Prediction Data-set

  • F.A. Alsulaiman et al.

    Identity verification based on haptic handwritten signatures: Genetic programming with unbalanced data

  • A. Anand et al.

    An approach for classification of highly imbalanced data using weighting and undersampling

    Amino acids

    (2010)
  • S. Ando

    Classifying imbalanced data in distance-based feature space

    Knowledge and Information Systems

    (2015)
  • A.D. Ashkezari et al.

    Application of fuzzy support vector machine for determining the health index of the insulation system of in-service power transformers

    Dielectrics and Electrical Insulation, IEEE Transactions on

    (2013)
  • A. Azaria et al.

    Behavioral Analysis of Insider Threat: A Survey and Bootstrapped Prediction in Imbalanced Data

    Computational Social Systems, IEEE Transactions on

    (2014)
  • S.-H. Bae et al.

    Polyp Detection via Imbalanced Learning and Discriminative Feature Learning

    Medical Imaging, IEEE Transactions on

    (2015)
  • S. Bagherpour et al.

    FIR as Classifier in the Presence of Imbalanced Data

  • A.C. Bahnsen et al.

    Cost sensitive credit card fraud detection using Bayes minimum risk

  • F. Bao et al.

    ACID: association correction for imbalanced data in GWAS

    IEEE/ACM Transactions on Computational Biology and Bioinformatics

    (2016)
  • R. Blagus et al.

    SMOTE for high-dimensional class-imbalanced data

    BMC bioinformatics

    (2013)
  • J. Błaszczyński et al.

    Diversity Analysis on Imbalanced Data Using Neighbourhood and Roughly Balanced Bagging Ensembles

  • Cited by (1647)

    • FCM-CSMOTE: Fuzzy C-Means Center-SMOTE

      2024, Expert Systems with Applications
    View all citing articles on Scopus
    1

    Guo Haixiang and Li Yijing contributed equally in this paper.

    View full text