An efficient and effective ensemble of support vector machines for anti-diabetic drug failure prediction

https://doi.org/10.1016/j.eswa.2015.01.042Get rights and content

Highlights

  • We propose a method called E3-SVM, efficient and effective ensemble of SVMs.

  • E3-SVM excludes superfluous data points when constructing an SVM ensemble.

  • E3-SVM was applied to the drug failure prediction problem for type 2 diabetes.

  • We confirmed the suitability of SVM with an accuracy of about 80%.

Abstract

The treatment of patients with type 2 diabetes is mostly based on drug therapies, aiming at managing glucose levels appropriately. As the number of patients with type 2 diabetes continually increases worldwide, predicting drug treatment failure becomes an important issue. Support vector machine (SVM) can be a good method for the anti-diabetic drug failure prediction problem; however, it is difficult to train SVM on large-scale medical datasets directly because of its high training time complexity O(N3). To address the limitation, we propose an efficient and effective ensemble of SVMs, called E3-SVM. The proposed method excludes superfluous data points when constructing an SVM ensemble, thereby yielding a better classification performance. The proposed method consists of two phases. The first phase is to select the data points that are likely to be the support vectors by applying data selection methods. The second phase is to construct an SVM ensemble using the selected data points. We demonstrated the efficiency and effectiveness of the proposed method using the real-world dataset of the anti-diabetic drug failure prediction problem for type 2 diabetes. Experimental results show that the proposed method requires less training time to achieve comparable success, compared to the conventional SVM ensembles. Moreover, the proposed method obtains more reliable prediction results for each independent run of constructing an ensemble. In conclusion, firstly, the proposed method provides an efficient and effective way to use SVM for large-scale datasets. Secondly, we confirmed the suitability of SVM for the anti-diabetic drug failure prediction problem with an accuracy of about 80%.

Introduction

Diabetes is one of the most prevalent chronic diseases today. As the number of people with diabetes continually grows worldwide, the importance of research on the treatment of diabetes is progressively increasing. Particularly, type 2 diabetes is the most common, accounting for 85–90% of diabetes (Bennett, Guo, & Dharmage, 2007). Most patients with type 2 diabetes are under medical care with mono- or combination therapy of oral hypoglycemic agents, aiming at lowering glucose level. Glycated hemoglobin (HbA1c) is an effective and widely used measurement of glucose level for patients with type 2 diabetes (Bennett et al., 2007, Lu et al., 2010). According to the guideline of the American Diabetes Association (ADA) (American Diabetes Association, 2014), combination therapy is recommended for the patients with type 2 diabetes who cannot be controlled by mono-therapy. In addition, the ADA recommends an HbA1c level of 7% or lower as the reasonable glycemic goal for most individuals.

Unfortunately, the majority of patients fail to achieve their glycemic goals (Brown, Nichols, & Perry, 2004). This is because the outcome of the diabetic treatment is highly related to various factors. The efficacy of anti-diabetic drugs can be affected by the characteristics of patients such as age, gender, obesity, and blood pressure. Moreover, the efficacy can also be affected by the interaction of various drugs. Therapies for type 2 diabetes are generally based on the combination of 2–3 oral hypoglycemic agents in order to obtain better and more reliable results (American Diabetes Association, 2014, Yki-Järvinen, 2001) In addition, because diabetes often leads to complications, drugs to treat complications are also administered to diabetic patients.

Predicting drug treatment failure is an important issue in the medical domain. Many studies have been conducted based on statistical analyses. However, it is difficult to predict the failure accurately using only statistical analyses because the failure is related to a variety of factors. Presently, the effectiveness of machine leaning approaches for disease diagnosis has been reported by several studies (Hu et al., 2012, Sajda, 2006, Zeng and Liu, 2010), and more recently, some researchers have attempted to apply machine learning approaches to diabetes (Huang et al., 2007, Marinov et al., 2011). Most of them focus on prediction at the disease level. Machine learning approaches can also be effective for predicting drug treatment failure. However, to the best of our knowledge, there are relatively few efforts at predicting drug treatment failure using these approaches. The problem of anti-diabetic drug failure prediction can be defined as a classification problem, and has the characteristics of multivariate and complex relationships. Therefore, support vector machine (SVM) can be a good candidate as a classification algorithm.

SVM (Vapnik, 1995) is one of the most popular state-of-the-art classification algorithms, and shows superior generalization performance based on structural risk minimization principle. The effectiveness of SVM has been verified in various applications such as text categorization, handwritten digit recognition, image segmentation, and financial forecasting (Burges, 1998). Moreover, SVM is also known to be very effective in the medical domain (Barakat et al., 2010, Yu et al., 2010).

However, training of an SVM becomes a difficult problem when the size of a given dataset N is very large because the SVM takes O(N3) of its training time complexity Kang and Cho (2014). the training of SVM involves solving a quadratic programming (QP) problem of O(N3) complexity. Therefore, it is practically undesirable to train the SVM for a large-scale dataset directly. Commonly used approaches to alleviate the complexity are improving the efficiency of the QP process (Fan et al., 2005, Joachims, 1999, Platt, 1999) and reducing the number of training data points by eliminating non-support vectors before the QP process (Li and Maguire, 2011, Shin and Cho, 2007).

When these approaches are insufficient, the training time of SVM can be further reduced by constructing an ensemble of SVMs that are trained with small bootstrap samples (Kim, Pang, Je, Kim, & Bang, 2003). The two typical ensemble methods, Bagging (Breiman, 1996) and Boosting (Freund & Schapire, 1997), can be employed to construct SVM ensembles (Kim et al., 2003, Wang et al., 2009). By doing so, we can obtain comparable classification accuracy by aggregating SVMs properly, although the classification accuracy of each SVM is lowered. One major concern is that a bootstrap sample might contain lots of superfluous data points when the size of such sample is set to very small, thereby resulting in the training of an ill-formed SVM.

In this paper, we propose an efficient and effective ensemble of SVMs for large-scale datasets based on data selection methods, called E3-SVM. The proposed method is based on the fact that the SVM only uses support vectors to determine the decision boundary. In the proposed method, a reduced dataset is constructed by applying data selection methods (Shin and Cho, 2007, Li and Maguire, 2011) that select data points that are more likely to be the support vectors. That is, non-crucial data points of the original dataset are excluded in the reduced dataset. The ensemble of SVMs is constructed using bootstrap samples drawn from the reduced dataset. Consequently, the classification accuracy of the ensemble is improved by reducing the risk of using superfluous data points when training SVMs with small bootstrap samples. We investigated the efficiency and effectiveness of the proposed method through experiments on the anti-diabetic drug failure prediction problem.

The rest of this paper is organized as follows. In section 2, we briefly review the related work. In section 3, we describe our proposed method and section 4 reports the experimental results on the anti-diabetic drug failure prediction problem. The conclusion and future work are given in section 5.

Section snippets

Support vector machines

SVM (Vapnik, 1995) seeks to find the maximum margin hyperplane wTφ(xi)+b that separates the positive datapoints from negative datapoints. Given a training dataset D=xi,yii=1N, where N is the number of training datapoints, xi is an input feature vector and yi-1,1 is the corresponding target class label, an SVM can be formulated as the following optimization problem:minimizew,b,ξi12wTw+Ciξisubjecttoyi(wTφ(xi)+b)1-ξi,ξi0,i=1,,N,where C>0 is the parameter that controls the tradeoff between the

E3-SVM: efficient and effective ensemble of SVMs

In this section, we describe our proposed method E3-SVM, short for efficient and effective ensemble of SVMs. To construct an SVM ensemble, several SVMs are trained using bootstrap samples drawn from a given dataset. When the size of the bootstrap sample is much smaller than the dataset, each SVM is probable to be trained on mostly superfluous data points. To address the problem, the proposed method reduces the original dataset by selecting crucial data points and draws bootstrap samples from

Anti-diabetic drug failure prediction

We demonstrated the effectiveness of the proposed method for the anti-diabetic drug failure prediction problem through experiments using real-world data. In this section, data collection, preprocessing, experimental settings, and results are described.

Conclusions and future work

The treatment of patients with type 2 diabetes is mostly based on drug therapies and aims at managing glucose levels appropriately. Predicting drug treatment failure is a major issue, but is very difficult because of the influence of a wide variety of factors and because the relationship between such factors is also complicated. Due to the complexity, SVM can be a good candidate as a classification algorithm for the anti-diabetic drug failure prediction problem. The major drawback is high

Acknowledgements

This work was supported by the National Research Foundation of Korea (NRF) Grant funded by the Korea government (MSIP) (No. 2011–0030814), Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT & Future Planning (NRF-2014R1A1A1004648), and the Brain Korea 21 PLUS Project in 2014. This work was also supported by the Engineering Research Institute of SNU.

References (27)

  • L. Breiman

    Bagging predictors

    Machine Learning

    (1996)
  • J.B. Brown et al.

    The burden of treatment failure in type 2 diabetes

    Diabetes Care

    (2004)
  • C.J.C. Burges

    A tutorial on support vector machines for pattern recognition

    Data Mining and Knowledge Discovery

    (1998)
  • Cited by (49)

    • A comprehensive review on ensemble deep learning: Opportunities and challenges

      2023, Journal of King Saud University - Computer and Information Sciences
    • A stacking-based ensemble learning method for earthquake casualty prediction

      2021, Applied Soft Computing
      Citation Excerpt :

      Ensemble learning is not a specific method; rather, it is more of an idea, where multiple base learners are built and multiple results are integrated with a certain strategy as the final result. It often demonstrates better prediction ability and stability than any single machine learning model [20]. Therefore, we exploit the advantages of the tree-based model and build a model for earthquake casualty prediction based on the ensemble learning method.

    View all citing articles on Scopus
    View full text