Elsevier

Neurocomputing

Volume 70, Issues 1–3, December 2006, Pages 173-186
Neurocomputing

A meta-learning approach to automatic kernel selection for support vector machines

https://doi.org/10.1016/j.neucom.2006.03.004Get rights and content

Abstract

Appropriate choice of a kernel is the most important ingredient of the kernel-based learning methods such as support vector machine (SVM). Automatic kernel selection is a key issue given the number of kernels available, and the current trial-and-error nature of selecting the best kernel for a given problem. This paper introduces a new method for automatic kernel selection, with empirical results based on classification. The empirical study has been conducted among five kernels with 112 different classification problems, using the popular kernel based statistical learning algorithm SVM. We evaluate the kernels’ performance in terms of accuracy measures. We then focus on answering the question: which kernel is best suited to which type of classification problem? Our meta-learning methodology involves measuring the problem characteristics using classical, distance and distribution-based statistical information. We then combine these measures with the empirical results to present a rule-based method to select the most appropriate kernel for a classification problem. The rules are generated by the decision tree algorithm C5.0 and are evaluated with 10 fold cross validation. All generated rules offer high accuracy ratings.

Introduction

Recently, researchers in the area of pattern recognition have given more attention to kernel-based learning algorithms due to their strong performance in the area of bioinformatics [10], text mining [19], fraud detection [11], speaker identification [32] and database marketing [3], amongst many others. Support vector machines (SVMs) [28], [6], [5] are one of the most popular kernel-based learning algorithms, first introduced by Vapnik and his group in the mid-1990s. SVM is an optimal hyperplane (OH)-based statistical learning method, which solves classification as well as regression problems. The performance of the SVM method depends however on the suitable selection of a kernel. A kernel is one of the most important features of the SVM algorithm: it generates the dot products in the higher dimensional feature space. The space could theoretically be of infinite dimension, where linear discrimination is possible. Up to now a good number of kernels have been proposed by researchers, but there has been little research conducted to advise how to choose an appropriate kernel for a given problem [22], [20]. Clearly, there is a need for automatic kernel selection methods.

Our present research is the first step to provide the solution to this issue. Our aim is to provide an answer to the following question: given the characteristics of a classification data set, and default parameter settings of the SVM,1 which kernel is likely to produce the most accurate results? Our methodology seeks to understand the characteristics of the data (classification problem), understand which kernels perform well on which types of problems, and generate rules to assist in the automatic selection of kernels for SVMs. This is a meta-learning approach [31]. First, we identify the data set characteristics matrix by statistical measures as we have done in some previous related work [24], [23]. All the statistical formulations are available in Matlab statistics toolbox [25]. We then build models for 112 classification problems (see Appendix A) from the UCI Repository [4] and Knowledge Discovery Central [13] database using SVM with five different kernels, and employing a cross-validation testing methodology. Finally, we use the induction algorithm C5.0 (Windows version See 5, http://www.rulequest.com/see5-info.html) to generate the rules to describe which of the five kernels is most suitable for which type of problem, given the data set characteristics and the performance of each kernel on each data set. We also examine the rules by 10 fold cross validation performance.

Our paper is organised as follows: in Section 2, we provide some theoretical frameworks regarding SVM. Section 3 focuses on a theoretical analysis of kernels, and draws a distinction between the five kernels considered in this paper. A comprehensive performance evaluation of all five kernels on the set of 112 classification problems is the presented in Section 4. Section 5 describes the statistical measures and methodology used to generate the data set characteristics matrix. The performance results and the data characteristics are then combined using the rule based learning algorithm C5.0, and the rules for automatic kernel selection are presented and evaluated in Section 6. Finally, we conclude our research in Section 7.

Section snippets

Support vector machine

Let us consider a binary classification task, the data matrix D=(x1, y1),…,(x, y), xRn, y{-1,+1} having corresponding targets y1,,y. Our aim is to find the OH in the feature space with this data matrix. The OH separates the classes of the data points without error by maximising the distance between the closest vectors as shown in Fig. 1.

We refer the interested reader to [30], [29] for a more comprehensive discussion of SVMs, and their underlying mathematics. For the purposes of this

Kernel theory

Since the development of SVM as an effective classifier for binary class problems, great interest has been generated in the method used to generalise the linear decision rule to non-linear ones, using kernel functions. A kernel function K(xi,xj) is a transformation function that satisfies Mercer's Theorem [29]. It basically explains that the kernel matrix has to be semi-definite, that means only has positive eigenvalues. Linear methods like principle components analysis (PCA) and Fisher

A performance evaluation of Kernal methods

To investigate the effectiveness of the kernels we conducted a wide range of experiments on binary as well as multiclass data sets as shown in Appendix A. We examined 112 classification problems from the UCI Repository [4] and Knowledge Discovery Central [13] database. We use 10 fold cross validation [27], for those data sets with less than 1000 samples (68% of the data sets). Otherwise we use the hold-out method, with 70% of the data randomly extracted for training and the remainder reserved

Data sets characteristics measurement

Each data set can be described by simple, distance and distribution-based statistical measures [24], [23]. Let Xk,ji be the value of the jth variable (column) in the kth example (row) of data set i. These three types of measures characterise the data set matrix in different ways. Firstly, the simple classical statistical measures identify the data characteristics based on variable to variable comparisons (i.e. comparisons between columns of the data set). Then, the distance based measures

Rule generation

Rule based learning algorithms, especially decision trees (also called classification trees or hierarchical classifiers), are a divide-and-conquer approach or a top-down induction method, that have been studied with more interest in the machine learning community. [21] introduced the C4.5 and C5.0 algorithms to solve classification problems. C5.0 works in three main steps. First, the root node at the top node of the tree considers all samples and passes them through to the second node called

Conclusions

In this paper, we have presented a new rule based method for automatic kernel selection based on statistical measures of the data sets and extensive empirical performance results. This method is very simple, efficient and allows the computational complexity for selection of kernels to be reduced. Empirical results on a wide range of problems point out that all rules are acceptable due to their higher accuracy performance. Most of the descriptive statistical measures are more appropriate for

Acknowledgements

The authors are grateful to the suggestions of the two anonymous reviewers and editor which greatly improved the paper.

Dr. Shawkat Ali is a lecturer in the School of Information Systems at Central Queensland University, Rockhampton, Australia. He holds a B.Sc. (Hons.) and M.Sc. in Applied Physics and Electronics, and M.Phil. in Computer Science and Technology from University of Rajshahi, Bangladesh and a Ph.D. in Information Technology from the Monash University, Australia. He was also an Assistant Professor at Islamic University, Bangladesh where he worked for 4 years prior to joining Monash University in

References (33)

  • S. Ali et al.

    On learning algorithm selection for classification

    Int. J. Appl. Soft Comput.

    (2006)
  • R.P.W. Duin

    A note on comparing classifier

    Pattern Recogn. Lett.

    (1996)
  • E. Parrado-Hernandez et al.

    Growing support vector classifiers with controlled complexity

    Pattern Recogn. Lett.

    (2003)
  • S. Ali, Automated support vector learning algorithms, PhD Thesis, Monash University,...
  • K.P. Bennett et al.

    On support vector decision trees for database marketing

    IEEE International Joint Conference on Neural Networks (IJCNN ‘99)

    (1999)
  • C. Blake, C.J. Merz, UCI Repository of Machine Learning Databases, University of California, Irvine, CA, 2002....
  • B.E. Boser et al.

    A Training algorithm for optimal margin classifiers

  • C. Cortes et al.

    Support vector networks

    Mach. Learning

    (1995)
  • T. Evgeniou et al.

    Regularization networks and support vector machines

    Adv. Comput. Math.

    (1999)
  • S.R. Gunn

    Support Vector Machines for Classification and Regression, ISIS Technical Report

    (1998)
  • I. Guyon et al.

    Gene selection for cancer classification using support vector machines

    Mach. Learning

    (2002)
  • K. Hyun-Chul, P. Shaoning, J. Hong-Mo, K. Daijin, B. Sung Yang, Pattern classification using support vector machine...
  • J.D. Jobson
    (1991)
  • T.-S. Lim, Knowledge Discovery Central, Data Sets, http://www.KDCentral.com/,...
  • W. Mandenhall et al.

    Statistics for Engineering and the Sciences

    (1995)
  • G. McLachlan

    Discriminant Analysis and Statistical Pattern Recognition

    (1992)
  • Cited by (124)

    • Data-driven approaches to generating knowledge: Machine learning, artificial intelligence, and predictive modeling

      2023, Clinical Decision Support and beyond: Progress and Opportunities in Knowledge-Enhanced Health and Healthcare
    • Meta-features for meta-learning

      2022, Knowledge-Based Systems
    View all citing articles on Scopus

    Dr. Shawkat Ali is a lecturer in the School of Information Systems at Central Queensland University, Rockhampton, Australia. He holds a B.Sc. (Hons.) and M.Sc. in Applied Physics and Electronics, and M.Phil. in Computer Science and Technology from University of Rajshahi, Bangladesh and a Ph.D. in Information Technology from the Monash University, Australia. He was also an Assistant Professor at Islamic University, Bangladesh where he worked for 4 years prior to joining Monash University in 2001. Dr. Ali has published a quite good number of refereed journal and international conference papers in the areas of support vector machine, data mining and telecommunication.

    Prof. Kate Smith-Miles is a Professor and Head of the School of Engineering and Information Technology at Deakin University in Australia. She obtained a B.Sc.(Hons) in Mathematics and a Ph.D. in Electrical Engineering, both from the University of Melbourne, Australia. She was also a Professor at Monash University, Australia, and co-Director of the Monash Data Mining Centre where she worked for ten years prior to joining Deakin University in 2006. Kate has published 2 books on neural networks and data mining in business, and over 150 refereed journal and international conference papers in the areas of neural networks, combinatorial optimisation, intelligent techniques and data mining. She has been awarded over AUD$1.5 million in competitive grants, including 7 Australian Research Council grants and industry awards. She is on the editorial board of several international journals, and has been a member of the organising committee for over 40 international data mining and neural network conferences. In addition to her academic activities, she also regularly acts as a consultant to industry in the areas of data mining and intelligent techniques.

    View full text