A meta-learning approach to automatic kernel selection for support vector machines
Introduction
Recently, researchers in the area of pattern recognition have given more attention to kernel-based learning algorithms due to their strong performance in the area of bioinformatics [10], text mining [19], fraud detection [11], speaker identification [32] and database marketing [3], amongst many others. Support vector machines (SVMs) [28], [6], [5] are one of the most popular kernel-based learning algorithms, first introduced by Vapnik and his group in the mid-1990s. SVM is an optimal hyperplane (OH)-based statistical learning method, which solves classification as well as regression problems. The performance of the SVM method depends however on the suitable selection of a kernel. A kernel is one of the most important features of the SVM algorithm: it generates the dot products in the higher dimensional feature space. The space could theoretically be of infinite dimension, where linear discrimination is possible. Up to now a good number of kernels have been proposed by researchers, but there has been little research conducted to advise how to choose an appropriate kernel for a given problem [22], [20]. Clearly, there is a need for automatic kernel selection methods.
Our present research is the first step to provide the solution to this issue. Our aim is to provide an answer to the following question: given the characteristics of a classification data set, and default parameter settings of the SVM,1 which kernel is likely to produce the most accurate results? Our methodology seeks to understand the characteristics of the data (classification problem), understand which kernels perform well on which types of problems, and generate rules to assist in the automatic selection of kernels for SVMs. This is a meta-learning approach [31]. First, we identify the data set characteristics matrix by statistical measures as we have done in some previous related work [24], [23]. All the statistical formulations are available in Matlab statistics toolbox [25]. We then build models for 112 classification problems (see Appendix A) from the UCI Repository [4] and Knowledge Discovery Central [13] database using SVM with five different kernels, and employing a cross-validation testing methodology. Finally, we use the induction algorithm C5.0 (Windows version See 5, http://www.rulequest.com/see5-info.html) to generate the rules to describe which of the five kernels is most suitable for which type of problem, given the data set characteristics and the performance of each kernel on each data set. We also examine the rules by 10 fold cross validation performance.
Our paper is organised as follows: in Section 2, we provide some theoretical frameworks regarding SVM. Section 3 focuses on a theoretical analysis of kernels, and draws a distinction between the five kernels considered in this paper. A comprehensive performance evaluation of all five kernels on the set of 112 classification problems is the presented in Section 4. Section 5 describes the statistical measures and methodology used to generate the data set characteristics matrix. The performance results and the data characteristics are then combined using the rule based learning algorithm C5.0, and the rules for automatic kernel selection are presented and evaluated in Section 6. Finally, we conclude our research in Section 7.
Section snippets
Support vector machine
Let us consider a binary classification task, the data matrix D=(x1, y1),…,(, ), , having corresponding targets . Our aim is to find the OH in the feature space with this data matrix. The OH separates the classes of the data points without error by maximising the distance between the closest vectors as shown in Fig. 1.
We refer the interested reader to [30], [29] for a more comprehensive discussion of SVMs, and their underlying mathematics. For the purposes of this
Kernel theory
Since the development of SVM as an effective classifier for binary class problems, great interest has been generated in the method used to generalise the linear decision rule to non-linear ones, using kernel functions. A kernel function is a transformation function that satisfies Mercer's Theorem [29]. It basically explains that the kernel matrix has to be semi-definite, that means only has positive eigenvalues. Linear methods like principle components analysis (PCA) and Fisher
A performance evaluation of Kernal methods
To investigate the effectiveness of the kernels we conducted a wide range of experiments on binary as well as multiclass data sets as shown in Appendix A. We examined 112 classification problems from the UCI Repository [4] and Knowledge Discovery Central [13] database. We use 10 fold cross validation [27], for those data sets with less than 1000 samples (68% of the data sets). Otherwise we use the hold-out method, with 70% of the data randomly extracted for training and the remainder reserved
Data sets characteristics measurement
Each data set can be described by simple, distance and distribution-based statistical measures [24], [23]. Let be the value of the jth variable (column) in the kth example (row) of data set i. These three types of measures characterise the data set matrix in different ways. Firstly, the simple classical statistical measures identify the data characteristics based on variable to variable comparisons (i.e. comparisons between columns of the data set). Then, the distance based measures
Rule generation
Rule based learning algorithms, especially decision trees (also called classification trees or hierarchical classifiers), are a divide-and-conquer approach or a top-down induction method, that have been studied with more interest in the machine learning community. [21] introduced the C4.5 and C5.0 algorithms to solve classification problems. C5.0 works in three main steps. First, the root node at the top node of the tree considers all samples and passes them through to the second node called
Conclusions
In this paper, we have presented a new rule based method for automatic kernel selection based on statistical measures of the data sets and extensive empirical performance results. This method is very simple, efficient and allows the computational complexity for selection of kernels to be reduced. Empirical results on a wide range of problems point out that all rules are acceptable due to their higher accuracy performance. Most of the descriptive statistical measures are more appropriate for
Acknowledgements
The authors are grateful to the suggestions of the two anonymous reviewers and editor which greatly improved the paper.
Dr. Shawkat Ali is a lecturer in the School of Information Systems at Central Queensland University, Rockhampton, Australia. He holds a B.Sc. (Hons.) and M.Sc. in Applied Physics and Electronics, and M.Phil. in Computer Science and Technology from University of Rajshahi, Bangladesh and a Ph.D. in Information Technology from the Monash University, Australia. He was also an Assistant Professor at Islamic University, Bangladesh where he worked for 4 years prior to joining Monash University in
References (33)
- et al.
On learning algorithm selection for classification
Int. J. Appl. Soft Comput.
(2006) A note on comparing classifier
Pattern Recogn. Lett.
(1996)- et al.
Growing support vector classifiers with controlled complexity
Pattern Recogn. Lett.
(2003) - S. Ali, Automated support vector learning algorithms, PhD Thesis, Monash University,...
- et al.
On support vector decision trees for database marketing
IEEE International Joint Conference on Neural Networks (IJCNN ‘99)
(1999) - C. Blake, C.J. Merz, UCI Repository of Machine Learning Databases, University of California, Irvine, CA, 2002....
- et al.
A Training algorithm for optimal margin classifiers
- et al.
Support vector networks
Mach. Learning
(1995) - et al.
Regularization networks and support vector machines
Adv. Comput. Math.
(1999) Support Vector Machines for Classification and Regression, ISIS Technical Report
(1998)
Gene selection for cancer classification using support vector machines
Mach. Learning
Statistics for Engineering and the Sciences
Discriminant Analysis and Statistical Pattern Recognition
Cited by (124)
Visual analytics and intelligent reasoning for smart manufacturing defect detection and judgement: A meta-learning approach with knowledge graph embedding case-based reasoning
2024, Journal of Industrial Information IntegrationNLS: An accurate and yet easy-to-interpret prediction method
2023, Neural NetworksData-driven approaches to generating knowledge: Machine learning, artificial intelligence, and predictive modeling
2023, Clinical Decision Support and beyond: Progress and Opportunities in Knowledge-Enhanced Health and HealthcareMeta-features for meta-learning
2022, Knowledge-Based Systems
Dr. Shawkat Ali is a lecturer in the School of Information Systems at Central Queensland University, Rockhampton, Australia. He holds a B.Sc. (Hons.) and M.Sc. in Applied Physics and Electronics, and M.Phil. in Computer Science and Technology from University of Rajshahi, Bangladesh and a Ph.D. in Information Technology from the Monash University, Australia. He was also an Assistant Professor at Islamic University, Bangladesh where he worked for 4 years prior to joining Monash University in 2001. Dr. Ali has published a quite good number of refereed journal and international conference papers in the areas of support vector machine, data mining and telecommunication.
Prof. Kate Smith-Miles is a Professor and Head of the School of Engineering and Information Technology at Deakin University in Australia. She obtained a B.Sc.(Hons) in Mathematics and a Ph.D. in Electrical Engineering, both from the University of Melbourne, Australia. She was also a Professor at Monash University, Australia, and co-Director of the Monash Data Mining Centre where she worked for ten years prior to joining Deakin University in 2006. Kate has published 2 books on neural networks and data mining in business, and over 150 refereed journal and international conference papers in the areas of neural networks, combinatorial optimisation, intelligent techniques and data mining. She has been awarded over AUD$1.5 million in competitive grants, including 7 Australian Research Council grants and industry awards. She is on the editorial board of several international journals, and has been a member of the organising committee for over 40 international data mining and neural network conferences. In addition to her academic activities, she also regularly acts as a consultant to industry in the areas of data mining and intelligent techniques.