An Investigation into the Interaction Between Feature Selection and Discretization: Learning How and When to Read Numbers

Ghodke, Sumukh; Baldwin, Timothy

doi:10.1007/978-3-540-76928-6_7

An Investigation into the Interaction Between Feature Selection and Discretization: Learning How and When to Read Numbers

Sumukh Ghodke¹ &
Timothy Baldwin^1,2

Conference paper

2364 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4830))

Abstract

Pre-processing is an important part of machine learning, and has been shown to significantly improve the performance of classifiers. In this paper, we take a selection of pre-processing methods—focusing specifically on discretization and feature selection—and empirically examine their combined effect on classifier performance. In our experiments, we take 11 standard datasets and a selection of standard machine learning algorithms, namely one-R, ID3, naive Bayes, and IB1, and explore the impact of different forms of preprocessing on each combination of dataset and algorithm. We find that in general the combination of wrapper-based forward selection and naive supervised methods of discretization yield consistently above-baseline results.

Download to read the full chapter text

Chapter PDF

References

Aha, D.W., Kibler, D., Albert, M.K.: Instance-based learning algorithms. Machine Learning 6(1), 37–66 (1991)
Google Scholar
Bird, S.: NLTK-Lite: Efficient scripting for natural language processing. In: Bird, S. (ed.) ICON. Proc. of the 4th International Conference on Natural Language Processing, Kanpur, India, pp. 11–18 (2005)
Google Scholar
Dougherty, J., Kohavi, R., Sahami, M.: Supervised and unsupervised discretization of continuous features. In: Proc. of the 12th International Conference on Machine Learning, pp. 194–202 (1995)
Google Scholar
Draper, N.R., Smith, H.: Applied Regression Analysis. Wiley-Interscience, Chichester (1998)
MATH Google Scholar
John, G.H., Kohavi, R., Pfleger, K.: Irrelevant features and the subset selection problem. In: Proc. of the 11th International Conference on Machine Learning, pp. 121–129 (1994)
Google Scholar
Quinlan, J.R.: Induction of decision trees. In: Shavlik, J.W., Dietterich, T.G. (eds.) Readings in Machine Learning, Morgan Kaufmann, San Francisco (1990)
Google Scholar
Tan, P.-N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison Wesley, Reading (2006)
Google Scholar
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco (2005)
MATH Google Scholar
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proc. of the 14th International Conference on Machine Learning (1997)
Google Scholar
Yang, Y., Webb, G.I.: On why discretization works for Naive-Bayes classifiers. In: Gedeon, T.D., Fung, L.C.C. (eds.) AI 2003. LNCS (LNAI), vol. 2903, pp. 440–452. Springer, Heidelberg (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Software Engineering, University of Melbourne, VIC 3010, Australia
Sumukh Ghodke & Timothy Baldwin
NICTA Victoria Laboratories, University of Melbourne, VIC 3010, Australia
Timothy Baldwin

Authors

Sumukh Ghodke
View author publications
You can also search for this author in PubMed Google Scholar
Timothy Baldwin
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Mehmet A. Orgun John Thornton

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ghodke, S., Baldwin, T. (2007). An Investigation into the Interaction Between Feature Selection and Discretization: Learning How and When to Read Numbers. In: Orgun, M.A., Thornton, J. (eds) AI 2007: Advances in Artificial Intelligence. AI 2007. Lecture Notes in Computer Science(), vol 4830. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-76928-6_7

Download citation

DOI: https://doi.org/10.1007/978-3-540-76928-6_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-76926-2
Online ISBN: 978-3-540-76928-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics