Abstract
The choice of the kernel function is crucial to most applications of support vector machines. In this paper, however, we show that in the case of text classification, term-frequency transformations have a larger impact on the performance of SVM than the kernel itself. We discuss the role of importance-weights (e.g. document frequency and redundancy), which is not yet fully understood in the light of model complexity and calculation cost, and we show that time consuming lemmatization or stemming can be avoided even when classifying a highly inflectional language like German.
Article PDF
Similar content being viewed by others
References
Altmann, G. (1988). Wiederholungen in texten [Repetitions in texts]. Bochum, Germany: Brockmeyer.
Balasubrahmanyan, V. K. & Naranan, S. (1996). Quantitative linguistics and complex system studies. Journal of Quantitative Linguistics, 3:3, 177-228.
Bookstein, A. & Swanson, Don R. (1974). Probabilistic models for automatic indexing. Journal of the American Society of Information Science, 25, 312-318.
Chitashvili, R. J. & Baayen, R. H. (1993). Word frequency distributions. In G. Altmann & L. Hřebíček (Eds.). Quantitative Text Analysis (pp. 46-135). Trier, Germany: wvt.
Dumais, S., Platt, J., Heckerman, D., & Sahami, M. (1998). Inductive learning algorithms and representations for text categorization. In Proceedings of the Seventh International Conference on Information Retrieval and Knowledge Management (ACM-CIKM-98) (pp. 148-155).
Grotjahn, R. (1982). Ein statistisches Modell für die Verteilung der Wortl¨ange [A statistical model for the distribution of word length]. Zeitschrift f¨ur Sprachwissenschaft, 1, 44-75.
Harter, S. P. (1975). A probabilistic approach to automatic keyword indexing, Part I. Journal of the American Society for Information Science, 26, 197-206.
Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In Proceedings of the Tenth European Conference on Machine Learning (ECML '98), Lecture Notes in Computer Science, Number 1398 (pp. 137-142).
Kral´k, J. (1977). An application of exponential distribution law in quantitative linguistics. Prague Studies in Mathematical Linguistics, 5, 223-235.
Krylov, Ju. K. (1995). A stationary model of coherent text generation. Journal of Quantitative Linguistics, 2:2, 157-167.
Lezius,W., Rapp, R., & Wettler,M. (1998). A freely available morphological analyzer, disambiguator and context sensitive lemmatizer for German. In Proceedings of the COLING-ACL 1998 (pp. 743-747).
Mandelbrot, B. (1953). On the theory of word frequencies and on related Markovian models of discourse. In R. Jakobson (Ed.), Structure of Language and its Mathematical Aspects, Proceedings of Symposia in Applied Mathematics (Vol. XII, pp. 190-210). Providence, RI: American Mathematical Society.
Manning, C. D. & Schütze, H. (1999). Foundations of statistical natural language processing, Cambridge, MA: MIT-Press.
Margulis, E. L. (1993). Modelling documents with multiple poisson distributions. Information Processing and Management, 29, 215-228.
Nigam, K., McCallum, A. K., Thrun, S., & Mitchell, T. (2000). Text classification from labeled and unlabeled documents using EM. Machine Learning, 39:1/2, 103-134.
Orlov, Ju. K. (1982). Linguostatistik: Aufstellung von Sprachnormen oder Analyse des Redeprozesses? (Die Antinomie 'Sprache-Rede' in der statistischen Linquistik)[Linguostatistics: Establishing language norms of analysis of the speech process (The antinomy 'language-speech' in statistical linguistics.).] In Ju. K. Orlov, M. G. Boroda, & I. S. NadarejČvili (Eds.). Sprache, Text, Kunst. Quantitative Analysen (pp. 1-55). Bochum, Germany: Brockmeyer.
Porter, M. F. (1980) An algorithm for suffix stripping. Program (Automated Library and Information Systems), 14:3, 130-137.
Rieger, B. B. (1999). Semiotics and computational linguistics. On semiotic cognitive information processintg. In Zadeh, L. A. & J. Kacprzyk (Eds.). Computing with words in information/intelligent systems I. foundations (pp. 93-118). Heidelberg, Germany: Physica.
Salton, G. & McGill, M. J. (1983). Introduction to modern information retrieval. New York: McGraw Hill.
Stricker, M.,Vichot, F., Dreyfus, G., & Wolinski F. (2000).Vers la conception automatique de filtres d'informations efficaces [Towards the automatic design of efficient custom filters]. In Reconnaissance des Formes et Intelligence Artificielle (RFIA '2000) (pp. 129-137).
Vapnik, Vladimir N. (1998). Statistical learning theory. New York: Wiley.
Wimmer, G., Köhler, R., Grotjahn, R., & Altmann, G. (1994). Towards a theory of word length distribution. Journal of Quantitative Linguistics, 1, 98-106.
Zipf, G. K. (1949). Human behavior and the principle of least effort. An introduction to human ecology. Cambridge, MA: Addison-Wesley.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Leopold, E., Kindermann, J. Text Categorization with Support Vector Machines. How to Represent Texts in Input Space?. Machine Learning 46, 423–444 (2002). https://doi.org/10.1023/A:1012491419635
Issue Date:
DOI: https://doi.org/10.1023/A:1012491419635