Text Categorization with Support Vector Machines. How to Represent Texts in Input Space?

Published: January 2002

Volume 46, pages 423–444, (2002)
Cite this article

Machine Learning Aims and scope Submit manuscript

Text Categorization with Support Vector Machines. How to Represent Texts in Input Space?

Edda Leopold¹ &
Jörg Kindermann¹

8287 Accesses
249 Citations
Explore all metrics

Abstract

The choice of the kernel function is crucial to most applications of support vector machines. In this paper, however, we show that in the case of text classification, term-frequency transformations have a larger impact on the performance of SVM than the kernel itself. We discuss the role of importance-weights (e.g. document frequency and redundancy), which is not yet fully understood in the light of model complexity and calculation cost, and we show that time consuming lemmatization or stemming can be avoided even when classifying a highly inflectional language like German.

Article PDF

Similar content being viewed by others

Selecting Features with SVM

Chapter © 2013

Supervised Machine Learning Text Classification: A Review

Chapter © 2023

Text categorization based on a new classification by thresholds

Article 03 June 2021

References

Altmann, G. (1988). Wiederholungen in texten [Repetitions in texts]. Bochum, Germany: Brockmeyer.
Google Scholar
Balasubrahmanyan, V. K. & Naranan, S. (1996). Quantitative linguistics and complex system studies. Journal of Quantitative Linguistics, 3:3, 177-228.
Google Scholar
Bookstein, A. & Swanson, Don R. (1974). Probabilistic models for automatic indexing. Journal of the American Society of Information Science, 25, 312-318.
Google Scholar
Chitashvili, R. J. & Baayen, R. H. (1993). Word frequency distributions. In G. Altmann & L. Hřebíček (Eds.). Quantitative Text Analysis (pp. 46-135). Trier, Germany: wvt.
Google Scholar
Dumais, S., Platt, J., Heckerman, D., & Sahami, M. (1998). Inductive learning algorithms and representations for text categorization. In Proceedings of the Seventh International Conference on Information Retrieval and Knowledge Management (ACM-CIKM-98) (pp. 148-155).
Grotjahn, R. (1982). Ein statistisches Modell für die Verteilung der Wortl¨ange [A statistical model for the distribution of word length]. Zeitschrift f¨ur Sprachwissenschaft, 1, 44-75.
Google Scholar
Harter, S. P. (1975). A probabilistic approach to automatic keyword indexing, Part I. Journal of the American Society for Information Science, 26, 197-206.
Google Scholar
Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In Proceedings of the Tenth European Conference on Machine Learning (ECML '98), Lecture Notes in Computer Science, Number 1398 (pp. 137-142).
Kral´k, J. (1977). An application of exponential distribution law in quantitative linguistics. Prague Studies in Mathematical Linguistics, 5, 223-235.
Google Scholar
Krylov, Ju. K. (1995). A stationary model of coherent text generation. Journal of Quantitative Linguistics, 2:2, 157-167.
Google Scholar
Lezius,W., Rapp, R., & Wettler,M. (1998). A freely available morphological analyzer, disambiguator and context sensitive lemmatizer for German. In Proceedings of the COLING-ACL 1998 (pp. 743-747).
Mandelbrot, B. (1953). On the theory of word frequencies and on related Markovian models of discourse. In R. Jakobson (Ed.), Structure of Language and its Mathematical Aspects, Proceedings of Symposia in Applied Mathematics (Vol. XII, pp. 190-210). Providence, RI: American Mathematical Society.
Google Scholar
Manning, C. D. & Schütze, H. (1999). Foundations of statistical natural language processing, Cambridge, MA: MIT-Press.
Google Scholar
Margulis, E. L. (1993). Modelling documents with multiple poisson distributions. Information Processing and Management, 29, 215-228.
Google Scholar
Nigam, K., McCallum, A. K., Thrun, S., & Mitchell, T. (2000). Text classification from labeled and unlabeled documents using EM. Machine Learning, 39:1/2, 103-134.
Google Scholar
Orlov, Ju. K. (1982). Linguostatistik: Aufstellung von Sprachnormen oder Analyse des Redeprozesses? (Die Antinomie 'Sprache-Rede' in der statistischen Linquistik)[Linguostatistics: Establishing language norms of analysis of the speech process (The antinomy 'language-speech' in statistical linguistics.).] In Ju. K. Orlov, M. G. Boroda, & I. S. NadarejČvili (Eds.). Sprache, Text, Kunst. Quantitative Analysen (pp. 1-55). Bochum, Germany: Brockmeyer.
Google Scholar
Porter, M. F. (1980) An algorithm for suffix stripping. Program (Automated Library and Information Systems), 14:3, 130-137.
Google Scholar
Rieger, B. B. (1999). Semiotics and computational linguistics. On semiotic cognitive information processintg. In Zadeh, L. A. & J. Kacprzyk (Eds.). Computing with words in information/intelligent systems I. foundations (pp. 93-118). Heidelberg, Germany: Physica.
Google Scholar
Salton, G. & McGill, M. J. (1983). Introduction to modern information retrieval. New York: McGraw Hill.
Google Scholar
Stricker, M.,Vichot, F., Dreyfus, G., & Wolinski F. (2000).Vers la conception automatique de filtres d'informations efficaces [Towards the automatic design of efficient custom filters]. In Reconnaissance des Formes et Intelligence Artificielle (RFIA '2000) (pp. 129-137).
Vapnik, Vladimir N. (1998). Statistical learning theory. New York: Wiley.
Google Scholar
Wimmer, G., Köhler, R., Grotjahn, R., & Altmann, G. (1994). Towards a theory of word length distribution. Journal of Quantitative Linguistics, 1, 98-106.
Google Scholar
Zipf, G. K. (1949). Human behavior and the principle of least effort. An introduction to human ecology. Cambridge, MA: Addison-Wesley.
Google Scholar

Download references

Author information

Authors and Affiliations

GMD German National Research Center for Information Technology, Institute for Autonomous intelligent Systems, Schloss Birlinghoven, D-53754, Sankt Augustin, Germany
Edda Leopold & Jörg Kindermann

Authors

Edda Leopold
View author publications
You can also search for this author in PubMed Google Scholar
Jörg Kindermann
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Leopold, E., Kindermann, J. Text Categorization with Support Vector Machines. How to Represent Texts in Input Space?. Machine Learning 46, 423–444 (2002). https://doi.org/10.1023/A:1012491419635

Download citation

Issue Date: January 2002
DOI: https://doi.org/10.1023/A:1012491419635

Use our pre-submission checklist

Avoid common mistakes on your manuscript.