ABSTRACT
The construction of a text classifier usually involves (i) a phase of term selection, in which the most relevant terms for the classification task are identified, (ii) a phase of term weighting, in which document weights for the selected terms are computed, and (iii) a phase of classifier learning, in which a classifier is generated from the weighted representations of the training documents. This process involves an activity of supervised learning, in which information on the membership of training documents in categories is used. Traditionally, supervised learning enters only phases (i) and (iii). In this paper we propose instead that learning from training data should also affect phase (ii), i.e. that information on the membership of training documents to categories be used to determine term weights. We call this idea supervised term weighting (STW). As an example, we propose a number of "supervised variants" of t f idf weighting, obtained by replacing the idf function with the function that has been used in phase (i) for term selection. We present experimental results obtained on the standard Reuters-21578 benchmark with one classifier learning method (support vector machines), three term selection functions (information gain, chi-square, and gain ratio), and both local and global term selection and weighting.
- T. M. Cover and J. A. Thomas. Elements of information theory. John Wiley & Sons, New York, US, 1991.]] Google ScholarDigital Library
- F. Debole and F. Sebastiani. Supervised term weighting for automated text categorization. Technical Report 2002-TR-08, Istituto di Scienza e Tecnologie dell'Informazione, Consiglio Nazionale delle Ricerche, Pisa, IT, 2002. Submitted for publication.]]Google Scholar
- L. Galavotti, F. Sebastiani, and M. Simi. Experiments on the use of feature selection and negative evidence in automated text categorization. In J. L. Borbinha and T. Baker, editors, Proceedings of ECDL-00, 4th European Conference on Research and Advanced Technology for Digital Libraries, pages 59--68, Lisbon, PT, 2000. Springer Verlag, Heidelberg, DE. Published in the "Lecture Notes in Computer Science" series, number 1923.]] Google ScholarDigital Library
- T. Joachims. Making large-scale SVM learning practical. In B. Schölkopf, C. J. Burges, and A. J. Smola, editors, Advances in Kernel Methods - Support Vector Learning, chapter 11, pages 169--184. The MIT Press, Cambridge, US, 1999.]] Google ScholarDigital Library
- D. D. Lewis. Representation and learning in information retrieval. PhD thesis, Department of Computer Science, University of Massachusetts, Amherst, US, 1992.]] Google ScholarDigital Library
- D. D. Lewis. Evaluating and optimizing autonomous text classification systems. In E. A. Fox, P. Ingwersen, and R. Fidel, editors, Proceedings of SIGIR-95, 18th ACM International Conference on Research and Development in Information Retrieval, pages 246--254, Seattle, US, 1995. ACM Press, New York, US.]] Google ScholarDigital Library
- C. Manning and H. Schütze. Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, US, 1999.]] Google ScholarDigital Library
- J. R. Quinlan. Induction of decision trees. Machine Learning, 1(1):81--106, 1986.]] Google ScholarCross Ref
- G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513--523, 1988. Also reprinted in {11}, pp. 323--328.]] Google ScholarDigital Library
- F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1--47, 2002.]] Google ScholarDigital Library
- K. Sparck Jones and P. Willett, editors. Readings in information retrieval. Morgan Kaufmann, San Mateo, US, 1997.]] Google ScholarDigital Library
- Y. Yang and J. O. Pedersen. A comparative study on feature selection in text categorization. In D. H. Fisher, editor, Proceedings of ICML-97, 14th International Conference on Machine Learning, pages 412--420, Nashville, US, 1997. Morgan Kaufmann Publishers, San Francisco, US.]] Google ScholarDigital Library
- J. Zobel and A. Moffat. Exploring the similarity space. SIGIR Forum, 32(1):18--34, 1998.]] Google ScholarDigital Library
Recommendations
Balancing between over-weighting and under-weighting in supervised term weighting
Show the importance of the trade-off between over-weighting and under-weighting.Propose a revision of add-one smoothing on delta smoothed idf (dsidf).Present three regularization techniques to reduce over-weighting.Propose a new supervised term ...
A semantic term weighting scheme for text categorization
Highlights► We propose a novel term weighting scheme for text categorization. ► We employ WordNet to interpret and represent the semantics of categories. ► The weight of a term is correlated to its semantic similarity with a ...
AbstractTraditional term weighting schemes in text categorization, such as TF-IDF, only exploit the statistical information of terms in documents. Instead, in this paper, we propose a novel term weighting scheme by exploiting the semantics of ...
On entropy-based term weighting schemes for text categorization
AbstractIn text categorization, Vector Space Model (VSM) has been widely used for representing documents, in which a document is represented by a vector of terms. Since different terms contribute to a document’s semantics in various degrees, a number of ...
Comments