skip to main content
10.1145/952532.952688acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
Article

Supervised term weighting for automated text categorization

Authors Info & Claims
Published:09 March 2003Publication History

ABSTRACT

The construction of a text classifier usually involves (i) a phase of term selection, in which the most relevant terms for the classification task are identified, (ii) a phase of term weighting, in which document weights for the selected terms are computed, and (iii) a phase of classifier learning, in which a classifier is generated from the weighted representations of the training documents. This process involves an activity of supervised learning, in which information on the membership of training documents in categories is used. Traditionally, supervised learning enters only phases (i) and (iii). In this paper we propose instead that learning from training data should also affect phase (ii), i.e. that information on the membership of training documents to categories be used to determine term weights. We call this idea supervised term weighting (STW). As an example, we propose a number of "supervised variants" of t f idf weighting, obtained by replacing the idf function with the function that has been used in phase (i) for term selection. We present experimental results obtained on the standard Reuters-21578 benchmark with one classifier learning method (support vector machines), three term selection functions (information gain, chi-square, and gain ratio), and both local and global term selection and weighting.

References

  1. T. M. Cover and J. A. Thomas. Elements of information theory. John Wiley & Sons, New York, US, 1991.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. F. Debole and F. Sebastiani. Supervised term weighting for automated text categorization. Technical Report 2002-TR-08, Istituto di Scienza e Tecnologie dell'Informazione, Consiglio Nazionale delle Ricerche, Pisa, IT, 2002. Submitted for publication.]]Google ScholarGoogle Scholar
  3. L. Galavotti, F. Sebastiani, and M. Simi. Experiments on the use of feature selection and negative evidence in automated text categorization. In J. L. Borbinha and T. Baker, editors, Proceedings of ECDL-00, 4th European Conference on Research and Advanced Technology for Digital Libraries, pages 59--68, Lisbon, PT, 2000. Springer Verlag, Heidelberg, DE. Published in the "Lecture Notes in Computer Science" series, number 1923.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. T. Joachims. Making large-scale SVM learning practical. In B. Schölkopf, C. J. Burges, and A. J. Smola, editors, Advances in Kernel Methods - Support Vector Learning, chapter 11, pages 169--184. The MIT Press, Cambridge, US, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. D. D. Lewis. Representation and learning in information retrieval. PhD thesis, Department of Computer Science, University of Massachusetts, Amherst, US, 1992.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. D. D. Lewis. Evaluating and optimizing autonomous text classification systems. In E. A. Fox, P. Ingwersen, and R. Fidel, editors, Proceedings of SIGIR-95, 18th ACM International Conference on Research and Development in Information Retrieval, pages 246--254, Seattle, US, 1995. ACM Press, New York, US.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. C. Manning and H. Schütze. Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, US, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. R. Quinlan. Induction of decision trees. Machine Learning, 1(1):81--106, 1986.]] Google ScholarGoogle ScholarCross RefCross Ref
  9. G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513--523, 1988. Also reprinted in {11}, pp. 323--328.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1--47, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. K. Sparck Jones and P. Willett, editors. Readings in information retrieval. Morgan Kaufmann, San Mateo, US, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Y. Yang and J. O. Pedersen. A comparative study on feature selection in text categorization. In D. H. Fisher, editor, Proceedings of ICML-97, 14th International Conference on Machine Learning, pages 412--420, Nashville, US, 1997. Morgan Kaufmann Publishers, San Francisco, US.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. Zobel and A. Moffat. Exploring the similarity space. SIGIR Forum, 32(1):18--34, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    SAC '03: Proceedings of the 2003 ACM symposium on Applied computing
    March 2003
    1268 pages
    ISBN:1581136242
    DOI:10.1145/952532

    Copyright © 2003 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 9 March 2003

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • Article

    Acceptance Rates

    Overall Acceptance Rate1,650of6,669submissions,25%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader