Article

Supervised term weighting for automated text categorization

Authors:
Franca Debole

Istituto di Scienza e Technologie dell'Informazione, Pisa (Italy)

Istituto di Scienza e Technologie dell'Informazione, Pisa (Italy)
View Profile

,
Fabrizio Sebastiani

Istituto di Scienza e Tecnologie dell'Informazione, Pisa (Italy)

Istituto di Scienza e Tecnologie dell'Informazione, Pisa (Italy)
View Profile

SAC '03: Proceedings of the 2003 ACM symposium on Applied computingMarch 2003Pages 784–788https://doi.org/10.1145/952532.952688

Published:09 March 2003Publication History

SAC '03: Proceedings of the 2003 ACM symposium on Applied computing

Pages 784–788

ABSTRACT

The construction of a text classifier usually involves (i) a phase of term selection, in which the most relevant terms for the classification task are identified, (ii) a phase of term weighting, in which document weights for the selected terms are computed, and (iii) a phase of classifier learning, in which a classifier is generated from the weighted representations of the training documents. This process involves an activity of supervised learning, in which information on the membership of training documents in categories is used. Traditionally, supervised learning enters only phases (i) and (iii). In this paper we propose instead that learning from training data should also affect phase (ii), i.e. that information on the membership of training documents to categories be used to determine term weights. We call this idea supervised term weighting (STW). As an example, we propose a number of "supervised variants" of t f idf weighting, obtained by replacing the idf function with the function that has been used in phase (i) for term selection. We present experimental results obtained on the standard Reuters-21578 benchmark with one classifier learning method (support vector machines), three term selection functions (information gain, chi-square, and gain ratio), and both local and global term selection and weighting.

References

T. M. Cover and J. A. Thomas. Elements of information theory. John Wiley & Sons, New York, US, 1991.]] Google ScholarDigital Library
F. Debole and F. Sebastiani. Supervised term weighting for automated text categorization. Technical Report 2002-TR-08, Istituto di Scienza e Tecnologie dell'Informazione, Consiglio Nazionale delle Ricerche, Pisa, IT, 2002. Submitted for publication.]]Google Scholar
L. Galavotti, F. Sebastiani, and M. Simi. Experiments on the use of feature selection and negative evidence in automated text categorization. In J. L. Borbinha and T. Baker, editors, Proceedings of ECDL-00, 4th European Conference on Research and Advanced Technology for Digital Libraries, pages 59--68, Lisbon, PT, 2000. Springer Verlag, Heidelberg, DE. Published in the "Lecture Notes in Computer Science" series, number 1923.]] Google ScholarDigital Library
T. Joachims. Making large-scale SVM learning practical. In B. Schölkopf, C. J. Burges, and A. J. Smola, editors, Advances in Kernel Methods - Support Vector Learning, chapter 11, pages 169--184. The MIT Press, Cambridge, US, 1999.]] Google ScholarDigital Library
D. D. Lewis. Representation and learning in information retrieval. PhD thesis, Department of Computer Science, University of Massachusetts, Amherst, US, 1992.]] Google ScholarDigital Library
D. D. Lewis. Evaluating and optimizing autonomous text classification systems. In E. A. Fox, P. Ingwersen, and R. Fidel, editors, Proceedings of SIGIR-95, 18th ACM International Conference on Research and Development in Information Retrieval, pages 246--254, Seattle, US, 1995. ACM Press, New York, US.]] Google ScholarDigital Library
C. Manning and H. Schütze. Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, US, 1999.]] Google ScholarDigital Library
J. R. Quinlan. Induction of decision trees. Machine Learning, 1(1):81--106, 1986.]] Google ScholarCross Ref
G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513--523, 1988. Also reprinted in {11}, pp. 323--328.]] Google ScholarDigital Library
F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1--47, 2002.]] Google ScholarDigital Library
K. Sparck Jones and P. Willett, editors. Readings in information retrieval. Morgan Kaufmann, San Mateo, US, 1997.]] Google ScholarDigital Library
Y. Yang and J. O. Pedersen. A comparative study on feature selection in text categorization. In D. H. Fisher, editor, Proceedings of ICML-97, 14th International Conference on Machine Learning, pages 412--420, Nashville, US, 1997. Morgan Kaufmann Publishers, San Francisco, US.]] Google ScholarDigital Library
J. Zobel and A. Moffat. Exploring the similarity space. SIGIR Forum, 32(1):18--34, 1998.]] Google ScholarDigital Library

Recommendations

Balancing between over-weighting and under-weighting in supervised term weighting

Show the importance of the trade-off between over-weighting and under-weighting.Propose a revision of add-one smoothing on delta smoothed idf (dsidf).Present three regularization techniques to reduce over-weighting.Propose a new supervised term ...
Read More
A semantic term weighting scheme for text categorization
Highlights
► We propose a novel term weighting scheme for text categorization. ► We employ WordNet to interpret and represent the semantics of categories. ► The weight of a term is correlated to its semantic similarity with a ...
Abstract
Traditional term weighting schemes in text categorization, such as TF-IDF, only exploit the statistical information of terms in documents. Instead, in this paper, we propose a novel term weighting scheme by exploiting the semantics of ...
Read More
On entropy-based term weighting schemes for text categorization
Abstract
In text categorization, Vector Space Model (VSM) has been widely used for representing documents, in which a document is represented by a vector of terms. Since different terms contribute to a document’s semantics in various degrees, a number of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SAC '03: Proceedings of the 2003 ACM symposium on Applied computing
March 2003
1268 pages
ISBN:1581136242
DOI:10.1145/952532
Conference Chair:
Gary B. Lamont
Air Force Institute of Technology
,
Program Chairs:
Hisham Haddad
Kennesaw State University
,
George A. Papadopoulos
University of Cyprus, Cyprus
,
Publications Chair:
Brajendra Panda
University of Arkansas
Copyright © 2003 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 March 2003
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
machine learning
text categorization
text classification
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,650of6,669submissions,25%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 187
  Total Citations
  View Citations
- 897
  Total Downloads
- Downloads (Last 12 months)16
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Supervised term weighting for automated text categorization

SAC '03: Proceedings of the 2003 ACM symposium on Applied computing

ABSTRACT

References

Cited By

Recommendations

Balancing between over-weighting and under-weighting in supervised term weighting

A semantic term weighting scheme for text categorization

On entropy-based term weighting schemes for text categorization