Skip to main content

Semantic Sensitive TF-IDF to Determine Word Relevance in Documents

  • Conference paper
  • First Online:
Advances in Computing and Network Communications

Abstract

Keyword extraction has received an increasing attention as an important research topic which can lead to have advancements in diverse applications such as document context categorization, text indexing and document classification. In this paper we propose STF-IDF, a novel semantic method based on TF-IDF, for scoring word importance of informal documents in a corpus. A set of nearly four million documents from health-care social media was collected and was trained in order to draw semantic model and to find the word embeddings. Then, the features of semantic space were utilized to rearrange the original TF-IDF scores through an iterative solution so as to improve the moderate performance of this algorithm on informal texts. After testing the proposed method with 160 randomly chosen documents, our method managed to decrease the TF-IDF mean error rate by a factor of 50% and reaching the mean error of 13.7%, as opposed to 27.2% of the original TF-IDF.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Zhang, K., Xu, H., Tang, J., Li, J.: Keyword extraction using support vector machine. In: International Conference on Web-Age Information Management, pp. 85–96. Springer (2006)

    Google Scholar 

  2. Uzun, Y.: Keyword extraction using naive Bayes. In: Bilkent University, Department of Computer Science, Turkey. www.cs.bilkent.edu.tr/guvenir/courses/CS550/Workshop/Yasin_Uzun.pdf (2005)

  3. Aizawa, A.: An information-theoretic perspective of TF-IDF measures. Inf. Process. Manag. 39(1), 45–65 (2003)

    Article  Google Scholar 

  4. Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18, 613–620

    Google Scholar 

  5. Almeida, F., Xexéo, G.: Word embeddings: a survey. arXiv eprint 1901.09069 (2019)

    Google Scholar 

  6. Papineni, K.: Why inverse document frequency. In: Proceedings of the North American Association for Computational Linguistics, pp. 25–32 (2001)

    Google Scholar 

  7. Singh, J., Dwivedi, S.K.: Comparative analysis of IDF methods to determine word relevance in web document. Int. J. Comput. Sci. 11(1) (2014). ISSN: 1694-0784

    Google Scholar 

  8. Ramos, J.: Using TF-IDF to Determine Word Relevance in Document Queries (2003)

    Google Scholar 

  9. Salton, G., Buckley, C.: Term-weighing approaches sin automatic text retrieval. Inf. Process. Manag. 24, 513–523 (1988)

    Article  Google Scholar 

  10. Berger, A., et al.: Bridging the lexical chasm: statistical approaches to answer finding. In: Proceedings of the International Conference on Research and Development in Information Retrieval, pp. 192–199 (2000)

    Google Scholar 

  11. Jung, Y., Park, H., Du, D.: An effective term-weighting scheme for information retrieval. Technical Report TR00-008, Department of Computer Science and Engineering, University of Minnesota (2000)

    Google Scholar 

  12. Ramos, J., et al.: Using TF-IDF to determine word relevance in document queries. In: Proceedings of the First Instructional Conference on Machine Learning, vol. 242, pp. 133–142 (2003)

    Google Scholar 

  13. Li, J., Zhang, K., et al.: Keyword extraction based on TF/IDF for Chinese news document. Wuhan Univ. J. Nat. Sci. 12(5), 917–921 (2007)

    Article  Google Scholar 

  14. Wu, H.C., Luk, R.W.P., Wong, K.F., Kwok, K.L.: Interpreting TF-IDF term weights as making relevance decisions. ACM Trans. Inf. Syst. (TOIS) 26(3), 13 (2008)

    Article  Google Scholar 

  15. Lee, S., Kim, H.-j.: News keyword extraction for topic tracking. In: Fourth International Conference on Networked Computing and Advanced Information Management, 2008. NCM’08, vol. 2, pp. 554–559. IEEE (2008)

    Google Scholar 

  16. Arroyo-Fernández, I., Méndez-Cruz, C.-F., Sierra, G., Torres-Moreno, J.-M., Sidorov, G.: Unsupervised sentence representations as word information series: revisiting TF-IDF. Comput. Speech Lang. 56, 107–129 (2019)

    Article  Google Scholar 

  17. Kim, D., Seo, D., Cho, S., Kang, P.: Multi-co-training for document classification using various document representations: TF-IDF, LDA, and Doc2Vec. Inf. Sci. 477, 15–29 (2019)

    Article  Google Scholar 

  18. Iwendi, C., Ponnan, S., Munirathinam, R., Srinivasan, K., Chang, C.-Y.: An efficient and unique TF/IDF algorithmic model-based data analysis for handling applications with big data streaming. Electronics 8, 1331 (2019)

    Article  Google Scholar 

  19. Morris, M.R., Teevan, J., Panovich, K.: A comparison of information seeking using search engines and social networks. ICWSM 10, 23–26 (2010)

    Google Scholar 

  20. Wöllmer, M., Eyben, F., Graves, A., Schuller, B., Rigoll, G.: Bidirectional LSTM networks for context-sensitive keyword detection in a cognitive virtual agent framework. Cogn. Comput. 2(3), 180–190 (2010)

    Article  Google Scholar 

  21. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)

  22. Berger, A., Lafferty, J.: Information retrieval as statistical translation. In: Proceedings of the 22nd ACM Conference on Research and Development in Information Retrieval, pp. 222–229 (1999)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vinicius Fernandes Caridá .

Editor information

Editors and Affiliations

Ethics declarations

The current method was proposed and tested by a group of data scientists from Itaú Unibanco. Any opinions, findings, and conclusions expressed in this manuscript are those of the authors and do not necessarily reflect the views, official policy or position of Itaú Unibanco.

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Jalilifard, A., Caridá, V.F., Mansano, A.F., Cristo, R.S., da Fonseca, F.P.C. (2021). Semantic Sensitive TF-IDF to Determine Word Relevance in Documents. In: Thampi, S.M., Gelenbe, E., Atiquzzaman, M., Chaudhary, V., Li, KC. (eds) Advances in Computing and Network Communications. Lecture Notes in Electrical Engineering, vol 736. Springer, Singapore. https://doi.org/10.1007/978-981-33-6987-0_27

Download citation

  • DOI: https://doi.org/10.1007/978-981-33-6987-0_27

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-33-6986-3

  • Online ISBN: 978-981-33-6987-0

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics