Semantic Sensitive TF-IDF to Determine Word Relevance in Documents

Jalilifard, Amir; Caridá, Vinicius Fernandes; Mansano, Alex Fernandes; Cristo, Rogers S.; da Fonseca, Felipe Penhorate Carvalho

doi:10.1007/978-981-33-6987-0_27

Amir Jalilifard⁴⁰,
Vinicius Fernandes Caridá³⁹,
Alex Fernandes Mansano³⁹,
Rogers S. Cristo³⁹ &
…
Felipe Penhorate Carvalho da Fonseca³⁹

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 736))

1027 Accesses
30 Citations

Abstract

Keyword extraction has received an increasing attention as an important research topic which can lead to have advancements in diverse applications such as document context categorization, text indexing and document classification. In this paper we propose STF-IDF, a novel semantic method based on TF-IDF, for scoring word importance of informal documents in a corpus. A set of nearly four million documents from health-care social media was collected and was trained in order to draw semantic model and to find the word embeddings. Then, the features of semantic space were utilized to rearrange the original TF-IDF scores through an iterative solution so as to improve the moderate performance of this algorithm on informal texts. After testing the proposed method with 160 randomly chosen documents, our method managed to decrease the TF-IDF mean error rate by a factor of 50% and reaching the mean error of 13.7%, as opposed to 27.2% of the original TF-IDF.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Zhang, K., Xu, H., Tang, J., Li, J.: Keyword extraction using support vector machine. In: International Conference on Web-Age Information Management, pp. 85–96. Springer (2006)
Google Scholar
Uzun, Y.: Keyword extraction using naive Bayes. In: Bilkent University, Department of Computer Science, Turkey. www.cs.bilkent.edu.tr/guvenir/courses/CS550/Workshop/Yasin_Uzun.pdf (2005)
Aizawa, A.: An information-theoretic perspective of TF-IDF measures. Inf. Process. Manag. 39(1), 45–65 (2003)
Article Google Scholar
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18, 613–620
Google Scholar
Almeida, F., Xexéo, G.: Word embeddings: a survey. arXiv eprint 1901.09069 (2019)
Google Scholar
Papineni, K.: Why inverse document frequency. In: Proceedings of the North American Association for Computational Linguistics, pp. 25–32 (2001)
Google Scholar
Singh, J., Dwivedi, S.K.: Comparative analysis of IDF methods to determine word relevance in web document. Int. J. Comput. Sci. 11(1) (2014). ISSN: 1694-0784
Google Scholar
Ramos, J.: Using TF-IDF to Determine Word Relevance in Document Queries (2003)
Google Scholar
Salton, G., Buckley, C.: Term-weighing approaches sin automatic text retrieval. Inf. Process. Manag. 24, 513–523 (1988)
Article Google Scholar
Berger, A., et al.: Bridging the lexical chasm: statistical approaches to answer finding. In: Proceedings of the International Conference on Research and Development in Information Retrieval, pp. 192–199 (2000)
Google Scholar
Jung, Y., Park, H., Du, D.: An effective term-weighting scheme for information retrieval. Technical Report TR00-008, Department of Computer Science and Engineering, University of Minnesota (2000)
Google Scholar
Ramos, J., et al.: Using TF-IDF to determine word relevance in document queries. In: Proceedings of the First Instructional Conference on Machine Learning, vol. 242, pp. 133–142 (2003)
Google Scholar
Li, J., Zhang, K., et al.: Keyword extraction based on TF/IDF for Chinese news document. Wuhan Univ. J. Nat. Sci. 12(5), 917–921 (2007)
Article Google Scholar
Wu, H.C., Luk, R.W.P., Wong, K.F., Kwok, K.L.: Interpreting TF-IDF term weights as making relevance decisions. ACM Trans. Inf. Syst. (TOIS) 26(3), 13 (2008)
Article Google Scholar
Lee, S., Kim, H.-j.: News keyword extraction for topic tracking. In: Fourth International Conference on Networked Computing and Advanced Information Management, 2008. NCM’08, vol. 2, pp. 554–559. IEEE (2008)
Google Scholar
Arroyo-Fernández, I., Méndez-Cruz, C.-F., Sierra, G., Torres-Moreno, J.-M., Sidorov, G.: Unsupervised sentence representations as word information series: revisiting TF-IDF. Comput. Speech Lang. 56, 107–129 (2019)
Article Google Scholar
Kim, D., Seo, D., Cho, S., Kang, P.: Multi-co-training for document classification using various document representations: TF-IDF, LDA, and Doc2Vec. Inf. Sci. 477, 15–29 (2019)
Article Google Scholar
Iwendi, C., Ponnan, S., Munirathinam, R., Srinivasan, K., Chang, C.-Y.: An efficient and unique TF/IDF algorithmic model-based data analysis for handling applications with big data streaming. Electronics 8, 1331 (2019)
Article Google Scholar
Morris, M.R., Teevan, J., Panovich, K.: A comparison of information seeking using search engines and social networks. ICWSM 10, 23–26 (2010)
Google Scholar
Wöllmer, M., Eyben, F., Graves, A., Schuller, B., Rigoll, G.: Bidirectional LSTM networks for context-sensitive keyword detection in a cognitive virtual agent framework. Cogn. Comput. 2(3), 180–190 (2010)
Article Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Berger, A., Lafferty, J.: Information retrieval as statistical translation. In: Proceedings of the 22nd ACM Conference on Research and Development in Information Retrieval, pp. 222–229 (1999)
Google Scholar

Download references

Author information

Authors and Affiliations

Data Science Team—Digital Customer Service, Itaú Unibanco, São Paulo, Brazil
Vinicius Fernandes Caridá, Alex Fernandes Mansano, Rogers S. Cristo & Felipe Penhorate Carvalho da Fonseca
Federal University of Minas Gerais, Belo Horizonte, Brazil
Amir Jalilifard

Authors

Amir Jalilifard
View author publications
You can also search for this author in PubMed Google Scholar
Vinicius Fernandes Caridá
View author publications
You can also search for this author in PubMed Google Scholar
Alex Fernandes Mansano
View author publications
You can also search for this author in PubMed Google Scholar
Rogers S. Cristo
View author publications
You can also search for this author in PubMed Google Scholar
Felipe Penhorate Carvalho da Fonseca
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vinicius Fernandes Caridá .

Editor information

Editors and Affiliations

School of CSE, Indian Institute of Information Technology and Management-Kerala, Trivandrum, Kerala, India
Sabu M. Thampi
Institute of Theoretical and Applied Informatics, Polish Academy of Sciences, Gliwice, Poland
Erol Gelenbe
School of Computer Science, University of Oklahoma, Norman, OK, USA
Mohammed Atiquzzaman
Department of Computer Science, University at Buffalo, State University, Buffalo, NY, USA
Vipin Chaudhary
Department of Computer Science and Information Engineering, Providence University, Taichung, Taiwan
Kuan-Ching Li

Ethics declarations

The current method was proposed and tested by a group of data scientists from Itaú Unibanco. Any opinions, findings, and conclusions expressed in this manuscript are those of the authors and do not necessarily reflect the views, official policy or position of Itaú Unibanco.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jalilifard, A., Caridá, V.F., Mansano, A.F., Cristo, R.S., da Fonseca, F.P.C. (2021). Semantic Sensitive TF-IDF to Determine Word Relevance in Documents. In: Thampi, S.M., Gelenbe, E., Atiquzzaman, M., Chaudhary, V., Li, KC. (eds) Advances in Computing and Network Communications. Lecture Notes in Electrical Engineering, vol 736. Springer, Singapore. https://doi.org/10.1007/978-981-33-6987-0_27

Download citation

DOI: https://doi.org/10.1007/978-981-33-6987-0_27
Published: 13 June 2021
Publisher Name: Springer, Singapore
Print ISBN: 978-981-33-6986-3
Online ISBN: 978-981-33-6987-0
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics