Abstract
Keyword extraction has received an increasing attention as an important research topic which can lead to have advancements in diverse applications such as document context categorization, text indexing and document classification. In this paper we propose STF-IDF, a novel semantic method based on TF-IDF, for scoring word importance of informal documents in a corpus. A set of nearly four million documents from health-care social media was collected and was trained in order to draw semantic model and to find the word embeddings. Then, the features of semantic space were utilized to rearrange the original TF-IDF scores through an iterative solution so as to improve the moderate performance of this algorithm on informal texts. After testing the proposed method with 160 randomly chosen documents, our method managed to decrease the TF-IDF mean error rate by a factor of 50% and reaching the mean error of 13.7%, as opposed to 27.2% of the original TF-IDF.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Zhang, K., Xu, H., Tang, J., Li, J.: Keyword extraction using support vector machine. In: International Conference on Web-Age Information Management, pp. 85–96. Springer (2006)
Uzun, Y.: Keyword extraction using naive Bayes. In: Bilkent University, Department of Computer Science, Turkey. www.cs.bilkent.edu.tr/guvenir/courses/CS550/Workshop/Yasin_Uzun.pdf (2005)
Aizawa, A.: An information-theoretic perspective of TF-IDF measures. Inf. Process. Manag. 39(1), 45–65 (2003)
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18, 613–620
Almeida, F., Xexéo, G.: Word embeddings: a survey. arXiv eprint 1901.09069 (2019)
Papineni, K.: Why inverse document frequency. In: Proceedings of the North American Association for Computational Linguistics, pp. 25–32 (2001)
Singh, J., Dwivedi, S.K.: Comparative analysis of IDF methods to determine word relevance in web document. Int. J. Comput. Sci. 11(1) (2014). ISSN: 1694-0784
Ramos, J.: Using TF-IDF to Determine Word Relevance in Document Queries (2003)
Salton, G., Buckley, C.: Term-weighing approaches sin automatic text retrieval. Inf. Process. Manag. 24, 513–523 (1988)
Berger, A., et al.: Bridging the lexical chasm: statistical approaches to answer finding. In: Proceedings of the International Conference on Research and Development in Information Retrieval, pp. 192–199 (2000)
Jung, Y., Park, H., Du, D.: An effective term-weighting scheme for information retrieval. Technical Report TR00-008, Department of Computer Science and Engineering, University of Minnesota (2000)
Ramos, J., et al.: Using TF-IDF to determine word relevance in document queries. In: Proceedings of the First Instructional Conference on Machine Learning, vol. 242, pp. 133–142 (2003)
Li, J., Zhang, K., et al.: Keyword extraction based on TF/IDF for Chinese news document. Wuhan Univ. J. Nat. Sci. 12(5), 917–921 (2007)
Wu, H.C., Luk, R.W.P., Wong, K.F., Kwok, K.L.: Interpreting TF-IDF term weights as making relevance decisions. ACM Trans. Inf. Syst. (TOIS) 26(3), 13 (2008)
Lee, S., Kim, H.-j.: News keyword extraction for topic tracking. In: Fourth International Conference on Networked Computing and Advanced Information Management, 2008. NCM’08, vol. 2, pp. 554–559. IEEE (2008)
Arroyo-Fernández, I., Méndez-Cruz, C.-F., Sierra, G., Torres-Moreno, J.-M., Sidorov, G.: Unsupervised sentence representations as word information series: revisiting TF-IDF. Comput. Speech Lang. 56, 107–129 (2019)
Kim, D., Seo, D., Cho, S., Kang, P.: Multi-co-training for document classification using various document representations: TF-IDF, LDA, and Doc2Vec. Inf. Sci. 477, 15–29 (2019)
Iwendi, C., Ponnan, S., Munirathinam, R., Srinivasan, K., Chang, C.-Y.: An efficient and unique TF/IDF algorithmic model-based data analysis for handling applications with big data streaming. Electronics 8, 1331 (2019)
Morris, M.R., Teevan, J., Panovich, K.: A comparison of information seeking using search engines and social networks. ICWSM 10, 23–26 (2010)
Wöllmer, M., Eyben, F., Graves, A., Schuller, B., Rigoll, G.: Bidirectional LSTM networks for context-sensitive keyword detection in a cognitive virtual agent framework. Cogn. Comput. 2(3), 180–190 (2010)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Berger, A., Lafferty, J.: Information retrieval as statistical translation. In: Proceedings of the 22nd ACM Conference on Research and Development in Information Retrieval, pp. 222–229 (1999)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Ethics declarations
The current method was proposed and tested by a group of data scientists from Itaú Unibanco. Any opinions, findings, and conclusions expressed in this manuscript are those of the authors and do not necessarily reflect the views, official policy or position of Itaú Unibanco.
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Jalilifard, A., Caridá, V.F., Mansano, A.F., Cristo, R.S., da Fonseca, F.P.C. (2021). Semantic Sensitive TF-IDF to Determine Word Relevance in Documents. In: Thampi, S.M., Gelenbe, E., Atiquzzaman, M., Chaudhary, V., Li, KC. (eds) Advances in Computing and Network Communications. Lecture Notes in Electrical Engineering, vol 736. Springer, Singapore. https://doi.org/10.1007/978-981-33-6987-0_27
Download citation
DOI: https://doi.org/10.1007/978-981-33-6987-0_27
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-33-6986-3
Online ISBN: 978-981-33-6987-0
eBook Packages: EngineeringEngineering (R0)