The Effect of Stopword Removal on Information Retrieval for Code-Mixed Data Obtained Via Social Media

Chanda, Supriya; Pal, Sukomal

doi:10.1007/s42979-023-01942-7

The Effect of Stopword Removal on Information Retrieval for Code-Mixed Data Obtained Via Social Media

Original Research
Published: 27 June 2023

Volume 4, article number 494, (2023)
Cite this article

SN Computer Science Aims and scope Submit manuscript

121 Accesses
1 Citation
Explore all metrics

Abstract

Stopwords often present themselves littered throughout the documents, their presence in sentences has the least significant semantic impact and these terms represent an impressive collection of archives without any semantic value. Thus, stopwords should be eliminated from any document for improved language description. In this paper, we have explored and evaluated the effect of stopwords on the performance of information retrieval in code-mixed social media data in Indian languages such as Bengali–English. A considerable amount of research has been performed in the areas of sentiment analysis, language identification, and language generation for code-mixed languages. However, no such work has been done in the field of removal of stopwords from a code-mixed document. That is the motivation behind this work. In this work, we have devoted our attention to comparing the impact of corpus-based stopword removal over non-corpus-based stopword removal on Information retrieval for code-mixed data. How to find the best stopword list for each constituent language of a code mixed language? It was observed that corpus-based stopword removal generally improved Mean Average Precision (MAP) values significantly compared to non-corpus-based stopword removal by 16%. For both languages, different threshold values were tuned together based on the TF-IDF score, and it gave the optimal list for stopwords.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sentiment Analysis of Code-Mixed Languages Leveraging Resource Rich Languages

Preparation of Rich Lists of Research Gaps in the Specific Sentiment Analysis Tasks of Code-mixed Indian Languages

Article 19 December 2023

Automatic Normalization of Word Variations in Code-Mixed Social Media Text

Data availability

Raw data for dataset are not publicly available to preserve individuals’ privacy as our data collected from social media. The data can be shared with a agreement only for research puspose. And for that a mail need to be send.

Notes

References

Fox C. A stop list for general text. SIGIR Forum. 1989;24(1–2):19–21. https://doi.org/10.1145/378881.378888.
Article Google Scholar
Myers-Scotton C. Common and uncommon ground: social and structural factors in codeswitching. Lang Soc. 1993;22(4):475–503.
Article Google Scholar
Rowlett P. Franglais. Concise encyclopedia of languages of the world. Amsterdam: Elsevier; 2008.
Google Scholar
Barman U, Das A, Wagner J, Foster J. Code mixing: a challenge for language identification in the language of social media. In: Proceedings of the first workshop on computational approaches to code switching, 2014. p. 13–23.
Ganguly D, Bandyopadhyay A, Mitra M, Jones GJF. Retrievability of code mixed microblogs. In: Proceedings of the 39th international ACM SIGIR conference on research and development in information retrieval. SIGIR ’16. New York, NY, USA: Association for Computing Machinery; 2016. p. 973–976. https://doi.org/10.1145/2911451.2914727.
Gupta P, Bali K, Banchs RE, Choudhury M, Rosso P. Query expansion for mixed-script information retrieval. In: Proceedings of the 37th international ACM SIGIR conference on research development in information retrieval. SIGIR ’14. New York, NY, USA: Association for Computing Machinery; 2014. p. 677–686. https://doi.org/10.1145/2600428.2609622.
Chakma K, Das A. Cmir: a corpus for evaluation of code mixed information retrieval of Hindi–English tweets. Comput Sist. 2016;20:425–34.
Google Scholar
Khanuja S, Dandapat S, Srinivasan A, Sitaram S, Choudhury M. GLUECoS : an evaluation benchmark for code-switched NLP, In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics. 2020, Online, p. 3575–3585.
Bhattu SN, Nunna SK, Somayajulu DVLN, Pradhan B. Improving code-mixed pos tagging using code-mixed embeddings. ACM Trans Asian Low-Resour Lang Inf Process. 2020. https://doi.org/10.1145/3380967.
Article Google Scholar
Fetahu B, Fang A, Rokhlenko O, Malmasi S. Gazetteer enhanced named entity recognition for code-mixed web queries. In: Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval. SIGIR ’21. New York, NY, USA: Association for Computing Machinery; 2021. p. 1677–1681. https://doi.org/10.1145/3404835.3463102.
Chanda S, Pal S. Irlab@ iitbhu@ dravidian-codemix-fire2020: sentiment analysis for dravidian languages in code-mixed text. In: FIRE (working notes); 2020. p. 535–540.
Chanda S, Singh RP, Pal S. Is meta embedding better than pre-trained word embedding to perform sentiment analysis for dravidian languages in code-mixed text? In: FIRE (working notes); 2021. p. 1051–1060.
Saroj A, Chanda S, Pal S. IRlab@IITV at SemEval-2020 task 12: multilingual offensive language identification in social media using SVM. In: Proceedings of the fourteenth workshop on semantic evaluation. Barcelona: International Committee for Computational Linguistics; 2020. p. 2012–2016. https://doi.org/10.18653/v1/2020.semeval-1.265. https://aclanthology.org/2020.semeval-1.265.
Chanda S, Ujjwal S, Das S, Pal S. Fine-tuning pre-trained transformer based model for hate speech and offensive content identification in English, Indo-aryan and code-mixed (English-Hindi) languages. In: FIRE (working notes); 2021. p. 446–458.
Chandu KR, Chinnakotla MK, Black AW, Shrivastava M. Webshodh: A code mixed factoid question answering system for web. In: Conference and labs of the evaluation forum; 2017.
Gupta P, Rosso P, Banchs RE. Encoding transliteration variation through dimensionality reduction: fire shared task on transliterated search, In: Pre-processing of the FIRE-2013 Workshop, Forum for Information Retrieval Evaluation (FIRE); 2013.
Joshi H, Bhatt A, Patel H. Transliterated search using syllabification approach, In: Pre-processing of the FIRE-2013 Workshop, Forum for Information Retrieval Evaluation (FIRE); 2013.
Pakray P, Bhaskar P. Transliterated search system for Indian languages, In: Pre-processing of the FIRE-2013 Workshop, Forum for Information Retrieval Evaluation (FIRE); 2013.
Prakash A, S, Saha: A relevance feedback based approach for mixed script transliterated text search: shared task report by bit Mesra, In: Pre-proceedings of the 6th FIRE-2014 Workshop, Forum for Information Retrieval Evaluation (FIRE); 2014.
Mukherjee A, Ravi A, Datta K. Mixed-script query labelling using supervised learning and ad hoc retrieval using sub word indexing. In: Proceedings of the forum for information retrieval evaluation. FIRE ’14. New York, NY, USA: Association for Computing Machinery; 2014. p. 86–90. https://doi.org/10.1145/2824864.2824873.
Ganguly D, Pal S, Jones GJF. Dcu@fire-2014: fuzzy queries with rule-based normalization for mixed script information retrieval. In: Proceedings of the forum for information retrieval evaluation. FIRE ’14. New York, NY, USA: Association for Computing Machinery; 2014. p. 80–85.https://doi.org/10.1145/2824864.2824888.
Bhat IA, Mujadia V, Tammewar A, Bhat RA, Shrivastava M. Iiit-h system submission for fire2014 shared task on transliterated search. In: Proceedings of the forum for information retrieval evaluation. FIRE ’14. New York, NY, USA: Association for Computing Machinery;2014. p. 48–53. https://doi.org/10.1145/2824864.2824872.
Londhe N, Srihari R. Exploiting named entity mentions towards code mixed ir: working notes for the ub system submission for msir@fire-2016. In: Proceedings of FIRE (Working Notes); 2016, p. 105–108
Barathi BG, Kumar MA, Soman KP. Distributional semantic representation for text classification and information retrieval. In: Proceedings of FIRE (Working Notes); 2016, p. 126–130.
Singh S, Kumar MA, Soman KP. Cen@amrita: information retrieval on codemixed Hindi-English tweets using vector space models. In: Proceedings of FIRE (Working Notes); 2016, p. 131–134
banerjee S, Chakma K, Naskar SK, Das A, Rosso P, Bandyopadhyay S, Choudhury M. Overview of the mixed script information retrieval (MSIR). In: Proceedings of FIRE 2016. FIRE; 2016. https://www.microsoft.com/en-us/research/publication/overview-mixed-script-information-retrieval-msir/.
Lo RT-W, He B, Ounis I. Automatically building a stopword list for an information retrieval system. J Digit Inf Manage. 2005;3:3–8.
Google Scholar
Manning CD, Schütze H. Foundations of statistical natural language processing. MIT Press, 1999.
Konchady M. Text mining application programming; 2006. Boston, Massachusetts: Charles River Media.
Ayral H, Yavuz S. An automated domain specific stop word generation method for natural language text classification. In: 2011 international symposium on innovations in intelligent systems and applications; 2011. p. 500–503.
Khalifa C, Rayner A. An automatic construction of malay stop words based on aggregation method. In: International conference on soft computing in data science; 2016. p. 180–189.
Sahu SS, Pal S. Effect of stopwords in Indian language ir. Sādhanā 2022;47.
Amati G, Van Rijsbergen CJ. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans Inf Syst. 2002;20(4):357–89. https://doi.org/10.1145/582415.582416.
Article Google Scholar
Lewis DD, Yang Y, Rose TG, Li F. Rcv1: a new benchmark collection for text categorization research. J Mach Learn Res. 2004;5:361–97.
Google Scholar

Download references

Funding

This study was funded by IIT (BHU), Varanasi, INDIA.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, IIT (BHU), Varanasi, Uttar Pradesh, 221005, India
Supriya Chanda & Sukomal Pal

Authors

Supriya Chanda
View author publications
You can also search for this author in PubMed Google Scholar
Sukomal Pal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Supriya Chanda.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Research Trends in Computational Intelligence” guest edited by Anshul Verma, Pradeepika Verma, Vivek Kumar Singh and S. Karthikeyan.

Appendix

The stopword list based on the SMART (System for the Mechanical Analysis and Retrieval of Text) IR System, an IR system developed at Cornell University in the 1960s. The English stopword list is taken from the online appendix 11^{Footnote 3} of Lewis et al. [34].

The another stopword list is taken from Terrier github site,^{Footnote 4} based on the Terrier IR system (Table 5).

Table 5 The corpus-based stopword list

Full size table

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Chanda, S., Pal, S. The Effect of Stopword Removal on Information Retrieval for Code-Mixed Data Obtained Via Social Media. SN COMPUT. SCI. 4, 494 (2023). https://doi.org/10.1007/s42979-023-01942-7

Download citation

Received: 05 March 2023
Accepted: 22 May 2023
Published: 27 June 2023
DOI: https://doi.org/10.1007/s42979-023-01942-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The Effect of Stopword Removal on Information Retrieval for Code-Mixed Data Obtained Via Social Media

Abstract

Access this article

Similar content being viewed by others

Sentiment Analysis of Code-Mixed Languages Leveraging Resource Rich Languages

Preparation of Rich Lists of Research Gaps in the Specific Sentiment Analysis Tasks of Code-mixed Indian Languages

Automatic Normalization of Word Variations in Code-Mixed Social Media Text

Data availability

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

The Effect of Stopword Removal on Information Retrieval for Code-Mixed Data Obtained Via Social Media

Abstract

Access this article

Similar content being viewed by others

Sentiment Analysis of Code-Mixed Languages Leveraging Resource Rich Languages

Preparation of Rich Lists of Research Gaps in the Specific Sentiment Analysis Tasks of Code-mixed Indian Languages

Automatic Normalization of Word Variations in Code-Mixed Social Media Text

Data availability

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation