Skip to main content
Log in

The Effect of Stopword Removal on Information Retrieval for Code-Mixed Data Obtained Via Social Media

  • Original Research
  • Published:
SN Computer Science Aims and scope Submit manuscript

Abstract

Stopwords often present themselves littered throughout the documents, their presence in sentences has the least significant semantic impact and these terms represent an impressive collection of archives without any semantic value. Thus, stopwords should be eliminated from any document for improved language description. In this paper, we have explored and evaluated the effect of stopwords on the performance of information retrieval in code-mixed social media data in Indian languages such as Bengali–English. A considerable amount of research has been performed in the areas of sentiment analysis, language identification, and language generation for code-mixed languages. However, no such work has been done in the field of removal of stopwords from a code-mixed document. That is the motivation behind this work. In this work, we have devoted our attention to comparing the impact of corpus-based stopword removal over non-corpus-based stopword removal on Information retrieval for code-mixed data. How to find the best stopword list for each constituent language of a code mixed language? It was observed that corpus-based stopword removal generally improved Mean Average Precision (MAP) values significantly compared to non-corpus-based stopword removal by 16%. For both languages, different threshold values were tuned together based on the TF-IDF score, and it gave the optimal list for stopwords.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Data availability

Raw data for dataset are not publicly available to preserve individuals’ privacy as our data collected from social media. The data can be shared with a agreement only for research puspose. And for that a mail need to be send.

Notes

  1. http://terrier.org/.

  2. http://www.amitavadas.com/code-mixing.html.

  3. http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/a11-smart-stop-list/english.stop.

  4. https://github.com/terrier-org/terrier-desktop/blob/master/share/stopword-list.txt.

References

  1. Fox C. A stop list for general text. SIGIR Forum. 1989;24(1–2):19–21. https://doi.org/10.1145/378881.378888.

    Article  Google Scholar 

  2. Myers-Scotton C. Common and uncommon ground: social and structural factors in codeswitching. Lang Soc. 1993;22(4):475–503.

    Article  Google Scholar 

  3. Rowlett P. Franglais. Concise encyclopedia of languages of the world. Amsterdam: Elsevier; 2008.

    Google Scholar 

  4. Barman U, Das A, Wagner J, Foster J. Code mixing: a challenge for language identification in the language of social media. In: Proceedings of the first workshop on computational approaches to code switching, 2014. p. 13–23.

  5. Ganguly D, Bandyopadhyay A, Mitra M, Jones GJF. Retrievability of code mixed microblogs. In: Proceedings of the 39th international ACM SIGIR conference on research and development in information retrieval. SIGIR ’16. New York, NY, USA: Association for Computing Machinery; 2016. p. 973–976. https://doi.org/10.1145/2911451.2914727.

  6. Gupta P, Bali K, Banchs RE, Choudhury M, Rosso P. Query expansion for mixed-script information retrieval. In: Proceedings of the 37th international ACM SIGIR conference on research development in information retrieval. SIGIR ’14. New York, NY, USA: Association for Computing Machinery; 2014. p. 677–686. https://doi.org/10.1145/2600428.2609622.

  7. Chakma K, Das A. Cmir: a corpus for evaluation of code mixed information retrieval of Hindi–English tweets. Comput Sist. 2016;20:425–34.

    Google Scholar 

  8. Khanuja S, Dandapat S, Srinivasan A, Sitaram S, Choudhury M. GLUECoS : an evaluation benchmark for code-switched NLP, In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics. 2020, Online, p. 3575–3585.

  9. Bhattu SN, Nunna SK, Somayajulu DVLN, Pradhan B. Improving code-mixed pos tagging using code-mixed embeddings. ACM Trans Asian Low-Resour Lang Inf Process. 2020. https://doi.org/10.1145/3380967.

    Article  Google Scholar 

  10. Fetahu B, Fang A, Rokhlenko O, Malmasi S. Gazetteer enhanced named entity recognition for code-mixed web queries. In: Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval. SIGIR ’21. New York, NY, USA: Association for Computing Machinery; 2021. p. 1677–1681. https://doi.org/10.1145/3404835.3463102.

  11. Chanda S, Pal S. Irlab@ iitbhu@ dravidian-codemix-fire2020: sentiment analysis for dravidian languages in code-mixed text. In: FIRE (working notes); 2020. p. 535–540.

  12. Chanda S, Singh RP, Pal S. Is meta embedding better than pre-trained word embedding to perform sentiment analysis for dravidian languages in code-mixed text? In: FIRE (working notes); 2021. p. 1051–1060.

  13. Saroj A, Chanda S, Pal S. IRlab@IITV at SemEval-2020 task 12: multilingual offensive language identification in social media using SVM. In: Proceedings of the fourteenth workshop on semantic evaluation. Barcelona: International Committee for Computational Linguistics; 2020. p. 2012–2016. https://doi.org/10.18653/v1/2020.semeval-1.265. https://aclanthology.org/2020.semeval-1.265.

  14. Chanda S, Ujjwal S, Das S, Pal S. Fine-tuning pre-trained transformer based model for hate speech and offensive content identification in English, Indo-aryan and code-mixed (English-Hindi) languages. In: FIRE (working notes); 2021. p. 446–458.

  15. Chandu KR, Chinnakotla MK, Black AW, Shrivastava M. Webshodh: A code mixed factoid question answering system for web. In: Conference and labs of the evaluation forum; 2017.

  16. Gupta P, Rosso P, Banchs RE. Encoding transliteration variation through dimensionality reduction: fire shared task on transliterated search, In: Pre-processing of the FIRE-2013 Workshop, Forum for Information Retrieval Evaluation (FIRE); 2013.

  17. Joshi H, Bhatt A, Patel H. Transliterated search using syllabification approach, In: Pre-processing of the FIRE-2013 Workshop, Forum for Information Retrieval Evaluation (FIRE); 2013.

  18. Pakray P, Bhaskar P. Transliterated search system for Indian languages, In: Pre-processing of the FIRE-2013 Workshop, Forum for Information Retrieval Evaluation (FIRE); 2013.

  19. Prakash A, S, Saha: A relevance feedback based approach for mixed script transliterated text search: shared task report by bit Mesra, In: Pre-proceedings of the 6th FIRE-2014 Workshop, Forum for Information Retrieval Evaluation (FIRE); 2014.

  20. Mukherjee A, Ravi A, Datta K. Mixed-script query labelling using supervised learning and ad hoc retrieval using sub word indexing. In: Proceedings of the forum for information retrieval evaluation. FIRE ’14. New York, NY, USA: Association for Computing Machinery; 2014. p. 86–90. https://doi.org/10.1145/2824864.2824873.

  21. Ganguly D, Pal S, Jones GJF. Dcu@fire-2014: fuzzy queries with rule-based normalization for mixed script information retrieval. In: Proceedings of the forum for information retrieval evaluation. FIRE ’14. New York, NY, USA: Association for Computing Machinery; 2014. p. 80–85.https://doi.org/10.1145/2824864.2824888.

  22. Bhat IA, Mujadia V, Tammewar A, Bhat RA, Shrivastava M. Iiit-h system submission for fire2014 shared task on transliterated search. In: Proceedings of the forum for information retrieval evaluation. FIRE ’14. New York, NY, USA: Association for Computing Machinery;2014. p. 48–53. https://doi.org/10.1145/2824864.2824872.

  23. Londhe N, Srihari R. Exploiting named entity mentions towards code mixed ir: working notes for the ub system submission for msir@fire-2016. In: Proceedings of FIRE (Working Notes); 2016, p. 105–108

  24. Barathi BG, Kumar MA, Soman KP. Distributional semantic representation for text classification and information retrieval. In: Proceedings of FIRE (Working Notes); 2016, p. 126–130.

  25. Singh S, Kumar MA, Soman KP. Cen@amrita: information retrieval on codemixed Hindi-English tweets using vector space models. In: Proceedings of FIRE (Working Notes); 2016, p. 131–134

  26. banerjee S, Chakma K, Naskar SK, Das A, Rosso P, Bandyopadhyay S, Choudhury M. Overview of the mixed script information retrieval (MSIR). In: Proceedings of FIRE 2016. FIRE; 2016. https://www.microsoft.com/en-us/research/publication/overview-mixed-script-information-retrieval-msir/.

  27. Lo RT-W, He B, Ounis I. Automatically building a stopword list for an information retrieval system. J Digit Inf Manage. 2005;3:3–8.

    Google Scholar 

  28. Manning CD, Schütze H. Foundations of statistical natural language processing. MIT Press, 1999.

  29. Konchady M. Text mining application programming; 2006. Boston, Massachusetts: Charles River Media.

  30. Ayral H, Yavuz S. An automated domain specific stop word generation method for natural language text classification. In: 2011 international symposium on innovations in intelligent systems and applications; 2011. p. 500–503.

  31. Khalifa C, Rayner A. An automatic construction of malay stop words based on aggregation method. In: International conference on soft computing in data science; 2016. p. 180–189.

  32. Sahu SS, Pal S. Effect of stopwords in Indian language ir. Sādhanā 2022;47.

  33. Amati G, Van Rijsbergen CJ. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans Inf Syst. 2002;20(4):357–89. https://doi.org/10.1145/582415.582416.

    Article  Google Scholar 

  34. Lewis DD, Yang Y, Rose TG, Li F. Rcv1: a new benchmark collection for text categorization research. J Mach Learn Res. 2004;5:361–97.

    Google Scholar 

Download references

Funding

This study was funded by IIT (BHU), Varanasi, INDIA.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Supriya Chanda.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Research Trends in Computational Intelligence” guest edited by Anshul Verma, Pradeepika Verma, Vivek Kumar Singh and S. Karthikeyan.

Appendix

Appendix

The stopword list based on the SMART (System for the Mechanical Analysis and Retrieval of Text) IR System, an IR system developed at Cornell University in the 1960s. The English stopword list is taken from the online appendix 11Footnote 3 of Lewis et al. [34].

The another stopword list is taken from Terrier github site,Footnote 4 based on the Terrier IR system (Table 5).

Table 5 The corpus-based stopword list

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chanda, S., Pal, S. The Effect of Stopword Removal on Information Retrieval for Code-Mixed Data Obtained Via Social Media. SN COMPUT. SCI. 4, 494 (2023). https://doi.org/10.1007/s42979-023-01942-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42979-023-01942-7

Keywords

Navigation