Abstract
Recent developments in sarcasm detection have been emerged as extremely successful tools in Social media opinion mining. With the advent of machine learning tools, accurate detection has been made possible. However, the social media data used to train the machine learning models is often ill suited due to the presence of highly imbalanced classes. In absence of any thorough study on the effect of imbalanced classes in sarcasm detection for social media opinion mining, the current article proposed synthetic minority oversampling based methods to mitigate the issue of imbalanced classes which can severely effect the classifier performance in social media sarcasm detection. In the current study, five different variants of synthetic minority oversampling technique have been used on two different datasets of varying sizes. The trustworthiness is judged by training and testing of six well known classifiers and measuring their performance in terms of test phase confusion matrix based performance measuring metrics. The experimental results indicated that SMOTE and BorderlineSMOTE – 1 are extremely successful in improving the classifier performance. A thorough analysis has been performed to better understand the effect of imbalanced classes in social media sarcasm detection.
Similar content being viewed by others
References
Abercrombie G, Hovy D (2016) Putting sarcasm detection into context: the effects of class imbalance and manual Labelling on supervised machine classification of twitter conversations. In: Proceedings of the ACL 2016 Student Research Workshop, pp 107–113, Germany
Alboaneen DA, Tianfield H, Zhang Y (2017) Sentiment analysis via multi-layer perceptron trained by meta-heuristic optimisation. 2017 IEEE International Conference on Big Data (Big Data). IEEE:4630–4635
Alcaide JM, Justo R, Torres MI (2015) Combining statistical and semantic knowledge for sarcasm detection in online dialogues. In: Paredes R, Cardoso J, Pardo X (eds) Pattern recognition and image analysis. IbPRIA 2015. Lecture notes in computer science, vol 9117. Springer, Cham
Ali A, Ghazali R, Deris MM (2011) The wavelet multilayer perceptron for the prediction of earthquake time series data. In: Proceedings of the 13th International Conference on Information Integration and Web-based Applications and Services (iiWAS’11). ACM, Vietnam, pp 138–143
Anjuman P, Vikas Khullar V (2017) Sentiment classification on big data using Naïve Bayes and logistic regression. In: (2017) International Conference on Computer Communication and Informatics (ICCCI), pp. 1–5. IEEE, 2017.
Bamman D, Smith NA (2014) Contextualized sarcasm detection on twitter. In: Proceedings of the Ninth International AAAI conference on Web and Social Media, UK, pp 574–577
Banfield SR, Richmond VP, McCroskey JC (2006) The effect of teacher misbehaviors on teacher credibility and affect for the teacher. Commun Educ 55(1):63–72
Basu S, Das N, Sarkar R, Kundu M, Nasipuri M, Basu DK (2005) Handwritten ‘bangla’ alphabet recognition using an MLP based classifier. In: Proceedings of the 2nd National Conf. on computer processing of Bangla, Dhaka, pp 285–291
Bharti SK, Babu KS, Jena SK (2015) Parsing-based sarcasm sentiment recognition in twitter data. In: Proceeding ASONAM ‘15 Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2015, France, pp 1373–1380
Bharti SK, Vachha B, Pradhan RK, Babu KS, Jena SK (2016) Sarcastic sentiment detection in tweets streamed in real time: a big data approach. Digital Communications and Network 2(3):108–121
Bharti SK, Babu KS, Raman R (2017) Context-based sarcasm detection in Hindi tweets. 2017 Ninth International Conference on Advances in Pattern Recognition (ICAPR), Bangalore, pp 1–6
Bharti SK, Sathya BK, Jena SK (2017) Harnessing online news for sarcasm detection in hindi tweets. In: Shankar B, Ghosh K, Mandal D, Ray S, Zhang D, Pal S (eds) Pattern recognition and machine intelligence. PReMI 2017. Lecture notes in computer science. Springer, Cham
Bharti SK, Pradhan R, Babu KS, Jena SK (2017) Sarcasm analysis on twitter data using machine learning approaches. In: Missaoui R, Abdessalem T, Latapy M (eds) Trends in social network analysis. Lecture Notes in Social Networks. Springer, Cham
Bouazizi M, Ohtsuki T (2015) Sarcasm detection in twitter: “All your products are incredibly amazing!!!” - Are they really? 2015 IEEE Global Communications Conference (GLOBECOM), San Diego, CA, pp 1–6
Bouazizi M, Ohtsuki T (2015) Opinion mining in Twitter: How to make use of sarcasm to enhance sentiment analysis. 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), Paris, pp 1594–1597
Bouazizi M, Otsuki Ohtsuki T (2016) A pattern-based approach for sarcasm detection on twitter. IEEE Access 4:5477–5488
Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Wadsworth, Belmont, CA
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level- SMOTE: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer:475–482
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2012) DBSMOTE: density-based synthetic minority over-sampling technique. Appl Intell Springer 36(3):664–684
Burez J, Van den Poel D (2009) Handling class imbalance in customer churn prediction. Expert Syst Appl Elsevier 36(3)Part 1:4626–4636
Chatterjee S, Sarkar S, Hore S, Dey N, Ashour AS, Balas VE (2017) Particle swarm optimization trained neural network for structural failure prediction of multistoried RC buildings. Neural Comput & Applic 28(8):2005–2016
Chawla NV, Bowyer KW, Hall OH, Kegelmeyer WP (2002) SMOTE: Synthetic Minority Over-sampling Technique. J Artif Intell Res 16:321–357 (AI Access Foundation and Morgan Kaufmann Publishers)
Dashtipour K, Gogate M, Adeel A, Ieracitano C, Larijani H, Hussain A (2018) Exploiting deep learning for persian sentiment analysis. In: International conference on brain inspired cognitive systems. Springer, Cham, pp 597–604
Dave AD, Desai NP (2016) A comprehensive study of classification techniques for sarcasm detection on textual data. 2016 International Conference on Electrical, Electronics, And Optimization Techniques (ICEEOT), Chennai, pp 1985–1991
Davidov D, Tsur O, Rappoport A (2010) Enhanced sentiment learning using twitter hashtags and smileys. In proceedings of the 23rd international conference on computational linguistics: posters (COLING ‘10). Association for Computational Linguistics, Stroudsburg, pp 241–249
Davidov D, Tsur O, Rappoport A (2010) Semi-supervised recognition of sarcastic sentences in Twitter and Amazon. In: Proceedings of the Fourteenth Conference on Computational Natural Language Learning (CoNLL ‘10). Association for Computational Linguistics, Stroudsburg, pp 107–116
del Pilar Salas-Zárate M, Paredes-Valverde MA, Rodriguez-García MÁ, Valencia-García R, Alor-Hernández G (2017) Automatic detection of satire in Twitter: A psycholinguistic-based approach. Knowl-based Syst Elsevier 128:20–33
Fersini E, Pozzi FA, Messina E (2015) Detecting irony and sarcasm in microblogs: The role of expressive signals and ensemble classifiers. 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA), Paris, pp 1–8
Filatova E (2012) Irony and sarcasm: corpus generation and analysis using crowdsourcing. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC-2012). European Languages Resources Association (ELRA), pp 392–398
Filatova E (2017) Sarcasm detection using sentiment flow shifts. Proceedings of the Thirtieth International Florida Artificial Intelligence Research Society Conference, pp 264–269
Ghosh D, Guo W, Muresan S (2015) Sarcastic or not: Word embeddings to predict the literal or sarcastic meaning of words. In: Proceedings of the (2015) conference on empirical methods in natural language processing, pp 1003–1012
Gibbs RW (1986) On the psycholinguistics of sarcasm. J Exp Psychol: General 115(1):3
Ghosh K, Banerjee A, Chatterjee S, Sen S (2019) Imbalanced twitter sentiment analysis using minority oversampling. In: IEEE 10th International Conference on Awareness Science and Technology (iCAST). IEEE, pp 1–5
Gokulakrishnan B, Priyanthan P, Ragavan T, Prasath N, Perera A (2012) Opinion mining and sentiment analysis on a Twitter data stream. International Conference on Advances in ICT for Emerging Regions (ICTer2012), Colombo, pp 182–188
González-Ibánez R, Muresan S, Wacholder N (2011) Identifying sarcasm in Twitter: a closer look. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies: short papers-volume 2. Association for Computational Linguistics, pp 581–586
Greiner R, Xiaoyuan S, Shen B, Zhou W (2005) Structural extension to logistic regression: discriminative parameter learning of belief net classifiers. Mach Learn 59(3):297–322
Han H, Wang W-Y, Mao B-H (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. International Conference on Intelligent computing, Springer, pp 878–887
Han J, Kamber M, Pei J (2012) Data mining concepts and techniques, third edn. Morgan Kaufmann publishers. Elsevier, pp 330–343
Hazarika D, Poria S, Gorantla S, Cambria E, Zimmermann R, Mihalcea R (2018) CASCADE: Contextual sarcasm detection in online discussion forums. arXiv preprint arXiv 1805:06413
He H, Bai Y, Garcia EA, Li S (2008) ADASYN: Adaptive synthetic sampling approach for imbalanced learning. 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, pp 1322–1328
Irani D, Webb S, Calton P, Li K (2010) Study of trend-stuffing on twitter through text classification. Collaboration, Electronic messaging, Anti-Abuse and Spam Conference (CEAS)
Japkowicz N (2000) The class imbalance problem: significance and strategies, Proceedings of the 2000 International Conference on Artificial Intelligence (IC-AI’2000): special track on inductive learning, Las Vegas, Nevada
Joshi A, Agrawal S, Bhattacharyya P, Carman MJ (2018) Expect the unexpected: harnessing sentence completion for sarcasm detection. In: Hasida K, Pa W (eds) Computational linguistics. PACLING 2017. Communications in computer and information science, vol 781. Springer, Singapore
Joshi A, Bhattacharyya P, Carman MJ (2018) Sarcasm detection using contextual incongruity. In: Investigations in computational sarcasm. Cognitive systems monographs, vol 37. Springer, Singapore
Justo R, Corcoran T, Lukin SM, Walker M, Inés Torres M (2014) Extracting relevant knowledge for the detection of sarcasm and nastiness in the social web. Knowl-Based Syst 69:124–133
Keerthi SS, Shevade SK, Bhattacharyya C, Murthy KRK (2001) improvements to Platt’s SMO algorithm for SVM classifier design. Neural Comput 13(3):637–649
Khattri A, Joshi A, Bhattacharyya P, Carman MJ (2015) Your sentiment precedes you: using an author’s historical tweets to predict sarcasm. In: Proceedings of the 6th workshop on computational approaches to subjectivity, sentiment and social media analysis, Portugal, pp 25–30
Khokhlova M, Patti V, Rosso P (2016) Distinguishing between irony and sarcasm in social media texts: linguistic observations. 2016 international FRUCT conference on intelligence, Social Media and Web (ISMW FRUCT), St. Petersburg, pp 1–6
Kirchhoff K, Bilmes JA (1999) Dynamic classifier combination in hybrid speech recognition systems using utterance-level confidence values. In: IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No. 99CH36258), vol 2. IEEE, pp 693–696
Kreuz RJ, Glucksberg S (1989) How to be sarcastic: the echoic reminder theory of verbal irony. J Exp Psychol Gen 118(4):374–386
Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one sided selection. In: Proceedings of the fourteenth international conference on machine learning. Nashville, Tennesse, pp 179–186
Kubat M, Holte R, Matwin S (1997) Learning when negative examples abound. In: van Someren M, Widmer G (eds) Machine Learning: ECML-97. ECML 1997. Lecture notes in computer science (lecture notes in artificial intelligence), vol 1224. Springer, Berlin, Heidelberg
Kubat M, Holte RC, Matwin S (1998) Machine learning for the detection of oil spills in satellite radar images. Machine Learning 30(2–3):195–215, Springer
Larkey LS, Ballesteros L, Connell ME (2002) Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis. In: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pp 275–282
Last F, Douzas G, Bacao F (2017) Oversampling for imbalanced learning based on K-Means and SMOTE. arXiv preprint arXiv:1711.00837
Li F, Yang Y (2003) A loss function analysis for classification methods in text categorization. In: Proceedings of the 20th international conference on machine learning (ICML-03), pp 472–479
Li Y, Guo Z, Yang J, Fang H, Yongwu H (2018) Prediction of ship collision risk based on CART. IET Intell Transp Syst 12(10):1345–1350
Ling CX, Li C (1998) Data mining for direct marketing: problems and solutions. Proceeding of KDD'98 Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, pp 73–79
Ling J, Klinger R (2016) An empirical, quantitative analysis of the differences between sarcasm and irony. In: Sack H, Rizzo G, Steinmetz N, Mladenić D, Auer S, Lange C (eds) The semantic web. ESWC 2016. Lecture notes in computer science, vol 9989. Springer, Cham
Liu P, Chen W, Ou G, Wang T, Yang D, Lei K (2014) Sarcasm detection in social media based on imbalanced classification. In: Li F, Li G, Hwang S, Yao B, Zhang Z (eds) Web-age information management. WAIM 2014. Lecture notes in computer science, vol 8485. Springer, Cham
Liu S, Wang Y, Zhang J, Chen C, Xiang Y (2017) Addressing the class imbalance problem in Twitter spam detection using ensemble learning. Computers & Security 69:35–49, Elsevier
Ludwig SA, Picek S, Jakobovic D (2018) Classification of cancer data: Analyzing gene expression data using a fuzzy decision tree algorithm. In: Operations research applications in health care management. Springer, Cham, pp 327–347
Maciejewski T, Stefanowski J (2011) Local neighbourhood extension of SMOTE for mining imbalanced data. In: 2011 IEEE symposium on computational intelligence and data mining (CIDM). IEEE, pp 104–111
Mazurowski MA, Habas PA, Zurada JM, Lo JY, Baker JA, Tourassi GD (2008) Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance. Neural Netw 21(2-3):427–536
McDonald S, Pearce S (1996) Clinical insights into pragmatic theory: frontal lobe deficits and sarcasm. Brain Lang 53(1):81–104
Medhaffar S, Bougares F, Estève Y, Hadrich-Belguith L (2017) Sentiment analysis of Tunisian dialects: Linguistic ressources and experiments. Proceedings of the third Arabic Natural Language Processing Workshop:55–61
Moskovitch R, Stopel D, Feher C, Nissim N, Elovici Y (2008) Unknown malcode detection via text categorization and the imbalance problem. 2008 IEEE International Conference on Intelligence and Security Informatics, Taipei:156–161
Mukherjee S, Bala PK (2017) Sarcasm detection in microblogs using Naïve Bayes and fuzzy clustering. Journal of Technology in Society 48:19–27, Elsevier
Muniyandi AP, Rajeswari R, Rajaram R (2012) Network anomaly detection by cascading k-means clustering and C4. 5 decision tree algorithm. Procedia Engineering 30:174–182
Nguyen HM, Cooper EW, Kamei K (2009) Borderline over-sampling for imbalanced data classification. Proceedings: Fifth International Workshop on Computational Intelligence & Applications 2009(1):24–29. IEEE SMC Hiroshima Chapter
Nijhawan VK, Madan M, Dave M (2019) An Analytical Implementation of CART Using RStudio for Churn Prediction. In: Information and communication technology for competitive strategies. Springer, Singapore, pp 109–120
Oghina A, Breuss M, Tsagkias M, de Rijke M (2012) Predicting IMDB movie ratings using social media. European Conference on Information Retrieval. Springer, Berlin, Heidelberg, pp 503–507
Olanow CW, Watts RL, Koller WC (2001) An algorithm (decision tree) for the management of Parkinson’s disease (2001): treatment guidelines. Neurology 56(5):S1-S88.
Pang B, Lee L (2008) Opinion mining and sentiment analysis. Foundations and Trends® in Information Retrieval 2(1–2):1–135
Parde N, Nielsen R (2018) Detecting sarcasm is extremely easy. Proceedings of the workshop on computational semantics beyond events and roles:21–26, Louisiana
Plisson J, Lavrac N, Mladenic D (2004) A rule based approach to word lemmatization. Proceedings of IS 3:83–86
Ptáček T, Habernal I, Hong J (2014) Sarcasm detection on Czech and english twitter. In: Proceedings of COLING (2014), The 25th international conference on computational linguistics: technical papers, Dublin, pp 213–223
Rajadesingan A, Zafarani R, Liu H (2015) Sarcasm detection on twitter: a behavioral modeling approach. In: WSDM ‘15 Proceedings of the Eighth ACM International Conference on Web Search and Data Mining. ACM, China, pp 97–106
Ramentol E, Caballero Y, Bello R, Herrera F (2011) SMOTE-RSB *: a hybrid preprocessing approach based on oversampling and under sampling for high imbalanced data-sets using SMOTE and rough sets theory. Knowledge and Information Systems 33(2):245–265, Springer
Ray P, Chakrabarti A (2017) Twitter sentiment analysis for product review using lexicon method. International Conference on Data Management, Analytics and Innovation (ICDMAI), Pune:211–216
Ren Y, Ji D, Ren H (2018) Context-augmented convolutional neural networks for twitter sarcasm detection. Neurocomputing, Elsevier 308:1–7
Reyes A, Rosso P, Buscaidi D (2012) From humor recognition to irony detection: the figurative language of social media. Journal of Data & Knowledge Engineering, Elsevier 74:1–12
Salzberg S (1995) Locating protein coding regions in human DNA using a decision tree algorithm. J Comput Biol 2(3):473–485
Schifanella R, de Juan P, Tetreault J, Cao LL (2016) Detecting Sarcasm in Multimodal Social Platforms. In: Proceedings of the 24th ACM international conference on Multimedia (MM ‘16). ACM, USA, pp 1136–1145
Singh A, Blanco E, Jin W (2019) Incorporating Emoji descriptions improves tweet classification. In: Proceedings of the 2019 conference of the North American chapter of the Association for Computational Linguistics: human language technologies, volume 1 (long and short papers), Minnesota, pp 2096–2101
Song J, Kim KT, Lee BJ, Kim S-Y, Youn HY (2017) A novel classification approach based on Naïve Bayes for Twitter sentiment analysis. TIIS 11(6):2996–3011
Sulis E, Farías DIH, Rosso P, Patti V, Ruffoa G (2016) Figurative messages and affect in twitter: differences between #irony, #sarcasm and #not. Knowl-Based Syst 108:132–143
Takeshi S, Okazaki M, Matsuo Y (2010) Earthquake shakes Twitter users: real-time event detection by social sensors. In: Proceedings of the 19th International Conference on world wide web. ACM, pp 851–860
Tan S (2006) An effective refinement strategy for KNN text classifier. Expert Syst Appl 30(2):290–298
Tsur O, Davidov D, Rappoport A (2010) ICWSM – a great catchy name: semi- supervised recognition of sarcastic sentences in product reviews. In AAAI-ICWSM.
Viloria A, Wang G, Gaitan M (2020) Segmentation of sales for a mobile phone service through CART classification tree algorithm. In: Proceedings of 6th International Conference on Big Data and Cloud Computing Challenges, pp. 77–85. Springer, Singapore
Wakade S, Shekar C, Liszka KJ, Chan C-C (2012) Text mining for sentiment analysis of Twitter data. In: Proceedings of the International Conference on Information and Knowledge Engineering (IKE), p. 1. The Steering Committee of the World Congress in Computer Science, Computer Engineering and Applied Computing (WorldComp)
Wilbur WJ, Sirotkin K (1992) The automatic identification of stop words. J Inf Sci 18(1):45–55
Woolson RF (2007) Wilcoxon signed-rank test. In: Wiley encyclopedia of clinical trials, pp 1–3
Yan H, Hu H, Ping Y (2019) A study on push technology of intelligent agriculture service information based on CART algorithm. In: International Conference on Robots & Intelligent System (ICRIS). IEEE, pp 258–260
Zhang M, Zhang Y, Guohong F (2016) Tweet sarcasm detection using deep neural network. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: technical papers, pp 2449–2460
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Banerjee, A., Bhattacharjee, M., Ghosh, K. et al. Synthetic minority oversampling in addressing imbalanced sarcasm detection in social media. Multimed Tools Appl 79, 35995–36031 (2020). https://doi.org/10.1007/s11042-020-09138-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-020-09138-4