Abstract
As social media constitutes a valuable source for data analysis for a wide range of applications, the need for handling such data arises. However, the nonstandard language used on social media poses problems for natural language processing (NLP) tools, as these are typically trained on standard language material. We propose a text normalization approach to tackle this problem. More specifically, we investigate the usefulness of a multimodular approach to account for the diversity of normalization issues encountered in user-generated content (UGC). We consider three different types of UGC written in Dutch (SNS, SMS, and tweets) and provide a detailed analysis of the performance of the different modules and the overall system. We also apply an extrinsic evaluation by evaluating the performance of a part-of-speech tagger, lemmatizer, and named-entity recognizer before and after normalization.
- Brandy Lee Aven, David Anthony Burgess, Jonathan Frank Haynes, James Raymond Merino, and Paul Cameron Moore. 2009. Using Product and Social Network Data to Improve Online Advertising. U.S. Patent App. 11/965,509.Google Scholar
- AiTi Aw, Min Zhang, Juan Xiao, and Jian Su. 2006. A phrase-based statistical model for SMS text normalization. In Proceedings of the COLING/ACL Main Conference Poster Sessions (COLING-ACL’06). 33--40. Google ScholarDigital Library
- Timothy Baldwin, Paul Cook, Marco Lui, Andrew MacKinlay, and Li Wang. 2013. How noisy social media text, how diffrnt social media sources? In Proceedings of the 6th International Joint Conference on Natural Language Processing (IJCNLP’13). 356--364.Google Scholar
- Richard Beaufort, Sophie Roekhaut, Louise-Amélie Cougnon, and Cédrick Fairon. 2010. A hybrid rule/model-based finite-state framework for normalizing SMS messages. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. 770--779. Google ScholarDigital Library
- Kalina Bontcheva, Leon Derczynski, Adam Funk, Mark Greenwood, Diana Maynard, and Niraj Aswani. 2013. TwitIE: An open-source information extraction pipeline for microblog text. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP’13). 83--90.Google Scholar
- Monojit Choudhury, Rahul Saraf, Vijit Jain, Animesh Mukherjee, Sudeshna Sarkar, and Anupan Basu. 2007. Investigating and modeling the structure of texting language. International Journal on Document Analysis and Recognition 10, 3, 157--174. Google ScholarDigital Library
- Grzegorz Chrupala. 2014. Normalizing tweets with edit scripts and recurrent neural embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL’14), Vol. 2: Short Papers. 680--686.Google ScholarCross Ref
- Eleanor Clark and Kenji Araki. 2011. Text normalization in social media: Progress, problems and applications for a pre-processing system of casual English. Procedia—Social and Behavioral Sciences 27, 2--11.Google Scholar
- José Carlos Cortizo, Francisco Carrero, Iván Cantador, José Antonio Troyano, and Paolo Rosso. 2012. Introduction to the special section on search and mining user-generated content. ACM Transactions on Intelligent Systems and Technology 3, 4, 65:1--65:3. Google ScholarDigital Library
- Walter Daelemans and Antal van den Bosch. 2005. Memory-Based Language Processing. Cambridge University Press, Cambridge, UK. Google ScholarDigital Library
- Orphée De Clercq, Sarah Schulz, Bart Desmet, Els Lefever, and Véronique Hoste. 2013. Normalization of dutch user-generated content. In Proceedings of the 9th International Conference on Recent Advances in Natural Language Processing (RANLP’13). 179--188.Google Scholar
- Bart Desmet and Véronique Hoste. 2014. Recognising suicidal messages in Dutch social media. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14). 830--835.Google Scholar
- Bradley Efron and Robert Tibshirani. 1986. Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Statistical Science 1, 1, 54--75.Google ScholarCross Ref
- Jacob Eisenstein. 2013. What to do about bad language on the Internet. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 359--369.Google Scholar
- Jacob Eisenstein, Brendan O’Connor, Noah A. Smith, and Eric P. Xing. 2010. A latent variable model for geographic lexical variation. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP’10). 1277--1287. Google ScholarDigital Library
- Jacob Eisenstein, Noah A. Smith, and Eric P. Xing. 2011. Discovering sociolinguistic associations with structured sparsity. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Vol. 1 (HLT’11). 1365--1374. Google ScholarDigital Library
- Jennifer Foster, Ozlem Cetinoglu, Joachim Wagner, Joseph Le Roux, Joakim Nivre, Deirdre Hogan, and Josef van Genabith. 2011. From news to comment: Resources and benchmarks for parsing the language of Web 2.0. In Proceedings of the 5th International Joint Conference on Natural Language Processing. 893--901.Google Scholar
- Kevin Gimpel, Nathan Schneider, Brendan O’Connor, Dipanjan Das, Daniel Mills, Jacob Eisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah A. Smith. 2011. Part-of-speech tagging for Twitter: Annotation, features, and experiments. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers, Vol. 2 (HLT’11). 42--47. Google ScholarDigital Library
- Stephan Gouws, Dirk Hovy, and Donald Metzler. 2011. Unsupervised mining of lexical variants from noisy text. In Proceedings of the 1st Workshop on Unsupervised Learning in NLP (EMNLP’11). 82--90. Google ScholarDigital Library
- Bo Han and Timothy Baldwin. 2011. Lexical normalisation of short text messages: Makn sens a #twitter. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Vol. 1 (HLT’11). 368--378. Google ScholarDigital Library
- Bo Han, Paul Cook, and Timothy Baldwin. 2012. Automatically constructing a normalisation dictionary for microblogs. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL’12). 421--432. Google ScholarDigital Library
- Bo Han, Paul Cook, and Timothy Baldwin. 2013. Lexical normalization for social media text. ACM Transactions on Intelligent Systems and Technology 4, 1, 5:1--5:27. Google ScholarDigital Library
- Kenneth Heafield. 2011. KenLM: Faster and smaller language model queries. In Proceedings of the 6th Workshop on Statistical Machine Translation (WMT’11). 187--197. Google ScholarDigital Library
- Max Kaufmann and Jugal Kalita. 2010. Syntactic normalization of Twitter messages. In Proceedings of the International Conference on Natural Language Processing.Google Scholar
- Mike Kestemont, Claudia Peersman, Benny De Decker, Guy De Pauw, Kim Luyckx, Roser Morante, Frederik Vaassen, Janneke van de Loo, and Walter Daelemans. 2012. The Netlog corpus: A resource for the study of Flemish Dutch Internet language. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’12). 1569--1572.Google Scholar
- Catherine Kobus, Yvon François, and Damnati Géraldine. 2008a. Normalizing SMS: Are two metaphors better than one? In Proceedings of the 22nd International Conference on Computational Linguistics (COLING’08). 441--448. Google ScholarDigital Library
- Catherine Kobus, Yvon François, and Damnati Géraldine. 2008b. Transcrire les SMS comme on reconnaît la parole. In Actes de la Conférence sur le Traitement Automatique des Langues (TALN’08). 128--138.Google Scholar
- Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions (ACL’07). 177--180. Google ScholarDigital Library
- Chen Li and Yang Liu. 2012. Improving text normalization using character-blocks based models and system combination. In Proceedings of the 24th International Conference on Computational Linguistics (COLING’12). 1587--1602.Google Scholar
- Chen Li and Yang Liu. 2014. Improving text normalization via unsupervised model and discriminative reranking. In Proceedings of the ACL 2014 Student Research Workshop. 86--93.Google ScholarCross Ref
- Wang Ling, Chris Dyer, Alan W. Black, and Isabel Trancoso. 2013. Paraphrasing 4 microblog normalization. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 73--84.Google Scholar
- Fei Liu, Fuliang Weng, and Xiao Jiang. 2012. A broad-coverage normalization system for social media language. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, Vol. 1: Long Papers. 1035--1044. Google ScholarDigital Library
- Xiaohua Liu, Furu Wei, Shaodian Zhang, and Ming Zhou. 2013. Named entity recognition for tweets. ACM Transactions on Intelligent Systems and Technology 4, 1, 3:1--3:15. Google ScholarDigital Library
- Maite Melero, Marta R. Costa-Juss, Judith Domingo, Montse Marquina, and Mart Quixal. 2012. Holaaa!! Writin like u talk is kewl but kinda hard 4 NLP. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’12). 3794--3800.Google Scholar
- San Murugesan. 2007. Understanding Web 2.0. IT Professional 9, 4, 34--41. Google ScholarDigital Library
- Brendan O’Connor, Michel Krieger, and David Ahn. 2010. TweetMotif: Exploratory search and topic summarization for Twitter. In Proceedings of the 4th International AAAI Conference on Weblogs and Social Media. AAAI Press, Washington, DC.Google Scholar
- Nelleke Oostdijk. 2000. The spoken Dutch corpus. Overview and first evaluation. In Proceedings of the 2nd International Conference on Language Resources and Evaluation (LREC’00). 887--893.Google Scholar
- Nelleke Oostdijk. 2008. SoNaR: STEVIN Nederlandstalig Referentiecorpus. Retrieved March 12, 2016, from http://lands.let.ru.nl/projects/SoNaR/.Google Scholar
- Georgios Paltoglou and Mike Thelwall. 2012. Twitter, MySpace, Digg: Unsupervised sentiment analysis in social media. ACM Transactions on Intelligent Systems and Technology 3, 4, 66:1--66:19. Google ScholarDigital Library
- Claudia Peersman, Walter Daelemans, and Leona Van Vaerenbergh. 2011. Predicting age and gender in online social networks. In Proceedings of the 3rd International Workshop on Search and Mining User-Generated Contents (SMUC’11). ACM, New York, NY, 37--44. Google ScholarDigital Library
- Deana L. Pennell and Yang Liu. 2011. A character-level machine translation approach for normalization of SMS abbreviations. In Proceedings of the 5th International Joint Conference on Natural Language Processing. 974--982.Google Scholar
- Karthik Raghunathan, Heeyoung Lee, Sudarshan Rangarajan, Nathanael Chambers, Mihai Surdeanu, Dan Jurafsky, and Christopher Manning. 2010. A multi-pass sieve for coreference resolution. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP’10). 492--501. Google ScholarDigital Library
- Martin Reynaert, Nelleke Oostdijk, Orphe De Clercq, Henk van den Heuvel, and Franciska de Jong. 2010. Balancing SoNaR: IPR versus processing issues in a 500-million-word written Dutch reference corpus. In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC’10). 2693--2698.Google Scholar
- Cate Riegner. 2007. Word of mouth on the Web: The impact of Web 2.0 on consumer purchase decisions. Journal of Advertising Research. 47, 4, 436--437.Google ScholarCross Ref
- Alan Ritter, Sam Clark, Mausam, and Oren Etzioni. 2011. Named entity recognition in tweets: An experimental study. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’11). 1524--1534. Google ScholarDigital Library
- Sara Rosenthal and Kathleen McKeown. 2011. Age prediction in blogs: A study of style, content, and online behavior in pre- and post-social media generations. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Vol. 1 (HLT’11). 763--772. Google ScholarDigital Library
- Kathleen Van Royen, Karolien Poels, Walter Daelemans, and Heidi Vandebosch. 2015. Automatic monitoring of cyberbullying on social networking sites: From technological feasibility to desirability. Telematics and Informatics 32, 1, 89--97.Google ScholarCross Ref
- Helmut Schmid. 1994. Probabilistic part-of-speech tagging using decision trees. In Proceedings of the International Conference on New Methods in Language Processing. 44--49.Google Scholar
- Sarah Schulz. 2014. Named entity recognition for user-generated content. In Proceedings of the ESSLLI 2014 Student Session. 207--2018.Google Scholar
- Richard Sproat, Alan W. Black, Stanley Chen, Shankar Kumar, Mari Ostendorf, and Christopher Richards. 2001. Normalization of non-standard words. Computer, Speech and Language 15, 3, 287--333. Google ScholarDigital Library
- Paul Taylor, Alan Black, and Richard Caley. 1998. The architecture of the festival speech synthesis system. In Proceedings of the 3rd ESCA/COCOSDA Workshop on Speech Synthesis. 147--151.Google Scholar
- Jörg Tiedemann. 2012. Character-based pivot translation for under-resourced languages and domains. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL’12). 141--151. Google ScholarDigital Library
- Marjan van de Kauter, Geert Coorman, Els Lefever, Bart Desmet, Lieve Macken, and Véronique Hoste. 2013. LeTs preprocess: The multilingual LT3 linguistic preprocessing toolkit. Computational Linguistics in the Netherlands Journal 3, 103--120.Google Scholar
- José van Dijk. 2009. Users like you? Theorizing agency in user generated content. Media, Culture and Society 31, 1, 41--58.Google ScholarCross Ref
- Cynthia Van Hee, Els Lefever, Ben Verhoeven, Julie Mennes, Bart Desmet, Guy De Pauw, Walter Daelemans, and Véronique Hoste. 2015. Detection and fine-grained classification of cyberbullying events. In Proceedings of Recent Advances in Natural Language Processing (RANLP’15).Google Scholar
- Reinhild VandeKerckhove and Judith Nobels. 2010. Code eclecticism: Linguistic variation and code alternation in the chat language of Flemish teenagers. Journal of Sociolinguistics 14, 657--677.Google ScholarCross Ref
- Robert A. Wagner and Michael J. Fisher. 1974. The string-to-string correction problem. Journal of the ACM 21, 1, 168--173. Google ScholarDigital Library
- Pidong Wang and Hwee Tou Ng. 2013. A beam-search decoder for normalization of social media text with application to machine translation. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 471--481.Google Scholar
- Benjamin P. Wing and Jason Baldridge. 2011. Simple supervised document geolocation with geodesic grids. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Vol. 1 1 (HLT’11). 955--964. Google ScholarDigital Library
- Zhenzhen Xue, Dawei Yin, and Brian D. Davison. 2011. Normalizing microtext. In Proceedings of the AAAI-11 Workshop on Analyzing Microtext, Vol. WS-11-05. 74--79. Google ScholarDigital Library
- Yi Yang and Jacob Eisenstein. 2013. A log-linear model for unsupervised text normalization. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 61--72.Google Scholar
- François Yvon. 2010. Rewriting the orthography of SMS messages. Natural Language Engineering 16, 2, 133--159. Google ScholarDigital Library
- Congle Zhang, Tyler Baldwin, Howard Ho, Benny Kimelfeld, and Yunyao Li. 2013. Adaptive parser-centric text normalization. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Vol. 1: Long Papers. 1159--1168.Google Scholar
Index Terms
- Multimodular Text Normalization of Dutch User-Generated Content
Recommendations
Machine Normalization: Bringing Social Media Text from Non-Standard to Standard Form
User-generated text in social media communication (SMC) is mainly characterized by non-standard form. It may contain code switching (CS) text, a widespread phenomenon in SMC, in addition to noisy elements used, especially in written conversations (use ...
Tag suggestion and localization in user-generated videos based on social knowledge
WSM '10: Proceedings of second ACM SIGMM workshop on Social mediaNowadays, almost any web site that provides means for sharing user-generated multimedia content, like Flickr, Facebook, YouTube and Vimeo, has tagging functionalities to let users annotate the material that they want to share. The tags are then used to ...
A Multilingual Text Normalization Approach
Human Language Technology Challenges for Computer Science and LinguisticsAbstractThe creation of text corpora requires a sequence of processing steps in order to constitute, normalize, and then to directly exploit it by a given application. This paper presents a generic approach for text normalization and concentrates on the ...
Comments