skip to main content
research-article

Multimodular Text Normalization of Dutch User-Generated Content

Published:07 July 2016Publication History
Skip Abstract Section

Abstract

As social media constitutes a valuable source for data analysis for a wide range of applications, the need for handling such data arises. However, the nonstandard language used on social media poses problems for natural language processing (NLP) tools, as these are typically trained on standard language material. We propose a text normalization approach to tackle this problem. More specifically, we investigate the usefulness of a multimodular approach to account for the diversity of normalization issues encountered in user-generated content (UGC). We consider three different types of UGC written in Dutch (SNS, SMS, and tweets) and provide a detailed analysis of the performance of the different modules and the overall system. We also apply an extrinsic evaluation by evaluating the performance of a part-of-speech tagger, lemmatizer, and named-entity recognizer before and after normalization.

References

  1. Brandy Lee Aven, David Anthony Burgess, Jonathan Frank Haynes, James Raymond Merino, and Paul Cameron Moore. 2009. Using Product and Social Network Data to Improve Online Advertising. U.S. Patent App. 11/965,509.Google ScholarGoogle Scholar
  2. AiTi Aw, Min Zhang, Juan Xiao, and Jian Su. 2006. A phrase-based statistical model for SMS text normalization. In Proceedings of the COLING/ACL Main Conference Poster Sessions (COLING-ACL’06). 33--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Timothy Baldwin, Paul Cook, Marco Lui, Andrew MacKinlay, and Li Wang. 2013. How noisy social media text, how diffrnt social media sources? In Proceedings of the 6th International Joint Conference on Natural Language Processing (IJCNLP’13). 356--364.Google ScholarGoogle Scholar
  4. Richard Beaufort, Sophie Roekhaut, Louise-Amélie Cougnon, and Cédrick Fairon. 2010. A hybrid rule/model-based finite-state framework for normalizing SMS messages. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. 770--779. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Kalina Bontcheva, Leon Derczynski, Adam Funk, Mark Greenwood, Diana Maynard, and Niraj Aswani. 2013. TwitIE: An open-source information extraction pipeline for microblog text. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP’13). 83--90.Google ScholarGoogle Scholar
  6. Monojit Choudhury, Rahul Saraf, Vijit Jain, Animesh Mukherjee, Sudeshna Sarkar, and Anupan Basu. 2007. Investigating and modeling the structure of texting language. International Journal on Document Analysis and Recognition 10, 3, 157--174. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Grzegorz Chrupala. 2014. Normalizing tweets with edit scripts and recurrent neural embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL’14), Vol. 2: Short Papers. 680--686.Google ScholarGoogle ScholarCross RefCross Ref
  8. Eleanor Clark and Kenji Araki. 2011. Text normalization in social media: Progress, problems and applications for a pre-processing system of casual English. Procedia—Social and Behavioral Sciences 27, 2--11.Google ScholarGoogle Scholar
  9. José Carlos Cortizo, Francisco Carrero, Iván Cantador, José Antonio Troyano, and Paolo Rosso. 2012. Introduction to the special section on search and mining user-generated content. ACM Transactions on Intelligent Systems and Technology 3, 4, 65:1--65:3. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Walter Daelemans and Antal van den Bosch. 2005. Memory-Based Language Processing. Cambridge University Press, Cambridge, UK. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Orphée De Clercq, Sarah Schulz, Bart Desmet, Els Lefever, and Véronique Hoste. 2013. Normalization of dutch user-generated content. In Proceedings of the 9th International Conference on Recent Advances in Natural Language Processing (RANLP’13). 179--188.Google ScholarGoogle Scholar
  12. Bart Desmet and Véronique Hoste. 2014. Recognising suicidal messages in Dutch social media. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14). 830--835.Google ScholarGoogle Scholar
  13. Bradley Efron and Robert Tibshirani. 1986. Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Statistical Science 1, 1, 54--75.Google ScholarGoogle ScholarCross RefCross Ref
  14. Jacob Eisenstein. 2013. What to do about bad language on the Internet. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 359--369.Google ScholarGoogle Scholar
  15. Jacob Eisenstein, Brendan O’Connor, Noah A. Smith, and Eric P. Xing. 2010. A latent variable model for geographic lexical variation. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP’10). 1277--1287. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Jacob Eisenstein, Noah A. Smith, and Eric P. Xing. 2011. Discovering sociolinguistic associations with structured sparsity. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Vol. 1 (HLT’11). 1365--1374. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Jennifer Foster, Ozlem Cetinoglu, Joachim Wagner, Joseph Le Roux, Joakim Nivre, Deirdre Hogan, and Josef van Genabith. 2011. From news to comment: Resources and benchmarks for parsing the language of Web 2.0. In Proceedings of the 5th International Joint Conference on Natural Language Processing. 893--901.Google ScholarGoogle Scholar
  18. Kevin Gimpel, Nathan Schneider, Brendan O’Connor, Dipanjan Das, Daniel Mills, Jacob Eisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah A. Smith. 2011. Part-of-speech tagging for Twitter: Annotation, features, and experiments. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers, Vol. 2 (HLT’11). 42--47. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Stephan Gouws, Dirk Hovy, and Donald Metzler. 2011. Unsupervised mining of lexical variants from noisy text. In Proceedings of the 1st Workshop on Unsupervised Learning in NLP (EMNLP’11). 82--90. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Bo Han and Timothy Baldwin. 2011. Lexical normalisation of short text messages: Makn sens a #twitter. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Vol. 1 (HLT’11). 368--378. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Bo Han, Paul Cook, and Timothy Baldwin. 2012. Automatically constructing a normalisation dictionary for microblogs. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL’12). 421--432. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Bo Han, Paul Cook, and Timothy Baldwin. 2013. Lexical normalization for social media text. ACM Transactions on Intelligent Systems and Technology 4, 1, 5:1--5:27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Kenneth Heafield. 2011. KenLM: Faster and smaller language model queries. In Proceedings of the 6th Workshop on Statistical Machine Translation (WMT’11). 187--197. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Max Kaufmann and Jugal Kalita. 2010. Syntactic normalization of Twitter messages. In Proceedings of the International Conference on Natural Language Processing.Google ScholarGoogle Scholar
  25. Mike Kestemont, Claudia Peersman, Benny De Decker, Guy De Pauw, Kim Luyckx, Roser Morante, Frederik Vaassen, Janneke van de Loo, and Walter Daelemans. 2012. The Netlog corpus: A resource for the study of Flemish Dutch Internet language. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’12). 1569--1572.Google ScholarGoogle Scholar
  26. Catherine Kobus, Yvon François, and Damnati Géraldine. 2008a. Normalizing SMS: Are two metaphors better than one? In Proceedings of the 22nd International Conference on Computational Linguistics (COLING’08). 441--448. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Catherine Kobus, Yvon François, and Damnati Géraldine. 2008b. Transcrire les SMS comme on reconnaît la parole. In Actes de la Conférence sur le Traitement Automatique des Langues (TALN’08). 128--138.Google ScholarGoogle Scholar
  28. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions (ACL’07). 177--180. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Chen Li and Yang Liu. 2012. Improving text normalization using character-blocks based models and system combination. In Proceedings of the 24th International Conference on Computational Linguistics (COLING’12). 1587--1602.Google ScholarGoogle Scholar
  30. Chen Li and Yang Liu. 2014. Improving text normalization via unsupervised model and discriminative reranking. In Proceedings of the ACL 2014 Student Research Workshop. 86--93.Google ScholarGoogle ScholarCross RefCross Ref
  31. Wang Ling, Chris Dyer, Alan W. Black, and Isabel Trancoso. 2013. Paraphrasing 4 microblog normalization. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 73--84.Google ScholarGoogle Scholar
  32. Fei Liu, Fuliang Weng, and Xiao Jiang. 2012. A broad-coverage normalization system for social media language. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, Vol. 1: Long Papers. 1035--1044. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Xiaohua Liu, Furu Wei, Shaodian Zhang, and Ming Zhou. 2013. Named entity recognition for tweets. ACM Transactions on Intelligent Systems and Technology 4, 1, 3:1--3:15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Maite Melero, Marta R. Costa-Juss, Judith Domingo, Montse Marquina, and Mart Quixal. 2012. Holaaa!! Writin like u talk is kewl but kinda hard 4 NLP. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’12). 3794--3800.Google ScholarGoogle Scholar
  35. San Murugesan. 2007. Understanding Web 2.0. IT Professional 9, 4, 34--41. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Brendan O’Connor, Michel Krieger, and David Ahn. 2010. TweetMotif: Exploratory search and topic summarization for Twitter. In Proceedings of the 4th International AAAI Conference on Weblogs and Social Media. AAAI Press, Washington, DC.Google ScholarGoogle Scholar
  37. Nelleke Oostdijk. 2000. The spoken Dutch corpus. Overview and first evaluation. In Proceedings of the 2nd International Conference on Language Resources and Evaluation (LREC’00). 887--893.Google ScholarGoogle Scholar
  38. Nelleke Oostdijk. 2008. SoNaR: STEVIN Nederlandstalig Referentiecorpus. Retrieved March 12, 2016, from http://lands.let.ru.nl/projects/SoNaR/.Google ScholarGoogle Scholar
  39. Georgios Paltoglou and Mike Thelwall. 2012. Twitter, MySpace, Digg: Unsupervised sentiment analysis in social media. ACM Transactions on Intelligent Systems and Technology 3, 4, 66:1--66:19. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Claudia Peersman, Walter Daelemans, and Leona Van Vaerenbergh. 2011. Predicting age and gender in online social networks. In Proceedings of the 3rd International Workshop on Search and Mining User-Generated Contents (SMUC’11). ACM, New York, NY, 37--44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Deana L. Pennell and Yang Liu. 2011. A character-level machine translation approach for normalization of SMS abbreviations. In Proceedings of the 5th International Joint Conference on Natural Language Processing. 974--982.Google ScholarGoogle Scholar
  42. Karthik Raghunathan, Heeyoung Lee, Sudarshan Rangarajan, Nathanael Chambers, Mihai Surdeanu, Dan Jurafsky, and Christopher Manning. 2010. A multi-pass sieve for coreference resolution. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP’10). 492--501. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Martin Reynaert, Nelleke Oostdijk, Orphe De Clercq, Henk van den Heuvel, and Franciska de Jong. 2010. Balancing SoNaR: IPR versus processing issues in a 500-million-word written Dutch reference corpus. In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC’10). 2693--2698.Google ScholarGoogle Scholar
  44. Cate Riegner. 2007. Word of mouth on the Web: The impact of Web 2.0 on consumer purchase decisions. Journal of Advertising Research. 47, 4, 436--437.Google ScholarGoogle ScholarCross RefCross Ref
  45. Alan Ritter, Sam Clark, Mausam, and Oren Etzioni. 2011. Named entity recognition in tweets: An experimental study. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’11). 1524--1534. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Sara Rosenthal and Kathleen McKeown. 2011. Age prediction in blogs: A study of style, content, and online behavior in pre- and post-social media generations. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Vol. 1 (HLT’11). 763--772. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Kathleen Van Royen, Karolien Poels, Walter Daelemans, and Heidi Vandebosch. 2015. Automatic monitoring of cyberbullying on social networking sites: From technological feasibility to desirability. Telematics and Informatics 32, 1, 89--97.Google ScholarGoogle ScholarCross RefCross Ref
  48. Helmut Schmid. 1994. Probabilistic part-of-speech tagging using decision trees. In Proceedings of the International Conference on New Methods in Language Processing. 44--49.Google ScholarGoogle Scholar
  49. Sarah Schulz. 2014. Named entity recognition for user-generated content. In Proceedings of the ESSLLI 2014 Student Session. 207--2018.Google ScholarGoogle Scholar
  50. Richard Sproat, Alan W. Black, Stanley Chen, Shankar Kumar, Mari Ostendorf, and Christopher Richards. 2001. Normalization of non-standard words. Computer, Speech and Language 15, 3, 287--333. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Paul Taylor, Alan Black, and Richard Caley. 1998. The architecture of the festival speech synthesis system. In Proceedings of the 3rd ESCA/COCOSDA Workshop on Speech Synthesis. 147--151.Google ScholarGoogle Scholar
  52. Jörg Tiedemann. 2012. Character-based pivot translation for under-resourced languages and domains. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL’12). 141--151. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Marjan van de Kauter, Geert Coorman, Els Lefever, Bart Desmet, Lieve Macken, and Véronique Hoste. 2013. LeTs preprocess: The multilingual LT3 linguistic preprocessing toolkit. Computational Linguistics in the Netherlands Journal 3, 103--120.Google ScholarGoogle Scholar
  54. José van Dijk. 2009. Users like you? Theorizing agency in user generated content. Media, Culture and Society 31, 1, 41--58.Google ScholarGoogle ScholarCross RefCross Ref
  55. Cynthia Van Hee, Els Lefever, Ben Verhoeven, Julie Mennes, Bart Desmet, Guy De Pauw, Walter Daelemans, and Véronique Hoste. 2015. Detection and fine-grained classification of cyberbullying events. In Proceedings of Recent Advances in Natural Language Processing (RANLP’15).Google ScholarGoogle Scholar
  56. Reinhild VandeKerckhove and Judith Nobels. 2010. Code eclecticism: Linguistic variation and code alternation in the chat language of Flemish teenagers. Journal of Sociolinguistics 14, 657--677.Google ScholarGoogle ScholarCross RefCross Ref
  57. Robert A. Wagner and Michael J. Fisher. 1974. The string-to-string correction problem. Journal of the ACM 21, 1, 168--173. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Pidong Wang and Hwee Tou Ng. 2013. A beam-search decoder for normalization of social media text with application to machine translation. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 471--481.Google ScholarGoogle Scholar
  59. Benjamin P. Wing and Jason Baldridge. 2011. Simple supervised document geolocation with geodesic grids. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Vol. 1 1 (HLT’11). 955--964. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Zhenzhen Xue, Dawei Yin, and Brian D. Davison. 2011. Normalizing microtext. In Proceedings of the AAAI-11 Workshop on Analyzing Microtext, Vol. WS-11-05. 74--79. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Yi Yang and Jacob Eisenstein. 2013. A log-linear model for unsupervised text normalization. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 61--72.Google ScholarGoogle Scholar
  62. François Yvon. 2010. Rewriting the orthography of SMS messages. Natural Language Engineering 16, 2, 133--159. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Congle Zhang, Tyler Baldwin, Howard Ho, Benny Kimelfeld, and Yunyao Li. 2013. Adaptive parser-centric text normalization. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Vol. 1: Long Papers. 1159--1168.Google ScholarGoogle Scholar

Index Terms

  1. Multimodular Text Normalization of Dutch User-Generated Content

                    Recommendations

                    Comments

                    Login options

                    Check if you have access through your login credentials or your institution to get full access on this article.

                    Sign in

                    Full Access

                    • Published in

                      cover image ACM Transactions on Intelligent Systems and Technology
                      ACM Transactions on Intelligent Systems and Technology  Volume 7, Issue 4
                      Special Issue on Crowd in Intelligent Systems, Research Note/Short Paper and Regular Papers
                      July 2016
                      498 pages
                      ISSN:2157-6904
                      EISSN:2157-6912
                      DOI:10.1145/2906145
                      • Editor:
                      • Yu Zheng
                      Issue’s Table of Contents

                      Copyright © 2016 ACM

                      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                      Publisher

                      Association for Computing Machinery

                      New York, NY, United States

                      Publication History

                      • Published: 7 July 2016
                      • Accepted: 1 November 2015
                      • Revised: 1 August 2015
                      • Received: 1 November 2014
                      Published in tist Volume 7, Issue 4

                      Permissions

                      Request permissions about this article.

                      Request Permissions

                      Check for updates

                      Qualifiers

                      • research-article
                      • Research
                      • Refereed

                    PDF Format

                    View or Download as a PDF file.

                    PDF

                    eReader

                    View online with eReader.

                    eReader