research-article

Multimodular Text Normalization of Dutch User-Generated Content

Authors:
Sarah Schulz

Ghent University, Gent, Belgium

Ghent University, Gent, Belgium

0000-0001-9069-471X
View Profile

,
Guy De Pauw

University of Antwerp, Antwerp, Belgium

University of Antwerp, Antwerp, Belgium
View Profile

,
Orphée De Clercq

Ghent University, Gent, Belgium

Ghent University, Gent, Belgium
View Profile

,
Bart Desmet

Ghent University, Gent, Belgium

Ghent University, Gent, Belgium
View Profile

,
Véronique Hoste

Ghent University, Gent, Belgium

Ghent University, Gent, Belgium
View Profile

,
Walter Daelemans

University of Antwerp, Antwerp, Belgium

University of Antwerp, Antwerp, Belgium
View Profile

,
Lieve Macken

Ghent University, Gent, Belgium

Ghent University, Gent, Belgium
View Profile

ACM Transactions on Intelligent Systems and Technology Volume 7 Issue 4Article No.: 61pp 1–22https://doi.org/10.1145/2850422

Published:07 July 2016Publication History

ACM Transactions on Intelligent Systems and Technology

Abstract

As social media constitutes a valuable source for data analysis for a wide range of applications, the need for handling such data arises. However, the nonstandard language used on social media poses problems for natural language processing (NLP) tools, as these are typically trained on standard language material. We propose a text normalization approach to tackle this problem. More specifically, we investigate the usefulness of a multimodular approach to account for the diversity of normalization issues encountered in user-generated content (UGC). We consider three different types of UGC written in Dutch (SNS, SMS, and tweets) and provide a detailed analysis of the performance of the different modules and the overall system. We also apply an extrinsic evaluation by evaluating the performance of a part-of-speech tagger, lemmatizer, and named-entity recognizer before and after normalization.

References

Brandy Lee Aven, David Anthony Burgess, Jonathan Frank Haynes, James Raymond Merino, and Paul Cameron Moore. 2009. Using Product and Social Network Data to Improve Online Advertising. U.S. Patent App. 11/965,509.Google Scholar
AiTi Aw, Min Zhang, Juan Xiao, and Jian Su. 2006. A phrase-based statistical model for SMS text normalization. In Proceedings of the COLING/ACL Main Conference Poster Sessions (COLING-ACL’06). 33--40. Google ScholarDigital Library
Timothy Baldwin, Paul Cook, Marco Lui, Andrew MacKinlay, and Li Wang. 2013. How noisy social media text, how diffrnt social media sources? In Proceedings of the 6th International Joint Conference on Natural Language Processing (IJCNLP’13). 356--364.Google Scholar
Richard Beaufort, Sophie Roekhaut, Louise-Amélie Cougnon, and Cédrick Fairon. 2010. A hybrid rule/model-based finite-state framework for normalizing SMS messages. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. 770--779. Google ScholarDigital Library
Kalina Bontcheva, Leon Derczynski, Adam Funk, Mark Greenwood, Diana Maynard, and Niraj Aswani. 2013. TwitIE: An open-source information extraction pipeline for microblog text. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP’13). 83--90.Google Scholar
Monojit Choudhury, Rahul Saraf, Vijit Jain, Animesh Mukherjee, Sudeshna Sarkar, and Anupan Basu. 2007. Investigating and modeling the structure of texting language. International Journal on Document Analysis and Recognition 10, 3, 157--174. Google ScholarDigital Library
Grzegorz Chrupala. 2014. Normalizing tweets with edit scripts and recurrent neural embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL’14), Vol. 2: Short Papers. 680--686.Google ScholarCross Ref
Eleanor Clark and Kenji Araki. 2011. Text normalization in social media: Progress, problems and applications for a pre-processing system of casual English. Procedia—Social and Behavioral Sciences 27, 2--11.Google Scholar
José Carlos Cortizo, Francisco Carrero, Iván Cantador, José Antonio Troyano, and Paolo Rosso. 2012. Introduction to the special section on search and mining user-generated content. ACM Transactions on Intelligent Systems and Technology 3, 4, 65:1--65:3. Google ScholarDigital Library
Walter Daelemans and Antal van den Bosch. 2005. Memory-Based Language Processing. Cambridge University Press, Cambridge, UK. Google ScholarDigital Library
Orphée De Clercq, Sarah Schulz, Bart Desmet, Els Lefever, and Véronique Hoste. 2013. Normalization of dutch user-generated content. In Proceedings of the 9th International Conference on Recent Advances in Natural Language Processing (RANLP’13). 179--188.Google Scholar
Bart Desmet and Véronique Hoste. 2014. Recognising suicidal messages in Dutch social media. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14). 830--835.Google Scholar
Bradley Efron and Robert Tibshirani. 1986. Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Statistical Science 1, 1, 54--75.Google ScholarCross Ref
Jacob Eisenstein. 2013. What to do about bad language on the Internet. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 359--369.Google Scholar
Jacob Eisenstein, Brendan O’Connor, Noah A. Smith, and Eric P. Xing. 2010. A latent variable model for geographic lexical variation. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP’10). 1277--1287. Google ScholarDigital Library
Jacob Eisenstein, Noah A. Smith, and Eric P. Xing. 2011. Discovering sociolinguistic associations with structured sparsity. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Vol. 1 (HLT’11). 1365--1374. Google ScholarDigital Library
Jennifer Foster, Ozlem Cetinoglu, Joachim Wagner, Joseph Le Roux, Joakim Nivre, Deirdre Hogan, and Josef van Genabith. 2011. From news to comment: Resources and benchmarks for parsing the language of Web 2.0. In Proceedings of the 5th International Joint Conference on Natural Language Processing. 893--901.Google Scholar
Kevin Gimpel, Nathan Schneider, Brendan O’Connor, Dipanjan Das, Daniel Mills, Jacob Eisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah A. Smith. 2011. Part-of-speech tagging for Twitter: Annotation, features, and experiments. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers, Vol. 2 (HLT’11). 42--47. Google ScholarDigital Library
Stephan Gouws, Dirk Hovy, and Donald Metzler. 2011. Unsupervised mining of lexical variants from noisy text. In Proceedings of the 1st Workshop on Unsupervised Learning in NLP (EMNLP’11). 82--90. Google ScholarDigital Library
Bo Han and Timothy Baldwin. 2011. Lexical normalisation of short text messages: Makn sens a #twitter. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Vol. 1 (HLT’11). 368--378. Google ScholarDigital Library
Bo Han, Paul Cook, and Timothy Baldwin. 2012. Automatically constructing a normalisation dictionary for microblogs. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL’12). 421--432. Google ScholarDigital Library
Bo Han, Paul Cook, and Timothy Baldwin. 2013. Lexical normalization for social media text. ACM Transactions on Intelligent Systems and Technology 4, 1, 5:1--5:27. Google ScholarDigital Library
Kenneth Heafield. 2011. KenLM: Faster and smaller language model queries. In Proceedings of the 6th Workshop on Statistical Machine Translation (WMT’11). 187--197. Google ScholarDigital Library
Max Kaufmann and Jugal Kalita. 2010. Syntactic normalization of Twitter messages. In Proceedings of the International Conference on Natural Language Processing.Google Scholar
Mike Kestemont, Claudia Peersman, Benny De Decker, Guy De Pauw, Kim Luyckx, Roser Morante, Frederik Vaassen, Janneke van de Loo, and Walter Daelemans. 2012. The Netlog corpus: A resource for the study of Flemish Dutch Internet language. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’12). 1569--1572.Google Scholar
Catherine Kobus, Yvon François, and Damnati Géraldine. 2008a. Normalizing SMS: Are two metaphors better than one? In Proceedings of the 22nd International Conference on Computational Linguistics (COLING’08). 441--448. Google ScholarDigital Library
Catherine Kobus, Yvon François, and Damnati Géraldine. 2008b. Transcrire les SMS comme on reconnaît la parole. In Actes de la Conférence sur le Traitement Automatique des Langues (TALN’08). 128--138.Google Scholar
Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions (ACL’07). 177--180. Google ScholarDigital Library
Chen Li and Yang Liu. 2012. Improving text normalization using character-blocks based models and system combination. In Proceedings of the 24th International Conference on Computational Linguistics (COLING’12). 1587--1602.Google Scholar
Chen Li and Yang Liu. 2014. Improving text normalization via unsupervised model and discriminative reranking. In Proceedings of the ACL 2014 Student Research Workshop. 86--93.Google ScholarCross Ref
Wang Ling, Chris Dyer, Alan W. Black, and Isabel Trancoso. 2013. Paraphrasing 4 microblog normalization. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 73--84.Google Scholar
Fei Liu, Fuliang Weng, and Xiao Jiang. 2012. A broad-coverage normalization system for social media language. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, Vol. 1: Long Papers. 1035--1044. Google ScholarDigital Library
Xiaohua Liu, Furu Wei, Shaodian Zhang, and Ming Zhou. 2013. Named entity recognition for tweets. ACM Transactions on Intelligent Systems and Technology 4, 1, 3:1--3:15. Google ScholarDigital Library
Maite Melero, Marta R. Costa-Juss, Judith Domingo, Montse Marquina, and Mart Quixal. 2012. Holaaa&excl;&excl; Writin like u talk is kewl but kinda hard 4 NLP. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’12). 3794--3800.Google Scholar
San Murugesan. 2007. Understanding Web 2.0. IT Professional 9, 4, 34--41. Google ScholarDigital Library
Brendan O’Connor, Michel Krieger, and David Ahn. 2010. TweetMotif: Exploratory search and topic summarization for Twitter. In Proceedings of the 4th International AAAI Conference on Weblogs and Social Media. AAAI Press, Washington, DC.Google Scholar
Nelleke Oostdijk. 2000. The spoken Dutch corpus. Overview and first evaluation. In Proceedings of the 2nd International Conference on Language Resources and Evaluation (LREC’00). 887--893.Google Scholar
Nelleke Oostdijk. 2008. SoNaR: STEVIN Nederlandstalig Referentiecorpus. Retrieved March 12, 2016, from http://lands.let.ru.nl/projects/SoNaR/.Google Scholar
Georgios Paltoglou and Mike Thelwall. 2012. Twitter, MySpace, Digg: Unsupervised sentiment analysis in social media. ACM Transactions on Intelligent Systems and Technology 3, 4, 66:1--66:19. Google ScholarDigital Library
Claudia Peersman, Walter Daelemans, and Leona Van Vaerenbergh. 2011. Predicting age and gender in online social networks. In Proceedings of the 3rd International Workshop on Search and Mining User-Generated Contents (SMUC’11). ACM, New York, NY, 37--44. Google ScholarDigital Library
Deana L. Pennell and Yang Liu. 2011. A character-level machine translation approach for normalization of SMS abbreviations. In Proceedings of the 5th International Joint Conference on Natural Language Processing. 974--982.Google Scholar
Karthik Raghunathan, Heeyoung Lee, Sudarshan Rangarajan, Nathanael Chambers, Mihai Surdeanu, Dan Jurafsky, and Christopher Manning. 2010. A multi-pass sieve for coreference resolution. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP’10). 492--501. Google ScholarDigital Library
Martin Reynaert, Nelleke Oostdijk, Orphe De Clercq, Henk van den Heuvel, and Franciska de Jong. 2010. Balancing SoNaR: IPR versus processing issues in a 500-million-word written Dutch reference corpus. In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC’10). 2693--2698.Google Scholar
Cate Riegner. 2007. Word of mouth on the Web: The impact of Web 2.0 on consumer purchase decisions. Journal of Advertising Research. 47, 4, 436--437.Google ScholarCross Ref
Alan Ritter, Sam Clark, Mausam, and Oren Etzioni. 2011. Named entity recognition in tweets: An experimental study. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’11). 1524--1534. Google ScholarDigital Library
Sara Rosenthal and Kathleen McKeown. 2011. Age prediction in blogs: A study of style, content, and online behavior in pre- and post-social media generations. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Vol. 1 (HLT’11). 763--772. Google ScholarDigital Library
Kathleen Van Royen, Karolien Poels, Walter Daelemans, and Heidi Vandebosch. 2015. Automatic monitoring of cyberbullying on social networking sites: From technological feasibility to desirability. Telematics and Informatics 32, 1, 89--97.Google ScholarCross Ref
Helmut Schmid. 1994. Probabilistic part-of-speech tagging using decision trees. In Proceedings of the International Conference on New Methods in Language Processing. 44--49.Google Scholar
Sarah Schulz. 2014. Named entity recognition for user-generated content. In Proceedings of the ESSLLI 2014 Student Session. 207--2018.Google Scholar
Richard Sproat, Alan W. Black, Stanley Chen, Shankar Kumar, Mari Ostendorf, and Christopher Richards. 2001. Normalization of non-standard words. Computer, Speech and Language 15, 3, 287--333. Google ScholarDigital Library
Paul Taylor, Alan Black, and Richard Caley. 1998. The architecture of the festival speech synthesis system. In Proceedings of the 3rd ESCA/COCOSDA Workshop on Speech Synthesis. 147--151.Google Scholar
Jörg Tiedemann. 2012. Character-based pivot translation for under-resourced languages and domains. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL’12). 141--151. Google ScholarDigital Library
Marjan van de Kauter, Geert Coorman, Els Lefever, Bart Desmet, Lieve Macken, and Véronique Hoste. 2013. LeTs preprocess: The multilingual LT3 linguistic preprocessing toolkit. Computational Linguistics in the Netherlands Journal 3, 103--120.Google Scholar
José van Dijk. 2009. Users like you? Theorizing agency in user generated content. Media, Culture and Society 31, 1, 41--58.Google ScholarCross Ref
Cynthia Van Hee, Els Lefever, Ben Verhoeven, Julie Mennes, Bart Desmet, Guy De Pauw, Walter Daelemans, and Véronique Hoste. 2015. Detection and fine-grained classification of cyberbullying events. In Proceedings of Recent Advances in Natural Language Processing (RANLP’15).Google Scholar
Reinhild VandeKerckhove and Judith Nobels. 2010. Code eclecticism: Linguistic variation and code alternation in the chat language of Flemish teenagers. Journal of Sociolinguistics 14, 657--677.Google ScholarCross Ref
Robert A. Wagner and Michael J. Fisher. 1974. The string-to-string correction problem. Journal of the ACM 21, 1, 168--173. Google ScholarDigital Library
Pidong Wang and Hwee Tou Ng. 2013. A beam-search decoder for normalization of social media text with application to machine translation. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 471--481.Google Scholar
Benjamin P. Wing and Jason Baldridge. 2011. Simple supervised document geolocation with geodesic grids. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Vol. 1 1 (HLT’11). 955--964. Google ScholarDigital Library
Zhenzhen Xue, Dawei Yin, and Brian D. Davison. 2011. Normalizing microtext. In Proceedings of the AAAI-11 Workshop on Analyzing Microtext, Vol. WS-11-05. 74--79. Google ScholarDigital Library
Yi Yang and Jacob Eisenstein. 2013. A log-linear model for unsupervised text normalization. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 61--72.Google Scholar
François Yvon. 2010. Rewriting the orthography of SMS messages. Natural Language Engineering 16, 2, 133--159. Google ScholarDigital Library
Congle Zhang, Tyler Baldwin, Howard Ho, Benny Kimelfeld, and Yunyao Li. 2013. Adaptive parser-centric text normalization. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Vol. 1: Long Papers. 1159--1168.Google Scholar

Index Terms

Recommendations

Machine Normalization: Bringing Social Media Text from Non-Standard to Standard Form

User-generated text in social media communication (SMC) is mainly characterized by non-standard form. It may contain code switching (CS) text, a widespread phenomenon in SMC, in addition to noisy elements used, especially in written conversations (use ...
Read More
Tag suggestion and localization in user-generated videos based on social knowledge
WSM '10: Proceedings of second ACM SIGMM workshop on Social media

Nowadays, almost any web site that provides means for sharing user-generated multimedia content, like Flickr, Facebook, YouTube and Vimeo, has tagging functionalities to let users annotate the material that they want to share. The tags are then used to ...
Read More
A Multilingual Text Normalization Approach
Human Language Technology Challenges for Computer Science and Linguistics
Abstract
The creation of text corpora requires a sequence of processing steps in order to constitute, normalize, and then to directly exploit it by a given application. This paper presents a generic approach for text normalization and concentrates on the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Intelligent Systems and Technology Volume 7, Issue 4
Special Issue on Crowd in Intelligent Systems, Research Note/Short Paper and Regular Papers
July 2016
498 pages
ISSN:2157-6904
EISSN:2157-6912
DOI:10.1145/2906145
Editor:
Yu Zheng
Microsoft Research, China
Issue’s Table of Contents
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 7 July 2016
- Accepted: 1 November 2015
- Revised: 1 August 2015
- Received: 1 November 2014
Published in tist Volume 7, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Social media
text normalization
user-generated content
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 10
  Total Citations
  View Citations
- 260
  Total Downloads
- Downloads (Last 12 months)11
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Multimodular Text Normalization of Dutch User-Generated Content

ACM Transactions on Intelligent Systems and Technology

Abstract

References

Cited By

Index Terms

Recommendations

Machine Normalization: Bringing Social Media Text from Non-Standard to Standard Form

Tag suggestion and localization in user-generated videos based on social knowledge

A Multilingual Text Normalization Approach