skip to main content
10.1145/2484028.2484166acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
short-paper

Improving LDA topic models for microblogs via tweet pooling and automatic labeling

Published:28 July 2013Publication History

ABSTRACT

Twitter, or the world of 140 characters poses serious challenges to the efficacy of topic models on short, messy text. While topic models such as Latent Dirichlet Allocation (LDA) have a long history of successful application to news articles and academic abstracts, they are often less coherent when applied to microblog content like Twitter. In this paper, we investigate methods to improve topics learned from Twitter content without modifying the basic machinery of LDA; we achieve this through various pooling schemes that aggregate tweets in a data preprocessing step for LDA. We empirically establish that a novel method of tweet pooling by hashtags leads to a vast improvement in a variety of measures for topic coherence across three diverse Twitter datasets in comparison to an unmodified LDA baseline and a variety of pooling schemes. An additional contribution of automatic hashtag labeling further improves on the hashtag pooling results for a subset of metrics. Overall, these two novel schemes lead to significantly improved LDA topic models on Twitter content.

References

  1. D. Blei, A. Ng, and M. Jordon. Latent Dirichlet allocation. volume 3, pages 993--1022, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. E. Chan, R. K. Pon, and A. F. Cárdenas. Visualization and clustering of author social networks. pages 30--31, Arizona, USA, 2006.Google ScholarGoogle Scholar
  3. B. Han, P. Cook, and T. Baldwin. Automatically constructing a normalisation dictionary for microblogs. In Proc. of EMNLP-CoNLL 2012, pages 421--432, Korea, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. B. Han, P. Cook, and T. Baldwin. Lexical normalisation of social media text. ACM Transactions on Intelligent Systems and Technology, 4(1), Feb. 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. L. Hong and B. Davison. Empirical study of topic modeling in Twitter. 1st ACM Workshop on Social Media Analytics, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. C. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. Naaman, H. Becker, and L. Gravano. Hip and trendy: Characterizing emerging trends on Twitter. J. Am. Soc. Inf. Sci. Technol., 62(5):902--918, May 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. N. Naveed, T. Gottron, J. Kunegis, and A. C. Alhadi. Searching microblogs: coping with sparsity and document quality. In CIKM '11, pages 183--188, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. D. Newman, E. Bonilla, and W. Buntine. Improving topic coherence with regularized topic models. NIPS, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. D. Newman, J. Lau, K. Grieser, and T. Baldwin. Automatic evaluation of topic coherence. NAACL, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D. Ramage, S. Dumais, and D. Liebling. Characterizing microblogs with topic models. AAAI Conference on Weblogs and Social Media, 2010.Google ScholarGoogle Scholar
  12. G. Salton and M. McGill. Introduction to modern information retrieval. McGraw-Hill, 1983. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. Weng, E.-P. Lim, J. Jiang, and Q. He. Twitterrank: finding topic-sensitive influential twitterers. In WSDM'10, pages 261--270, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. L. Yang, T. Sun, M. Zhang, and Q. Mei. We know what @you#tag: does the dual role affect hashtag adoption? WWW '12, pages 261--270, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. W. X. Zhao, J. Jiang, J. Weng, J. He, E.-P. Lim, H. Yan, and X. Li. Comparing twitter and traditional media using topic models. In ECIR'11, pages 338--349, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Improving LDA topic models for microblogs via tweet pooling and automatic labeling

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
          July 2013
          1188 pages
          ISBN:9781450320344
          DOI:10.1145/2484028

          Copyright © 2013 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 28 July 2013

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • short-paper

          Acceptance Rates

          SIGIR '13 Paper Acceptance Rate73of366submissions,20%Overall Acceptance Rate792of3,983submissions,20%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader