short-paper

Improving LDA topic models for microblogs via tweet pooling and automatic labeling

Authors:
Rishabh Mehrotra

BITS Pilani, Pilani, India

BITS Pilani, Pilani, India
View Profile

,
Scott Sanner

NICTA & ANU, Canberra, Australia

NICTA & ANU, Canberra, Australia
View Profile

,
Wray Buntine

NICTA & ANU, Canberra, Australia

NICTA & ANU, Canberra, Australia
View Profile

,
Lexing Xie

ANU & NICTA, Canberra, Australia

ANU & NICTA, Canberra, Australia
View Profile

SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrievalJuly 2013Pages 889–892https://doi.org/10.1145/2484028.2484166

Published:28 July 2013Publication History

SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

Pages 889–892

ABSTRACT

Twitter, or the world of 140 characters poses serious challenges to the efficacy of topic models on short, messy text. While topic models such as Latent Dirichlet Allocation (LDA) have a long history of successful application to news articles and academic abstracts, they are often less coherent when applied to microblog content like Twitter. In this paper, we investigate methods to improve topics learned from Twitter content without modifying the basic machinery of LDA; we achieve this through various pooling schemes that aggregate tweets in a data preprocessing step for LDA. We empirically establish that a novel method of tweet pooling by hashtags leads to a vast improvement in a variety of measures for topic coherence across three diverse Twitter datasets in comparison to an unmodified LDA baseline and a variety of pooling schemes. An additional contribution of automatic hashtag labeling further improves on the hashtag pooling results for a subset of metrics. Overall, these two novel schemes lead to significantly improved LDA topic models on Twitter content.

References

D. Blei, A. Ng, and M. Jordon. Latent Dirichlet allocation. volume 3, pages 993--1022, 2003. Google ScholarDigital Library
S. E. Chan, R. K. Pon, and A. F. Cárdenas. Visualization and clustering of author social networks. pages 30--31, Arizona, USA, 2006.Google Scholar
B. Han, P. Cook, and T. Baldwin. Automatically constructing a normalisation dictionary for microblogs. In Proc. of EMNLP-CoNLL 2012, pages 421--432, Korea, 2012. Google ScholarDigital Library
B. Han, P. Cook, and T. Baldwin. Lexical normalisation of social media text. ACM Transactions on Intelligent Systems and Technology, 4(1), Feb. 2013. Google ScholarDigital Library
L. Hong and B. Davison. Empirical study of topic modeling in Twitter. 1st ACM Workshop on Social Media Analytics, 2010. Google ScholarDigital Library
C. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008. Google ScholarDigital Library
M. Naaman, H. Becker, and L. Gravano. Hip and trendy: Characterizing emerging trends on Twitter. J. Am. Soc. Inf. Sci. Technol., 62(5):902--918, May 2011. Google ScholarDigital Library
N. Naveed, T. Gottron, J. Kunegis, and A. C. Alhadi. Searching microblogs: coping with sparsity and document quality. In CIKM '11, pages 183--188, 2011. Google ScholarDigital Library
D. Newman, E. Bonilla, and W. Buntine. Improving topic coherence with regularized topic models. NIPS, 2011.Google ScholarDigital Library
D. Newman, J. Lau, K. Grieser, and T. Baldwin. Automatic evaluation of topic coherence. NAACL, 2010. Google ScholarDigital Library
D. Ramage, S. Dumais, and D. Liebling. Characterizing microblogs with topic models. AAAI Conference on Weblogs and Social Media, 2010.Google Scholar
G. Salton and M. McGill. Introduction to modern information retrieval. McGraw-Hill, 1983. Google ScholarDigital Library
J. Weng, E.-P. Lim, J. Jiang, and Q. He. Twitterrank: finding topic-sensitive influential twitterers. In WSDM'10, pages 261--270, 2010. Google ScholarDigital Library
L. Yang, T. Sun, M. Zhang, and Q. Mei. We know what @you#tag: does the dual role affect hashtag adoption? WWW '12, pages 261--270, 2012. Google ScholarDigital Library
W. X. Zhao, J. Jiang, J. Weng, J. He, E.-P. Lim, H. Yan, and X. Li. Comparing twitter and traditional media using topic models. In ECIR'11, pages 338--349, 2011. Google ScholarDigital Library

Index Terms

Improving LDA topic models for microblogs via tweet pooling and automatic labeling
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

Combining IR and LDA Topic Modeling for Filtering Microblogs

Twitter is a networking micro-blogging service where users post millions of short messages every day. Building multilingual corpora from these microblogs contents can be useful to perform several computational tasks such as opinion mining. However, ...
Read More
Extracting time series variation of topic popularity in microblogs
iiWAS2018: Proceedings of the 20th International Conference on Information Integration and Web-based Applications & Services

Extracting topics and their popularities in microblogs is a promising approach to discover popular topics in the world. To challenge this task, some methods that estimate popularity of topics based on Latent Dirichlet Allocation (LDA) has been proposed. ...
Read More
Research on Multi-document Summarization Based on LDA Topic Model
IHMSC '14: Proceedings of the 2014 Sixth International Conference on Intelligent Human-Machine Systems and Cybernetics - Volume 02

Compared with VSM (Vector Space Model) and graph-ranking models, LDA (Latent Dirichlet Allocation) Model can discover latent topics in the corpus and latent topics are beneficial to use sentence-ranking mechanisms to form a good summary. In the paper, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
July 2013
1188 pages
ISBN:9781450320344
DOI:10.1145/2484028
General Chairs:
Gareth J.F. Jones
Dublin City University, Ireland
,
Páraic Sheridan
Dublin City University, Ireland
,
Program Chairs:
Diane Kelly
University of North Carolina, Chapel Hill, USA
,
Maarten de Rijke
University of Amsterdam, The Netherlands
,
Tetsuya Sakai
Microsoft Research Asia, China
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 28 July 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
lda
microblogs
topic modeling
Qualifiers
- short-paper
Conference

Acceptance Rates
SIGIR '13 Paper Acceptance Rate73of366submissions,20%Overall Acceptance Rate792of3,983submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 288
  Total Citations
  View Citations
- 2,918
  Total Downloads
- Downloads (Last 12 months)177
- Downloads (Last 6 weeks)12
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Improving LDA topic models for microblogs via tweet pooling and automatic labeling

SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Combining IR and LDA Topic Modeling for Filtering Microblogs

Extracting time series variation of topic popularity in microblogs

Research on Multi-document Summarization Based on LDA Topic Model