research-article

Scalable text and link analysis with mixed-topic link models

Authors:
Yaojia Zhu

University of New Mexico, Albuquerque, NM, USA

University of New Mexico, Albuquerque, NM, USA
View Profile

,
Xiaoran Yan

University of New Mexico, Albuquerque, NM, USA

University of New Mexico, Albuquerque, NM, USA
View Profile

,
Lise Getoor

University of Maryland, College Park, MD, USA

University of Maryland, College Park, MD, USA
View Profile

,
Cristopher Moore

Santa Fe Institute, Santa Fe, NM, USA

Santa Fe Institute, Santa Fe, NM, USA
View Profile

KDD '13: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2013Pages 473–481https://doi.org/10.1145/2487575.2487693

Published:11 August 2013Publication History

KDD '13: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 473–481

ABSTRACT

Many data sets contain rich information about objects, as well as pairwise relations between them. For instance, in networks of websites, scientific papers, and other documents, each node has content consisting of a collection of words, as well as hyperlinks or citations to other nodes. In order to perform inference on such data sets, and make predictions and recommendations, it is useful to have models that are able to capture the processes which generate the text at each node and the links between them. In this paper, we combine classic ideas in topic modeling with a variant of the mixed-membership block model recently developed in the statistical physics community. The resulting model has the advantage that its parameters, including the mixture of topics of each document and the resulting overlapping communities, can be inferred with a simple and scalable expectation-maximization algorithm. We test our model on three data sets, performing unsupervised topic classification and link prediction. For both tasks, our model outperforms several existing state-of-the-art methods, achieving higher accuracy with significantly less computation, analyzing a data set with 1.3 million words and 44 thousand links in a few minutes.

References

E. Airoldi, D. Blei, S. Fienberg, and E. Xing. Mixed membership stochastic blockmodels. J. Machine Learning Research, 9:1981--2014, 2008. Google ScholarDigital Library
B. Ball, B. Karrer, and M. E. J. Newman. Efficient and principled method for detecting communities in networks. Phys. Rev. E, 84:036103, 2011.Google ScholarCross Ref
S. Basu. Semi-supervised Clustering: Probabilistic Models, Algorithms and Experiments. PhD thesis, Department of Computer Sciences, University of Texas at Austin, 2005. Google ScholarDigital Library
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. J. Machine Learning Research, 3:993--1022, 2003. Google ScholarDigital Library
J. Chang and D. M. Blei. Relational topic models for document networks. Artificial Intelligence and Statistics, 2009.Google Scholar
J. Chang and D. M. Blei. Hierarchical relational models for document networks. The Annals of Applied Statistics, 4(1):124--150, Mar. 2010.Google ScholarCross Ref
A. Chen, A. A. Amini, P. J. Bickel, and E. Levina. Fitting community models to large sparse networks. CoRR, abs/1207.2340, 2012.Google Scholar
A. Clauset, C. Moore, and M. E. Newman. Hierarchical structure and the prediction of missing links in networks. Nature, 453(7191):98--101, 2008.Google ScholarCross Ref
D. Cohn and H. Chang. Learning to probabilistically identify authoritative documents. In Proc. 17th Intl. Conf. on Machine Learning, pages 167--174, 2000. Google ScholarDigital Library
D. Cohn and T. Hofmann. The missing link--a probabilistic model of document content and hypertext connectivity. Proc. 13th Neural Information Processing Systems, 2001.Google Scholar
A. Decelle, F. Krzakala, C. Moore, and L. Zdeborová. Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications. Phys. Rev. E, 84(6), 2011.Google ScholarCross Ref
A. Decelle, F. Krzakala, C. Moore, and L. Zdeborová. Inference and phase transitions in the detection of modules in sparse networks. Phys. Rev. Lett., 107:065701, 2011.Google ScholarCross Ref
E. Erosheva, S. Fienberg, and J. Lafferty. Mixed-membership models of scientific publications. Proc. National Academy of Sciences, 101 Suppl:5220--7, Apr. 2004.Google ScholarCross Ref
S. E. Fienberg and S. Wasserman. Categorical data analysis of single sociometric relations. sociological Methodology, pages 156--192, 1981.Google Scholar
L. Getoor, N. Friedman, D. Koller, and B. Taskar. Learning probabilistic models of relational structure. Journal of Machine Learning Research, 3:679--707, December 2002. Google ScholarDigital Library
P. Gopalan, D. Mimno, S. Gerrish, M. Freedman, and D. Blei. Scalable inference of overlapping communities. In Advances in Neural Information Processing Systems 25, pages 2258--2266, 2012.Google Scholar
A. Gruber, M. Rosen-Zvi, and Y. Weiss. Latent topic models for hypertext. Proc. 24th Conf. on Uncertainty in Artificial Intelligence, 2008.Google Scholar
T. Hofmann. Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '99, pages 50--57, New York, NY, USA, 1999. ACM. Google ScholarDigital Library
P. Holland, K. Laskey, and S. Leinhardt. Stochastic blockmodels: First steps. Social Networks, 5(2):109--137, 1983.Google ScholarCross Ref
B. Karrer and M. E. J. Newman. Stochastic blockmodels and community structure in networks. Phys. Rev. E, 83:016107, 2011.Google ScholarCross Ref
M. Kim and J. Leskovec. Latent multi-group membership graph model. CoRR, abs/1205.4546, 2012.Google Scholar
L. Lü and T. Zhou. Link prediction in complex networks: A survey. Physica A: Statistical Mechanics and its Applications, 390(6):1150--1170, 2011.Google ScholarCross Ref
Q. Lu and L. Getoor. Link-based classification. In Proceedings of the 20th Annual Intl. Conf. on Machine Learning, pages 496--503, 2003.Google Scholar
Q. Lu and L. Getoor. Link-based classification using labeled and unlabeled data. ICML Workshop on "The Continuum from Labeled to Unlabeled Data in Machine Learning and Data Mining, 2003.Google Scholar
M. Meila. Comparing clusterings by the variation of information. Learning theory and kernel machines, pages 173--187, 2003.Google Scholar
C. Moore, X. Yan, Y. Zhu, J. Rouquier, and T. Lane. Active learning for node classification in assortative and disassortative networks. In Proc. 17th KDD, pages 841--849, 2011. Google ScholarDigital Library
R. M. Nallapati, A. Ahmed, E. P. Xing, and W. W. Cohen. Joint latent topic models for text and citations. Proc. 14th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '08, page 542, 2008. Google ScholarDigital Library
M. E. J. Newman and E. A. Leicht. Mixture models and exploratory analysis in networks. Proceedings of the National Academy of Sciences of the United States of America, 104(23):9564--9, 2007.Google ScholarCross Ref
P. Sen, G. Namata, M. Bilgic, and L. Getoor. Collective classification in network data. AI Magazine, pages 1--24, 2008.Google ScholarCross Ref
T. Snijders and K. Nowicki. Estimation and prediction for stochastic blockmodels for graphs with latent block structure. Journal of Classification, 14(1):75--100, 1997.Google ScholarCross Ref
C. Sun, B. Gao, Z. Cao, and H. Li. H™: a topic model for hypertexts. In Proc. Conf. on Empirical Methods in Natural Language Processing, EMNLP '08, pages 514--522, 2008. Google ScholarDigital Library
T. Yang, R. Jin, Y. Chi, and S. Zhu. A Bayesian framework for community detection integrating content and link. In Proc. 25th Conf. on Uncertainty in Artificial Intelligence, pages 615--622, 2009. Google ScholarDigital Library
P. Yu, J. Han, and C. Faloutsos. Link Mining: Models, Algorithms, and Applications. Springer, 2010. Google ScholarDigital Library
Y. Zhao, E. Levina, and J. Zhu. Link prediction for partially observed networks. arXiv preprint arXiv:1301.7047, 2013.Google Scholar

Index Terms

Scalable text and link analysis with mixed-topic link models
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis

Recommendations

Extractive text summarization using clustering-based topic modeling
Abstract
Text summarization is the process of converting the input document into a short form, provided that it preserves the overall meaning associated with it. Primarily, text summarization is achieved in two ways, i.e., abstractive and extractive. ...
Read More
Heterogeneous-Length Text Topic Modeling for Reader-Aware Multi-Document Summarization

More and more user comments like Tweets are available, which often contain user concerns. In order to meet the demands of users, a good summary generating from multiple documents should consider reader interests as reflected in reader comments. In this ...
Read More
Twitter Opinion Topic Model: Extracting Product Opinions from Tweets by Leveraging Hashtags and Sentiment Lexicon
CIKM '14: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management

Aspect-based opinion mining is widely applied to review data to aggregate or summarize opinions of a product, and the current state-of-the-art is achieved with Latent Dirichlet Allocation (LDA)-based model. Although social media data like tweets are ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '13: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2013
1534 pages
ISBN:9781450321747
DOI:10.1145/2487575
Editors:
Rayid Ghani
University of Chicago
,
Ted E. Senator
SAIC
,
Paul Bradley
MethodCare, Inc.
,
Rajesh Parekh
Groupon
,
Jingrui He
Stevens Institute of Technology
,
General Chairs:
Robert L. Grossman
University of Chicago and Open Data Group
,
Ramasamy Uthurusamy
General Motors Corporation (retired)
,
Program Chairs:
Inderjit S. Dhillon
University of Texas
,
Yehuda Koren
Google
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 August 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
document classification
link prediction
stochastic block model
topic modeling
Qualifiers
- research-article
Conference

Acceptance Rates
KDD '13 Paper Acceptance Rate125of726submissions,17%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 36
  Total Citations
  View Citations
- 856
  Total Downloads
- Downloads (Last 12 months)13
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Scalable text and link analysis with mixed-topic link models

KDD '13: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Extractive text summarization using clustering-based topic modeling

Heterogeneous-Length Text Topic Modeling for Reader-Aware Multi-Document Summarization

Twitter Opinion Topic Model: Extracting Product Opinions from Tweets by Leveraging Hashtags and Sentiment Lexicon

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Scalable text and link analysis with mixed-topic link models

KDD '13: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Extractive text summarization using clustering-based topic modeling

Heterogeneous-Length Text Topic Modeling for Reader-Aware Multi-Document Summarization

Twitter Opinion Topic Model: Extracting Product Opinions from Tweets by Leveraging Hashtags and Sentiment Lexicon

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media