Abstract
Due to the significant increase of communications between individuals via social media (Facebook, Twitter, Linkedin) or electronic formats (email, web, e-publication) in the past two decades, network analysis has become an unavoidable discipline. Many random graph models have been proposed to extract information from networks based on person-to-person links only, without taking into account information on the contents. This paper introduces the stochastic topic block model, a probabilistic model for networks with textual edges. We address here the problem of discovering meaningful clusters of vertices that are coherent from both the network interactions and the text contents. A classification variational expectation-maximization algorithm is proposed to perform inference. Simulated datasets are considered in order to assess the proposed approach and to highlight its main features. Finally, we demonstrate the effectiveness of our methodology on two real-word datasets: a directed communication network and an undirected co-authorship network.
Similar content being viewed by others
References
Airoldi, E., Blei, D., Fienberg, S., Xing, E.: Mixed membership stochastic blockmodels. J. Mach. Learn. Res. 9, 1981–2014 (2008)
Akaike, H.: Information theory and an extension of the maximum likelihood principle. In: Second International Symposium on Information Theory, pp. 267–281 (1973)
Ambroise, C., Grasseau, G., Hoebeke, M., Latouche, P., Miele, V., Picard, F.: The mixer R package (version 1.8) (2010). http://cran.r-project.org/web/packages/mixer/
Bickel, P., Chen, A.: A nonparametric view of network models and newman-girvan and other modularities. Proc. Natl Acad. Sci. 106(50), 21068–21073 (2009)
Biernacki, C., Celeux, G., Govaert, G.: Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans. Pattern Anal. Mach. Intel. 7, 719–725 (2000)
Biernacki, C., Celeux, G., Govaert, G.: Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate gaussian mixture models. Comput. Stat. Data Anal. 41(3–4), 561–575 (2003)
Bilmes, J.: A gentle tutorial of the EM algorithm and its application to parameter estimation for gaussian mixture and hidden markov models. Int. Comput. Sci. Inst. 4, 126 (1998)
Blei, D., Lafferty, J.: Correlated topic models. Adv. Neural Inf. Process. Syst. 18, 147 (2006)
Blei, D.M.: Probabilistic topic models. Commun. ACM 55(4), 77–84 (2012)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Blondel, V.D., Guillaume, J.-L., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities in large networks. J. Stat. Mech. 10, 10008–10020 (2008)
Bouveyron, C., Latouche, P., Zreik, R.: The dynamic random subgraph model for the clustering of evolving networks. Comput. Stat. (2016)
Celeux, G., Govaert, G.: A classification em algorithm for clustering and two stochastic versions. Comput. Stat. Q. 2(1), 73–82 (1991)
Chang, J., Blei, D.M.: Relational topic models for document networks. In: International Conference on Artificial Intelligence and Statistics, pp. 81–88 (2009)
Côme, E., Randriamanamihaga, A., Oukhellou, L., Aknin, P.: Spatio-temporal analysis of dynamic origin-destination data using latent dirichlet allocation. application to the vélib? bike sharing system of paris. In: Proceedings of 93rd Annual Meeting of the Transportation Research Board (2014)
Côme, E., Latouche, P.: Model selection and clustering in stochastic block models with the exact integrated complete data likelihood. Stat. Model. doi:10.1177/1471082X15577017 (2015)
Daudin, J.-J., Picard, F., Robin, S.: A mixture model for random graphs. Stat. Comput. 18(2), 173–183 (2008)
Deerwester, S., Dumais, S., Furnas, G., Landauer, T., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391 (1990)
Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. B 39(1), 1–38 (1977)
Fienberg, S., Wasserman, S.: Categorical data analysis of single sociometric relations. Sociol. Methodol. 12, 156–192 (1981)
Girvan, M., Newman, M.: Community structure in social and biological networks. Proc. Natl Acad. Sci. 99(12), 7821 (2002)
Gormley, I.C., Murphy, T.B.: A mixture of experts latent position cluster model for social network data. Stat. Methodol. 7(3), 385–405 (2010)
Grun, B., Hornik, K.: The mixer topicmodels package (version 0.2-3). http://cran.r-project.org/web/packages/topicmodels/ (2013)
Handcock, M., Raftery, A., Tantrum, J.: Model-based clustering for social networks. J. R. Stat. Soc. A 170(2), 301–354 (2007)
Hathaway, R.: Another interpretation of the EM algorithm for mixture distributions. Stat. Prob. Lett. 4(2), 53–56 (1986)
Hoff, P., Raftery, A., Handcock, M.: Latent space approaches to social network analysis. J. Am. Stat. Assoc. 97(460), 1090–1098 (2002)
Hofman, J., Wiggins, C.: Bayesian approach to network modularity. Phys. Rev. Lett. 100(25), 258701 (2008)
Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, 50–57. ACM, New York (1999)
Jernite, Y., Latouche, P., Bouveyron, C., Rivera, P., Jegou, L., Lamassé, S.: The random subgraph model for the analysis of an ecclesiastical network in Merovingian Gaul. Ann. Appl. Stat. 8(1), 55–74 (2014)
Kemp, C., Tenenbaum, J., Griffiths, T., Yamada, T., Ueda, N.: Learning systems of concepts with an infinite relational model. Proc. Natl Conf. Artif. Intell. 21, 381–391 (2006)
Latouche, P., Birmelé, E., Ambroise, C.: Overlapping stochastic block models with application to the French political blogosphere. Ann. Appl. Stat. 5(1), 309–336 (2011)
Latouche, P., Birmelé, E., Ambroise, C.: Variational Bayesian inference and complexity control for stochastic block models. Stat. Model. 12(1), 93–115 (2012)
Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 2169–2178. IEEE, Piscataway (2006)
Liu, Y., Niculescu-Mizil, A., Gryc, W. : Topic-link lda: joint models of topic and author community. In: proceedings of the 26th annual international conference on machine learning, pp. 665–672. ACM, New York (2009)
Mariadassou, M., Robin, S., Vacher, C.: Uncovering latent structure in valued graphs: a variational approach. Ann. Appl. Stat. 4(2), 715–742 (2010)
Matias, C., Miele, V.: Statistical clustering of temporal networks through a dynamic stochastic block model. Preprint HAL. n.01167837 (2016)
Matias, C., Robin, S.: Modeling heterogeneity in random graphs through latent space models: a selective review. Esaim Proc. Surv. 47, 55–74 (2014)
McDaid, A., Murphy, T., Friel, N., Hurley, N.: Improved bayesian inference for the stochastic block model with application to large networks. Comput. Stat. Data Anal. 60, 12–31 (2013)
McCallum, A., Corrada-Emmanuel, A., Wang, X.: The author-recipient-topic model for topic and role discovery in social networks, with application to enron and academic email, pp. 33–44. In: Workshop on Link Analysis, Counterterrorism and Security (2005)
Newman, M.E.J.: Fast algorithm for detecting community structure in networks. Phys. Rev. Lett. E. 69, 0066133 (2004)
Nigam, K., McCallum, A., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using em. Mach. Learn. 39(2–3), 103–134 (2000)
Nowicki, K., Snijders, T.: Estimation and prediction for stochastic blockstructures. J. Am. Stat. Assoc. 96(455), 1077–1087 (2001)
Papadimitriou, C., Raghavan, P., Tamaki, H., Vempala, S.: Latent semantic indexing: a probabilistic analysis. In: Proceedings of the tenth ACM PODS, pp. 159–168. ACM, New York (1998)
Pathak, N., DeLong, C., Banerjee, A., Erickson, K.: Social topic models for community extraction. In: The 2nd SNA-KDD workshop, vol. 8. Citeseer (2008)
Rand, W.: Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66, 846–850 (1971)
Rosen-Zvi, M., Griffiths, T., Steyvers, M., Smyth, P.: The author-topic model for authors and documents. In: Proceedings of the 20th conference on Uncertainty in artificial intelligence, pp. 487–494. AUAI Press, Arlington (2004)
Sachan, M., Contractor, D., Faruquie, T., Subramaniam, L.: Using content and interactions for discovering communities in social networks. In: Proceedings of the 21st international conference on World Wide Web, pp. 331–340. ACM, New York (2012)
Salter-Townshend, M., White, A., Gollini, I., Murphy, T.B.: Review of statistical network analysis: models, algorithms, and software. Stat. Anal. Data Min. 5(4), 243–264 (2012)
Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6, 461–464 (1978)
Steyvers, M., Smyth, P., Rosen-Zvi, M., Griffiths, T.: Probabilistic author-topic models for information discovery. In: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 306–315. ACM, New York (2004)
Sun, Y., Han, J., Gao, J., Yu, Y.: itopicmodel: Information network-integrated topic modeling. In: Ninth IEEE International Conference on Data Mining, 2009. ICDM’09, pp. 493–502. IEEE, Piscataway (2009)
Teh, Y., Newman, D., Welling, M.: A collapsed variational bayesian inference algorithm for latent Dirichlet allocation. Adv. Neural Inf. Process. Syst. 18, 1353–1360 (2006)
Than, K., Ho, T.: Fully sparse topic models. Machine Learning and Knowledge Discovery in Databases, Lecture Notes in Computer Science. vol. 7523, pp. 490–505. Springer, Berlin (2012)
Wang, Y., Wong, G.: Stochastic blockmodels for directed graphs. J. Am. Stat. Assoc. 82, 8–19 (1987)
White, H., Boorman, S., Breiger, R.: Social structure from multiple networks. I. Blockmodels of roles and positions. Am. J. Sociol. 81, 730–780 (1976)
Xu, K., Hero III, A.: Dynamic stochastic blockmodels: statistical models for time-evolving networks. In: Social Computing, Behavioral-Cultural Modeling and Prediction, pp. 201–210. Springer, Berlin (2013)
Yang, T., Chi, Y., Zhu, S., Gong, Y., Jin, R.: Detecting communities and their evolutions in dynamic social networks: a bayesian approach. Mach. Learn. 82(2), 157–189 (2011)
Zanghi, H., Ambroise, C., Miele, V.: Fast online graph clustering via Erdos–Renyi mixture. Pattern Recognit. 41, 3592–3599 (2008)
Zanghi, H., Volant, S., Ambroise, C.: Clustering based on random graph model embedding vertex features. Pattern Recognit. Lett. 31(9), 830–836 (2010)
Zhou, D., Manavoglu, E., Li, J., Giles, C., Zha, H.: Probabilistic models for discovering e-communities. In: Proceedings of the 15th international conference on World Wide Web, pp. 173–182. ACM, New York (2006)
Acknowledgments
The authors would like to greatly thank the editor and the two reviewers for their helpful remarks on the first version of this paper, and Laurent Bergé for his kind suggestions and the development of visualization tools.
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Appendix
Appendix
1.1 Appendix 1: Optimization of R(Z)
The VEM update step for each distribution \(R(Z_{ij}^{dn}), A_{ij}=1\), is given by
where all terms that do not depend on \(Z_{ij}^{dn}\) have been put into the constant term \(\mathrm {const}\). Moreover, \(\psi (\cdot )\) denotes the digamma function. The functional form of a multinomial distribution is then recognized in (9)
where
\(\phi _{ij}^{dnk}\) is the (approximate) posterior distribution of words \(W_{ij}^{dn}\) being in topic k.
1.2 Appendix 2: Optimization of \(R(\theta )\)
The VEM update step for distribution \(R(\theta )\) is given by
We recognize the functional form of a product of Dirichlet distributions
where
1.3 Appendix 3: Derivation of the lower bound \(\tilde{{\mathcal {L}}}\left( R(\cdot ); Y, \beta \right) \)
The lower bound \(\tilde{{\mathcal {L}}}\left( R(\cdot ); Y, \beta \right) \) in (7) is given by
1.4 Appendix 4: Optimization of \(\beta \)
In order to maximize the lower bound \(\tilde{{\mathcal {L}}}\left( R(\cdot ); Y, \beta \right) \), we isolate the terms in (10) that depend on \(\beta \) and add Lagrange multipliers to satisfy the constraints \(\sum _{v=1}^{V}\beta _{kv}=1,\forall k\)
Setting the derivative, with respect to \(\beta _{kv}\), to zero, we find
1.5 Appendix 5: Optimization of \(\rho \)
Only the distribution \(p(Y|\rho )\) in the complete data log-likelihood \(\log p(A, Y|\rho , \pi )\) depends on the parameter vector \(\rho \) of cluster proportions. Taking the log and adding a Lagrange multiplier to satisfy the constraint \(\sum _{q=1}^{Q}\rho _{q}=1\), we have
Taking the derivative with respect \(\rho \) to zero, we find
1.6 Appendix 6: Optimization of \(\pi \)
Only the distribution \(p(A|Y, \pi )\) in the complete data log-likelihood \(\log p(A, Y|\rho , \pi )\) depends on the parameter matrix \(\pi \) of connection probabilities. Taking the log we have
Taking the derivative with respect to \(\pi _{qr}\) to zero, we obtain
1.7 Appendix 7: Model selection
Assuming that the prior distribution over the model parameters \((\rho , \pi , \beta )\) can be factorized, the integrated complete data log-likelihood \(\log p(A, W, Y|K, Q)\) is given by
Note that the dependency on K and Q is made explicit here, in all expressions. In all other sections of the paper, we did not include these terms to keep the notations uncluttered. We find
Following the derivation of the ICL criterion, we apply a Laplace (BIC-like) approximation on the second term of Eq. (11). Moreover, considering a Jeffreys prior distribution for \(\rho \) and using Stirling formula for large values of M, we obtain
as well as
For more details, we refer to Biernacki et al. (2000). Furthermore, we emphasize that adding these two approximations leads to the ICL criterion for the SBM model, as derived by Daudin et al. (2008)
In Daudin et al. (2008), \(M(M-1)\) is replaced by \(M(M-1)/2\) and \(Q^2\) by \(Q(Q+1)/2\) since they considered undirected networks.
Now, it is worth taking a closer look at the first term of Eq. (11). This term involves a marginalization over \(\beta \). Let us emphasize that \(p(W|A, Y, \beta , K, Q)\) is related to the LDA model and involves a marginalization over \(\theta \) (and Z). Because we aim at approximating the first term of Eq. (11), also with a Laplace (BIC-like) approximation, it is crucial to identify the number of observations in the associated likelihood term \(p(W|A, Y, \beta , K, Q)\). As pointed out in Sect. 2.4, given Y (and \(\theta \)), it is possible to reorganize the documents in W as \(W=({\tilde{W}}_{qr})_{qr}\) is such a way that all words in \({\tilde{W}}_{qr}\) follow the same mixture distribution over topics. Each aggregated document \({\tilde{W}}_{qr}\) has its own vector \(\theta _{qr}\) of topic proportions and since the distribution over \(\theta \) factorizes (\(p(\theta )=\prod _{q,r}^{Q}p(\theta _{qr}))\), we find
where \(\ell ({\tilde{W}}_{qr}|\beta , K, Q)\) is exactly the likelihood term of the LDA model associated with document \({\tilde{W}}_{qr}\), as described in Blei et al. (2003). Thus
Applying a Laplace approximation on Eq. (12) is then equivalent to deriving a BIC-like criterion for the LDA model with documents in \(W=({\tilde{W}}_{qr})_{qr}\). In the LDA model, the number of observations in the penalization term of BIC is the number of documents [see Than and Ho (2012) for instance]. In our case, this leads to
Unfortunately, \(\log p(W|A, Y, \beta , K, Q)\) is not tractable and so we propose to replace it with its variational approximation \(\tilde{{\mathcal {L}}}\), after convergence of the C-VEM algorithm. By analogy with \(ICL_{SBM}\), we call the corresponding criterion \(BIC_{LDA|Y}\) such that
Rights and permissions
About this article
Cite this article
Bouveyron, C., Latouche, P. & Zreik, R. The stochastic topic block model for the clustering of vertices in networks with textual edges. Stat Comput 28, 11–31 (2018). https://doi.org/10.1007/s11222-016-9713-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11222-016-9713-7