Skip to main content
Log in

Topic discovery and evolution in scientific literature based on content and citations

  • Published:
Frontiers of Information Technology & Electronic Engineering Aims and scope Submit manuscript

Abstract

Researchers across the globe have been increasingly interested in the manner in which important research topics evolve over time within the corpus of scientific literature. In a dataset of scientific articles, each document can be considered to comprise both the words of the document itself and its citations of other documents. In this paper, we propose a citation- content-latent Dirichlet allocation (LDA) topic discovery method that accounts for both document citation relations and the con-tent of the document itself via a probabilistic generative model. The citation-content-LDA topic model exploits a two-level topic model that includes the citation information for ‘father’ topics and text information for sub-topics. The model parameters are estimated by a collapsed Gibbs sampling algorithm. We also propose a topic evolution algorithm that runs in two steps: topic segmentation and topic dependency relation calculation. We have tested the proposed citation-content-LDA model and topic evolution algorithm on two online datasets, IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) and IEEE Computer Society (CS), to demonstrate that our algorithm effectively discovers important topics and reflects the topic evolution of important research themes. According to our evaluation metrics, citation-content-LDA outperforms both content-LDA and citation-LDA.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Ahmed, A., Xing, E.P., 2010. Timeline: a dynamic hierarchical Dirichlet process model for recovering birth/death and evolution of topics in text stream. Proc. 26th Conf. on Uncertainty in Artificial Intelligence, p.20–29.

    Google Scholar 

  • Blei, D.M., Lafferty, J.D., 2006. Dynamic topic models. Proc. 23rd ACM Int. Conf. on Machine Learning, p.113–120. https://doi.org/10.1145/1143844.1143859

    Google Scholar 

  • Blei, D.M., Ng, A.Y., Jordan, M.I., 2003. Latent Dirichlet allocation. J. Mach. Learn. Res., 3: 993–1022.

    MATH  Google Scholar 

  • Brin, B.S., Page, L., 1998. The anatomy of a large scale hy-pertextual web search engine. Comput. Netw. ISDN Syst., 30(98): 107–117. https://doi.org/10.1016/S0169-7552(98)00110-X

    Article  Google Scholar 

  • Chang, J., Blei, D.M., 2009. Relational topic models for document networks. Proc. 12th Int. Conf. on Artificial Intelligence and Statistics, p.81–88.

    Google Scholar 

  • Cohn, D., Chang, H., 2000. Learning to probabilistically identify authoritative documents. Proc. 17th Int. Conf. on Machine Learning, p.167–174.

    Google Scholar 

  • Dietz, L., Bickel, S., Scheffer, T., 2007. Unsupervised predic-tion of citation influences. Proc. 24th ACM Int. Conf. on Machine Learning, p.233–240. https://doi.org/10.1145/1273496.1273526

    Google Scholar 

  • Erosheva, E., Fienberg, S., Lafferty, J., 2004. Mixed-membership models of scientific publications. PNAS, 101(Suppl 1):5220–5227. https://doi.org/10.1073/pnas.0307760101

    Article  Google Scholar 

  • Griffiths, T.L., Steyvers, M., 2004. Finding scientific topics. PNAS, 101(Suppl 1):5228–5235. https://doi.org/10.1073/pnas.0307752101

    Article  Google Scholar 

  • Guo, Z., Zhang, Z., Zhu, S., et al., 2014. A two-level topic model towards knowledge discovery from citation net-works. IEEE Trans. Knowl. Data Eng., 26(4): 780–794. https://doi.org/10.1109/TKDE.2013.56

    Article  Google Scholar 

  • He, Q., Chen, B., Pei, J., et al., 2009. Detecting topic evolution in scientific literature: how can citations help? Proc. 18th ACM Conf. on Information and Knowledge Management, p.957–966. https://doi.org/10.1145/1645953.1646076

    Google Scholar 

  • Hofmann, T., 2001. Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn., 42(1–2): 177–196. https://doi.org/10.1023/A:1007617005950

    Article  Google Scholar 

  • Lin, F.R., Huang, F.M., Liang, C.H., 2007. Individualized storyline-based news topic retrospection. Pacific Asia Conf. on Information Systems, Article 140.

    Google Scholar 

  • Lu, Z., Mamoulis, N., Cheung, D.W., 2014. A collective topic model for milestone paper discovery. Proc. 37th Int. ACM SIGIR Conf. on Research & Development in In-formation Retrieval, p.1019–1022. https://doi.org/10.1145/2600428.2609499

    Google Scholar 

  • Macroberts, M.H., Macroberts, B.R., 1989. Problems of cita-tion analysis: a critical review. J. Am. Soc. Inform. Sci., 40(5): 342–349. https://doi.org/10.1002/(SICI)1097-4571(198909)40:5<342::AID-ASI7>3.0.CO;2-U

    Article  Google Scholar 

  • Mei, Q., Zhai, C., 2005. Discovering evolutionary theme patterns from text: an exploration of temporal text mining. Proc. 11th ACM SIGKDD Int. Conf. on Knowledge Discovery in Data Mining, p.198–207. https://doi.org/10.1145/1081870.1081895

    Google Scholar 

  • Mei, Q., Cai, D., Zhang, D., et al., 2008. Topic modeling with network regularization. Proc. 17th Int. Conf. on World Wide Web, p.101–110. https://doi.org/10.1145/1367497.1367512

    Google Scholar 

  • Nallapati, R., Cohen, W.W., 2008. Link-PLSA-LDA: a new unsupervised model for topics and influence of blogs. Proc. 2nd Int. Conf. on Weblogs and Social Media, p.84–92.

    Google Scholar 

  • Nallapati, R.M., Ahmed, A., Xing, E.P., et al., 2008. Joint latent topic models for text and citations. Proc. 14th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.542–550. https://doi.org/10.1145/1401890.1401957

    Google Scholar 

  • Wang, X.L., Zhai, C.X., Roth, D., 2013. Understanding evo-lution of research themes: a probabilistic generative model for citations. Proc. 19th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.1115–1123. https://doi.org/10.1145/2487575.2487698

    Google Scholar 

  • Wang, X.R., McCallum, A., 2006. Topics over time: a non-Markov continuous-time model of topical trends. Proc. 12th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.424–433. https://doi.org/10.1145/1150402.1150450

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hui-min Yu.

Additional information

Project supported by the National Basic Research Program (973) of China (No. 2012CB316400)

Electronic supplementary materials: The online version of this article (https://doi.org/10.1631/FITEE.1601125) contains supplementary materials, which are available to authorized users

Electronic supplementary material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhou, Hk., Yu, Hm. & Hu, R. Topic discovery and evolution in scientific literature based on content and citations. Frontiers Inf Technol Electronic Eng 18, 1511–1524 (2017). https://doi.org/10.1631/FITEE.1601125

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1631/FITEE.1601125

Keywords

CLC number

Navigation