Mining coherent topics in documents using word embeddings and large-scale text data

https://doi.org/10.1016/j.engappai.2017.06.024Get rights and content

Highlights

  • A novel knowledge mining method for topic modeling is proposed.

  • A new topic model which could handle the knowledge encoded by word embeddings.

  • Our method outperforms six state-of-the-art knowledge-based topic models.

Abstract

Probabilistic topic models have been extensively used to extract low-dimension aspects from document collections. However, such models without any human knowledge often generate topics that are not interpretable. Recently, a number of knowledge-based topic models have been proposed, which enable users to input prior domain knowledge to produce more meaningful and coherent topics. Word embeddings, on the other hand, can automatically capture both semantic and syntactic information of words from a large amount of documents, and can be used to measure word similarities. In this paper, we incorporate word embeddings obtained from a large number of domains into topic modeling. By combining Latent Dirichlet Allocation, a widely used topic model with Skip-Gram, a well-known framework for learning word vectors, we improve the semantic coherence significantly. Our evaluation results using product review documents from 100 domains will demonstrate the effectiveness of our method.

Introduction

The explosive growth of online text content, such as Twitter messages, blogs, news and product reviews has brought about the challenge to understand the very dynamic sea of text. To deal with the challenge, we need to discover concepts from massive text.

A number of text mining tasks, especially aspects extraction tasks, utilize probabilistic topic models such as Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA) Hofmann (1999), Blei et al. (2003). However, these unsupervised models without any human knowledge often result in topics that are difficult to interpret. In other words, they could not produce semantically coherent concepts Chang et al. (2009), Mimno et al. (2011).

To overcome the shortcoming of interpretability in topic models, especially in LDA, some previous works incorporate prior domain knowledge into topic modeling in different ways. However, they either cannot learn knowledge automatically, or fail to utilize multiple domain data sufficiently.

Topic models such as LDA utilize the bag of word representation and document-level word co-occurrence to assign a topic to each word observation in the corpus. Similarly, word embeddings Bengio et al. (2003), Mnih and Hinton (2007), Collobert and Weston (2008), Collobert et al. (2011), Huang et al. (2012), Mikolov et al. (2013) conduct dimensionality reduction based on co-occurrence information, but focus more on local context and word order in text to learn a low-dimension dense word vector for each word. Word embeddings aim at explicitly encoding many semantic relationships as well as linguistic regularities and patterns into new embedding space. For example, the result of a vector calculation vec(“Madrid”) − vec(“Spain”) + vec(“France”) is closer to vec(“Paris”) than to any other word vector, and “Spain” is close to “France” in the embedding space. Since similar words are close in embedding space, we can utilize the word correlation knowledge encoded by word embeddings.

In this paper, we improve previous knowledge-based topic models by proposing a new probabilistic method, called Word Embedding LDA (WE-LDA), which combines topic model and word embeddings, in particular LDA model and Skip-Gram (Mikolov et al., 2013). The proposed method explicitly models document-level word co-occurrence in the corpus with word correlation knowledge encoded by word vectors automatically learned from a large amount of relevant data, which could extract more coherent topics in documents.

The contributions of the paper are threefold: (1) It proposes a novel knowledge mining method for topic modeling based on word embeddings. (2) It provides a novel knowledge-based topic model which could handle the knowledge encoded by word embeddings properly. (3) Comprehensive experimental results on two large e-commerce domain datasets demonstrate our method outperforms six state-of-the-art knowledge-based topic models.

We begin this paper by introducing some related works, including studies which devote to improving the semantic coherence of topic models mainly by incorporating domain knowledge into topic models, studies which measure the coherence of topic models, and studies which focus on learning word representation. In the remainder of this paper, we first describe our model, then empirically evaluate our method on real world datasets and analyze experimental results. Experiments on two large product review datasets show the effectiveness of our method.

Section snippets

Knowledge-based topic models

To overcome the drawback of interpretability in topic models, especially in LDA, some previous works incorporate prior domain knowledge into topic modeling. For instance, Andrzejewski and Zhu (2009) proposed topic-in-set knowledge which restricts topic assignment of words to a subset of topics. Andrzejewski et al. (2011) extended topic-in-set knowledge (Andrzejewski and Zhu, 2009) by incorporating general knowledge specified by first-order logic. Similarly, Chemudugunta et al. (2008) proposed

The WE-LDA model

The proposed WE-LDA model consists of three steps. First, we run LDA and select topical words as seed words of a corpus. Then we use word vectors to generate the must-link knowledge base. Finally, we take the generalized Pólya urn (GPU) method Mahmoud (2008), Mimno et al. (2011) which is the key technique for incorporating must-links into Gibbs sampling, and find more semantically coherent topics. The first and the second step aim to generate high quality prior knowledge for topic modeling. The

Experimental results

This section evaluates the proposed WE-LDA model and compares it with seven state-of-the-art baseline models:

  • LDA (Blei et al., 2003) : A classic unsupervised topic model.

  • LDA-GPU (Mimno et al., 2011) : LDA with GPU, an unsupervised topic model. Specifically, LDA-GPU applies GPU in LDA using co-document frequency.

  • GK-LDA (Chen et al., 2013a) : A knowledge-based topic model. It uses the ratio of word probabilities under each topic to reduce the effect of wrong knowledge.2

Conclusion

This paper has presented WE-LDA, which combines topic model and word embeddings, in particular LDA model and Skip-Gram. The proposed method models document-level word co-occurrence with knowledge encoded by word vectors automatically learned from a large amount of relevant text data, could extract more coherent topics. Experimental results on real world e-commerce datasets show the effectiveness of the proposed method. We can conclude that semantic similarity of word vectors learned from

Acknowledgments

This work is supported by the National Natural Science Foundation of China (No. 61572434), China Knowledge Centre for Engineering Sciences and Technology (No. CKCEST-2017-1-3), Zhejiang Provincial Natural Science Foundation of China (No. LY14F020027) and Specialized Research Fund for the Doctoral Program of Higher Education (SRFDP) ( 20130101110136).

References (32)

  • YaoL. et al.

    Concept over Time: the combination of probabilistic topic model with wikipedia knowledge

    Expert Syst. Appl.

    (2016)
  • AndrzejewskiD. et al.

    Latent dirichlet allocation with topic-in-set knowledge

  • AndrzejewskiD. et al.

    Incorporating domain knowledge into topic modeling via Dirichlet forest priors

  • Andrzejewski, D., Zhu, X., Craven, M., Recht, B., 2011. A framework for incorporating general domain knowledge into...
  • Baroni, M., Dinu, G., Kruszewski, G., 2014. Don’t count, predict! A systematic comparison of context-counting vs....
  • BengioY. et al.

    A neural probabilistic language model

    J. Mach. Learn. Res.

    (2003)
  • BleiD.M. et al.

    Latent dirichlet allocation

    J. Mach. Learn. Res.

    (2003)
  • Chang, J., Gerrish, S., Wang, C., Boyd-graber, J.L., Blei, D.M., 2009. Reading tea leaves: How humans interpret topic...
  • ChemuduguntaC. et al.

    Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning

    (2008)
  • ChenZ. et al.

    Mining topics in documents: standing on the shoulders of big data

  • Chen, Z., Liu, B., 2014b. Topic modeling using topics from many domains, lifelong learning and big data. In: ICML, pp....
  • Chen, Z., Mukherjee, A., Liu, B., 2014. Aspect extraction with automated prior knowledge learning. In: ACL, pp....
  • ChenZ. et al.

    Discovering coherent topics using general knowledge

  • Chen, Z., Mukherjee, A., Liu, B., Hsu, M., Castellanos, M., Ghosh, R., 2013b. Exploiting domain knowledge in aspect...
  • ChenZ. et al.

    Leveraging multi-domain prior knowledge in topic models

  • Chuang, J., Gupta, S., Manning, C., Heer, J., 2013. Topic model diagnostics: Assessing domain relevance via topical...
  • Cited by (24)

    • Geoscience keyphrase extraction algorithm using enhanced word embedding

      2019, Expert Systems with Applications
      Citation Excerpt :

      The ML-based KE approach learns how to automatically extract keyphrases from large volumes of training data. Among the KE methods, deep learning has been widely used because of its suitability and adaptability in supporting KE (Hu, Wu, & Qi et al., 2018; Jo, & Lee, 2015; Yao, Zhang, & Chen et al., 2017). Deep learning-based approaches have some advantages: (1) they optimize models by using relevant feedback that offers high-level features for representing the text, (2) their feedback model self-updates to capture a set of rich interdependent features, and (3) they can simultaneously reduce human efforts and obtain high performance when extracting information from heterogeneous and complex text.

    • Incorporating knowledge into neural network for text representation

      2018, Expert Systems with Applications
      Citation Excerpt :

      In contrast, the Skip-gram model uses the order of the sequence to sample the words that appear less frequently during training time (Franco-Salvador, Rangel, Rosso, Taulé, & Martít, 2015). In this paper, we select the Skip-Gram model to train the concept embeddings due to its better performance on average especially at the semantic level (Yao, Zhang, Chen et al., 2017). After being trained, the text concept vector tv can be regarded as the feature of the original text and used for nlp tasks.

    • Extracting topic-sensitive content from textual documents—A hybrid topic model approach

      2018, Engineering Applications of Artificial Intelligence
      Citation Excerpt :

      It provides a complementary discriminative learning scheme to infer the topic distribution via regressions. To mine coherent topics in documents, word embeddings obtained from a large number of domains were combined into LDA approach in topic modeling (Yao et al., 2017). We have noticed that most of the previous studies focus on discriminating documents, while few of them have attempted to recognize different segments in each single document.

    View all citing articles on Scopus
    View full text