Associative topic models with numerical time series

https://doi.org/10.1016/j.ipm.2015.06.007Get rights and content

Highlights

  • Introduce a probabilistic graphical model extracting topics with numerical guidance.

  • Enhance the regression performance with unified PGM of text and numbers.

  • Tightly links the analysis on numeric and text data over time.

Abstract

A series of events generates multiple types of time series data, such as numeric and text data over time, and the variations of the data types capture the events from different angles. This paper aims to integrate the analyses on such numerical and text time-series data influenced by common events with a single model to better understand the events. Specifically, we present a topic model, called an associative topic model (ATM), which finds the soft cluster of time-series text data guided by time-series numerical value. The identified clusters are represented as word distributions per clusters, and these word distributions indicate what the corresponding events were. We applied ATM to financial indexes and president approval rates. First, ATM identifies topics associated with the characteristics of time-series data from the multiple types of data. Second, ATM predicts numerical time-series data with a higher level of accuracy than does the iterative model, which is supported by lower mean squared errors.

Introduction

Probabilistic topic models are probabilistic graphical models used to cluster a corpus with semantics (Blei et al., 2003, Griffiths and Steyvers, 2004, McCallum et al., 2005), and they have been successful in analyzing diverse sources of text as well as image data (Gupta and Manning, 2011, Hörster et al., 2007, Khurdiya et al., 2011, Ramage et al., 2010). The essence of the topic models is the probabilistic modeling of how to generate the words in documents, given the prior knowledge of word distributions per key ideas, called topics, in the corpus. Specifically, a topic is a probability distribution over a vocabulary, and the topic models assume that a document is a mixture of multiple topics. These ideas are implicit because they cannot be directly observed, so some describe these topics as latent topics (Blei, 2012). The latent Dirichlet allocation (LDA) (Blei et al., 2003), one of the popular topic models, is the Bayesian approach to modeling such generation processes.

Given the success of LDA, many expansions of LDA are introduced, and they strengthen the probabilistic process of generating topics and words with additional information. One type of such additional information is the meta-data of documents. For instance, the author-topic model (Rosen-Zvi, Griffiths, Steyvers, & Smyth, 2004) includes the authorship model to produce additional estimations on the authorship as well as better topic modeling results. Another type of information is the prior information on the corpus characteristics. For instance, the aspect-sentiment unification model (Jo & Oh, 2011) utilizes the sentiment information of words as priors to estimate the sentiment of topics. While the expansion is motivated by the additional information, such expansion is often realized by either adding additional variables in the graphical model or calibrating prior settings in the inference process.

While most variations of LDA augment the data, such as sentiment lexicons and authorships, from either text data or the meta-data of the corpus, several models relate text data to other types of data, such as geospatial data (Sizov, 2012) and document-level labels (Blei and McAuliffe, 2007, Chen et al., 2010, Ramage et al., 2009). This paper considers integrating texts and numerical data over time. We assume that there are events that become the common cause of dynamics in the text and the numeric data. Then, we are interested in identifying the representations of such events which are not directly observable from the corpus and the numbers. For instance, there are latent events, such as interest rate changes or policy changes, in the stock markets that cause the generation of news articles and numerical data. Our goal is to represent such latent events with word distributions per events. The key to identifying such events is the correlation between texts and numeric variables associated with the events. Such correlations between numerical and textual data are common in product sales (Liu, Huang, An, & Yu, 2007), candidate approval (Livne, Simmons, Adar, & Adamic, 2011), and box-office revenues (Asur & Huberman, 2010). In these domains, understanding why we see such numerical fluctuation with the textual context can provide a key insight, i.e., finding a topic that resulted in the huge drop of a stock index.

This paper introduces a new topic model called an associative topic model (ATM), which correlates the time-series data of text and numerical values. We define an associative topic as a soft cluster of unique words whose likelihood of cluster association is influenced by the appearances on documents as well as the fluctuation of numeric values over-time. The association likelihood of words to clusters, or associative topics, is inferred to maximize the joint probability including factors of text generation and numeric value generation by the proportion of topics over-time. The topic proportion is the likelihood of the cluster, or topic, appearance on a certain time, which means that our data consists of timed batch datasets of texts and numbers over the period. In other words, The model assumes that the text and the numerical value at the same timed-batch are generated from a common latent random variable of topic proportions, which indicate the relative ratio of topic appearances at the time (Blei and Lafferty, 2006, Putthividhy et al., 2010). For instance, a topic on tax could be strongly related to and influencing to the generation of economic articles as well stock prices, for example. Then, we interpret that increasing the proportion of tax topic is the event influencing the numerical-data and text-data generation.

This assumption enables the numerical value to adjust the topic extraction from texts through its movements. Fig. 1 describes an example of an associative topic. The associative topic provides summarized information about text-data, and its appearance over time is highly related to the numerical time-series data. Such associative topics are useful in interpretations and predictions. For the interpretation, the associative topics hold more information about the numerical context than do the topics from only texts. Also, the associative topics enable the prediction of the numerical value of the next time step with high accuracy. Fig. 2 explains the inputs and the outputs of ATM. A time-series numerical datum, y1:T, and time-series text datum, D1:T are input data for ATM. ATM provides associated topics with correlation indexes and topic proportions over time indicating the dynamics of topic appearances in text data. Additionally, ATM can predict the next time numerical value, yT+1, with the next time text data, DT+1.

Inherently, the proposed model infers associative topics that are adjusted to better explain the associated numerical values. When this paper applies the model to the three datasets, the economic news corpus and the stock return from the Dow Jones Industrial Average (DJIA), the same news corpus and the stock volatilities from the DJIA, and the news corpus related to the president and president approval index, we observe that the model provides adjusted topics that explain each numerical value movement. Associative topics extracted from the proposed model have a higher correlation with time-series data when there is a relationship between the text and the numerical data. Also, the model becomes a forecasting model that predicts the next time variable by considering the past text and numerical data. We experimented with the model to predict the numerical time-series variable, and we found that the proposed model provides topics better adapted to the time-series values and better predicts the next time numerical value than do the previous approaches. The contributions of the paper are summarized as follows:

  • We introduced a model to analyze a bi-modal dataset including text and numerical time-series data, which have different characteristics. This is the first unified topic model for the numerical and text bimodal dataset without batch-processing two different models for each data type.

  • In the parameter inference of the presented model, we introduced a method for estimating the expectation of the softmax function with Gaussian inputs, which was inevitable in the process of relating the numerical and the text data. This parameter inference technique is applicable to diverse cases of joining two datasets originating from the continuous and discrete domains.

Section snippets

Previous work

There are several attempts to integrate side information with topic models. Basically, three kinds of side information have been considered: document-level features (i.e., categories, sentiment label, hits, etc.), word-level features (i.e., term volume), and corpus-level features (i.e. product sales, stock indexes). For document-level features, several topic models (Blei and McAuliffe, 2007, Ramage et al., 2009, Zhu et al., 2012) are suggested to incorporate features of a document with the

Associative topic models with numerical time series

This section provides the description of our proposed model, ATM, used to infer the associative topics jointly influenced by the texts and the numeric values. Fundamentally, ATM is a model of Bayesian network, so the first subsection provides its probabilistic graphical model and its independence/causality structure between modeled variables. The first subsection compares and contrasts ATM to its predecessor model, DTM; and the subsection provides a description on the generative process of

Empirical experiment

This section demonstrates the utility of the ATM in both explaining and predicting the time-series values. We applied ATM to a financial news corpus and stock indexes as well as to a news corpus related with the president and president approval index. Section 4.1 shows the detailed description of the datasets. Sections 4.2 Overview of experimental design, 4.3 Baseline describe the overview of our experiments and the baseline models, such as autoregressive models (AR), LDA, DTM, ITMTF. Through

Conclusion

This paper proposes associative topic model, or ATM, to better capture the relationship between numerical data and text data over time. We tested the proposed model with financial indexes and presidential approval index. The pair of numeric and text data in the financial domain consists of the economic news articles; and the stock return as well as volatility. The pair in the politics domain consists of news articles on the president and the president approval index. Our experiments show

Acknowledgements

This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT & Future Planning (2012R1A1A1044575).

References (38)

  • A. Asuncion et al.

    On smoothing and inference for topic models

  • S. Asur et al.

    Predicting the future with social media

  • C.A. Bejan et al.

    Using clustering methods for discovering event structures

  • D.M. Blei

    Probabilistic topic models

    Communications of the ACM

    (2012)
  • D.M. Blei et al.

    Dynamic topic models

  • D.M. Blei et al.

    Supervised topic models

    Proceeding of NIPS’07

    (2007)
  • D.M. Blei et al.

    Latent Dirichlet allocation

    The Journal of Machine Learning Research

    (2003)
  • Bouchard, G. (2007). Efficient bounds for the softmax function and applications to approximate inference in hybrid...
  • Chang, J., Gerrish, S., Wang, C., Boyd-graber, J. L., & Blei, D. M. (2009). Reading tea leaves: How humans interpret...
  • Chen, X., Li, L., Xu, G., Yang, Z., & Kitsuregawa, M. (2012). Recommending related microblogs: A comparison between...
  • B. Chen et al.

    What is an opinion about? Exploring political standpoints using opinion scoring model

  • Dufresne, D. (2008). Sums of lognormals. In Proceedings of the 43rd actuarial research...
  • X. Gao et al.

    Asymptotic behavior of tail density for sum of correlated lognormal variables

    International Journal of Mathematics and Mathematical Sciences

    (2009)
  • T.L. Griffiths et al.

    Finding scientific topics

    Proceedings of the National Academy of Sciences of the United States of America

    (2004)
  • Gupta, S., & Manning, C. (2011). Analyzing the dynamics of research by extracting key aspects of scientific papers. In...
  • L. Hong et al.

    Tracking trends: Incorporating term volume into temporal topic models

  • E. Hörster et al.

    Image retrieval on large-scale image databases

  • Jaakkola, T.S. (2001). Tutorial on variational approximation methods. In Advanced mean field methods: Theory and...
  • Y. Jo et al.

    Aspect and sentiment unification model for online review analysis

  • Cited by (0)

    View full text