Associative topic models with numerical time series
Introduction
Probabilistic topic models are probabilistic graphical models used to cluster a corpus with semantics (Blei et al., 2003, Griffiths and Steyvers, 2004, McCallum et al., 2005), and they have been successful in analyzing diverse sources of text as well as image data (Gupta and Manning, 2011, Hörster et al., 2007, Khurdiya et al., 2011, Ramage et al., 2010). The essence of the topic models is the probabilistic modeling of how to generate the words in documents, given the prior knowledge of word distributions per key ideas, called topics, in the corpus. Specifically, a topic is a probability distribution over a vocabulary, and the topic models assume that a document is a mixture of multiple topics. These ideas are implicit because they cannot be directly observed, so some describe these topics as latent topics (Blei, 2012). The latent Dirichlet allocation (LDA) (Blei et al., 2003), one of the popular topic models, is the Bayesian approach to modeling such generation processes.
Given the success of LDA, many expansions of LDA are introduced, and they strengthen the probabilistic process of generating topics and words with additional information. One type of such additional information is the meta-data of documents. For instance, the author-topic model (Rosen-Zvi, Griffiths, Steyvers, & Smyth, 2004) includes the authorship model to produce additional estimations on the authorship as well as better topic modeling results. Another type of information is the prior information on the corpus characteristics. For instance, the aspect-sentiment unification model (Jo & Oh, 2011) utilizes the sentiment information of words as priors to estimate the sentiment of topics. While the expansion is motivated by the additional information, such expansion is often realized by either adding additional variables in the graphical model or calibrating prior settings in the inference process.
While most variations of LDA augment the data, such as sentiment lexicons and authorships, from either text data or the meta-data of the corpus, several models relate text data to other types of data, such as geospatial data (Sizov, 2012) and document-level labels (Blei and McAuliffe, 2007, Chen et al., 2010, Ramage et al., 2009). This paper considers integrating texts and numerical data over time. We assume that there are events that become the common cause of dynamics in the text and the numeric data. Then, we are interested in identifying the representations of such events which are not directly observable from the corpus and the numbers. For instance, there are latent events, such as interest rate changes or policy changes, in the stock markets that cause the generation of news articles and numerical data. Our goal is to represent such latent events with word distributions per events. The key to identifying such events is the correlation between texts and numeric variables associated with the events. Such correlations between numerical and textual data are common in product sales (Liu, Huang, An, & Yu, 2007), candidate approval (Livne, Simmons, Adar, & Adamic, 2011), and box-office revenues (Asur & Huberman, 2010). In these domains, understanding why we see such numerical fluctuation with the textual context can provide a key insight, i.e., finding a topic that resulted in the huge drop of a stock index.
This paper introduces a new topic model called an associative topic model (ATM), which correlates the time-series data of text and numerical values. We define an associative topic as a soft cluster of unique words whose likelihood of cluster association is influenced by the appearances on documents as well as the fluctuation of numeric values over-time. The association likelihood of words to clusters, or associative topics, is inferred to maximize the joint probability including factors of text generation and numeric value generation by the proportion of topics over-time. The topic proportion is the likelihood of the cluster, or topic, appearance on a certain time, which means that our data consists of timed batch datasets of texts and numbers over the period. In other words, The model assumes that the text and the numerical value at the same timed-batch are generated from a common latent random variable of topic proportions, which indicate the relative ratio of topic appearances at the time (Blei and Lafferty, 2006, Putthividhy et al., 2010). For instance, a topic on tax could be strongly related to and influencing to the generation of economic articles as well stock prices, for example. Then, we interpret that increasing the proportion of tax topic is the event influencing the numerical-data and text-data generation.
This assumption enables the numerical value to adjust the topic extraction from texts through its movements. Fig. 1 describes an example of an associative topic. The associative topic provides summarized information about text-data, and its appearance over time is highly related to the numerical time-series data. Such associative topics are useful in interpretations and predictions. For the interpretation, the associative topics hold more information about the numerical context than do the topics from only texts. Also, the associative topics enable the prediction of the numerical value of the next time step with high accuracy. Fig. 2 explains the inputs and the outputs of ATM. A time-series numerical datum, , and time-series text datum, are input data for ATM. ATM provides associated topics with correlation indexes and topic proportions over time indicating the dynamics of topic appearances in text data. Additionally, ATM can predict the next time numerical value, , with the next time text data, .
Inherently, the proposed model infers associative topics that are adjusted to better explain the associated numerical values. When this paper applies the model to the three datasets, the economic news corpus and the stock return from the Dow Jones Industrial Average (DJIA), the same news corpus and the stock volatilities from the DJIA, and the news corpus related to the president and president approval index, we observe that the model provides adjusted topics that explain each numerical value movement. Associative topics extracted from the proposed model have a higher correlation with time-series data when there is a relationship between the text and the numerical data. Also, the model becomes a forecasting model that predicts the next time variable by considering the past text and numerical data. We experimented with the model to predict the numerical time-series variable, and we found that the proposed model provides topics better adapted to the time-series values and better predicts the next time numerical value than do the previous approaches. The contributions of the paper are summarized as follows:
- –
We introduced a model to analyze a bi-modal dataset including text and numerical time-series data, which have different characteristics. This is the first unified topic model for the numerical and text bimodal dataset without batch-processing two different models for each data type.
- –
In the parameter inference of the presented model, we introduced a method for estimating the expectation of the softmax function with Gaussian inputs, which was inevitable in the process of relating the numerical and the text data. This parameter inference technique is applicable to diverse cases of joining two datasets originating from the continuous and discrete domains.
Section snippets
Previous work
There are several attempts to integrate side information with topic models. Basically, three kinds of side information have been considered: document-level features (i.e., categories, sentiment label, hits, etc.), word-level features (i.e., term volume), and corpus-level features (i.e. product sales, stock indexes). For document-level features, several topic models (Blei and McAuliffe, 2007, Ramage et al., 2009, Zhu et al., 2012) are suggested to incorporate features of a document with the
Associative topic models with numerical time series
This section provides the description of our proposed model, ATM, used to infer the associative topics jointly influenced by the texts and the numeric values. Fundamentally, ATM is a model of Bayesian network, so the first subsection provides its probabilistic graphical model and its independence/causality structure between modeled variables. The first subsection compares and contrasts ATM to its predecessor model, DTM; and the subsection provides a description on the generative process of
Empirical experiment
This section demonstrates the utility of the ATM in both explaining and predicting the time-series values. We applied ATM to a financial news corpus and stock indexes as well as to a news corpus related with the president and president approval index. Section 4.1 shows the detailed description of the datasets. Sections 4.2 Overview of experimental design, 4.3 Baseline describe the overview of our experiments and the baseline models, such as autoregressive models (AR), LDA, DTM, ITMTF. Through
Conclusion
This paper proposes associative topic model, or ATM, to better capture the relationship between numerical data and text data over time. We tested the proposed model with financial indexes and presidential approval index. The pair of numeric and text data in the financial domain consists of the economic news articles; and the stock return as well as volatility. The pair in the politics domain consists of news articles on the president and the president approval index. Our experiments show
Acknowledgements
This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT & Future Planning (2012R1A1A1044575).
References (38)
- et al.
On smoothing and inference for topic models
- et al.
Predicting the future with social media
- et al.
Using clustering methods for discovering event structures
Probabilistic topic models
Communications of the ACM
(2012)- et al.
Dynamic topic models
- et al.
Supervised topic models
Proceeding of NIPS’07
(2007) - et al.
Latent Dirichlet allocation
The Journal of Machine Learning Research
(2003) - Bouchard, G. (2007). Efficient bounds for the softmax function and applications to approximate inference in hybrid...
- Chang, J., Gerrish, S., Wang, C., Boyd-graber, J. L., & Blei, D. M. (2009). Reading tea leaves: How humans interpret...
- Chen, X., Li, L., Xu, G., Yang, Z., & Kitsuregawa, M. (2012). Recommending related microblogs: A comparison between...