Greedy learning of latent tree models for multidimensional clustering

Liu, Teng-Fei; Zhang, Nevin L.; Chen, Peixian; Liu, April Hua; Poon, Leonard K. M.; Wang, Yi

doi:10.1007/s10994-013-5393-0

Greedy learning of latent tree models for multidimensional clustering

Published: 29 June 2013

Volume 98, pages 301–330, (2015)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Greedy learning of latent tree models for multidimensional clustering

Download PDF

Teng-Fei Liu¹,
Nevin L. Zhang¹,
Peixian Chen¹,
April Hua Liu¹,
Leonard K. M. Poon² &
…
Yi Wang³

2336 Accesses
14 Citations
Explore all metrics

Abstract

Real-world data are often multifaceted and can be meaningfully clustered in more than one way. There is a growing interest in obtaining multiple partitions of data. In previous work we learnt from data a latent tree model (LTM) that contains multiple latent variables (Chen et al. 2012). Each latent variable represents a soft partition of data and hence multiple partitions result in. The LTM approach can, through model selection, automatically determine how many partitions there should be, what attributes define each partition, and how many clusters there should be for each partition. It has been shown to yield rich and meaningful clustering results.

Our previous algorithm EAST for learning LTMs is only efficient enough to handle data sets with dozens of attributes. This paper proposes an algorithm called BI that can deal with data sets with hundreds of attributes. We empirically compare BI with EAST and other more efficient LTM learning algorithms, and show that BI outperforms its competitors on data sets with hundreds of attributes. In terms of clustering results, BI compares favorably with alternative methods that are not based on LTMs.

A survey on semi-supervised learning

Article Open access 15 November 2019

A Comprehensive Survey of Clustering Algorithms

Article 01 June 2015

Density-Based Clustering Based on Hierarchical Density Estimates

1 Introduction

Latent tree models (LTMs) are tree-structured probabilistic graphical models where the leaf nodes represent observed variables, while the internal nodes represent latent variables. Special LTMs such as phylogenetic trees (Durbin et al. 1998) and latent class models (Bartholomew and Knott 1999) have been studied for decades. General LTMs were first investigated by Zhang (2004), where they are called hierarchical latent class models. LTMs can be used for latent structure discovery (Zhang et al. 2008a, 2008b; Chen et al. 2008), density estimation (Wang et al. 2008), and clustering (Chen et al. 2012; Poon et al. 2010). This paper is concerned with the use of LTMs for clustering. We consider only the case where all variables (observed and latent) are discrete.

Previous algorithms for learning LTMs can be divided into three groups. Algorithms in the first group aim at finding models that are optimal according to a scoring metric. They conduct search in the space of LTMs and are hence called search algorithms (Zhang 2004; Zhang and Kocka 2004; Chen et al. 2012). Algorithms in the second group introduce latent variables based on results of attribute clustering (Harmeling and Williams 2011; Mourad et al. 2011). We call them AC-based algorithms, where AC stands for attribute clustering. Algorithms in the third group are inspired by work on phylogenetic tree reconstruction (PTR) (Choi et al. 2011; Anandkumar et al. 2011). We will refer to them as PTR-motivated algorithms.

Empirical results reported by Mourad et al. (2013) indicate that search algorithms usually find the best models on small data sets with dozens of attributes, while AC-based algorithms can handle data sets with as many as 10,000 attributes. PTR-motivated algorithms have theoretical guarantees, but they require all latent variables have the same number of states and the number be known beforehand.

Liu et al. (2012) propose an algorithm, called the bridged-islands (BI) algorithm, for learning LTMs. The algorithm aims at finding flat LTMs where each latent variable is directly connected to at least one observed variable. All the observed variables that are directly connected to a given latent variable are said to be siblings and form a sibling cluster. Similar to AC-based algorithms, BI determines potential siblings by considering how closely correlated each pair of observed variables are. Unlike AC-based algorithms, BI has a novel procedure called uni-dimensionality (UD) test to detect highly correlated observed variables that should be siblings. Initial sibling clusters are thereby created. One latent variable is then introduced for each sibling cluster and the latent variables are connected to form a tree structure using Chow-Liu’s algorithm (Chow and Liu 1968). There is also a final global adjustment step that refines the model.

This paper builds upon and extends Liu et al. (2012). We compare BI with previous algorithms for learning LTMs in terms of computational complexity and model quality. We also study the impact of key parameters and sample size to the performance of BI. A variety of experiments on both synthetic and real-world data sets are presented. As it turns out, BI is significantly more efficient than search algorithms. It can handle data sets with hundreds of attributes. Moreover, it produces better models on such data sets than AC-based and PTR-motivated algorithms. As in Liu et al. (2012), we also use BI as a clustering method that produces multiple partitions of data, and compare BI with previous methods for the task that are not based on LTMs.

This paper is divided into two parts. In the first part, we briefly review LTMs (Sect. 2) and previous algorithms for learning LTMs (Sect. 3). Then we describe BI algorithm (Sect. 4) and compare it with previous algorithms (Sect. 5). In the second part, we survey previous methods for multi-partition clustering (Sect. 6) and investigate the use of BI for the task (Sects. 7–9). The paper concludes in Sect. 10.

2 Latent tree models

A latent tree model (LTM) is a Markov random field over an undirected tree, where variables at leaf nodes are observed and variables at internal nodes are hidden. An example LTM is shown in Fig. 1. For technical convenience, we often root an LTM at one of its latent nodes and regard it as a directed graphical model, i.e., a Bayesian network (Pearl 1988). In the example, suppose that we root the model at the node AS. Then numerical information of the model includes a marginal distribution P(AS) for the root and one conditional distribution for each edge. For the edge AS→MG, for instance, we have distribution P(MG∣AS), which characterizes the dependence of MG on AS. The product of all these distributions defines a joint distribution over all the latent and observed variables. Note that we can choose to root the model at any of the latent nodes because the selection of root node would not change the joint distribution (Zhang 2004). In other words, different choices of root nodes lead to an equivalent class of directed tree models.

In general, suppose there are n observed variables X ₁,…,X _n and m latent variables Y ₁,…,Y _m in an LTM. Assume the model be rooted at one of the latent variables. Denote the parent of a variable Z as parent(Z) and let parent(Z) be the empty set when Z is the root. The LTM defines a joint distribution over X ₁,…,X _n,Y ₁,…,Y _m as follows:

$$\begin{aligned} {P(X_1,\ldots,X_n, Y_1,\ldots,Y_m)} = \prod_{Z \in\{X_1,\ldots,X_n, Y_1,\ldots,Y_m\}}P\bigl(Z \mid \mathit {parent}(Z)\bigr). \end{aligned}$$

Throughout the paper, we use the terms ‘node’ interchangeably with ‘variable’, and term ‘leaf node’ interchangeably with ‘attribute’ and ‘observed variable’. The set of attributes that are connected to a given latent variable is called a sibling cluster. Attributes in the cluster are said to be siblings. In Fig. 1, MG and SG form one sibling cluster because they all connected to latent node AS. Attributes EG and HG form another sibling cluster.

To learn an LTM from a data set D, one needs to determine: (1) the number of latent variables, (2) the number of states of each latent variable, which is sometimes called the cardinality of the variable, (3) the connections among the latent variables and observed variables, and (4) the probability parameters. In the following, we will use m to denote the information for the first three items and θ to denote the collection of parameter values.

To see how LTMs can be used for clustering, imagine that the model in Fig. 1 is learned from student transcript data. Then two partitions of the data are obtained, each being represented by a latent variable. In this scenario the latent variables were introduced during data analysis and are not properly named. We refer to them using their relative positions. The latent variable on the left is mainly related to Math grade and Science grade. Hence, it represents a soft partition of the students based primarily on their analytical skill. The latent variable on the right is mainly related to English grade and History grade. Hence, it represents a soft partition of the students based primarily on their literacy skill. We refer the reader to Sect. 7 for a much more detailed discussion of the use of LTMs for multi-partition clustering.

3 Previous algorithms for learning LTMs

A variety of algorithms for learning LTMs have been proposed previously. In this section, we give a brief overview of those algorithms. The reader is referred to Mourad et al. (2013) for a detailed survey.

We start with search algorithms which aim at finding the model m that maximizes the BIC score (Schwarz 1978):

$$\mathit {BIC}(m\mid \mathit {D}) = \log P\bigl(\mathit {D}\mid m,\theta^*\bigr) - \frac{d(m)}{2} \log{N}, $$

where θ ^∗ is the maximum likelihood estimate of the parameters, d(m) is the number of free probability parameters in m, and N is the sample size. The first search algorithm for learning LTMs was proposed by Zhang (2004). It is capable of handling only small toy data sets. Later two other search-based algorithms HSHC (Zhang and Kocka 2004) and EAST (Chen et al. 2012) were developed. These two algorithms gain efficiency by reducing the number of candidate models examined during search and by reducing the time spent on evaluating the candidate models. Between the two, EAST has a more principled method for evaluating candidate models and adopts a more intelligent search strategy. It is capable of handling data sets with dozens of attributes. However, it is unable to deal with data sets with hundreds of or more attributes because it still needs to evaluate a quadratic number of candidate models at each step of search.

Although search algorithms usually find high quality models, they are relatively slow. Greedy algorithms were consequently developed (Harmeling and Williams 2011; Mourad et al. 2011). These algorithms determine the structure of an LTM by performing agglomerative hierarchical clustering of observed variables. A latent variable is introduced for each node in the resulting dendrogram. The number of states for the latent variable is determined by considering its neighbors and using model selection criteria or some heuristics. The final model can be a binary tree, a non-binary tree, or a forest. These AC-based algorithms are efficient, and can handle data sets with as many as 10,000 observed variables (Mourad et al. 2013). As will be demonstrated in Sect. 4, however, the efficiency comes with a compromise of model quality.

Phylogenetic trees (PTs) (Durbin et al. 1998) are diagrams depicting the evolutionary relationships among organisms. They can be viewed as special LTMs where all the (latent and observed) variables take the same possible values (i.e., A, C, G, T) and the conditional probability distributions are parameterized by edge lengths, which represent evolution time. Phylogenetic tree reconstruction (PTR) refers to the task of inferring the evolution tree for a collection of current species. A large number of algorithms have been developed for the task, including neighbor-joining (Saitou and Nei 1987) and quartet-based methods (Ranwez and Gascuel 2001). A key property that those algorithm exploits is that, in PTs, a concept of distance is defined for any two nodes.

Recently, new algorithms for learning LTMs have been proposed by drawing ideas from research on PTR. Choi et al. (2011) and Song et al. (2011) identify classes of LTMs where a concept of distance between nodes can be defined, while Mossel et al. (2011) and Anandkumar et al. (2011) define classes of LTMs where quartet tests can be performed. Unlike search algorithms and AC-based algorithms introduced above which are designed for discrete data, the PTR-motivated methods can deal with both discrete data and continuous data. For continuous data, they usually assume data are normally distributed. In the discrete case, they require all the observed variables have the same number of states. Theoretical results on consistence and sample complexity for PTR-motivated methods have been proved.

4 The bridged-islands algorithm

We now set out to present the greedy algorithm for learning LTMs. The algorithm aims at obtaining flat LTMs where each latent variable is connected to at least one observed variable. This restriction is motivated by the observation that search algorithms almost always yield flat LTMs. The algorithm proceeds in four steps:

1.
Partition the set of attributes into sibling clusters;
2.
Introduce a latent variable for each sibling cluster;
3.
Connect the latent variables to form a tree;
4.
Refine the model based on global considerations.

If we imagine the sibling clusters formed in Step 1, together with the latent variables added in Step 2, as islands in an ocean, then the islands are connected in Step 3. So we call the algorithm the bridged-islands (BI) algorithm.

The pseudo code for BI is given in Algorithm 1. In the following four subsections, we describe the steps of BI in details.

4.1 Sibling cluster determination

The first step of BI consists of lines 1–18 of the pseudo code. The objective is to determine sibling clusters. To identify potential siblings, BI considers how closely correlated each pairs of attributes are in terms of mutual information. The mutual information (MI) I(X;Y) (Cover and Thomas 2006) between the two variables X and Y is defined as follows:

$$\begin{aligned} I(X; Y) = \sum_{X, Y}P(X, Y) \log\frac{P(X, Y)}{P(X)P(Y)}, \end{aligned}$$

(1)

where the summation is taken over all possible states of X and Y. In BI, the joint distribution P(X,Y) is estimated from data. It is the joint empirical distribution of the two variables. The MI value calculated from the empirical distribution is called the empirical MI.

To determine the first sibling cluster, BI maintains a working set S of attributes that initially consists of the pair of attributes with the highest MI (line 4). Other attributes are added to the set one by one. At each step, BI chooses to add the attribute that has the highest MI with the current set (lines 6 and 7). The MI between a variable X and a set S of variables is estimated as follows:

$$\begin{aligned} I(X;S) = \max_{Z\in S} \sum_{X,Z} P(X,Z)\log{\frac{P(X,Z)}{P(X)P(Z)}}. \end{aligned}$$

(2)

A key question is when to stop expanding the working set S. BI answers the question by performing a Bayesian statistical test to determine whether correlations among the variables in S can be properly modeled using one single latent variable (lines 8–11). The test is hence called the uni-dimensionality test or simply the UD-test. The expansion stops when the UD-test fails.

An LTM that contains only one latent variable is called a latent class model (LCM). To perform the UD-test, BI first projects the original data set D onto the working set S to get a smaller data set D′ (line 8). Then it obtains from D′ the best LCM m ₁ and the best mode m ₂ that contains 1 or 2 latent variables (line 9). The subroutines for these two tasks are given in Sect. 4.4. BI concludes that the UD-test passes if and only if one of these two conditions is satisfied: (1) m ₂ contains only one latent variable, or (2) m ₂ contains two latent variables and

$$\begin{aligned} \mathit {BIC}\bigl(m_2\mid D'\bigr) - \mathit {BIC}\bigl(m_1\mid D'\bigr) \leq\delta, \end{aligned}$$

(3)

where δ is a threshold parameter.

The left hand side of inequality (3) is an approximation to the natural logarithm of the Bayes factor (Kass and Raftery 1995) for comparing m ₂ with m ₁. In Kass and Raftery (1995), guidelines are given for the use of Bayes factor (page 777). One set of guidelines are given in terms of twice the natural logarithm of Bayes factor. In terms of simply the natural logarithm of Bayes factor, the guidelines are as follows: a value between 1 and 3 is positive evidence favoring m ₂, a value between 3 and 5 is strong evidence favoring m ₂, and a value larger than 5 is very strong evidence favoring m ₂. In our empirical evaluation, we usually set δ=3. Sometimes, we also consider δ=1, 5 or 10.

To illustrate the process, suppose that the working set S={X ₁,X ₂} initially. Two other attributes X ₃ and X ₄ are added to the set one by one and the UD-test passes in both cases. Then X ₅ is added. Suppose the models m ₁ and m ₂ obtained at line 9 for S={X ₁,X ₂,X ₃,X ₄,X ₅} are as shown in Fig. 2. Further suppose the BIC score of m ₂ exceeds that of m ₁ by threshold δ. Then UD-test fails and BI stops growing the set S.

When the UD-test fails, the model m ₂ contains two latent variables. Each gives us a potential sibling cluster. Hence there are two potential sibling clusters. BI chooses one of them at line 11. If one of the two potential sibling clusters contains both the two initial attributes, it is picked. Otherwise, BI picks the one with more attributes and breaks ties arbitrarily.

In the aforementioned example, the two potential sibling clusters are {X ₁,X ₂,X ₄} and {X ₃,X ₅}. BI picks {X ₁,X ₂,X ₄} because it contains both the two initial attributes X ₁ and X ₂.

After the first sibling cluster is determined, BI removes the attributes in the cluster from the data set (line 11), and repeats the process to find other sibling clusters (line 3). This continues until all attributes grouped into sibling clusters.

4.2 Tree formation

The second step of BI is to learn an LCM for each sibling cluster (lines 19–22). One latent variable is introduced for each sibling cluster and the cardinality of the latent variable is determined. This is done using a subroutine that will be described in Sect. 4.4.

After the second step, we have a collection of LCMs. In Step 3, BI links up the latent variables of the LCMs to form a tree (lines 23–24). Chow and Liu (1968) give a well-known algorithm for learning tree-structured models among observed variables. It first estimates the MI between each pair of variables from data, then constructs a complete undirected graph with the MI values as edge weights, and finally finds the maximum spanning tree of the graph. The resulting tree model has the maximum likelihood among all tree models.

Chow-Liu’s algorithm can be adapted to link up the latent variables of the LCMs. We call the algorithm (i.e., subroutine LearnCL in line 24) to learn a Chow-Liu tree among latent variables. Then all LCMs are connected and form an LTM (line 24). We only need to specify how the MI between two latent variables is to be estimated. Let m and m′ be two LCMs with latent variables Y and Y′ respectively. We calculate the MI I(Y;Y′) between Y and Y′ using (1) from the following joint distribution:

$$\begin{aligned} P\bigl(Y, Y'\mid D, m, m'\bigr) = C\sum _{d \in D}P(Y\mid m, d)P\bigl(Y'\mid m', d\bigr) \end{aligned}$$

(4)

where P(Y∣m,d) is the posterior distribution of Y in m given data case d, P(Y′∣m′,d) is that of Y′ in m′, and C is the normalization constant.

4.3 Model refinement

The sibling clusters and the cardinalities of the latent variables were determined in Step 1 and Step 2. Each of those decisions was made in the context of a small number of attributes. In Step 4 (lines 25–27), BI tries to detect the possible mistakes made in those steps based on global considerations and adjust the model accordingly. Specifically BI checks each attribute to see whether it should be relocated and each latent variable to see if its cardinality should be changed (i.e., subroutine ModelRefinement). To facilitate global considerations, BI first optimizes the probability parameters of the model resulted from Step 3 using EM algorithm (Dempster et al. 1977). The optimized model is denoted by m ^∗ (line 25).

To detect beneficial node relocations, BI completes the data using the model m ^∗, re-estimates the MI between each pair of variables using the completed data, and considers adjusting connections among variables accordingly. To be specific, let X be an observed variable and Y be a latent variable. BI calculates the mutual information I(X;Y) using (1) from the following distribution:

$$\begin{aligned} P\bigl(X, Y\mid D, m^*\bigr) = \frac{1}{N}\sum _{d \in D} P\bigl(X, Y\mid m^*, d\bigr), \end{aligned}$$

(5)

where P(X,Y∣m ^∗,d) is the joint posterior distribution of X and Y in m ^∗ given data case. Note that X is an observed variable. When the data set D contains no missing values, the equation can be rewritten as

$$\begin{aligned} P\bigl(X, Y\mid D, m^*\bigr) = \frac{1}{N}\sum _{d \in D} P\bigl(X\mid m^*, d\bigr) P\bigl(Y\mid m^*, d\bigr). \end{aligned}$$

(6)

With this new formula, we need to compute only the posterior distribution for each latent variable, rather than for each latent variable—observed variable combination. It is computationally more efficient. In implementation, we use (6) even when there are missing values.

Let Y be the latent variable that is currently directly connected to X and Y ^∗ be the latent variable that has the highest MI with X. If Y ^∗≠Y, then BI deems beneficial to relocate X from Y to Y ^∗. This means to remove the edge between X and Y, and add an edge between X and Y ^∗.

To determine whether a change in the cardinality of a latent variable is beneficial, BI freezes all the parameters that are not affected by the change, runs EM locally (Chen et al. 2012) to optimize the parameters affected by the change, and recalculates the BIC score. The change is deemed beneficial if the BIC is increased. BI starts from the current cardinality of each latent variable and considers increasing it by one. If it is beneficial to do so, further increases are considered.

All the potential adjustments (node relocation and cardinality change) are evaluated with respect to the model m ^∗. The beneficial adjustments are executed in one batch after all the evaluations. Adjustment evaluations and adjustment executions are not interleaved because that would require parameter optimization after each adjustment and hence be computationally expensive.

After model refinement, BI runs the EM algorithm on the whole model one more time to optimize the parameters (line 27).

It is well known that the EM algorithm generally converges to a local maxima likelihood estimate. To avoid the local maxima, we adopt the scheme proposed by Chickering and Heckerman (1997). The scheme first randomly generates a number α of initial values for the new parameters, resulting in α initial models. One EM iteration is run on all the models and afterwards the bottom α/2 models are discarded. Then two EM iterations are run on the remaining models and afterwards the bottom α/4 models are discarded. Then four EM iterations are run on the remaining models, and so on. The process continues until there is only one model. After that, more EM iterations are run on the remaining model, until the total number of iterations reaches a predetermined number.

4.4 Two subroutines

The BI algorithm needs two subroutines learnLCM and learnLTM-2L. The input to learnLCM is a data set D′ with attributes S. This subroutine aims at finding the LCM for S that has the highest BIC score. To do so, it first creates an initial LCM for S and sets the cardinality of the only latent variable to 2. The parameters of the initial model are optimized by running EM and its BIC score is calculated. The subroutine considers repeatedly increasing the cardinality of the latent variable. After each increase, model parameters are re-optimized. The process stops when the BIC score ceases to increase.

The subroutine learnLTM-2L is more complex than learnLCM. Its pseudo code is given in Algorithm 2 to aid understanding. The objective is to find, among LTMs for S that contain 1 or 2 latent variables, the model that has the highest BIC score. The subroutine achieves the goal by searching the restricted model space. It starts with the LCM where the latent variable has only two states (line 1). At each step of search, it generates a collection of candidate models by modifying the current model m (lines 4 and 6). Each candidate model is evaluated. The one with the highest BIC score is picked as the next model. This last step is achieved using another subroutine pickBestModel. The search continues until model score ceases to increase (line 15).

When the current m contains two latent variables, we consider only candidate models produced by the state introduction (SI) operator (line 6). The operator creates a new model by increasing the cardinality of each latent variable by one. Hence two candidate models are produced.

When the current model m contains only one latent variable Y, the node introduction (NI) operator is also considered in addition to SI (line 4). This operator considers each pair of neighbors of Y. It creates a new model by introducing a new latent node Y′ to mediate between Y and the two neighbors. The cardinality of Y′ is set to be the same as that of Y. In the model m ₁ shown in Fig. 2, if we introduce a new latent variable Y ₂ to mediate of Y ₁ and its neighbors X ₃ and X ₅, we get the model m ₂. In this example, there are totally $5 \choose2$ many possible ways to apply the NI operator. Hence NI produces $5 \choose2$ candidate models. Because there is only one latent variable, SI produces only one candidate model.

Line 8 tests whether the best candidate model m′ is produced at line 4 by the NI operator. The condition can be true for at most once. When it is true, suppose m′ was obtained by introducing a new latent variable Y′ mediates the existing latent variable Y and two of its neighbors. Then learnLTM-2L repeatedly tries to relocate other neighbors of Y to Y′ until it is no longer beneficial to do so (lines 9–13). In the pseudo code, NR(m′,Y,Y′) stands for the collection of models that can be obtained from m′ by relocating one neighbor of Y to Y′.

4.5 Complexity analysis

The running time of BI is dominated by calls to the subroutine learnLTM-2L at line 9 and the two calls to EM on the whole model at lines 25 and 27. We analyze the complexity in terms of the following quantities: N—the sample size; n—the number of observed variables; l—the number of latent variables in the final model; c—the maximum cardinality of a latent variable; k—the maximum number of observed variables in the working set S; e—the maximum number of iterations of EM.

Let us first consider the complexity of one call to learnLTM-2L. Lines 4 and 6 in Algorithm 2 are executed no more than 2(c−2) times in total. In each of those calls, there are no more than ${k \choose2} < k^{2}$ candidate models. Line 11 is executed no more than k−2 times. In each of those calls, there are no more than k candidate models. Therefore, one call to learnLTM-2L involves no more than 2(c−2)k ²+(k−2)k<2ck ² candidate models.

To evaluate each candidate model, we need to run EM once to optimize its parameters. A candidate model contains no more than k+2 variables. There are N samples. Since inference in trees takes linear time, each EM iteration takes O((k+2)N) time. Consequently, the time it takes to evaluate one candidate model is O(ekN).

The while-loop of BI is executed l times. In each pass through the while-loop, learnLTM-2L is called no more then k−2 times. Putting everything together, we see that the total time that all calls to learnLTM-2L is O(l⋅(k−2)⋅2ck ²⋅ekN)=O(Nleck ⁴). It is linear in the sample size N and the number of latent variables l. In a more careful analysis, N can be replaced by the number of distinct data cases one gets by projecting N samples onto the working set S. It can be much smaller than N. The maximum cardinality c is usually very small relative to N and l and can be regarded as a constant. The maximum number of EM iterations e can be controlled. The term k ⁴ looks bad. Fortunately, k is usually much smaller than total number of observed variables n.

Now consider the two calls to EM at lines 25 and 27. They are run on the global model, which consists of n observed variables and l latent variables. Each iteration takes O((l+n)N) times. Hence, the total time is O(2N(l+n)e). It is linear in the sample size and the number of observed variables n.

In implementation, we let EM run from multiple random starting points to avoid local maxima. It is allowed to run a large number of random starts and a large number of iterations at line 27 since the parameters of the final model are optimized here. The number of random starts and the number of iterations take smaller values in other places, e.g., in learnLTM-2L. The reason is that the models encountered in learnLTM-2L contain no more than k+2 variables, which is much fewer than in the global model. Hence convergence can be reached in much fewer iterations.

5 Empirical comparison with previous LTM learning algorithms

In this section we empirically compare BI with previous algorithms for learning LTMs.^{Footnote 1} As discussed in Sect. 3, previous algorithms can be divided into three groups. Representative algorithms from each of the three groups are included in the comparisons, namely the search algorithm EAST (Chen et al. 2012), the AC-based algorithm BIN (Harmeling and Williams 2011), and the PTR-motivated algorithms CLNJ and CLRG (Choi et al. 2011).

5.1 Comparisons on synthetic data

We first compare the algorithms on synthetic data.

5.1.1 The setups

One objective of our experimental work is to demonstrate the scalability of BI. To this end, we created several generative models with different numbers of observed variables. The simplest one is shown in Fig. 3(a). It consists of 3 levels of latent variables and one level of observed variables. Each latent variable has exactly 4 neighbors. The model is hence called the 4-complete model. We denote it as M4C. It contains 36 observed variables. Two other models were created by adding latent variables to levels 2 and 3 and observed variables to level 4 so that each latent variable has exactly 5 and 7 neighbors respectively. Those two models are called the 5-complete and 7-complete models and will be denoted as M5C and M7C. They contain 80 and 252 observed variables respectively.

The models M4C, M5C and M7C are not flat because latent variables at level 1 and 2 are not directly connected to observed variables. Three flat models were created by adding more observed variables to the model so that each latent variable at level 1 and 2 has the same number of observed neighbors as a latent variable at level 3. The resulting flat models are denoted as M4CF, M5CF and M7CF respectively and they contain 51, 104, and 300 observed variables. M4CF is shown in Fig. 3(b). The numbers n of observed variables in the 6 generative models are summarized in the following table:

Models	M4C	M4CF	M5C	M5CF	M7C	M7CF
n	36	51	80	104	252	300

In the six models, cardinalities of the variables (observed and latent) were set to 2. The model parameters were randomly generated so that the normalized MI (Strehl et al. 2002) between each pair of neighboring nodes is between 0.05 and 0.75. From each generative model, a training set of 5,000 samples and a testing set of 5,000 samples were obtained. Each sample contains values for all the observed variables. It does not contain values for latent variables.

Each algorithm was run on the training set for 10 times. All experiments were conducted on a desktop machine. The maximum time allowed was 60 hours. All the algorithms have parameters that the user needs to set. For the previous algorithms, we use the default settings given by the authors. For BI, we always set δ at 3 except when investigating its impact (Sect. 5.1.3). For the call of EM in line 27, we used 64 random starting points and the maximum iteration was set at 100. For the calls of EM in other places, we used 32 random starting points and the maximum iteration was set at 32.

The model m learned by an algorithm from a training set is evaluated using the following metrics:

1.
The Robinson-Foulds (RF) distance (Robinson and Foulds 1981) that measures how much the structure of m deviates from the structure of the corresponding generative model m ₀ is computed as follows:
$$\begin{aligned} d_{RF}(m, m_0) = \frac{|C(m)-C(m_0)| + |C(m_0)-C(m)|}{2}, \end{aligned}$$
(7)
where C(m) denotes the set of bipartitions defined by all edges in tree m. For example, in Fig. 2, removing the edge between Y ₁ and Y ₂ in model m ₂ separates the observed variables into two subsets {X ₁,X ₂,X ₃} and {X ₄,X ₅}. So this edge defines a bipartition X ₁ X ₂ X ₃|X ₄ X ₅. C(m) is the set of all such bipartitions defined by all edges in tree m. Term |C(m)−C(m ₀)| represents the number of bipartitions that appear in C(m) but not in C(m ₀). Term |C(m ₀)−C(m)| represents the number of bipartitions that appear in C(m ₀) but not in C(m). The sum of this two terms is the number of bipartitions that differ between tree m and tree m ₀.
2.
The empirical KL divergence of m from the generative model m ₀ is computed from the testing set D _test as follows:
$$\begin{aligned} \mathit {KL}(m_0, m|D_{\mathtt{test}}) = \frac{1}{N_{\mathtt{test}}} \biggl(\sum_{d \in D_{\mathtt{test}}} \log P(d|m_0) - \sum _{d \in D_{\mathtt{test}}} \log P(d|m)\biggr), \end{aligned}$$
(8)
where N _test is the size of the testing set. Note that the first term inside the parentheses is the loglikelihood of m ₀ on the testing set and the second is that of m. The second term measures how well the model m predicts unseen data.

5.1.2 The results

Running time statistics are shown in Fig. 4. We see that BI is much more efficient than EAST. On the M4C, M4CF and M5C data sets, BI took only 9, 14 and 25 minutes while EAST took 8, 9 and 51 hours respectively. BI was about 55, 39 and 120 times faster than EAST. EAST did not finish in 60 hours on the other three data sets, while BI took 30 minutes, 1.7 hours and 2.4 hours respectively. Those results indicate that BI scales up fairly well. However, BI is not as efficient as BIN, CLNJ and CLRG. In our experiments, it was several times slower than the alternative algorithms.

Table 1 shows the performances of the algorithms on data sampled from the three flat generative models M4CF, M5CF and M7CF. The RF values for BIN are missing because it produced forests rather than trees, and RF is not defined for forests. On the M4CF data, EAST found the best models. The models obtained by BI are also of high quality. They are better than those produced by BIN, CLNJ and CLRG both in terms of empirical KL and RF values. The differences between BI and the three alternative algorithms are more pronounced on the M5CF and M7CF data. Those results indicate that BI was significantly better in recovering the structures of the generative models, and the models it obtained can predict unseen data much better than those produced by the three alternative algorithms.

Table 1 Performances of LTM learning algorithms on data sets from flat models

Greedy learning of latent tree models for multidimensional clustering

Abstract

Similar content being viewed by others

A survey on semi-supervised learning

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

1 Introduction

2 Latent tree models

3 Previous algorithms for learning LTMs

4 The bridged-islands algorithm

4.1 Sibling cluster determination

4.2 Tree formation

4.3 Model refinement

4.4 Two subroutines

4.5 Complexity analysis

5 Empirical comparison with previous LTM learning algorithms

5.1 Comparisons on synthetic data

5.1.1 The setups

5.1.2 The results

5.1.3 Impact of δ and sample size on BI

5.2 Comparisons on real-wold data

5.3 Summary

6 Multi-partition clustering

7 LTMs for multi-partition clustering

8 Comparisons of MPC methods on labeled data

9 Comparisons of MPC methods on unlabeled data

9.1 Results on the ICAC data by BI

9.2 Results on the ICAC data by other MPC algorithms

10 Conclusions

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation