Semi-supervised projected model-based clustering

Guerra, Luis; Bielza, Concha; Robles, Víctor; Larrañaga, Pedro

doi:10.1007/s10618-013-0323-0

Semi-supervised projected model-based clustering

Published: 21 May 2013

Volume 28, pages 882–917, (2014)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Luis Guerra¹,
Concha Bielza¹,
Víctor Robles² &
…
Pedro Larrañaga¹

1777 Accesses
2 Citations
Explore all metrics

Abstract

We present an adaptation of model-based clustering for partially labeled data, that is capable of finding hidden cluster labels. All the originally known and discoverable clusters are represented using localized feature subset selections (subspaces), obtaining clusters unable to be discovered by global feature subset selection. The semi-supervised projected model-based clustering algorithm (SeSProC) also includes a novel model selection approach, using a greedy forward search to estimate the final number of clusters. The quality of SeSProC is assessed using synthetic data, demonstrating its effectiveness, under different data conditions, not only at classifying instances with known labels, but also at discovering completely hidden clusters in different subspaces. Besides, SeSProC also outperforms three related baseline algorithms in most scenarios using synthetic and real data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On Semi-Supervised Clustering

Exploratory Learning

A Survey of Constrained Clustering

Notes

Note that “class”, “component”, and “cluster” are equivalent concepts at the end of the classification, but each concept will be used here to refer, respectively to a priori knowledge about instances (classes), mixture components (components) or identified groups (clusters).
Note that, for legibility, the notation related to iterations is used with $\varTheta $, but not with $\varvec{\theta }$ throughout the paper.
Note that the classification term only iterates theoretically until $m = C$, but we can assume that this iteration finishes at $m=C+1$ with $z_{i,C+1} = 0$, $\forall i = 1,\ldots ,L$.

References

Aggarwal C, Yu P (2000) Finding generalized projected clusters in high dimensional spaces. SIGMOD Rec 29(2):70–81
Article Google Scholar
Aggarwal C, Han J, Wang J, Yu P (2004) A framework for projected clustering of high dimensional data streams. In: Proceedings of 30th international conference on very large data bases, pp 852–863
Aggarwal C, Procopiuc C, Wolf J, Yu P, Park J (1999) Fast algorithms for projected clustering. SIGMOD Rec 28(2):61–72
Article Google Scholar
Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. SIGMOD Rec 27:94–105
Article Google Scholar
Ahmed M, Khan L (2009) SISC: a text classification approach using semi supervised subspace clustering. In: IEEE international conference on data mining workshops, pp 1–6
Alexandridis R, Lin S, Irwin M (2004) Class discovery and classification of tumor samples using mixture modeling of gene expression data, a unified approach. Bioinformatics 20(16):2545–2552
Article Google Scholar
Basu S, Banjeree A, Mooney E, Banerjee A, Mooney R (2004) Active semi-supervision for pairwise constrained clustering. In: Proceedings of the SIAM international conference on data mining, pp 333–344
Basu S, Davidson I, Wagstaff K (eds) (2009) Constrained clustering: advances in algorithms, theory and applications. Chapman and Hall/CRC, Boca Raton
Bishop C (2007) Pattern recognition and machine learning. Springer, New York
Google Scholar
Boutemedjet S, Ziou D, Bouguila N (2010) Model based subspace clustering of non-Gaussian data. Neurocomputing 73(10–12):1730–1739
Article Google Scholar
Chandel A, Tiwari A, Chaudhari N (2009) Constructive semi-supervised classification algorithm and its implement in data mining. In: Proceedings of the 3rd international conference on pattern recognition and machine intelligence. Springer, Berlin, pp 62–67
Chapelle O, Schölkopf B, Zien A (eds) (2006) Semi-supervised learning. MIT Press, Cambridge
Chawla N, Karakoulas G (2005) Learning from labeled and unlabeled data: an empirical study across techniques and domains. J Artif Intell Res 23:331–366
MATH Google Scholar
Chen L, Jiang Q, Wang S (2012) Model-based method for projective clustering. IEEE Trans Knowl Data Eng 24(7):1291–1305
Article MathSciNet Google Scholar
Cheng C, Fu A, Zhang Y (1999) Entropy-based subspace clustering for mining numerical data. In: Proceedings of the 5th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 84–93
Cheng H, Hua K, Vu K (2008) Constrained locally weighted clustering. In: Proceedings of the 34th international conference on very large data bases, vol 1, Auckland, pp 90–101
Cordeiro R, Traina A, Faloutsos C, Traina C (2010) Finding clusters in subspaces of very large, multi-dimensional datasets. In: International conference on data engineering, Long Beach, pp 625–636
Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc 39(1):1–38
MATH MathSciNet Google Scholar
Fraley C, Raftery A (1998) How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput J 41(8):578–588
Article MATH Google Scholar
Fraley C, Raftery A (2012) MCLUST version 4 for R: normal mixture modeling for model-based clustering, classication and density estimation. Technical report no. 597, Department of Statistics, University of Washington, Seatlle
Frank A, Asuncion A (2010) UCI machine learning repository. http://archive.ics.uci.edu/ml
Friedman J, Meulman J (2004) Clustering objects on subsets of attributes. J R Stat Soc 66:815–849
Article MATH MathSciNet Google Scholar
Fromont E, Prado A, Robardet C (2009) Constraint-based subspace clustering. In: Proceedings of the 9th SIAM international conference on data mining, pp 26–37
Goil S, Nagesh H, Choudhary A (1999) MAFIA: efficient and scalable subspace clustering for very large data sets. In: International conference on data engineering
Graham M, Miller D (2006) Unsupervised learning of parsimonious mixtures on large spaces with integrated feature and component selection. IEEE Trans Signal Process 54(4):1289–1303
Article Google Scholar
Günnemann S, Färber I, Müller E, Seidl T (2010) ASCLU: alternative subspace clustering. In: Multiclust: first international workshop on discovering, summarizing and using multiple clustering, held in conjunction with KDD 2010
Günnemann S, Färber I, Virochsiri K, Seidl T (2012) Subspace correlation clustering: finding locally correlated dimensions in subspace projections of the data. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining, pp 352–360
Hoff P (2005) Subset clustering of binary sequences, with an application to genomic abnormality data. Biometrics 61(4):1027–1036
Article MATH MathSciNet Google Scholar
Hoff P (2006) Model based subspace clustering. Bayesian. Analysis 1(2):321–344
MathSciNet Google Scholar
Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218
Article Google Scholar
Kriegel H, Kröger P, Zimek A (2009) Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering and correlation clustering. ACM Trans Knowl Disc Data 3(1):1–58
Article Google Scholar
Kriegel H, Kröger P, Ntoutsi I, Zimek A (2011) Density based subspace clustering over dynamic data. In: Proceedings of the 23rd international conference on scientific and statistical database management, pp 387–404
Kriegel H, Kröger P, Zimek A (2012) Subspace clustering. Wiley Interdiscip. Rev 2(4):351–364
Google Scholar
Lange T, Law M, Jain A, Buhmann J (2005) Learning with constrained and unlabelled data. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 731–738
Law M, Figueiredo M, Jain A (2004) Simultaneous feature selection and clustering using mixture models. IEEE Trans Pattern Anal Mach Intell 26(9):1154–1166
Article Google Scholar
Li Y, Dong M, Hua J (2007) A Gaussian mixture model to detect clusters embedded in feature subspace. J Commun Inf Syst 7(4):337–352
MATH MathSciNet Google Scholar
Li Y, Dong M, Hua J (2009) Simultaneous localized feature selection and model detection for Gaussian mixtures. IEEE Trans Pattern Anal Mach Intell 31(5):953–960
Article Google Scholar
Lu Z, Leen T (2005) Semi-supervised learning with penalized probabilistic clustering. Adv Neural Inf Process Syst 17:849–856
Google Scholar
Maitra R, Melnykov V (2010) Simulating data to study performance of finite mixture modeling and clustering algorithms. J Computd Graph Stat 19(2):354–376
Article MathSciNet Google Scholar
Markley S, Miller D (2010) Joint parsimonious modeling and model order selection for multivariate Gaussian mixtures. IEEE J Sel Top Signal Process 4(3):548–559
Article Google Scholar
McLachlan G, Basford K (1988) Mixture models: inference and applications to clustering. Marcel Dekker, New York
MATH Google Scholar
McLachlan G, Peel D (2000) Finite mixture models. Wiley-Interscience, New York
Book MATH Google Scholar
Melnykov V, Maitra R (2010) Finite mixture models and model-based clustering. Stat Surv 4:80–116
Article MATH MathSciNet Google Scholar
Melnykov V, Chen W, Maitra R (2012) MixSim: an R package for simulating data to study performance of clustering algorithms. J Stat Softw 51(12):1–25
Google Scholar
Miller D, Browning J (2003) A mixture model and EM-based algorithm for class discovery, robust classification, and outlier rejection in mixed labeled/unlabeled data sets. IEEE Trans Pattern Anal Mach Intell 25(11):1468–1483
Article Google Scholar
Miller D, Chu-Fang L, Kesidis G, Collins C (2009) Semisupervised mixture modeling with fine-grained component-conditional class labeling and transductive inference. In: IEEE international workshop on machine learning for signal processing, pp 1–6
Moise G, Sander J, Ester M (2008) Robust projected clustering. Knowl Inf Syst 14(3):273–298
Article MATH Google Scholar
Müller E, Assent I, Seidl T (2009) HSM: heterogeneous subspace mining in high dimensional. In: Proceedings of the 21st international conference on scientific and statistical database management, pp 497–516
Parsons L, Haque E, Liu H (2004) Subspace clustering for high dimensional data: a review. ACM SIGKDD Explor Newsl 6(1):90–105
Article Google Scholar
Procopiuc C, Jones M, Agarwal P, Murali T (2002) A Monte Carlo algorithm for fast projective clustering. In: Proceedings of the ACM international conference on management of data, pp 418–427
R Core Team (2012) R: a language and environment for statistical computing. R Foundation for Statisical Computing, Vienna
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464
Article MATH Google Scholar
Shental N, Bar-Hillel A, Hertz T, Weinshall D (2003) Computing Gaussian mixture models with EM using equivalence constraints. Adv Neural Inf Process Syst 16:1–8
Google Scholar
Sim K, Gopalkrishnan V, Zimek A, Cong G (2012) A survey on enhanced subspace clustering. Data Min Knowl Discov. doi:10.1007/s10618-012-0258-x
Wang F, Zhang C, Shen H, Wang J (2006) Semi-supervised classification using linear neighborhood propagation. In: IEEE Computer Society Conference on Computer Vision and. Pattern Recognition, vol 1:160–167
Watanabe M, Yamaguchi K (2003) The EM algorithm and related statistical models. CRC Press, Boca Raton
Book Google Scholar
Witten I, Frank E, Hall M (2011) Data mining: practical machine learning tools and techniques, 3rd edn. Morgan Kaufmann, Burlington
Google Scholar
Woo K, Lee J, Kim M, Lee Y (2004) FINDIT: a fast and intelligent subspace clustering algorithm using dimension voting. Inf Softw Technol 46(4):255–271
Article Google Scholar
Yip K, Cheung D, Ng M (2004) HARP: a practical projected clustering algorithm. IEEE Trans Knowl Data Eng 16:1387–1397
Article Google Scholar
Yip K, Cheung D, Ng M (2005) On discovery of extremely low-dimensional clusters using semi-supervised projected clustering. In: International conference on data engineering, pp 329–340
Zhang X, Wu Y, Qiu Y (2010) Constraint based dimension correlation and distance divergence for clustering high-dimensional data. In: IEEE 10th International conference on data mining, pp 629–638
Zhang X, Qiu Y, Wu Y (2011) Exploiting constraint inconsistence for dimension selection in subspace clustering: a semi-supervised approach. Neurocomputing 74(17):3598–3608
Article Google Scholar
Zhu X (2005) Semi-supervised learning literature survey. Tech. rep., Computer Sciences, University of Wisconsin-Madison
Zhu X, Goldberg A (2009) Introduction to semi-supervised learning. Morgan & Claypool Publishers, New York
MATH Google Scholar

Download references

Acknowledgments

This research is partially supported by the Spanish Ministry of Economy and Competitiveness TIN2010-20900-C04-04 and TIN2010-21289-C02-02 projects, the Cajal Blue Brain project and Consolider Ingenio 2010-CSD2007-00018. The authors thankfully acknowledge the computer resources, technical expertise and assistance provided by the Centro de Supercomputación y Visualización de Madrid (CeSViMa). The authors are also very grateful for the useful comments and suggestions proposed by the anonymous reviewers, which have contributed definitely to the improvement of the manuscript.

Author information

Authors and Affiliations

Computational Intelligence Group, Departamento de Inteligencia Artificial, Facultad de Informática, Universidad Politécnica de Madrid, Campus de Montegancedo, 28660 , Boadilla del Monte, Madrid, Spain
Luis Guerra, Concha Bielza & Pedro Larrañaga
Departamento de Arquitectura y Tecnología de Sistemas Informáticos, Facultad de Informática, Universidad Politécnica de Madrid, Campus de Montegancedo, 28660 , Boadilla del Monte, Madrid, Spain
Víctor Robles

Authors

Luis Guerra
View author publications
You can also search for this author in PubMed Google Scholar
Concha Bielza
View author publications
You can also search for this author in PubMed Google Scholar
Víctor Robles
View author publications
You can also search for this author in PubMed Google Scholar
Pedro Larrañaga
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Luis Guerra.

Additional information

Communicated by Charu Aggarwal.

Appendices

1.1 Appendix 1: Basic EM theory

The density function of an instance $\mathbf x _{i}$ is

$$\begin{aligned} p(\mathbf x _{i} \mid \varTheta ) = \sum _{m=1}^K\pi _{m} p(\mathbf x _{i}\mid \varvec{\theta }_{m}). \end{aligned}$$

We can define a binary random variable $\mathbf z _{i} = (z_{i1}, \ldots ,z_{iK})$, with $z_{im} = 1$ if instance $\mathbf x _{i}$ belongs to component $m$ and with all other elements $z_{im^{\prime }} = 0$, $\forall $ $m^{\prime } \ne m$. Besides, $p(z_{im} = 1) = \pi _{m}$. Therefore, we can write

$$\begin{aligned} p(\mathbf z _{i}) =\prod _{m=1}^K \pi _{m}^{z_{im}}. \end{aligned}$$

(10)

Also, $p(\mathbf x _{i}\mid z_{im}=1) = p(\mathbf x _{i}\mid \varvec{\theta }_{m})$, which, extended, is

$$\begin{aligned} p(\mathbf x _{i}\mid \mathbf z _{i}, \varTheta ) =\prod _{m=1}^K p(\mathbf x _{i}\mid \varvec{\theta }_{m})^{z_{im}}. \end{aligned}$$

(11)

Using Eqs. (10) and (11), Eq. (1) is obtained by summing over all possible states of $\mathbf z _{i}$

$$\begin{aligned} p(\mathbf x _{i} \mid \varTheta )&= \sum _\mathbf{z _{i}} p(\mathbf x _{i}, \mathbf z _{i} \mid \varTheta )= \sum _\mathbf{z _{i}} p(\mathbf z _{i})p(\mathbf x _{i}\mid \mathbf z _{i}, \varTheta ) \\&= \sum _\mathbf{z _{i}}\left( \prod _{m=1}^K \pi _{m}^{z_{im}}\prod _{m=1}^K p(\mathbf x _{i}\mid \varvec{\theta }_{m})^{z_{im}}\right) \\&= \sum _{m=1}^K\pi _{m} p(\mathbf x _{i}\mid \varvec{\theta }_{m}). \end{aligned}$$

This mixture of distributions has unknown parameters in $\varTheta $ that must be estimated. These parameters can be obtained using the maximum likelihood estimation method. Therefore, assuming that each instance is independent and identically distributed (i.i.d.), and building the log-likelihood function ($\log L$) from Eq. (1) and extending it to all the instances, we obtain

$$\begin{aligned} \log L(\varTheta \mid \mathcal{X })&= \log p(\mathcal{X }|\varTheta ) \\&= \log \prod _{i=1}^N p(\mathbf x _{i} \mid \varTheta ) \\&= \sum _{i=1}^N\log \left( \sum _{m=1}^K \pi _{m} p(\mathbf x _{i}\mid \varvec{\theta }_{m})\right) . \end{aligned}$$

This log-likelihood function is difficult to maximize because the summation over the components is inside the logarithm function. The log-likelihood function would change if both the latent variables ($\mathcal{Z }$) and the observable data ($\mathcal{X }$) were known. Then, based on Eqs. (10) and (11), we can define the complete-data log-likelihood function as

$$\begin{aligned} \log L(\varTheta \mid \mathcal{X },\mathcal{Z })&= \log p(\mathcal{X },\mathcal{Z }|\varTheta ) \nonumber \\&= \log \prod _{i=1}^N \prod _{m=1}^K \pi _{m}^{z_{im}}p(\mathbf x _{i}\mid \varvec{\theta }_{m})^{z_{im}} \nonumber \\&= \sum _{i=1}^N \sum _{m=1}^K z_{im} \left( \log \pi _{m} + \log p (\mathbf x _{i}\mid \varvec{\theta }_{m}) \right) . \end{aligned}$$

(12)

The maximization of this complete-data log-likelihood function is straightforward because the summation is outside the logarithm. Since the latent variables are unknown we cannot use this function directly. However, we can obtain the expectation of this log-likelihood function with respect to the posterior distribution of the latent variables. This expectation is calculated in iteration $t$, having fixed the parameters from the previous iteration $t-1$, in the E-step of the EM algorithm. After this, the parameters of the distributions are recalculated to maximize this expectation (M-step). These two steps are repeated until a convergence criterion is reached. Hence, the expectation of the complete-data log-likelihood function is given by

$$\begin{aligned} \mathcal{Q }(\varTheta ,\varTheta ^{t-1})&= \mathbb{E }_{\mathcal{Z }\mid \mathcal{X },\varTheta ^{t-1}}[\log L(\varTheta \mid \mathcal{X },\mathcal{Z })] \nonumber \\&= \sum _\mathcal{Z } p(\mathcal{Z }\mid \mathcal{X },\varTheta ^{t-1}) \log p(\mathcal{X },\mathcal{Z }\mid \varTheta ), \end{aligned}$$

(13)

where the posterior distribution of the latent variables given the data and the parameters of the previous iteration $t-1$ using Eq. (12), is

$$\begin{aligned} p(\mathcal{Z }\mid \mathcal{X },\varTheta ^{t-1}) \propto \prod _{i=1}^N \prod _{m=1}^K\left( \pi _{m}p(\mathbf x _{i}\mid \varvec{\theta }_{m}) \right) ^{z_{im}}. \end{aligned}$$

(14)

This factorizes over $i$ so that the $\{ \mathbf z _{i} \}$ in this distribution are independent. Using this posterior distribution and Bayes’ theorem, we can calculate the expected value of each $z_{im}$ (responsibility) as

$$\begin{aligned} \mathbb{E }_{z_{im}\mid \mathbf x _{i},\varvec{\theta }_{m}}[z_{im}]&= \gamma (z_{im}) \\&= \frac{\sum _{z_{im}} z_{im}(\pi _{m}p(\mathbf x _{i}\mid \varvec{\theta }_{m}))^{z_{im}}}{\sum _{z_{im^{\prime }}}(\pi _{m^{\prime }}p(\mathbf x _{i}\mid \varvec{\theta }_{m^{\prime }}))^{z_{im^{\prime }}}} \\&= \frac{\pi _{m}p(\mathbf x _{i}\mid \varvec{\theta }_{m})}{\sum _{m^{\prime }=1}^K\pi _{m^{\prime }}p(\mathbf x _{i}\mid \varvec{\theta }_{m^{\prime }})}\\&= p(z_{im}=1 \mid \mathbf x _{i}, \varvec{\theta }_{m}), \end{aligned}$$

which we can use to calculate the expectation of the complete-data log-likelihood, as

$$\begin{aligned} \mathbb{E }_{\mathcal{Z }\mid \mathcal{X },\varTheta ^{t-1}}[\log L(\varTheta \mid \mathcal{X },\mathcal{Z })] = \sum _{i=1}^N \sum _{m=1}^K \gamma (z_{im}) \left( \log \pi _{m} + \log p (\mathbf x _{i}\mid \varvec{\theta }_{m}) \right) . \end{aligned}$$

1.2 Appendix 2: Including subspaces

Defining, for each component and feature, $\rho _{mj} = p(v_{mj}=1)$, the probability of feature $j$ being relevant to component $m$, the new density function, including the search for subspaces, is

$$\begin{aligned} p(\mathbf x _{i} \mid \varTheta )= \sum _{m=1}^K \pi _{m}\prod _{j=1}^{F}\Bigl (\rho _{mj} p(x_{ij}\mid \theta _{mj})+(1 - \rho _{mj}) p(x_{ij}\mid \lambda _{mj})\Bigr ). \end{aligned}$$

(15)

To prove this new density function, we can obtain for a component $m$ and an instance $i$,

$$\begin{aligned} p(\mathbf v _{m}\mid z_{im} = 1) = \prod _{j=1}^{F}(\rho _{mj})^{v_{mj}}(1 - \rho _{mj})^{1-v_{mj}}. \end{aligned}$$

This can be extended for all components as

$$\begin{aligned} p(\mathcal{V }\mid \mathbf z _{i}) = \prod _{m=1}^K \left( \prod _{j=1}^{F}(\rho _{mj})^{v_{mj}}(1 - \rho _{mj})^{1-v_{mj}}\right) ^{z_{im}}. \end{aligned}$$

(16)

Besides, we can extend Eq. (11) introducing $\mathcal{V }$, as

$$\begin{aligned} p(\mathbf x _{i} \mid \mathcal{V }, \mathbf z _{i}, \varTheta ) = \prod _{m=1}^K \left( \prod _{j=1}^F p(x_{ij}\mid \theta _{mj})^{v_{mj}} p(x_{ij}\mid \lambda _{mj})^{1-v_{mj}}\right) ^{z_{im}}. \end{aligned}$$

(17)

The new density function, based on Eq. (1), and using Eqs. (10), (16), and (17), is,

$$\begin{aligned} p(\mathbf x _{i} \mid \varTheta )&= \sum _\mathbf{z _{i}}\sum _\mathcal{V } p(\mathbf x _{i},\mathcal{V },\mathbf z _{i} \mid \varTheta ) \\&= \sum _\mathbf{z _{i}}\sum _\mathcal{V } p(\mathbf x _{i} \mid \mathcal{V }, \mathbf z _{i}, \varTheta ) p(\mathcal{V } \mid \mathbf z _{i}) p(\mathbf z _{i}). \end{aligned}$$

The summation over $\mathbf z _{i}$ is solved as in Eq. (1), obtaining,

$$\begin{aligned} p(\mathbf x _{i} \mid \varTheta ) \!=\! \sum _\mathcal{V } \Biggl (\! \sum _{m=1}^K \pi _{m}\prod _{j=1}^{F} \Bigr ( [\rho _{mj} p(x_{ij}\mid \theta _{mj})]^{v_{mj}} \!\times \! [(1 \!-\! \rho _{mj}) p(x_{ij}\mid \lambda _{mj})]^{1-v_{mj}}\Bigr ) \Biggr ). \end{aligned}$$

And we can solve the summation over $\mathcal{V }$, summing over all the possible states of each $v_{mj}$, as,

$$\begin{aligned} p(\mathbf x _{i} \mid \varTheta )\!&= \!\sum _{m=1}^K\pi _{m} \prod _{j=1}^{F} \sum _{v_{mj}=0}^1 \Bigl ([\rho _{mj} p(x_{ij}\mid \theta _{mj})]^{v_{mj}} \!\times \! [(1 \!-\! \rho _{mj}) p(x_{ij}\mid \lambda _{mj})]^{1-v_{mj}} \Bigr ) \\&= \sum _{m=1}^K \pi _{m}\prod _{j=1}^{F}\Bigl (\rho _{mj} p(x_{ij}\mid \theta _{mj})+(1 - \rho _{mj}) p(x_{ij}\mid \lambda _{mj})\Bigr ). \end{aligned}$$

Taking into account that each component can be described in a different feature subspace, this is the new density function of an instance, as shown in Eq. (15).

The new log-likelihood function that should be maximized, by extending Eq. (15) to all the instances, is

$$\begin{aligned} \log L(\varTheta \mid \mathcal{X })&= \log p(\mathcal{X }|\varTheta ) = \log \prod _{i=1}^N p(\mathbf x _{i} \mid \varTheta ) \\&= \sum _{i=1}^N \Biggl ( \log \sum _{m=1}^K \pi _{m}\prod _{j=1}^{F}\Bigl (\rho _{mj} p(x_{ij}\mid \theta _{mj}) +(1 - \rho _{mj}) p(x_{ij}\mid \lambda _{mj}) \Bigr ) \Biggr ). \end{aligned}$$

This is again difficult to compute since the summation over the components is inside the logarithm function. This equation would change if we knew the sets of latent variables, $\mathcal{Z }$ and $\mathcal V $. Again by extending Eqs. (10), (16), and (17) to all the data, we can write

$$\begin{aligned} p(\mathcal{X },\mathcal{Z },\mathcal V \mid \varTheta )&= \prod _{i=1}^N \prod _{m=1}^K \left( \prod _{j=1}^F p(x_{ij}\mid \theta _{mj})^{v_{mj}} p(x_{ij}\mid \lambda _{mj})^{1-v_{mj}}\right) ^{z_{im}}\\&\times \prod _{i=1}^N \prod _{m=1}^K \left( \prod _{j=1}^{F}(\rho _{mj})^{v_{mj}}(1 - \rho _{mj})^{1-v_{mj}}\right) ^{z_{im}} \\&\times \prod _{i=1}^N \prod _{m=1}^K \pi _{m}^{z_{im}}, \end{aligned}$$

which can be simplified to

$$\begin{aligned} p(\mathcal{X },\mathcal{Z },\mathcal{V } \mid \varTheta )&= \prod _{i=1}^N \prod _{m=1}^K \Biggl ( \pi _{m}^{z_{im}} \prod _{j=1}^F \left( [\rho _{mj} p(x_{ij}\mid \theta _{mj})]^{v_{mj}} \right. \nonumber \\&\times \left. [(1 - \rho _{mj}) p(x_{ij}\mid \lambda _{mj})]^{1-v_{mj}}\right) ^{z_{im}} \Biggr ). \end{aligned}$$

(18)

We can obtain the complete-data log-likelihood function by taking the logarithm of the previous function as,

$$\begin{aligned} \log L(\varTheta \mid \mathcal{X },\mathcal{Z },\mathcal{V })&= \log p(\mathcal{X },\mathcal{Z },\mathcal{V }\mid \varTheta ) \\&= \log \prod _{i=1}^N \prod _{m=1}^K \Biggl ( \pi _{m}^{z_{im}} \prod _{j=1}^F \left( [\rho _{mj} p(x_{ij}\mid \theta _{mj})]^{v_{mj}} \right. \\&\times \left. [(1 - \rho _{mj}) p(x_{ij}\mid \lambda _{mj})]^{1-v_{mj}}\right) ^{z_{im}} \Biggr ), \end{aligned}$$

and operating again,

$$\begin{aligned}&\log L(\varTheta \mid \mathcal{X },\mathcal{Z },\mathcal{V }) \nonumber \\&\quad = \sum _{i=1}^N \sum _{m=1}^K \Bigl ( z_{im}\log \pi _{m}+ \sum _{j=1}^F\left( z_{im} \left[ v_{mj} (\log \rho _{mj} + \log p(x_{ij}\mid \theta _{mj})) \right. \right. \nonumber \\&\qquad + \left. \left. (1-v_{mj})(\log (1 - \rho _{mj}) + \log p(x_{ij}\mid \lambda _{mj}))\right] \right) \Bigr ). \end{aligned}$$

(19)

1.3 Appendix 3: Expectation of the complete-data log-likelihood function

Similarly to Eq. (13), the expectation of the complete-data log-likelihood function can be written as

$$\begin{aligned} \mathbb{E }_{\mathcal{Z },\mathcal{V }\mid \mathcal{X },\varTheta ^{t-1}}[\log L(\varTheta \mid \mathcal{X },\mathcal{Z },\mathcal{V })] \!=\! \sum _\mathcal{Z }\sum _\mathcal{V } p(\mathcal{Z },\mathcal{V } \mid \mathcal{X },\varTheta ^{t-1}) \log p(\mathcal{X },\mathcal{Z },\mathcal{V }\mid \varTheta ). \end{aligned}$$

As in Eq. (14), the posterior distribution of the latent variables given the data, having fixed the parameters of the previous iteration $t-1$, and using Eq. (18), can be written as

$$\begin{aligned} p(\mathcal{Z },\mathcal{V } \mid \mathcal{X },\varTheta ^{t-1})&\propto \prod _{i=1}^N \prod _{m=1}^K \Biggl ( \pi _{m}^{z_{im}} \prod _{j=1}^F \left( [\rho _{mj} p(x_{ij}\mid \theta _{mj})]^{v_{mj}} \right. \\&\times \left. [(1 - \rho _{mj}) p(x_{ij}\mid \lambda _{mj})]^{1-v_{mj}}\right) ^{z_{im}} \Biggr ). \end{aligned}$$

Before computing the expected values of each $v_{mj}$ and each $z_{im}$, we need to define some other necessary probabilities:

$$\begin{aligned} p(x_{ij},v_{mj}=1 \mid \theta _{mj}) = \rho _{mj}p(x_{ij}\mid \theta _{mj}), \end{aligned}$$

and, similarly

$$\begin{aligned} p(x_{ij}, v_{mj}=0 \mid \theta _{mj}) = (1-\rho _{mj})p(x_{ij}\mid \lambda _{mj}). \end{aligned}$$

Taking both expressions into account, we have

$$\begin{aligned} p(x_{ij} \mid \theta _{mj})&= p(x_{ij},v_{mj}=1 \mid \theta _{mj}) + p(x_{ij}, v_{mj}=0 \mid \theta _{mj}) \\&= \rho _{mj}p(x_{ij}\mid \theta _{mj}) + (1-\rho _{mj})p(x_{ij}\mid \lambda _{mj}). \end{aligned}$$

Now, as detailed after Eq. (14), we can calculate the expected value of each $v_{mj}$, as

$$\begin{aligned}&\mathbb{E }_{v_{mj}, \mid \mathbf x _{ij},\varvec{\theta }_{mj}}[v_{mj}] = \gamma (v_{mj}) \\&\quad =\frac{ \rho _{mj}p(x_{ij}\mid \theta _{mj})}{\rho _{mj}p(x_{ij}\mid \theta _{mj}) + (1-\rho _{mj})p(x_{ij}\mid \lambda _{mj})} \\&\quad =p(v_{mj}=1 \mid x_{ij}, \theta _{mj}). \end{aligned}$$

Using this, we calculate the expected value of each $z_{im}$

$$\begin{aligned}&\mathbb{E }_{z_{im} \mid \mathbf v _{m},\mathbf x _{i},\varvec{\theta }_{m}}[z_{im}] = \gamma (z_{im}) \\&\quad =\frac{\pi _{m} \prod _{j=1}^F[\rho _{mj} p(x_{ij}\mid \theta _{mj})+(1 - \rho _{mj}) p(x_{ij}\mid \lambda _{mj})]}{\sum _{m^{\prime }=1}^K \pi _{m^{\prime }} \prod _{j=1}^F[\rho _{m^{\prime }j} p(x_{ij}\mid \theta _{m^{\prime }j})+(1 - \rho _{m^{\prime }j}) p(x_{ij}\mid \lambda _{m^{\prime }j})]} \\&\quad =p(z_{im} = 1 \mid \mathbf v _m, \mathbf x _{i}, \varvec{\theta }_{m}). \end{aligned}$$

Thus the expectation of the complete-data log-likelihood, as in Eq. (3) and using Eq. (19), is

$$\begin{aligned}&\mathbb{E }_{\mathcal{Z },\mathcal{V }\mid \mathcal{X },\varTheta ^{t-1}}[\log L(\varTheta \mid \mathcal{X },\mathcal{Z },\mathcal{V })]\\&\quad = \sum _{i=1}^N \sum _{m=1}^K \gamma (z_{im}) \\&\qquad \times \Biggl ( \log \pi _{m} + \sum _{j=1}^F \Biggl ( \gamma (v_{mj}) (\log \rho _{mj} + \log p(x_{ij}\mid \theta _{mj})) \\&\qquad + (1- \gamma (v_{mj}))(\log (1 - \rho _{mj}) + \log p(x_{ij}\mid \lambda _{mj})) \Biggr ) \Biggr ). \end{aligned}$$

Then, for simplicity’s sake, we define

$$\begin{aligned} \gamma (u_{imj})&= \gamma (z_{im}) \gamma (v_{mj}),\\ \gamma (w_{imj})&= \gamma (z_{im}) (1-\gamma (v_{mj})). \end{aligned}$$

Now we can obtain the expectation of the complete-data log-likelihood as

$$\begin{aligned}&\mathbb{E }_{\mathcal{Z },\mathcal{V }\mid \mathcal{X },\varTheta ^{t-1}}[\log L(\varTheta \mid \mathcal{X },\mathcal{Z },\mathcal{V })]\\&\quad =\sum _{i=1}^N \sum _{m=1}^K \gamma (z_{im}) \log \pi _{m}\\&\qquad + \sum _{i=1}^N \sum _{m=1}^K\sum _{j=1}^F \gamma (u_{imj})(\log \rho _{mj} + \log p(x_{ij}\mid \theta _{mj})) \\&\qquad +\sum _{i=1}^N \sum _{m=1}^K\sum _{j=1}^F \gamma (w_{imj})(\log (1 - \rho _{mj}) + \log p(x_{ij}\mid \lambda _{mj})). \end{aligned}$$

1.4 Appendix 4: M-step

The parameters are recalculated in the M-step to maximize the value of the expectation of the complete-data log-likelihood function. As already mentioned, these updates are obtained by computing the partial derivatives of this expectation and equaling to zero. The univariate Gaussian distribution for each feature and component is used for this explanation. Therefore $\theta _{mj}$ = ($\mu _{\theta _{mj}}$, $\sigma _{\theta _{mj}}^{2})$, and

$$\begin{aligned} \log p(x_{ij}\mid \theta _{mj}) = \log (\sigma _{\theta _{mj}}^{-1} (2\pi )^{-\frac{1}{2}}) - \frac{1}{2}(x_{ij}-\mu _{\theta _{mj}})^2\sigma _{\theta _{mj}}^{-2}. \end{aligned}$$

All the detailed steps of how to update each parameter, follow.

$\pi _{m}$ is updated^{Footnote 3} using a Lagrange multiplier to enforce constraint $\sum _{m=1}^{C+1}\pi _{m} = 1$:
$$\begin{aligned}&\frac{{\partial }}{{\partial \pi _{m}}}\left( \sum _{i=1}^L\sum _{m=1}^C z_{im}\log \pi _{m}\right. \\&\quad \qquad \;\;+ \sum _{i=L+1}^N\sum _{m=1}^{C+1}\gamma (z_{im})\log \pi _{m} \\&\quad \qquad \left. \;\;+\, \lambda \left( \sum _{m=1}^{C+1}\pi _{m} -1 \right) \right) = 0, \quad \forall m = 1, \ldots , C+1, \end{aligned}$$
whose derivative is
$$\begin{aligned} \sum _{i=1}^L z_{im} \frac{1}{\pi _{m}}+ \sum _{i=L+1}^N \gamma (z_{im}) \frac{1}{\pi _{m}} + \lambda = 0. \end{aligned}$$
Multiplying both sides by $\pi _{m}$ and summing over $m$, with $m = 1, \ldots , C+1$, we have $\lambda = -N$, as
$$\begin{aligned} - \lambda = \sum _{i=1}^L\sum _{m=1}^{C+1} z_{im} + \sum _{i=L+1}^N\sum _{m=1}^{C+1} \gamma (z_{im}) = N, \end{aligned}$$
and then we update each $\pi _{m}$ by using
$$\begin{aligned} \pi _{m} = \frac{\sum _{i=1}^L z_{im}+\sum _{i=L+1}^N \gamma (z_{im})}{N}. \end{aligned}$$
$\mu _{\theta _{mj}}$ is updated solving the following partial derivative equation:
$$\begin{aligned}&\frac{{\partial }}{{\partial \mu _{\theta _{mj}}}}\Biggl ( \sum _{i=1}^L\sum _{m=1}^C \sum _{j=1}^F \gamma (u_{imj})\log p(x_{ij}\mid \theta _{mj}) \\&\qquad \qquad + \sum _{i=L+1}^N \sum _{m=1}^{C+1} \sum _{j=1}^F\gamma (u_{imj})\log p(x_{ij}\mid \theta _{mj}) \Biggr ) =0. \end{aligned}$$
Then the result is
$$\begin{aligned}&\sum _{i=1}^L \Bigl ( \gamma (u_{imj})\sigma _{\theta _{mj}}^{-2}x_{ij} - \gamma (u_{imj})\sigma _{\theta _{mj}}^{-2}\mu _{\theta _{mj}} \Bigr ) \\&\quad + \sum _{i=L+1}^N \Bigl (\gamma (u_{imj})\sigma _{\theta _{mj}}^{-2}x_{ij} - \gamma (u_{imj})\sigma _{\theta _{mj}}^{-2}\mu _{\theta _{mj}}\Bigr ) = 0, \end{aligned}$$
and the value of the parameter can be found as
$$\begin{aligned} \mu _{\theta _{mj}}&= \frac{\sum _{i=1}^L \gamma (u_{imj})x_{ij} + \sum _{i=L+1}^N \gamma (u_{imj})x_{ij}}{\sum _{i=1}^L \gamma (u_{imj})+ \sum _{i=L+1}^N \gamma (u_{imj})} \\&= \frac{\sum _{i=1}^N \gamma (u_{imj})x_{ij}}{\sum _{i=1}^N \gamma (u_{imj})},\quad \forall m = 1, \ldots , C+1; j = 1,\ldots ,F. \end{aligned}$$
And for $\sigma _{\theta _{mj}}^{2}$,
$$\begin{aligned}&\frac{{\partial }}{{\partial \sigma _{\theta _{mj}}^{2}}}\Biggl ( \sum _{i=1}^L\sum _{m=1}^C\sum _{j=1}^F \gamma (u_{imj})\log p(x_{ij}\mid \theta _{mj}) \\&\qquad \qquad + \sum _{i=L+1}^N \sum _{m=1}^{C+1}\sum _{j=1}^F \gamma (u_{imj})\log p(x_{ij}\mid \theta _{mj})\Biggr ) = 0. \end{aligned}$$
The derivative is
$$\begin{aligned}&\sum _{i=1}^L \Bigl (\gamma (u_{imj})(x_{ij}-\mu _{mj})^2 \sigma _{\theta _{mj}}^{-4} -\gamma (u_{imj})\sigma _{\theta _{mj}}^{-2}\Bigr ) \\&\quad + \sum _{i=L+1}^N \Bigl (\gamma (u_{imj})(x_{ij}-\mu _{mj})^2 \sigma _{\theta _{mj}}^{-4} -\gamma (u_{imj})\sigma _{\theta _{mj}}^{-2}\Bigr ) = 0, \end{aligned}$$
and the parameter update is
$$\begin{aligned} \sigma _{\theta _{mj}}^{2}&= \frac{\sum _{i=1}^L \gamma (u_{imj})(x_{ij}-\mu _{\theta _{mj}})^2+ \sum _{i=L+1}^N \gamma (u_{imj})(x_{ij}-\mu _{\theta _{mj}})^2}{\sum _{i=1}^L \gamma (u_{imj})+ \sum _{i=L+1}^N\gamma (u_{imj})}\\&= \frac{\sum _{i=1}^N \gamma (u_{imj})(x_{ij}-\mu _{\theta _{mj}})^2}{\sum _{i=1}^N\gamma (u_{imj})},\quad \forall m = 1, \ldots , C+1; j = 1,\ldots ,F. \end{aligned}$$
The same development is valid for $\lambda _{mj} = (\mu _{\lambda _{mj}}$, $\sigma _{\lambda _{mj}}^{2})$ but using $\gamma (w_{imj})$ instead of $\gamma (u_{imj})$ to indicate that feature $j$ is irrelevant for component $m$.
Then, we update $\mu _{\lambda _{mj}}$ as
$$\begin{aligned} \mu _{\lambda _{mj}} = \frac{\sum _{i=1}^N\gamma (w_{imj})x_{ij}}{\sum _{i=1}^N \gamma (w_{imj})}, \quad \forall m = 1, \ldots , C+1; j = 1,\ldots ,F. \end{aligned}$$
And for $\sigma _{\lambda _{mj}}^{2}$
$$\begin{aligned} \sigma _{\lambda _{mj}}^{2} = \frac{\sum _{i=1}^N \gamma (w_{imj})(x_{ij}-\mu _{\lambda _{mj}})^2}{\sum _{i=1}^N \gamma (w_{imj})}, \quad \forall m = 1, \ldots , C+1; j = 1,\ldots ,F. \end{aligned}$$
Finally in $\mathcal{M }^1$, $\rho _{mj}$ is updated by
$$\begin{aligned}&\frac{{\partial }}{{\partial \rho _{mj}}} \Biggl ( \sum _{i=1}^L\sum _{m=1}^C \sum _{j=1}^F \gamma (u_{imj})\log \rho _{mj}\\&\qquad \quad \;\;+ \sum _{i=L+1}^N \sum _{m=1}^{C+1}\sum _{j=1}^F \gamma (u_{imj})\log \rho _{mj} \\&\qquad \quad \;\;+\sum _{i=1}^L \sum _{m=1}^C \sum _{j=1}^F \gamma (w_{imj})\log (1- \rho _{mj}) \\&\qquad \quad \;\;+ \sum _{i=L+1}^N \sum _{m=1}^{C+1}\sum _{j=1}^F \gamma (w_{imj})\log (1- \rho _{mj}) \Biggr )= 0, \end{aligned}$$
whose partial derivative solution is,
$$\begin{aligned} \sum _{i=1}^N \gamma (u_{imj})\frac{1}{\rho _{mj}} - \sum _{i=1}^N \gamma (w_{imj}) \frac{1}{1-\rho _{mj}}=0. \end{aligned}$$
This parameter is updated by
$$\begin{aligned} \rho _{mj} = \frac{\sum _{i=1}^N \gamma (u_{imj})}{\sum _{i=1}^L z_{im} + \sum _{i=L+1}^N \gamma (z_{im})},\quad \forall m = 1, \ldots , C+1; j = 1,\ldots ,F. \end{aligned}$$
Note that $z_{i,C+1} = 0$ for $i = 1,\ldots ,L$ for the three sets of parameters, $\theta _{mj}$, $\lambda _{mj}$ and $\rho _{mj}$.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Guerra, L., Bielza, C., Robles, V. et al. Semi-supervised projected model-based clustering. Data Min Knowl Disc 28, 882–917 (2014). https://doi.org/10.1007/s10618-013-0323-0

Download citation

Received: 03 September 2012
Accepted: 07 May 2013
Published: 21 May 2013
Issue Date: July 2014
DOI: https://doi.org/10.1007/s10618-013-0323-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Semi-supervised projected model-based clustering

Abstract

Access this article

Similar content being viewed by others

On Semi-Supervised Clustering

Exploratory Learning

A Survey of Constrained Clustering

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

Appendices

1.1 Appendix 1: Basic EM theory

1.2 Appendix 2: Including subspaces

1.3 Appendix 3: Expectation of the complete-data log-likelihood function

1.4 Appendix 4: M-step

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Semi-supervised projected model-based clustering

Abstract

Access this article

Similar content being viewed by others

On Semi-Supervised Clustering

Exploratory Learning

A Survey of Constrained Clustering

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

Appendices

1.1 Appendix 1: Basic EM theory

1.2 Appendix 2: Including subspaces

1.3 Appendix 3: Expectation of the complete-data log-likelihood function

1.4 Appendix 4: M-step

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation