Skip to main content
Log in

Semi-supervised projected model-based clustering

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

We present an adaptation of model-based clustering for partially labeled data, that is capable of finding hidden cluster labels. All the originally known and discoverable clusters are represented using localized feature subset selections (subspaces), obtaining clusters unable to be discovered by global feature subset selection. The semi-supervised projected model-based clustering algorithm (SeSProC) also includes a novel model selection approach, using a greedy forward search to estimate the final number of clusters. The quality of SeSProC is assessed using synthetic data, demonstrating its effectiveness, under different data conditions, not only at classifying instances with known labels, but also at discovering completely hidden clusters in different subspaces. Besides, SeSProC also outperforms three related baseline algorithms in most scenarios using synthetic and real data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. Note that “class”, “component”, and “cluster” are equivalent concepts at the end of the classification, but each concept will be used here to refer, respectively to a priori knowledge about instances (classes), mixture components (components) or identified groups (clusters).

  2. Note that, for legibility, the notation related to iterations is used with \(\varTheta \), but not with \(\varvec{\theta }\) throughout the paper.

  3. Note that the classification term only iterates theoretically until \(m = C\), but we can assume that this iteration finishes at \(m=C+1\) with \(z_{i,C+1} = 0\), \(\forall i = 1,\ldots ,L\).

References

  • Aggarwal C, Yu P (2000) Finding generalized projected clusters in high dimensional spaces. SIGMOD Rec 29(2):70–81

    Article  Google Scholar 

  • Aggarwal C, Han J, Wang J, Yu P (2004) A framework for projected clustering of high dimensional data streams. In: Proceedings of 30th international conference on very large data bases, pp 852–863

  • Aggarwal C, Procopiuc C, Wolf J, Yu P, Park J (1999) Fast algorithms for projected clustering. SIGMOD Rec 28(2):61–72

    Article  Google Scholar 

  • Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. SIGMOD Rec 27:94–105

    Article  Google Scholar 

  • Ahmed M, Khan L (2009) SISC: a text classification approach using semi supervised subspace clustering. In: IEEE international conference on data mining workshops, pp 1–6

  • Alexandridis R, Lin S, Irwin M (2004) Class discovery and classification of tumor samples using mixture modeling of gene expression data, a unified approach. Bioinformatics 20(16):2545–2552

    Article  Google Scholar 

  • Basu S, Banjeree A, Mooney E, Banerjee A, Mooney R (2004) Active semi-supervision for pairwise constrained clustering. In: Proceedings of the SIAM international conference on data mining, pp 333–344

  • Basu S, Davidson I, Wagstaff K (eds) (2009) Constrained clustering: advances in algorithms, theory and applications. Chapman and Hall/CRC, Boca Raton

  • Bishop C (2007) Pattern recognition and machine learning. Springer, New York

    Google Scholar 

  • Boutemedjet S, Ziou D, Bouguila N (2010) Model based subspace clustering of non-Gaussian data. Neurocomputing 73(10–12):1730–1739

    Article  Google Scholar 

  • Chandel A, Tiwari A, Chaudhari N (2009) Constructive semi-supervised classification algorithm and its implement in data mining. In: Proceedings of the 3rd international conference on pattern recognition and machine intelligence. Springer, Berlin, pp 62–67

  • Chapelle O, Schölkopf B, Zien A (eds) (2006) Semi-supervised learning. MIT Press, Cambridge

  • Chawla N, Karakoulas G (2005) Learning from labeled and unlabeled data: an empirical study across techniques and domains. J Artif Intell Res 23:331–366

    MATH  Google Scholar 

  • Chen L, Jiang Q, Wang S (2012) Model-based method for projective clustering. IEEE Trans Knowl Data Eng 24(7):1291–1305

    Article  MathSciNet  Google Scholar 

  • Cheng C, Fu A, Zhang Y (1999) Entropy-based subspace clustering for mining numerical data. In: Proceedings of the 5th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 84–93

  • Cheng H, Hua K, Vu K (2008) Constrained locally weighted clustering. In: Proceedings of the 34th international conference on very large data bases, vol 1, Auckland, pp 90–101

  • Cordeiro R, Traina A, Faloutsos C, Traina C (2010) Finding clusters in subspaces of very large, multi-dimensional datasets. In: International conference on data engineering, Long Beach, pp 625–636

  • Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc 39(1):1–38

    MATH  MathSciNet  Google Scholar 

  • Fraley C, Raftery A (1998) How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput J 41(8):578–588

    Article  MATH  Google Scholar 

  • Fraley C, Raftery A (2012) MCLUST version 4 for R: normal mixture modeling for model-based clustering, classication and density estimation. Technical report no. 597, Department of Statistics, University of Washington, Seatlle

  • Frank A, Asuncion A (2010) UCI machine learning repository. http://archive.ics.uci.edu/ml

  • Friedman J, Meulman J (2004) Clustering objects on subsets of attributes. J R Stat Soc 66:815–849

    Article  MATH  MathSciNet  Google Scholar 

  • Fromont E, Prado A, Robardet C (2009) Constraint-based subspace clustering. In: Proceedings of the 9th SIAM international conference on data mining, pp 26–37

  • Goil S, Nagesh H, Choudhary A (1999) MAFIA: efficient and scalable subspace clustering for very large data sets. In: International conference on data engineering

  • Graham M, Miller D (2006) Unsupervised learning of parsimonious mixtures on large spaces with integrated feature and component selection. IEEE Trans Signal Process 54(4):1289–1303

    Article  Google Scholar 

  • Günnemann S, Färber I, Müller E, Seidl T (2010) ASCLU: alternative subspace clustering. In: Multiclust: first international workshop on discovering, summarizing and using multiple clustering, held in conjunction with KDD 2010

  • Günnemann S, Färber I, Virochsiri K, Seidl T (2012) Subspace correlation clustering: finding locally correlated dimensions in subspace projections of the data. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining, pp 352–360

  • Hoff P (2005) Subset clustering of binary sequences, with an application to genomic abnormality data. Biometrics 61(4):1027–1036

    Article  MATH  MathSciNet  Google Scholar 

  • Hoff P (2006) Model based subspace clustering. Bayesian. Analysis 1(2):321–344

    MathSciNet  Google Scholar 

  • Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218

    Article  Google Scholar 

  • Kriegel H, Kröger P, Zimek A (2009) Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering and correlation clustering. ACM Trans Knowl Disc Data 3(1):1–58

    Article  Google Scholar 

  • Kriegel H, Kröger P, Ntoutsi I, Zimek A (2011) Density based subspace clustering over dynamic data. In: Proceedings of the 23rd international conference on scientific and statistical database management, pp 387–404

  • Kriegel H, Kröger P, Zimek A (2012) Subspace clustering. Wiley Interdiscip. Rev 2(4):351–364

    Google Scholar 

  • Lange T, Law M, Jain A, Buhmann J (2005) Learning with constrained and unlabelled data. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 731–738

  • Law M, Figueiredo M, Jain A (2004) Simultaneous feature selection and clustering using mixture models. IEEE Trans Pattern Anal Mach Intell 26(9):1154–1166

    Article  Google Scholar 

  • Li Y, Dong M, Hua J (2007) A Gaussian mixture model to detect clusters embedded in feature subspace. J Commun Inf Syst 7(4):337–352

    MATH  MathSciNet  Google Scholar 

  • Li Y, Dong M, Hua J (2009) Simultaneous localized feature selection and model detection for Gaussian mixtures. IEEE Trans Pattern Anal Mach Intell 31(5):953–960

    Article  Google Scholar 

  • Lu Z, Leen T (2005) Semi-supervised learning with penalized probabilistic clustering. Adv Neural Inf Process Syst 17:849–856

    Google Scholar 

  • Maitra R, Melnykov V (2010) Simulating data to study performance of finite mixture modeling and clustering algorithms. J Computd Graph Stat 19(2):354–376

    Article  MathSciNet  Google Scholar 

  • Markley S, Miller D (2010) Joint parsimonious modeling and model order selection for multivariate Gaussian mixtures. IEEE J Sel Top Signal Process 4(3):548–559

    Article  Google Scholar 

  • McLachlan G, Basford K (1988) Mixture models: inference and applications to clustering. Marcel Dekker, New York

    MATH  Google Scholar 

  • McLachlan G, Peel D (2000) Finite mixture models. Wiley-Interscience, New York

    Book  MATH  Google Scholar 

  • Melnykov V, Maitra R (2010) Finite mixture models and model-based clustering. Stat Surv 4:80–116

    Article  MATH  MathSciNet  Google Scholar 

  • Melnykov V, Chen W, Maitra R (2012) MixSim: an R package for simulating data to study performance of clustering algorithms. J Stat Softw 51(12):1–25

    Google Scholar 

  • Miller D, Browning J (2003) A mixture model and EM-based algorithm for class discovery, robust classification, and outlier rejection in mixed labeled/unlabeled data sets. IEEE Trans Pattern Anal Mach Intell 25(11):1468–1483

    Article  Google Scholar 

  • Miller D, Chu-Fang L, Kesidis G, Collins C (2009) Semisupervised mixture modeling with fine-grained component-conditional class labeling and transductive inference. In: IEEE international workshop on machine learning for signal processing, pp 1–6

  • Moise G, Sander J, Ester M (2008) Robust projected clustering. Knowl Inf Syst 14(3):273–298

    Article  MATH  Google Scholar 

  • Müller E, Assent I, Seidl T (2009) HSM: heterogeneous subspace mining in high dimensional. In: Proceedings of the 21st international conference on scientific and statistical database management, pp 497–516

  • Parsons L, Haque E, Liu H (2004) Subspace clustering for high dimensional data: a review. ACM SIGKDD Explor Newsl 6(1):90–105

    Article  Google Scholar 

  • Procopiuc C, Jones M, Agarwal P, Murali T (2002) A Monte Carlo algorithm for fast projective clustering. In: Proceedings of the ACM international conference on management of data, pp 418–427

  • R Core Team (2012) R: a language and environment for statistical computing. R Foundation for Statisical Computing, Vienna

  • Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464

    Article  MATH  Google Scholar 

  • Shental N, Bar-Hillel A, Hertz T, Weinshall D (2003) Computing Gaussian mixture models with EM using equivalence constraints. Adv Neural Inf Process Syst 16:1–8

    Google Scholar 

  • Sim K, Gopalkrishnan V, Zimek A, Cong G (2012) A survey on enhanced subspace clustering. Data Min Knowl Discov. doi:10.1007/s10618-012-0258-x

  • Wang F, Zhang C, Shen H, Wang J (2006) Semi-supervised classification using linear neighborhood propagation. In: IEEE Computer Society Conference on Computer Vision and. Pattern Recognition, vol 1:160–167

  • Watanabe M, Yamaguchi K (2003) The EM algorithm and related statistical models. CRC Press, Boca Raton

    Book  Google Scholar 

  • Witten I, Frank E, Hall M (2011) Data mining: practical machine learning tools and techniques, 3rd edn. Morgan Kaufmann, Burlington

    Google Scholar 

  • Woo K, Lee J, Kim M, Lee Y (2004) FINDIT: a fast and intelligent subspace clustering algorithm using dimension voting. Inf Softw Technol 46(4):255–271

    Article  Google Scholar 

  • Yip K, Cheung D, Ng M (2004) HARP: a practical projected clustering algorithm. IEEE Trans Knowl Data Eng 16:1387–1397

    Article  Google Scholar 

  • Yip K, Cheung D, Ng M (2005) On discovery of extremely low-dimensional clusters using semi-supervised projected clustering. In: International conference on data engineering, pp 329–340

  • Zhang X, Wu Y, Qiu Y (2010) Constraint based dimension correlation and distance divergence for clustering high-dimensional data. In: IEEE 10th International conference on data mining, pp 629–638

  • Zhang X, Qiu Y, Wu Y (2011) Exploiting constraint inconsistence for dimension selection in subspace clustering: a semi-supervised approach. Neurocomputing 74(17):3598–3608

    Article  Google Scholar 

  • Zhu X (2005) Semi-supervised learning literature survey. Tech. rep., Computer Sciences, University of Wisconsin-Madison

  • Zhu X, Goldberg A (2009) Introduction to semi-supervised learning. Morgan & Claypool Publishers, New York

    MATH  Google Scholar 

Download references

Acknowledgments

This research is partially supported by the Spanish Ministry of Economy and Competitiveness TIN2010-20900-C04-04 and TIN2010-21289-C02-02 projects, the Cajal Blue Brain project and Consolider Ingenio 2010-CSD2007-00018. The authors thankfully acknowledge the computer resources, technical expertise and assistance provided by the Centro de Supercomputación y Visualización de Madrid (CeSViMa). The authors are also very grateful for the useful comments and suggestions proposed by the anonymous reviewers, which have contributed definitely to the improvement of the manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Luis Guerra.

Additional information

Communicated by Charu Aggarwal.

Appendices

Appendices

1.1 Appendix 1: Basic EM theory

The density function of an instance \(\mathbf x _{i}\) is

$$\begin{aligned} p(\mathbf x _{i} \mid \varTheta ) = \sum _{m=1}^K\pi _{m} p(\mathbf x _{i}\mid \varvec{\theta }_{m}). \end{aligned}$$

We can define a binary random variable \(\mathbf z _{i} = (z_{i1}, \ldots ,z_{iK})\), with \(z_{im} = 1\) if instance \(\mathbf x _{i}\) belongs to component \(m\) and with all other elements \(z_{im^{\prime }} = 0\), \(\forall \) \(m^{\prime } \ne m\). Besides, \(p(z_{im} = 1) = \pi _{m}\). Therefore, we can write

$$\begin{aligned} p(\mathbf z _{i}) =\prod _{m=1}^K \pi _{m}^{z_{im}}. \end{aligned}$$
(10)

Also, \(p(\mathbf x _{i}\mid z_{im}=1) = p(\mathbf x _{i}\mid \varvec{\theta }_{m})\), which, extended, is

$$\begin{aligned} p(\mathbf x _{i}\mid \mathbf z _{i}, \varTheta ) =\prod _{m=1}^K p(\mathbf x _{i}\mid \varvec{\theta }_{m})^{z_{im}}. \end{aligned}$$
(11)

Using Eqs. (10) and (11), Eq. (1) is obtained by summing over all possible states of \(\mathbf z _{i}\)

$$\begin{aligned} p(\mathbf x _{i} \mid \varTheta )&= \sum _\mathbf{z _{i}} p(\mathbf x _{i}, \mathbf z _{i} \mid \varTheta )= \sum _\mathbf{z _{i}} p(\mathbf z _{i})p(\mathbf x _{i}\mid \mathbf z _{i}, \varTheta ) \\&= \sum _\mathbf{z _{i}}\left( \prod _{m=1}^K \pi _{m}^{z_{im}}\prod _{m=1}^K p(\mathbf x _{i}\mid \varvec{\theta }_{m})^{z_{im}}\right) \\&= \sum _{m=1}^K\pi _{m} p(\mathbf x _{i}\mid \varvec{\theta }_{m}). \end{aligned}$$

This mixture of distributions has unknown parameters in \(\varTheta \) that must be estimated. These parameters can be obtained using the maximum likelihood estimation method. Therefore, assuming that each instance is independent and identically distributed (i.i.d.), and building the log-likelihood function (\(\log L\)) from Eq. (1) and extending it to all the instances, we obtain

$$\begin{aligned} \log L(\varTheta \mid \mathcal{X })&= \log p(\mathcal{X }|\varTheta ) \\&= \log \prod _{i=1}^N p(\mathbf x _{i} \mid \varTheta ) \\&= \sum _{i=1}^N\log \left( \sum _{m=1}^K \pi _{m} p(\mathbf x _{i}\mid \varvec{\theta }_{m})\right) . \end{aligned}$$

This log-likelihood function is difficult to maximize because the summation over the components is inside the logarithm function. The log-likelihood function would change if both the latent variables (\(\mathcal{Z }\)) and the observable data (\(\mathcal{X }\)) were known. Then, based on Eqs. (10) and (11), we can define the complete-data log-likelihood function as

$$\begin{aligned} \log L(\varTheta \mid \mathcal{X },\mathcal{Z })&= \log p(\mathcal{X },\mathcal{Z }|\varTheta ) \nonumber \\&= \log \prod _{i=1}^N \prod _{m=1}^K \pi _{m}^{z_{im}}p(\mathbf x _{i}\mid \varvec{\theta }_{m})^{z_{im}} \nonumber \\&= \sum _{i=1}^N \sum _{m=1}^K z_{im} \left( \log \pi _{m} + \log p (\mathbf x _{i}\mid \varvec{\theta }_{m}) \right) . \end{aligned}$$
(12)

The maximization of this complete-data log-likelihood function is straightforward because the summation is outside the logarithm. Since the latent variables are unknown we cannot use this function directly. However, we can obtain the expectation of this log-likelihood function with respect to the posterior distribution of the latent variables. This expectation is calculated in iteration \(t\), having fixed the parameters from the previous iteration \(t-1\), in the E-step of the EM algorithm. After this, the parameters of the distributions are recalculated to maximize this expectation (M-step). These two steps are repeated until a convergence criterion is reached. Hence, the expectation of the complete-data log-likelihood function is given by

$$\begin{aligned} \mathcal{Q }(\varTheta ,\varTheta ^{t-1})&= \mathbb{E }_{\mathcal{Z }\mid \mathcal{X },\varTheta ^{t-1}}[\log L(\varTheta \mid \mathcal{X },\mathcal{Z })] \nonumber \\&= \sum _\mathcal{Z } p(\mathcal{Z }\mid \mathcal{X },\varTheta ^{t-1}) \log p(\mathcal{X },\mathcal{Z }\mid \varTheta ), \end{aligned}$$
(13)

where the posterior distribution of the latent variables given the data and the parameters of the previous iteration \(t-1\) using Eq. (12), is

$$\begin{aligned} p(\mathcal{Z }\mid \mathcal{X },\varTheta ^{t-1}) \propto \prod _{i=1}^N \prod _{m=1}^K\left( \pi _{m}p(\mathbf x _{i}\mid \varvec{\theta }_{m}) \right) ^{z_{im}}. \end{aligned}$$
(14)

This factorizes over \(i\) so that the \(\{ \mathbf z _{i} \}\) in this distribution are independent. Using this posterior distribution and Bayes’ theorem, we can calculate the expected value of each \(z_{im}\) (responsibility) as

$$\begin{aligned} \mathbb{E }_{z_{im}\mid \mathbf x _{i},\varvec{\theta }_{m}}[z_{im}]&= \gamma (z_{im}) \\&= \frac{\sum _{z_{im}} z_{im}(\pi _{m}p(\mathbf x _{i}\mid \varvec{\theta }_{m}))^{z_{im}}}{\sum _{z_{im^{\prime }}}(\pi _{m^{\prime }}p(\mathbf x _{i}\mid \varvec{\theta }_{m^{\prime }}))^{z_{im^{\prime }}}} \\&= \frac{\pi _{m}p(\mathbf x _{i}\mid \varvec{\theta }_{m})}{\sum _{m^{\prime }=1}^K\pi _{m^{\prime }}p(\mathbf x _{i}\mid \varvec{\theta }_{m^{\prime }})}\\&= p(z_{im}=1 \mid \mathbf x _{i}, \varvec{\theta }_{m}), \end{aligned}$$

which we can use to calculate the expectation of the complete-data log-likelihood, as

$$\begin{aligned} \mathbb{E }_{\mathcal{Z }\mid \mathcal{X },\varTheta ^{t-1}}[\log L(\varTheta \mid \mathcal{X },\mathcal{Z })] = \sum _{i=1}^N \sum _{m=1}^K \gamma (z_{im}) \left( \log \pi _{m} + \log p (\mathbf x _{i}\mid \varvec{\theta }_{m}) \right) . \end{aligned}$$

1.2 Appendix 2: Including subspaces

Defining, for each component and feature, \(\rho _{mj} = p(v_{mj}=1)\), the probability of feature \(j\) being relevant to component \(m\), the new density function, including the search for subspaces, is

$$\begin{aligned} p(\mathbf x _{i} \mid \varTheta )= \sum _{m=1}^K \pi _{m}\prod _{j=1}^{F}\Bigl (\rho _{mj} p(x_{ij}\mid \theta _{mj})+(1 - \rho _{mj}) p(x_{ij}\mid \lambda _{mj})\Bigr ). \end{aligned}$$
(15)

To prove this new density function, we can obtain for a component \(m\) and an instance \(i\),

$$\begin{aligned} p(\mathbf v _{m}\mid z_{im} = 1) = \prod _{j=1}^{F}(\rho _{mj})^{v_{mj}}(1 - \rho _{mj})^{1-v_{mj}}. \end{aligned}$$

This can be extended for all components as

$$\begin{aligned} p(\mathcal{V }\mid \mathbf z _{i}) = \prod _{m=1}^K \left( \prod _{j=1}^{F}(\rho _{mj})^{v_{mj}}(1 - \rho _{mj})^{1-v_{mj}}\right) ^{z_{im}}. \end{aligned}$$
(16)

Besides, we can extend Eq. (11) introducing \(\mathcal{V }\), as

$$\begin{aligned} p(\mathbf x _{i} \mid \mathcal{V }, \mathbf z _{i}, \varTheta ) = \prod _{m=1}^K \left( \prod _{j=1}^F p(x_{ij}\mid \theta _{mj})^{v_{mj}} p(x_{ij}\mid \lambda _{mj})^{1-v_{mj}}\right) ^{z_{im}}. \end{aligned}$$
(17)

The new density function, based on Eq. (1), and using Eqs. (10), (16), and (17), is,

$$\begin{aligned} p(\mathbf x _{i} \mid \varTheta )&= \sum _\mathbf{z _{i}}\sum _\mathcal{V } p(\mathbf x _{i},\mathcal{V },\mathbf z _{i} \mid \varTheta ) \\&= \sum _\mathbf{z _{i}}\sum _\mathcal{V } p(\mathbf x _{i} \mid \mathcal{V }, \mathbf z _{i}, \varTheta ) p(\mathcal{V } \mid \mathbf z _{i}) p(\mathbf z _{i}). \end{aligned}$$

The summation over \(\mathbf z _{i}\) is solved as in Eq. (1), obtaining,

$$\begin{aligned} p(\mathbf x _{i} \mid \varTheta ) \!=\! \sum _\mathcal{V } \Biggl (\! \sum _{m=1}^K \pi _{m}\prod _{j=1}^{F} \Bigr ( [\rho _{mj} p(x_{ij}\mid \theta _{mj})]^{v_{mj}} \!\times \! [(1 \!-\! \rho _{mj}) p(x_{ij}\mid \lambda _{mj})]^{1-v_{mj}}\Bigr ) \Biggr ). \end{aligned}$$

And we can solve the summation over \(\mathcal{V }\), summing over all the possible states of each \(v_{mj}\), as,

$$\begin{aligned} p(\mathbf x _{i} \mid \varTheta )\!&= \!\sum _{m=1}^K\pi _{m} \prod _{j=1}^{F} \sum _{v_{mj}=0}^1 \Bigl ([\rho _{mj} p(x_{ij}\mid \theta _{mj})]^{v_{mj}} \!\times \! [(1 \!-\! \rho _{mj}) p(x_{ij}\mid \lambda _{mj})]^{1-v_{mj}} \Bigr ) \\&= \sum _{m=1}^K \pi _{m}\prod _{j=1}^{F}\Bigl (\rho _{mj} p(x_{ij}\mid \theta _{mj})+(1 - \rho _{mj}) p(x_{ij}\mid \lambda _{mj})\Bigr ). \end{aligned}$$

Taking into account that each component can be described in a different feature subspace, this is the new density function of an instance, as shown in Eq. (15).

The new log-likelihood function that should be maximized, by extending Eq. (15) to all the instances, is

$$\begin{aligned} \log L(\varTheta \mid \mathcal{X })&= \log p(\mathcal{X }|\varTheta ) = \log \prod _{i=1}^N p(\mathbf x _{i} \mid \varTheta ) \\&= \sum _{i=1}^N \Biggl ( \log \sum _{m=1}^K \pi _{m}\prod _{j=1}^{F}\Bigl (\rho _{mj} p(x_{ij}\mid \theta _{mj}) +(1 - \rho _{mj}) p(x_{ij}\mid \lambda _{mj}) \Bigr ) \Biggr ). \end{aligned}$$

This is again difficult to compute since the summation over the components is inside the logarithm function. This equation would change if we knew the sets of latent variables, \(\mathcal{Z }\) and \(\mathcal V \). Again by extending Eqs. (10), (16), and (17) to all the data, we can write

$$\begin{aligned} p(\mathcal{X },\mathcal{Z },\mathcal V \mid \varTheta )&= \prod _{i=1}^N \prod _{m=1}^K \left( \prod _{j=1}^F p(x_{ij}\mid \theta _{mj})^{v_{mj}} p(x_{ij}\mid \lambda _{mj})^{1-v_{mj}}\right) ^{z_{im}}\\&\times \prod _{i=1}^N \prod _{m=1}^K \left( \prod _{j=1}^{F}(\rho _{mj})^{v_{mj}}(1 - \rho _{mj})^{1-v_{mj}}\right) ^{z_{im}} \\&\times \prod _{i=1}^N \prod _{m=1}^K \pi _{m}^{z_{im}}, \end{aligned}$$

which can be simplified to

$$\begin{aligned} p(\mathcal{X },\mathcal{Z },\mathcal{V } \mid \varTheta )&= \prod _{i=1}^N \prod _{m=1}^K \Biggl ( \pi _{m}^{z_{im}} \prod _{j=1}^F \left( [\rho _{mj} p(x_{ij}\mid \theta _{mj})]^{v_{mj}} \right. \nonumber \\&\times \left. [(1 - \rho _{mj}) p(x_{ij}\mid \lambda _{mj})]^{1-v_{mj}}\right) ^{z_{im}} \Biggr ). \end{aligned}$$
(18)

We can obtain the complete-data log-likelihood function by taking the logarithm of the previous function as,

$$\begin{aligned} \log L(\varTheta \mid \mathcal{X },\mathcal{Z },\mathcal{V })&= \log p(\mathcal{X },\mathcal{Z },\mathcal{V }\mid \varTheta ) \\&= \log \prod _{i=1}^N \prod _{m=1}^K \Biggl ( \pi _{m}^{z_{im}} \prod _{j=1}^F \left( [\rho _{mj} p(x_{ij}\mid \theta _{mj})]^{v_{mj}} \right. \\&\times \left. [(1 - \rho _{mj}) p(x_{ij}\mid \lambda _{mj})]^{1-v_{mj}}\right) ^{z_{im}} \Biggr ), \end{aligned}$$

and operating again,

$$\begin{aligned}&\log L(\varTheta \mid \mathcal{X },\mathcal{Z },\mathcal{V }) \nonumber \\&\quad = \sum _{i=1}^N \sum _{m=1}^K \Bigl ( z_{im}\log \pi _{m}+ \sum _{j=1}^F\left( z_{im} \left[ v_{mj} (\log \rho _{mj} + \log p(x_{ij}\mid \theta _{mj})) \right. \right. \nonumber \\&\qquad + \left. \left. (1-v_{mj})(\log (1 - \rho _{mj}) + \log p(x_{ij}\mid \lambda _{mj}))\right] \right) \Bigr ). \end{aligned}$$
(19)

1.3 Appendix 3: Expectation of the complete-data log-likelihood function

Similarly to Eq. (13), the expectation of the complete-data log-likelihood function can be written as

$$\begin{aligned} \mathbb{E }_{\mathcal{Z },\mathcal{V }\mid \mathcal{X },\varTheta ^{t-1}}[\log L(\varTheta \mid \mathcal{X },\mathcal{Z },\mathcal{V })] \!=\! \sum _\mathcal{Z }\sum _\mathcal{V } p(\mathcal{Z },\mathcal{V } \mid \mathcal{X },\varTheta ^{t-1}) \log p(\mathcal{X },\mathcal{Z },\mathcal{V }\mid \varTheta ). \end{aligned}$$

As in Eq. (14), the posterior distribution of the latent variables given the data, having fixed the parameters of the previous iteration \(t-1\), and using Eq. (18), can be written as

$$\begin{aligned} p(\mathcal{Z },\mathcal{V } \mid \mathcal{X },\varTheta ^{t-1})&\propto \prod _{i=1}^N \prod _{m=1}^K \Biggl ( \pi _{m}^{z_{im}} \prod _{j=1}^F \left( [\rho _{mj} p(x_{ij}\mid \theta _{mj})]^{v_{mj}} \right. \\&\times \left. [(1 - \rho _{mj}) p(x_{ij}\mid \lambda _{mj})]^{1-v_{mj}}\right) ^{z_{im}} \Biggr ). \end{aligned}$$

Before computing the expected values of each \(v_{mj}\) and each \(z_{im}\), we need to define some other necessary probabilities:

$$\begin{aligned} p(x_{ij},v_{mj}=1 \mid \theta _{mj}) = \rho _{mj}p(x_{ij}\mid \theta _{mj}), \end{aligned}$$

and, similarly

$$\begin{aligned} p(x_{ij}, v_{mj}=0 \mid \theta _{mj}) = (1-\rho _{mj})p(x_{ij}\mid \lambda _{mj}). \end{aligned}$$

Taking both expressions into account, we have

$$\begin{aligned} p(x_{ij} \mid \theta _{mj})&= p(x_{ij},v_{mj}=1 \mid \theta _{mj}) + p(x_{ij}, v_{mj}=0 \mid \theta _{mj}) \\&= \rho _{mj}p(x_{ij}\mid \theta _{mj}) + (1-\rho _{mj})p(x_{ij}\mid \lambda _{mj}). \end{aligned}$$

Now, as detailed after Eq. (14), we can calculate the expected value of each \(v_{mj}\), as

$$\begin{aligned}&\mathbb{E }_{v_{mj}, \mid \mathbf x _{ij},\varvec{\theta }_{mj}}[v_{mj}] = \gamma (v_{mj}) \\&\quad =\frac{ \rho _{mj}p(x_{ij}\mid \theta _{mj})}{\rho _{mj}p(x_{ij}\mid \theta _{mj}) + (1-\rho _{mj})p(x_{ij}\mid \lambda _{mj})} \\&\quad =p(v_{mj}=1 \mid x_{ij}, \theta _{mj}). \end{aligned}$$

Using this, we calculate the expected value of each \(z_{im}\)

$$\begin{aligned}&\mathbb{E }_{z_{im} \mid \mathbf v _{m},\mathbf x _{i},\varvec{\theta }_{m}}[z_{im}] = \gamma (z_{im}) \\&\quad =\frac{\pi _{m} \prod _{j=1}^F[\rho _{mj} p(x_{ij}\mid \theta _{mj})+(1 - \rho _{mj}) p(x_{ij}\mid \lambda _{mj})]}{\sum _{m^{\prime }=1}^K \pi _{m^{\prime }} \prod _{j=1}^F[\rho _{m^{\prime }j} p(x_{ij}\mid \theta _{m^{\prime }j})+(1 - \rho _{m^{\prime }j}) p(x_{ij}\mid \lambda _{m^{\prime }j})]} \\&\quad =p(z_{im} = 1 \mid \mathbf v _m, \mathbf x _{i}, \varvec{\theta }_{m}). \end{aligned}$$

Thus the expectation of the complete-data log-likelihood, as in Eq. (3) and using Eq. (19), is

$$\begin{aligned}&\mathbb{E }_{\mathcal{Z },\mathcal{V }\mid \mathcal{X },\varTheta ^{t-1}}[\log L(\varTheta \mid \mathcal{X },\mathcal{Z },\mathcal{V })]\\&\quad = \sum _{i=1}^N \sum _{m=1}^K \gamma (z_{im}) \\&\qquad \times \Biggl ( \log \pi _{m} + \sum _{j=1}^F \Biggl ( \gamma (v_{mj}) (\log \rho _{mj} + \log p(x_{ij}\mid \theta _{mj})) \\&\qquad + (1- \gamma (v_{mj}))(\log (1 - \rho _{mj}) + \log p(x_{ij}\mid \lambda _{mj})) \Biggr ) \Biggr ). \end{aligned}$$

Then, for simplicity’s sake, we define

$$\begin{aligned} \gamma (u_{imj})&= \gamma (z_{im}) \gamma (v_{mj}),\\ \gamma (w_{imj})&= \gamma (z_{im}) (1-\gamma (v_{mj})). \end{aligned}$$

Now we can obtain the expectation of the complete-data log-likelihood as

$$\begin{aligned}&\mathbb{E }_{\mathcal{Z },\mathcal{V }\mid \mathcal{X },\varTheta ^{t-1}}[\log L(\varTheta \mid \mathcal{X },\mathcal{Z },\mathcal{V })]\\&\quad =\sum _{i=1}^N \sum _{m=1}^K \gamma (z_{im}) \log \pi _{m}\\&\qquad + \sum _{i=1}^N \sum _{m=1}^K\sum _{j=1}^F \gamma (u_{imj})(\log \rho _{mj} + \log p(x_{ij}\mid \theta _{mj})) \\&\qquad +\sum _{i=1}^N \sum _{m=1}^K\sum _{j=1}^F \gamma (w_{imj})(\log (1 - \rho _{mj}) + \log p(x_{ij}\mid \lambda _{mj})). \end{aligned}$$

1.4 Appendix 4: M-step

The parameters are recalculated in the M-step to maximize the value of the expectation of the complete-data log-likelihood function. As already mentioned, these updates are obtained by computing the partial derivatives of this expectation and equaling to zero. The univariate Gaussian distribution for each feature and component is used for this explanation. Therefore \(\theta _{mj}\) = (\(\mu _{\theta _{mj}}\), \(\sigma _{\theta _{mj}}^{2})\), and

$$\begin{aligned} \log p(x_{ij}\mid \theta _{mj}) = \log (\sigma _{\theta _{mj}}^{-1} (2\pi )^{-\frac{1}{2}}) - \frac{1}{2}(x_{ij}-\mu _{\theta _{mj}})^2\sigma _{\theta _{mj}}^{-2}. \end{aligned}$$

All the detailed steps of how to update each parameter, follow.

  • \(\pi _{m}\) is updatedFootnote 3 using a Lagrange multiplier to enforce constraint \(\sum _{m=1}^{C+1}\pi _{m} = 1\):

    $$\begin{aligned}&\frac{{\partial }}{{\partial \pi _{m}}}\left( \sum _{i=1}^L\sum _{m=1}^C z_{im}\log \pi _{m}\right. \\&\quad \qquad \;\;+ \sum _{i=L+1}^N\sum _{m=1}^{C+1}\gamma (z_{im})\log \pi _{m} \\&\quad \qquad \left. \;\;+\, \lambda \left( \sum _{m=1}^{C+1}\pi _{m} -1 \right) \right) = 0, \quad \forall m = 1, \ldots , C+1, \end{aligned}$$

    whose derivative is

    $$\begin{aligned} \sum _{i=1}^L z_{im} \frac{1}{\pi _{m}}+ \sum _{i=L+1}^N \gamma (z_{im}) \frac{1}{\pi _{m}} + \lambda = 0. \end{aligned}$$

    Multiplying both sides by \(\pi _{m}\) and summing over \(m\), with \(m = 1, \ldots , C+1\), we have \(\lambda = -N\), as

    $$\begin{aligned} - \lambda = \sum _{i=1}^L\sum _{m=1}^{C+1} z_{im} + \sum _{i=L+1}^N\sum _{m=1}^{C+1} \gamma (z_{im}) = N, \end{aligned}$$

    and then we update each \(\pi _{m}\) by using

    $$\begin{aligned} \pi _{m} = \frac{\sum _{i=1}^L z_{im}+\sum _{i=L+1}^N \gamma (z_{im})}{N}. \end{aligned}$$
  • \(\mu _{\theta _{mj}}\) is updated solving the following partial derivative equation:

    $$\begin{aligned}&\frac{{\partial }}{{\partial \mu _{\theta _{mj}}}}\Biggl ( \sum _{i=1}^L\sum _{m=1}^C \sum _{j=1}^F \gamma (u_{imj})\log p(x_{ij}\mid \theta _{mj}) \\&\qquad \qquad + \sum _{i=L+1}^N \sum _{m=1}^{C+1} \sum _{j=1}^F\gamma (u_{imj})\log p(x_{ij}\mid \theta _{mj}) \Biggr ) =0. \end{aligned}$$

    Then the result is

    $$\begin{aligned}&\sum _{i=1}^L \Bigl ( \gamma (u_{imj})\sigma _{\theta _{mj}}^{-2}x_{ij} - \gamma (u_{imj})\sigma _{\theta _{mj}}^{-2}\mu _{\theta _{mj}} \Bigr ) \\&\quad + \sum _{i=L+1}^N \Bigl (\gamma (u_{imj})\sigma _{\theta _{mj}}^{-2}x_{ij} - \gamma (u_{imj})\sigma _{\theta _{mj}}^{-2}\mu _{\theta _{mj}}\Bigr ) = 0, \end{aligned}$$

    and the value of the parameter can be found as

    $$\begin{aligned} \mu _{\theta _{mj}}&= \frac{\sum _{i=1}^L \gamma (u_{imj})x_{ij} + \sum _{i=L+1}^N \gamma (u_{imj})x_{ij}}{\sum _{i=1}^L \gamma (u_{imj})+ \sum _{i=L+1}^N \gamma (u_{imj})} \\&= \frac{\sum _{i=1}^N \gamma (u_{imj})x_{ij}}{\sum _{i=1}^N \gamma (u_{imj})},\quad \forall m = 1, \ldots , C+1; j = 1,\ldots ,F. \end{aligned}$$
  • And for \(\sigma _{\theta _{mj}}^{2}\),

    $$\begin{aligned}&\frac{{\partial }}{{\partial \sigma _{\theta _{mj}}^{2}}}\Biggl ( \sum _{i=1}^L\sum _{m=1}^C\sum _{j=1}^F \gamma (u_{imj})\log p(x_{ij}\mid \theta _{mj}) \\&\qquad \qquad + \sum _{i=L+1}^N \sum _{m=1}^{C+1}\sum _{j=1}^F \gamma (u_{imj})\log p(x_{ij}\mid \theta _{mj})\Biggr ) = 0. \end{aligned}$$

    The derivative is

    $$\begin{aligned}&\sum _{i=1}^L \Bigl (\gamma (u_{imj})(x_{ij}-\mu _{mj})^2 \sigma _{\theta _{mj}}^{-4} -\gamma (u_{imj})\sigma _{\theta _{mj}}^{-2}\Bigr ) \\&\quad + \sum _{i=L+1}^N \Bigl (\gamma (u_{imj})(x_{ij}-\mu _{mj})^2 \sigma _{\theta _{mj}}^{-4} -\gamma (u_{imj})\sigma _{\theta _{mj}}^{-2}\Bigr ) = 0, \end{aligned}$$

    and the parameter update is

    $$\begin{aligned} \sigma _{\theta _{mj}}^{2}&= \frac{\sum _{i=1}^L \gamma (u_{imj})(x_{ij}-\mu _{\theta _{mj}})^2+ \sum _{i=L+1}^N \gamma (u_{imj})(x_{ij}-\mu _{\theta _{mj}})^2}{\sum _{i=1}^L \gamma (u_{imj})+ \sum _{i=L+1}^N\gamma (u_{imj})}\\&= \frac{\sum _{i=1}^N \gamma (u_{imj})(x_{ij}-\mu _{\theta _{mj}})^2}{\sum _{i=1}^N\gamma (u_{imj})},\quad \forall m = 1, \ldots , C+1; j = 1,\ldots ,F. \end{aligned}$$

    The same development is valid for \(\lambda _{mj} = (\mu _{\lambda _{mj}}\), \(\sigma _{\lambda _{mj}}^{2})\) but using \(\gamma (w_{imj})\) instead of \(\gamma (u_{imj})\) to indicate that feature \(j\) is irrelevant for component \(m\).

  • Then, we update \(\mu _{\lambda _{mj}}\) as

    $$\begin{aligned} \mu _{\lambda _{mj}} = \frac{\sum _{i=1}^N\gamma (w_{imj})x_{ij}}{\sum _{i=1}^N \gamma (w_{imj})}, \quad \forall m = 1, \ldots , C+1; j = 1,\ldots ,F. \end{aligned}$$
  • And for \(\sigma _{\lambda _{mj}}^{2}\)

    $$\begin{aligned} \sigma _{\lambda _{mj}}^{2} = \frac{\sum _{i=1}^N \gamma (w_{imj})(x_{ij}-\mu _{\lambda _{mj}})^2}{\sum _{i=1}^N \gamma (w_{imj})}, \quad \forall m = 1, \ldots , C+1; j = 1,\ldots ,F. \end{aligned}$$
  • Finally in \(\mathcal{M }^1\), \(\rho _{mj}\) is updated by

    $$\begin{aligned}&\frac{{\partial }}{{\partial \rho _{mj}}} \Biggl ( \sum _{i=1}^L\sum _{m=1}^C \sum _{j=1}^F \gamma (u_{imj})\log \rho _{mj}\\&\qquad \quad \;\;+ \sum _{i=L+1}^N \sum _{m=1}^{C+1}\sum _{j=1}^F \gamma (u_{imj})\log \rho _{mj} \\&\qquad \quad \;\;+\sum _{i=1}^L \sum _{m=1}^C \sum _{j=1}^F \gamma (w_{imj})\log (1- \rho _{mj}) \\&\qquad \quad \;\;+ \sum _{i=L+1}^N \sum _{m=1}^{C+1}\sum _{j=1}^F \gamma (w_{imj})\log (1- \rho _{mj}) \Biggr )= 0, \end{aligned}$$

    whose partial derivative solution is,

    $$\begin{aligned} \sum _{i=1}^N \gamma (u_{imj})\frac{1}{\rho _{mj}} - \sum _{i=1}^N \gamma (w_{imj}) \frac{1}{1-\rho _{mj}}=0. \end{aligned}$$

    This parameter is updated by

    $$\begin{aligned} \rho _{mj} = \frac{\sum _{i=1}^N \gamma (u_{imj})}{\sum _{i=1}^L z_{im} + \sum _{i=L+1}^N \gamma (z_{im})},\quad \forall m = 1, \ldots , C+1; j = 1,\ldots ,F. \end{aligned}$$

    Note that \(z_{i,C+1} = 0\) for \(i = 1,\ldots ,L\) for the three sets of parameters, \(\theta _{mj}\), \(\lambda _{mj}\) and \(\rho _{mj}\).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Guerra, L., Bielza, C., Robles, V. et al. Semi-supervised projected model-based clustering. Data Min Knowl Disc 28, 882–917 (2014). https://doi.org/10.1007/s10618-013-0323-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-013-0323-0

Keywords

Navigation