Skip to main content
Log in

Modeling Differences in the Dimensionality of Multiblock Data by Means of Clusterwise Simultaneous Component Analysis

  • Published:
Psychometrika Aims and scope Submit manuscript

Abstract

Given multivariate multiblock data (e.g., subjects nested in groups are measured on multiple variables), one may be interested in the nature and number of dimensions that underlie the variables, and in differences in dimensional structure across data blocks. To this end, clusterwise simultaneous component analysis (SCA) was proposed which simultaneously clusters blocks with a similar structure and performs an SCA per cluster. However, the number of components was restricted to be the same across clusters, which is often unrealistic. In this paper, this restriction is removed. The resulting challenges with respect to model estimation and selection are resolved.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1.
Figure 2.
Figure 3.

Similar content being viewed by others

Notes

  1. This algorithm is implemented in an easy-to-use software program that can be downloaded at http://ppw.kuleuven.be/okp/software/MBCA/ (De Roover et al. 2012a).

  2. It was confirmed for the simulation study reported below that multiplying the second term of the loss function (and partition criterion) with two—like in the AIC—gives an optimal cluster recovery for 99.6 % of the simulated data sets, as opposed to using another factor. In particular, multiplying fp with log(N)—like in the Bayesian information criterion (BIC; Schwarz 1978)—appeared to lead to a too high penalty, in that too few data blocks were assigned to the higher-dimensional clusters.

  3. The adapted procedure will be added to the above mentioned software program in the near future and the updated program will be made available at http://ppw.kuleuven.be/okp/software/MBCA/.

  4. In Step 2 of the ALS AIC procedure, the estimation of the SCA-ECP model per cluster is also based on the least squares estimates for the \(\mathbf{F}_{i}^{(k)}\) and B (k) matrices described by Timmerman and Kiers (2003), which implies that this step minimizes the SSE objective function. This is equivalent to minimizing the AIC objective function, because the number of free parameters is fixed within Step 2 and the minimal SSE corresponds to the minimal log(SSE).

  5. We also assessed the sensitivity to local minima and the recovery of the within-cluster component structures. A sufficiently low sensitivity to local minima was established for both procedures (i.e., 5.17 % and 0.29 % local minima over all conditions for ALS SSE and ALS AIC , respectively) and the recovery of the cluster loading matrices was found to be really good (i.e., mean congruence coefficient of 0.9968 (SD=0.02) between estimated and simulated loadings across all conditions) for the ALS AIC procedure. Note that previous studies on Clusterwise SCA (De Roover et al. 2012c; De Roover, Ceulemans, Timmerman, & Onghena, 2012b) have already indicated that the within-cluster component loadings are recovered very well in cases where the data blocks are clustered correctly.

  6. The mean values for the modified RV-coefficient (Smilde, Kiers, Bijlsma, Rubingh, & van Erk, 2009), are 0.02 (SD=0.09) and 0.59 (SD=0.08), respectively.

References

  • Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19, 716–723.

    Article  Google Scholar 

  • Barrett, L.F. (1998). Discrete emotions or dimensions? The role of valence focus and arousal focus. Cognition and Emotion, 12, 579–599.

    Article  Google Scholar 

  • Brusco, M.J., & Cradit, J.D. (2001). A variable selection heuristic for K-means clustering. Psychometrika, 66, 249–270.

    Article  Google Scholar 

  • Brusco, M.J., & Cradit, J.D. (2005). ConPar: a method for identifying groups of concordant subject proximity matrices for subsequent multidimensional scaling analyses. Journal of Mathematical Psychology, 49, 142–154.

    Article  Google Scholar 

  • Cattell, R.B. (1966). The scree test for the number of factors. Multivariate Behavioral Research, 1, 245–276.

    Article  Google Scholar 

  • Ceulemans, E., & Kiers, H.A.L. (2006). Selecting among three-mode principal component models of different types and complexities: a numerical convex hull based method. British Journal of Mathematical & Statistical Psychology, 59, 133–150.

    Article  Google Scholar 

  • Ceulemans, E., & Kiers, H.A.L. (2009). Discriminating between strong and weak structures in three-mode principal component analysis. British Journal of Mathematical & Statistical Psychology, 62, 601–620.

    Article  Google Scholar 

  • Ceulemans, E., Timmerman, M.E., & Kiers, H.A.L. (2011). The CHULL procedure for selecting among multilevel component solutions. Chemometrics and Intelligent Laboratory Systems, 106, 12–20.

    Article  Google Scholar 

  • Ceulemans, E., & Van Mechelen, I. (2005). Hierarchical classes models for three-way three-mode binary data: interrelations and model selection. Psychometrika, 70, 461–480.

    Article  Google Scholar 

  • Cohen, J. (1973). Eta-squared and partial eta-squared in fixed factor ANOVA designs. Educational and Psychological Measurement, 33, 107–112.

    Article  Google Scholar 

  • De Roover, K., Ceulemans, E., & Timmerman, M.E. (2012a). How to perform multiblock component analysis in practice. Behavior Research Methods, 44, 41–56.

    Article  PubMed  Google Scholar 

  • De Roover, K., Ceulemans, E., Timmerman, M.E., & Onghena, P. (2012b). A clusterwise simultaneous component method for capturing within-cluster differences in component variances and correlations. British Journal of Mathematical & Statistical Psychology. doi:10.1111/j.2044-8317.2012.02040.x. Advance online publication.

    Google Scholar 

  • De Roover, K., Ceulemans, E., Timmerman, M.E., Vansteelandt, K., Stouten, J., & Onghena, P. (2012c). Clusterwise simultaneous component analysis for the analysis of structural differences in multivariate multiblock data. Psychological Methods, 17, 100–119.

    Article  PubMed  Google Scholar 

  • Diaz-Loving, R. (1998). Contributions of Mexican ethnopsychology to the resolution of the etic-emic dilemma in personality. Journal of Cross-Cultural Psychology, 29, 104–118.

    Article  Google Scholar 

  • Feningstein, A., Scheier, M.F., & Buss, A. (1975). Public and private self-consciousness. Journal of Consulting and Clinical Psychology, 43, 522–527.

    Article  Google Scholar 

  • Goldberg, L.R. (1990). An alternative “description of personality”: the Big-Five factor structure. Journal of Personality and Social Psychology, 59, 1216–1229.

    Article  PubMed  Google Scholar 

  • Hands, S., & Everitt, B. (1987). A Monte Carlo study of the recovery of cluster structure in binary data by hierarchical clustering techniques. Multivariate Behavioral Research, 22, 235–243.

    Article  Google Scholar 

  • Hoerl, A.E. (1962). Application of ridge analysis to regression problems. Chemical Engineering Progress, 58, 54–59.

    Google Scholar 

  • Hofmans, J., Ceulemans, E., Steinley, D., & Van Mechelen, I. (2012). On the added value of bootstrap analysis for K-means clustering. Manuscript conditionally accepted.

  • Jolliffe, I.T. (1986). Principal component analysis. New York: Springer.

    Book  Google Scholar 

  • Kaiser, H.F. (1958). The Varimax criterion for analytic rotation in factor analysis. Psychometrika, 23, 187–200.

    Article  Google Scholar 

  • Kiers, H.A.L. (1990). SCA. A program for simultaneous components analysis of variables measured in two or more populations. Groningen: iec ProGAMMA.

    Google Scholar 

  • Kiers, H.A.L., & ten Berge, J.M.F. (1994). Hierarchical relations between methods for Simultaneous Components Analysis and a technique for rotation to a simple simultaneous structure. British Journal of Mathematical & Statistical Psychology, 47, 109–126.

    Article  Google Scholar 

  • McLachlan, G.J., & Peel, D. (2000). Finite mixture models. New York: Wiley.

    Book  Google Scholar 

  • Meredith, W., & Millsap, R.E. (1985). On component analyses. Psychometrika, 50, 495–507.

    Article  Google Scholar 

  • Milligan, G.W., Soon, S.C., & Sokol, L.M. (1983). The effect of cluster size, dimensionality, and the number of clusters on recovery of true cluster structure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 5, 40–47.

    Article  PubMed  Google Scholar 

  • Nezlek, J.B. (2005). Distinguishing affective and non-affective reactions to daily events. Journal of Personality, 73, 1539–1568.

    Article  PubMed  Google Scholar 

  • Nezlek, J.B. (2012). Diary methods for social and personality psychology. In J.B. Nezlek (Ed.), The SAGE library in social and personality psychology methods. London: Sage Publications.

    Google Scholar 

  • Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. Philosophical Magazine, 2, 559–572.

    Article  Google Scholar 

  • Robert, P., & Escoufier, Y. (1976). A unifying tool for linear multivariate statistical methods: the RV-coefficient. Applied Statistics, 25, 257–265.

    Article  Google Scholar 

  • Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6, 461–464.

    Article  Google Scholar 

  • Selim, S.Z., & Ismail, M.A. (1984). K-means-type algorithms: a generalized convergence theorem and characterization of local optimality. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 81–87.

    Article  PubMed  Google Scholar 

  • Smilde, A.K., Kiers, H.A.L., Bijlsma, S., Rubingh, C.M., & van Erk, M.J. (2009). Matrix correlations for high-dimensional data: the modified RV-coefficient. Bioinformatics, 25, 401–405.

    Article  PubMed  Google Scholar 

  • Steinley, D. (2003). Local optima in K-means clustering: what you don’t know may hurt you. Psychological Methods, 8, 294–304.

    Article  PubMed  Google Scholar 

  • ten Berge, J.M.F. (1993). Least squares optimization in multivariate analysis. Leiden: DSWO Press.

    Google Scholar 

  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B (Methodological), 58, 267–288.

    Google Scholar 

  • Timmerman, M.E., Ceulemans, E., Kiers, H.A.L., & Vichi, M. (2010). Factorial and reduced K-means reconsidered. Computational Statistics & Data Analysis, 54, 1858–1871.

    Article  Google Scholar 

  • Timmerman, M.E., & Kiers, H.A.L. (2000). Three-mode principal component analysis: choosing the numbers of components and sensitivity to local optima. British Journal of Mathematical & Statistical Psychology, 53, 1–16.

    Article  Google Scholar 

  • Timmerman, M.E., & Kiers, H.A.L. (2003). Four simultaneous component models of multivariate time series from more than one subject to model intraindividual and interindividual differences. Psychometrika, 86, 105–122.

    Article  Google Scholar 

  • Timmerman, M.E., Kiers, H.A.L., Smilde, A.K., Ceulemans, E., & Stouten, J. (2009). Bootstrap confidence intervals in multi-level simultaneous component analysis. British Journal of Mathematical & Statistical Psychology, 62, 299–318.

    Article  Google Scholar 

  • Trapnell, P.D., & Campbell, J.D. (1999). Private self-consciousness and the five factor model of personality: distinguishing rumination from reflection. Journal of Personality and Social Psychology, 76, 284–304.

    Article  PubMed  Google Scholar 

  • Tugade, M.M., Fredrickson, B.L., & Barrett, L.F. (2004). Psychological resilience and positive emotional granularity: examining the benefits of positive emotions on coping and health. Journal of Personality, 72, 1161–1190.

    Article  PubMed  Google Scholar 

  • Van Deun, K., Wilderjans, T.F., van den Berg, R.A., Antoniadis, A., & Van Mechelen, I. (2011). A flexible framework for sparse simultaneous component based data integration. BMC Bioinformatics, 12, 448.

    Article  PubMed  Google Scholar 

  • Van Mechelen, I., & Smilde, A.K. (2010). A generic linked-mode decomposition model for data fusion. Chemometrics and Intelligent Laboratory Systems, 104, 83–94. doi:10.1016/j.chemolab.2010.04.012.

    Article  Google Scholar 

  • Wilderjans, T.F., Ceulemans, E., Van Mechelen, I., & van den Berg, R.A. (2011). Simultaneous analysis of coupled data matrices subject to different amounts of noise. British Journal of Mathematical & Statistical Psychology, 64, 277–290.

    Article  Google Scholar 

  • Yung, Y.F. (1997). Finite mixtures in confirmatory factor-analysis models. Psychometrika, 62, 297–330.

    Article  Google Scholar 

Download references

Acknowledgements

The research reported in this paper was partially supported by the fund for Scientific Research-Flanders (Belgium), Project No. G.0477.09 awarded to Eva Ceulemans, Marieke Timmerman, and Patrick Onghena and by the Research Council of KU Leuven (GOA/2010/02).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kim De Roover.

Appendix:  Derivation of an AIC-Based Partition Criterion

Appendix:  Derivation of an AIC-Based Partition Criterion

Conditional upon a specific Clusterwise SCA-ECP model M, the log-likelihood of data block X i when assigned to cluster k (and thus modeled by \(\mathbf{M}_{i}^{(k)}\)) amounts to

(A.1)

which is the block-specific counterpart of Equation (7), given \(\mathit{SSE}_{i}^{(k)}\) as defined in Equation (5). When inserting \(\hat{\sigma}^{2} = \frac{\mathit{SSE}_{i}^{(k)}}{N_{i} J}\) as a post-hoc estimator of the error variance σ 2 (Wilderjans et al. 2011), the log-likelihood can be rewritten as

$$ \operatorname{loglik}\bigl(\mathbf{X}_{i}| \mathbf{M}_{i}^{(k)}\bigr) = - \frac{N_{i} J}{2}\bigl[ 1 + \log( 2\pi) - \log( N_{i }J ) + \log\bigl( \mathit{SSE}_{i}^{(k)} \bigr) \bigr], $$
(A.2)

where the first three terms are not influenced by the cluster assignment and can thus be discarded. The number of free parameters for data block i, when it is tentatively assigned to cluster k, is denoted by \(\mathit{fp}_{i}^{(k)}\) and can be computed as follows:

$$ \mathit{fp}_{i}^{(k)} = N_{i}Q^{(k)}. $$
(A.3)

It corresponds to the size of the component score matrix \(\mathbf{F}_{i}^{(k)}\) that is computed to evaluate the fit of data block i in cluster k. When combining Equations (A.2) (omitting the invariant terms) and (A.3) as in the AIC (Akaike 1974), we obtain the AIC-based partition criterion in Equation (11).

Rights and permissions

Reprints and permissions

About this article

Cite this article

De Roover, K., Ceulemans, E., Timmerman, M.E. et al. Modeling Differences in the Dimensionality of Multiblock Data by Means of Clusterwise Simultaneous Component Analysis. Psychometrika 78, 648–668 (2013). https://doi.org/10.1007/s11336-013-9318-4

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11336-013-9318-4

Key words

Navigation