Abstract
Given multivariate multiblock data (e.g., subjects nested in groups are measured on multiple variables), one may be interested in the nature and number of dimensions that underlie the variables, and in differences in dimensional structure across data blocks. To this end, clusterwise simultaneous component analysis (SCA) was proposed which simultaneously clusters blocks with a similar structure and performs an SCA per cluster. However, the number of components was restricted to be the same across clusters, which is often unrealistic. In this paper, this restriction is removed. The resulting challenges with respect to model estimation and selection are resolved.
Similar content being viewed by others
Notes
This algorithm is implemented in an easy-to-use software program that can be downloaded at http://ppw.kuleuven.be/okp/software/MBCA/ (De Roover et al. 2012a).
It was confirmed for the simulation study reported below that multiplying the second term of the loss function (and partition criterion) with two—like in the AIC—gives an optimal cluster recovery for 99.6 % of the simulated data sets, as opposed to using another factor. In particular, multiplying fp with log(N)—like in the Bayesian information criterion (BIC; Schwarz 1978)—appeared to lead to a too high penalty, in that too few data blocks were assigned to the higher-dimensional clusters.
The adapted procedure will be added to the above mentioned software program in the near future and the updated program will be made available at http://ppw.kuleuven.be/okp/software/MBCA/.
In Step 2 of the ALS AIC procedure, the estimation of the SCA-ECP model per cluster is also based on the least squares estimates for the \(\mathbf{F}_{i}^{(k)}\) and B (k) matrices described by Timmerman and Kiers (2003), which implies that this step minimizes the SSE objective function. This is equivalent to minimizing the AIC objective function, because the number of free parameters is fixed within Step 2 and the minimal SSE corresponds to the minimal log(SSE).
We also assessed the sensitivity to local minima and the recovery of the within-cluster component structures. A sufficiently low sensitivity to local minima was established for both procedures (i.e., 5.17 % and 0.29 % local minima over all conditions for ALS SSE and ALS AIC , respectively) and the recovery of the cluster loading matrices was found to be really good (i.e., mean congruence coefficient of 0.9968 (SD=0.02) between estimated and simulated loadings across all conditions) for the ALS AIC procedure. Note that previous studies on Clusterwise SCA (De Roover et al. 2012c; De Roover, Ceulemans, Timmerman, & Onghena, 2012b) have already indicated that the within-cluster component loadings are recovered very well in cases where the data blocks are clustered correctly.
The mean values for the modified RV-coefficient (Smilde, Kiers, Bijlsma, Rubingh, & van Erk, 2009), are 0.02 (SD=0.09) and 0.59 (SD=0.08), respectively.
References
Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19, 716–723.
Barrett, L.F. (1998). Discrete emotions or dimensions? The role of valence focus and arousal focus. Cognition and Emotion, 12, 579–599.
Brusco, M.J., & Cradit, J.D. (2001). A variable selection heuristic for K-means clustering. Psychometrika, 66, 249–270.
Brusco, M.J., & Cradit, J.D. (2005). ConPar: a method for identifying groups of concordant subject proximity matrices for subsequent multidimensional scaling analyses. Journal of Mathematical Psychology, 49, 142–154.
Cattell, R.B. (1966). The scree test for the number of factors. Multivariate Behavioral Research, 1, 245–276.
Ceulemans, E., & Kiers, H.A.L. (2006). Selecting among three-mode principal component models of different types and complexities: a numerical convex hull based method. British Journal of Mathematical & Statistical Psychology, 59, 133–150.
Ceulemans, E., & Kiers, H.A.L. (2009). Discriminating between strong and weak structures in three-mode principal component analysis. British Journal of Mathematical & Statistical Psychology, 62, 601–620.
Ceulemans, E., Timmerman, M.E., & Kiers, H.A.L. (2011). The CHULL procedure for selecting among multilevel component solutions. Chemometrics and Intelligent Laboratory Systems, 106, 12–20.
Ceulemans, E., & Van Mechelen, I. (2005). Hierarchical classes models for three-way three-mode binary data: interrelations and model selection. Psychometrika, 70, 461–480.
Cohen, J. (1973). Eta-squared and partial eta-squared in fixed factor ANOVA designs. Educational and Psychological Measurement, 33, 107–112.
De Roover, K., Ceulemans, E., & Timmerman, M.E. (2012a). How to perform multiblock component analysis in practice. Behavior Research Methods, 44, 41–56.
De Roover, K., Ceulemans, E., Timmerman, M.E., & Onghena, P. (2012b). A clusterwise simultaneous component method for capturing within-cluster differences in component variances and correlations. British Journal of Mathematical & Statistical Psychology. doi:10.1111/j.2044-8317.2012.02040.x. Advance online publication.
De Roover, K., Ceulemans, E., Timmerman, M.E., Vansteelandt, K., Stouten, J., & Onghena, P. (2012c). Clusterwise simultaneous component analysis for the analysis of structural differences in multivariate multiblock data. Psychological Methods, 17, 100–119.
Diaz-Loving, R. (1998). Contributions of Mexican ethnopsychology to the resolution of the etic-emic dilemma in personality. Journal of Cross-Cultural Psychology, 29, 104–118.
Feningstein, A., Scheier, M.F., & Buss, A. (1975). Public and private self-consciousness. Journal of Consulting and Clinical Psychology, 43, 522–527.
Goldberg, L.R. (1990). An alternative “description of personality”: the Big-Five factor structure. Journal of Personality and Social Psychology, 59, 1216–1229.
Hands, S., & Everitt, B. (1987). A Monte Carlo study of the recovery of cluster structure in binary data by hierarchical clustering techniques. Multivariate Behavioral Research, 22, 235–243.
Hoerl, A.E. (1962). Application of ridge analysis to regression problems. Chemical Engineering Progress, 58, 54–59.
Hofmans, J., Ceulemans, E., Steinley, D., & Van Mechelen, I. (2012). On the added value of bootstrap analysis for K-means clustering. Manuscript conditionally accepted.
Jolliffe, I.T. (1986). Principal component analysis. New York: Springer.
Kaiser, H.F. (1958). The Varimax criterion for analytic rotation in factor analysis. Psychometrika, 23, 187–200.
Kiers, H.A.L. (1990). SCA. A program for simultaneous components analysis of variables measured in two or more populations. Groningen: iec ProGAMMA.
Kiers, H.A.L., & ten Berge, J.M.F. (1994). Hierarchical relations between methods for Simultaneous Components Analysis and a technique for rotation to a simple simultaneous structure. British Journal of Mathematical & Statistical Psychology, 47, 109–126.
McLachlan, G.J., & Peel, D. (2000). Finite mixture models. New York: Wiley.
Meredith, W., & Millsap, R.E. (1985). On component analyses. Psychometrika, 50, 495–507.
Milligan, G.W., Soon, S.C., & Sokol, L.M. (1983). The effect of cluster size, dimensionality, and the number of clusters on recovery of true cluster structure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 5, 40–47.
Nezlek, J.B. (2005). Distinguishing affective and non-affective reactions to daily events. Journal of Personality, 73, 1539–1568.
Nezlek, J.B. (2012). Diary methods for social and personality psychology. In J.B. Nezlek (Ed.), The SAGE library in social and personality psychology methods. London: Sage Publications.
Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. Philosophical Magazine, 2, 559–572.
Robert, P., & Escoufier, Y. (1976). A unifying tool for linear multivariate statistical methods: the RV-coefficient. Applied Statistics, 25, 257–265.
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6, 461–464.
Selim, S.Z., & Ismail, M.A. (1984). K-means-type algorithms: a generalized convergence theorem and characterization of local optimality. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 81–87.
Smilde, A.K., Kiers, H.A.L., Bijlsma, S., Rubingh, C.M., & van Erk, M.J. (2009). Matrix correlations for high-dimensional data: the modified RV-coefficient. Bioinformatics, 25, 401–405.
Steinley, D. (2003). Local optima in K-means clustering: what you don’t know may hurt you. Psychological Methods, 8, 294–304.
ten Berge, J.M.F. (1993). Least squares optimization in multivariate analysis. Leiden: DSWO Press.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B (Methodological), 58, 267–288.
Timmerman, M.E., Ceulemans, E., Kiers, H.A.L., & Vichi, M. (2010). Factorial and reduced K-means reconsidered. Computational Statistics & Data Analysis, 54, 1858–1871.
Timmerman, M.E., & Kiers, H.A.L. (2000). Three-mode principal component analysis: choosing the numbers of components and sensitivity to local optima. British Journal of Mathematical & Statistical Psychology, 53, 1–16.
Timmerman, M.E., & Kiers, H.A.L. (2003). Four simultaneous component models of multivariate time series from more than one subject to model intraindividual and interindividual differences. Psychometrika, 86, 105–122.
Timmerman, M.E., Kiers, H.A.L., Smilde, A.K., Ceulemans, E., & Stouten, J. (2009). Bootstrap confidence intervals in multi-level simultaneous component analysis. British Journal of Mathematical & Statistical Psychology, 62, 299–318.
Trapnell, P.D., & Campbell, J.D. (1999). Private self-consciousness and the five factor model of personality: distinguishing rumination from reflection. Journal of Personality and Social Psychology, 76, 284–304.
Tugade, M.M., Fredrickson, B.L., & Barrett, L.F. (2004). Psychological resilience and positive emotional granularity: examining the benefits of positive emotions on coping and health. Journal of Personality, 72, 1161–1190.
Van Deun, K., Wilderjans, T.F., van den Berg, R.A., Antoniadis, A., & Van Mechelen, I. (2011). A flexible framework for sparse simultaneous component based data integration. BMC Bioinformatics, 12, 448.
Van Mechelen, I., & Smilde, A.K. (2010). A generic linked-mode decomposition model for data fusion. Chemometrics and Intelligent Laboratory Systems, 104, 83–94. doi:10.1016/j.chemolab.2010.04.012.
Wilderjans, T.F., Ceulemans, E., Van Mechelen, I., & van den Berg, R.A. (2011). Simultaneous analysis of coupled data matrices subject to different amounts of noise. British Journal of Mathematical & Statistical Psychology, 64, 277–290.
Yung, Y.F. (1997). Finite mixtures in confirmatory factor-analysis models. Psychometrika, 62, 297–330.
Acknowledgements
The research reported in this paper was partially supported by the fund for Scientific Research-Flanders (Belgium), Project No. G.0477.09 awarded to Eva Ceulemans, Marieke Timmerman, and Patrick Onghena and by the Research Council of KU Leuven (GOA/2010/02).
Author information
Authors and Affiliations
Corresponding author
Appendix: Derivation of an AIC-Based Partition Criterion
Appendix: Derivation of an AIC-Based Partition Criterion
Conditional upon a specific Clusterwise SCA-ECP model M, the log-likelihood of data block X i when assigned to cluster k (and thus modeled by \(\mathbf{M}_{i}^{(k)}\)) amounts to
which is the block-specific counterpart of Equation (7), given \(\mathit{SSE}_{i}^{(k)}\) as defined in Equation (5). When inserting \(\hat{\sigma}^{2} = \frac{\mathit{SSE}_{i}^{(k)}}{N_{i} J}\) as a post-hoc estimator of the error variance σ 2 (Wilderjans et al. 2011), the log-likelihood can be rewritten as
where the first three terms are not influenced by the cluster assignment and can thus be discarded. The number of free parameters for data block i, when it is tentatively assigned to cluster k, is denoted by \(\mathit{fp}_{i}^{(k)}\) and can be computed as follows:
It corresponds to the size of the component score matrix \(\mathbf{F}_{i}^{(k)}\) that is computed to evaluate the fit of data block i in cluster k. When combining Equations (A.2) (omitting the invariant terms) and (A.3) as in the AIC (Akaike 1974), we obtain the AIC-based partition criterion in Equation (11).
Rights and permissions
About this article
Cite this article
De Roover, K., Ceulemans, E., Timmerman, M.E. et al. Modeling Differences in the Dimensionality of Multiblock Data by Means of Clusterwise Simultaneous Component Analysis. Psychometrika 78, 648–668 (2013). https://doi.org/10.1007/s11336-013-9318-4
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11336-013-9318-4