Introduction

In the behavioral sciences, researchers often gather multivariate multiblock data—that is, multiple data blocks, each of which contains the scores of a different set of observations on the same set of variables. For an example, one can think of multivariate data from different groups of subjects (e.g., inhabitants from different countries). In that case, the groups (e.g., countries) constitute the separate data blocks. Another example is data from multiple subjects that have scored the same variables on multiple measurement occasions (also called multioccasion–multisubject data; see Kroonenberg, 2008). In such data, the data blocks correspond to the different subjects.

Both the observations in the data blocks and the data blocks themselves can be either fixed or random. For instance, in the case of multioccasion–multisubject data, the data blocks are considered fixed when the researcher is interested in the specific subjects in the study and random when one aims at generalizing the conclusions to a larger population of subjects. In the latter case, a representative sample of subjects is needed to justify the generalization. When the observations are random and the data blocks fixed, multiblock data are referred to as multigroup data (Jöreskog, 1971) in the literature. When both observations and data blocks are random, the data are called multilevel (e.g., Maas & Hox, 2005; Muthén, 1994; Snijders & Bosker, 1999).

Researchers are often interested in the underlying structure of such multiblock data. For instance, given multioccasion–multisubject scores on a number of emotions, one can wonder whether certain emotions covary across the measurement occasions of separate subjects or fluctuate independently of one another, and whether and how this structure is similar across subjects (i.e., across the data blocks). For capturing the structure in multiblock data, a number of component analysis, as well as factor analysis, methods are available. Component analysis and factor analysis differ strongly with respect to their theoretical underpinnings, but both of them model the variation on the variables by a smaller number of constructed variables—called components and factors, respectively—which are based on the covariance structure of the observed variables. Which component or factor analysis method is the most appropriate depends on the research question at hand. For well-defined, confirmatory questions, factor analysis is usually most appropriate. For exploratory analysis of data that may have an intricate structure, as is often the case for multiblock data, component analysis is generally most appropriate.

To test specific hypotheses about the underlying structure, structural equation modeling (SEM; Haavelmo, 1943; Kline, 2004) is commonly used. SEM is applied, for example, to test whether the items of a questionnaire measure the theoretical constructs under study (Floyd & Widaman, 1995; Keller et al., 1998; Novy et al., 1994). Moreover, multigroup SEM (Jöreskog, 1971; Kline, 2004; Sörbom, 1974) allows testing different levels of factorial invariance among the data blocks (e.g., Lee & Lam, 1988), going from weak invariance (i.e., same factor loadings for all data blocks) to strict invariance (i.e., intercepts, factor loadings, and unique variances equal across data blocks).

When there are no a priori hypotheses about the underlying structure, one may resort to component analysis or exploratory factor analysis (EFA). We will first discuss a family of component methods that explicitly focus on capturing structural differences between the data blocks and then, briefly, the family of factor analysis methods. Note that many other multiblock component methods exist that focus, for example, on redundancy (Escofier & Pagès, 1998) or on modeling block structured covariance matrices (Flury & Neuenschwander, 1995; Klingenberg, Neuenschwander, & Flury, 1996).

If one expects the structure of each of the data blocks to be different, standard principal component analysis (PCA; Jolliffe, 2002; Pearson, 1901) can be performed on each data block. In case one thinks that the structure will not differ across the data blocks, simultaneous component analysis (SCA; Kiers, 1990; Kiers & ten Berge, 1994b; Timmerman & Kiers, 2003; Van Deun, Smilde, van der Werf, Kiers, & Van Mechelen, 2009) can be applied, which reduces the data for all the blocks at once to find one common component structure for all the blocks. Finally, if one presumes that subgroups of the data blocks exist that share the same structure, one may conduct clusterwise simultaneous component analysis (clusterwise SCA-ECP, where ECP stands for equal cross-product constraints on the component scores of the data blocks; De Roover, Ceulemans, Timmerman, Vansteelandt, et al., in press; Timmerman & Kiers, 2003). This method simultaneously searches for the best clustering of the data blocks and for the best fitting SCA-ECP model within each cluster. This flexible and generic approach encompasses separate PCA and SCA-ECP as special cases.

For the separate PCA and SCA-ECP approaches, similar factor-analytic approaches exist, which are specific instances of exploratory structural equation modeling (Asparouhov & Muthén, 2009; Dolan, Oort, Stoel, & Wicherts, 2009; Lawley & Maxwell, 1962). While component and factor analyses differ strongly with respect to their theoretical backgrounds, they often give comparable solutions in practice (Velicer & Jackson, 1990a, b). However, no factor-analytic counterpart exists for the clusterwise SCA-ECP method.

While plenty of software is available for the factor-analytic approaches (e.g., LISREL, Dolan, Bechger, & Molenaar, 1999; Jöreskog & Sörbom, 1999; Mplus, Muthén, & Muthén, 2007; and Mx, Neale, Boker, Xie, & Maes, 2003), no easy-to-use software program exists for applying the multiblock component analysis methods described above. Thus, although the component methods are potentially very useful for substantive researchers (e.g., De Leersnyder & Mesquita, 2010; McCrae & Costa, 1997; Pastorelli, Barbaranelli, Cermak, Rozsa, & Caprara, 1997), it might be difficult for researchers to apply them. In this article, we describe software for fitting separate PCAs, SCA-ECP, and clusterwise SCA-ECP models. This MultiBlock Component Analysis (MBCA) software (Fig. 1) can be downloaded from http://ppw.kuleuven.be/okp/software/MBCA/. The program is based on MATLAB code, but it can also be used by researchers who do not have MATLAB at their disposal. Specifically, two versions of the software can be downloaded: one for use within the MATLAB environment and a stand-alone application that can be run on any Windows computer. The program includes a model selection procedure and can handle missing data.

The remainder of the article is organized in three sections. In Section Multiblock component analysis, we first discuss multiblock data, how to preprocess them and how to deal with missing data. Subsequently, we discuss clusterwise SCA-ECP as a generic modeling approach that comprises separate PCAs and SCA-ECP as special cases. Finally, we describe the different data analysis steps: checking data requirements, running the analysis, and model selection. The clusterwise SCA-ECP approach is illustrated by means of an empirical example. Section Multiblock component analysis program describes the handling of the MBCA software. Section Conclusion adds a general conclusion to the article.

Multiblock component analysis

Data structure, preprocessing, and missing values

In this section, we first describe the data structure that is required by the multiblock component methods under study. Second, the preprocessing of the data is discussed. Third, the problem of missing values is reviewed shortly, since an important feature of the MBCA software is that it can handle missing data.

Data structure

Clusterwise SCA-ECP, as well as SCA-ECP and separate PCAs, is applicable to all kinds of multivariate multiblock data—that is, data that consist of I data blocks X i (N i × J) that contain scores of N i observations on J variables, where the number of observations N i (i = 1, . . . , I) may differ between data blocks. These I data blocks can be concatenated into an N (observations) × J (variables) data matrix X, where \( N = \sum\limits_{{i = 1}}^I {{N_i}} \). More specific requirements (e.g., minimal number of observations in each data block) will be discussed in the Checking data requirements section.

As an example, consider the following empirical data set from emotion research, which will be used throughout the article. Emotional granularity refers to the degree to which a subject differentiates between negative and positive emotions (Barrett, 1998); that is, subjects who score high on emotional granularity describe their emotions in a more fine-grained way than subjects scoring low. To study emotional granularity, 42 subjects were asked to rate on a 7-point scale the extent to which 22 target persons (e.g., mother, father, partner, . . . ) elicited 16 negative emotions, where the selected target persons obviously differ across subjects. Thus, one may conceive these data as consisting of 42 data blocks X i , 1 for each subject, where each data block holds the ratings of the 16 negative emotions for the 22 target persons selected by subject i. Note that, in this case, the number of observations N i is the same for all data blocks, but this is not necessary for the application of any of the three component methods considered. The data blocks X 1, . . . , X 42 can be concatenated below each other, resulting in a 924 × 16 data matrix X.

Preprocessing

Before applying any of the multiblock component methods, one may consider whether or not the data should be preprocessed. Since we focus on differences and similarities in within-block correlational structures, we disregard between-block differences in variable means and in variances. Note that variants of the PCA and SCA methods exist in which the differences in means (Timmerman, 2006) and variances (De Roover, Ceulemans, Timmerman, & Onghena, 2011; Timmerman & Kiers, 2003) are explicitly modeled. To eliminate the differences in variable means and variances, the data are centered and standardized per data block. This type of preprocessing, which is implemented in the MBCA software, is commonly denoted as autoscaling (Bro & Smilde, 2003). The standardization also results in a removal of arbitrary differences between the variables in measurement scale.

Missing values

In practice, data points may be missing. For instance, in our emotion data, 4% of the data points are missing, because some subjects neglected to rate certain emotions for some of their target persons. To judge the generalizability of the results obtained, one has to consider the method for dealing with the missing data in the analysis and the mechanism(s) that plausibly caused the missing data. To start with the latter, Rubin distinguished between “missing completely at random,” (MCAR) “missing at random,” (MAR) and “not missing at random” (NMAR) (Little & Rubin, 2002; Rubin, 1976). MCAR means that the missing data are related neither to observed nor to unobserved data. When data are MAR, the missing data are dependent on variables in the data set but are unrelated to unobserved variables. NMAR refers to missing data that depend on the values of unobserved variables.

To deal with missing data in the analysis, we advocate the use of imputation. Imputation is much more favorable than the simplest alternative—namely, to discard all observations that have at least one missing value. The latter may result in large losses of information (Kim & Curry, 1977; Stumpf, 1978) and requires the missing data to be MCAR. In contrast, imputation requires the missing data to be MAR, implying that it is more widely applicable. The procedure to perform missing data imputation in multiblock component analysis is described in the Missing Data Imputation section. and is included in the MBCA software.

The clusterwise SCA-ECP model

A clusterwise SCA-ECP model for a multiblock data matrix X consists of three ingredients: a I × K binary partition matrix P, which represents how the I data blocks are grouped into K mutually exclusive clusters; K J × Q cluster loading matrices B k, which indicate how the J variables are reduced to Q components for all the data blocks that belong to cluster k; and a N i × Q component score matrix F i for each data block. Figures 1 and 2 presents the partition matrix P and the cluster loading matrices B k, and Table 1 presents the component score matrix F 2 (of subject 2) of a clusterwise SCA-ECP model with three clusters and two components for our emotion data. The partition matrix P shows that 15 subjects are assigned to the first cluster (i.e., 15 subjects have a one in the first column and a zero in the other columns), while the second and third cluster contain 14 and 13 subjects, respectively.

Fig. 1
figure 1

Interface of the MultiBlock component analysis software

Fig. 2
figure 2

Output file for the clusterwise SCA-ECP analysis of the emotion data, showing the partition matrix and the orthogonally rotated cluster loading matrices for the model with two clusters and two components. The components in cluster 1 can be labeled “negative affect” and “jealousy,” while the components in cluster 2 can be interpreted as “cold dislike” and “sadness,” and the ones in cluster 2 as “hot dislike” and “low self-esteem,” respectively

Table 1 Component scores for subject 2 out of the emotion data, given a clusterwise SCA-ECP model with two clusters and two components. Note that subject 2 is assigned to the first cluster

The cluster loading matrices B k in Fig. 2 display the component structure for each of the subject clusters. Because we analyzed autoscaled data and have orthogonal components, the loadings can be interpreted as correlations between the variables and components. Each component of a cluster loading matrix can be interpreted by considering the common content of the variables that load highly positive or negative on that component (e.g., loadings with an absolute value greater than .50). Specifically, for cluster 1, the first component can be labeled negative affect, since virtually all negative emotions have a high positive loading on this component. The second component of this cluster is named jealousy due to the high loading of jealous. For cluster 2, the first component is termed cold dislike, since it consists of the negative emotions that are experienced when feeling a dislike for someone, without feeling sad about this. The second component is made up out of negative feelings that arise when feeling sad; therefore, it is named sadness. The first component of cluster 3 has some similarities to the first component of cluster 2, but with the important difference that additional emotions load high on this component—that is, uneasy, miserable, and sad. Thus, it seems that for this cluster, dislike is more emotionally charged; thus, it is labeled hot dislike. The second component of cluster 3 can be interpreted as a low self-esteem component, because it consists of negative emotions that stem from having a low feeling of self-worth. We can conclude that the clusters differ strongly from one another in the nature of the dimensions that underlie the negative emotions. On the basis of these results, we can hypothesize that the subjects in cluster 1 are the ones with the least granular emotional experience against the target persons, since most negative emotions strongly co-occur for these subjects. The subjects in clusters 2 and 3 seem to display a higher granularity in their emotional ratings, since they differentiate between feelings of dislike and feelings of sadness (cluster 2) or low self-esteem (cluster 3) toward the target persons. To evaluate whether the structural differences between the three subject clusters may, indeed, be interpreted as differences in emotional granularity, we related the cluster membership to the average intraclass correlation coefficients (ICCs; Shrout & Fleiss, 1979; Tugade, Fredrickson, & Barrett, 2004) measuring absolute agreement, which were calculated across the negative emotions for each subject. The three clusters differ significantly, F(2) = 5.12, p = .01, with mean ICCs of .91 (SD = .07), .82 (SD = .10), and .88 (SD = .05) for clusters 1–3, respectively. Since higher ICC values indicate a lower granularity, cluster 1 contains the least granular subjects, while cluster 2 contains the most granular subjects.

In Table 1, the component score matrix for subject 2 is presented. Since subject 2 belongs to cluster 1 (see the partition matrix in Fig. 2), we can derive how this subject feels about each of the 22 target persons in terms of negative affect and jealousy. For instance, it can be read that target person 15 (disliked person 3) elicits the most negative affect in subject 2 (i.e., a score of 1.93 on the first component) and that this subject has the strongest feelings of jealousy (i.e., a score of 2.73 on the second component) toward target person 14 (disliked person 2).

To reconstruct the observed scores in each data block X i , the information in the three types of matrices is combined as follows:

$$ {{\mathbf{X}}_i} = \sum\limits_{{k = 1}}^K {{p_{{ik}}}{\mathbf{F}}_i^k{{\mathbf{B}}^k}^{\prime } + {{\mathbf{E}}_i}}, $$
(1)

where p ik denotes the entries of the partition matrix P, \( F_i^k \) is the component score matrix for data block i when assigned to cluster k, and E i (N i × J) denotes the matrix of residuals. Since data block i is assigned to one cluster only, the index k in \( \mathbf{F}_i^k \) is mostly omitted in the remainder of this article, for reasons of parsimony. For example, to reconstruct the data block for subject 2, we read in the partition matrix (Fig. 2) that this subject is assigned to the first cluster and, subsequently, multiply the component scores in Table 1 with the component loadings of cluster 1 in Fig. 2. It can be concluded that the separate PCA and SCA-ECP strategies for multiblock data are indeed special cases of the clusterwise SCA-ECP model. On the one hand, when K, the number of clusterwise SCA-ECP clusters, is equal to I, the number of data blocks, the model boils down to separate PCAs with an equal number of components for each data block. On the other hand, when K equals one, all data blocks belong to the same cluster, and the clusterwise SCA-ECP model reduces to a regular SCA-ECP model.

In clusterwise SCA-ECP, the columns of each component score matrix F i are restricted to have a variance of one, and the correlations between the columns of F i (i.e., between the cluster-specific components) must be equal for all data blocks that are assigned to the same cluster. With respect to the latter restriction, note that the parameter estimates of an SCA-ECP solution have rotational freedom. Thus, to obtain components that are easier to interpret, the components of a clusterwise SCA-ECP solution can be freely rotated within each cluster without altering the fit of the solution, provided that the corresponding component scores are counterrotated. For instance, the cluster loading matrices in Fig. 2 were obtained by means of an orthogonal normalized varimax rotation (Kaiser, 1958). When an oblique rotation is applied, the cluster-specific components become correlated to some extent. In that case, the loadings should not be read as correlations, but they can be interpreted similarly as weights that indicate the extent to which each variable is influenced by the respective components.

Steps to take when performing multiblock component analysis

When applying one of the multiblock component methods in practice, three steps have to be taken: checking the data requirements, running the analysis, and selecting the model. In the following subsections, each of these steps will be discussed in more detail.

Checking data requirements

As a first step, one needs to check whether the different data blocks contain a sufficient number of observations, whether the data have been preprocessed adequately, and whether and which data are missing. For the different component models to be identified, the number of observations N i in the data blocks should always be larger than the number of components Q to be fitted. Moreover, when the observations in the data blocks and/or the data blocks themselves are a random sample, this sample needs to be sufficiently large and representative; otherwise, the generalizability of the obtained results is questionable.

With respect to preprocessing, as discussed in the Preprocessing section, it is often advisable to autoscale all data blocks, which is done automatically by the MBCA program. However, autoscaling is not possible when a variable displays no variance within one or more data blocks, which may occur in empirical data. For instance, for our emotion example, it is conceivable that some subjects rate a certain negative emotion to be absent for all target persons. In such cases, one of the following options can be considered. First, one may remove the variables that are invariant for one or more data blocks. Second, one may discard the data blocks for which one or more variables are invariant. When many variables or data blocks are omitted, this leads to a great loss of data, however. Therefore, a third option, which is also provided in the MBCA software, is to replace the invariant scores by zeros, implying that the variables in question have a mean of zero but also a variance of zero in some of the data blocks. This strategy has the disadvantage that the interpretation of the component loadings becomes less straightforward. Specifically, even the loadings on orthogonal components can no longer be interpreted as correlations between the variables and the respective components.

With respect to missing data, when the multiblock data contain missing values, it seems straightforward to autoscale each variable with respect to the nonmissing data entries only. This way the nonmissing values for each variable will have a mean of zero and a variance of one per data block, regardless of any assumed or imputed values for the missing data. It may also be wise to remove variables that are missing completely within certain data blocks (i.e., an entire column of a data block is missing), since such missingness patterns are rather likely to be NMAR and, hence, yield biased analysis results.

Running the analysis

The second step consists of performing the multiblock component analysis, with the appropriate number of components Q and number of clusters K in case of clusterwise SCA-ECP. Given (K and) Q, the aim of the analysis is to find a solution that minimizes the following loss function:

$$ L = \sum\limits_{{i = 1}}^I {||{{\mathbf{X}}_i} - {{\hat{\mathbf X}}_i}|{|^2}}, $$
(2)

where \( {\hat{\mathbf{X}}_i} \) equals \( \sum\limits_{{k = 1}}^K {{p_{{ik}}}{\mathbf{F}}_i^k{{\mathbf{B}}^k}^{\prime }} \), \( {F_i}{B_i}^{\prime } \), and \( {\mathbf{F}_i}{\mathbf{B}^{\prime }} \) for clusterwise SCA-ECP, separate PCAs, and SCA-ECP, respectively. In case the data contain missing values, N i × J binary weight matrices W i , containing zeros if the corresponding data entries are missing and ones if not, are included in the loss function:

$$ L = \sum\limits_{{i = 1}}^I {||({{\mathbf{X}}_i} - {{\hat{\mathbf{X}}}_i})*{\mathbf{W}_i}|{|^2}} . $$
(3)

Note that * denotes the Hadamard (i.e., elementwise) product. On the basis of the loss function value L, the percentage of variance accounted for (VAF) can be computed as follows:

$$ VAF(\% ) = \frac{{{{\left\| {\mathbf{X}} \right\|}^2} - L}}{{{{\left\| {\mathbf{X}} \right\|}^2}}} \times 100. $$
(4)

The algorithms for estimating the multiblock component models and for missing data imputation are described in the following subsections.

Algorithms

In this section, we discuss the algorithms for performing separate PCAs, SCA-ECPs, and clusterwise SCA-ECPs. Each of these algorithms is based on a singular value decomposition. However, unlike the separate PCA algorithm, which boils down to the computation of a closed form solution, the SCA-ECP and clusterwise SCA-ECP algorithms are iterative procedures.

Separate PCAs for each of the data blocks are obtained on the basis of the singular value decomposition of data block X i into U i , S i , and V i , with \( {\mathbf{X}_i} = {\mathbf{U}_i} {\mathbf{S}_i} {\mathbf{V}_i}^{\prime } \) (Jolliffe, 2002). Least squares estimators of F i and B i are \( {\mathbf{F}_i} = \sqrt {{{N_i}}} {\mathbf{U}_{{i(Q)}}} \) and \( {\mathbf{B}_i} = \frac{1}{{\sqrt {{{N_i}}} }}{\mathbf{V}_{{i(Q)}}}{\mathbf{S}_{{i(Q)}}} \), respectively, where U i(Q) and V i(Q) are the first Q columns of U i and V i respectively, and S i(Q) consists of the first Q rows and columns of S i .

To estimate the SCA-ECP solution, an alternating least squares (ALS) procedure is used (see Timmerman & Kiers, 2003, for more details) that consists of four steps:

  1. 1.

    Rationally initialize loading matrix B : Initialize B by performing a singular value decomposition of the total data matrix X into U, S, and V, with X = U S V′. A rational start of B is then given by B = V (Q), where V (Q) contains the first Q columns of V.

  2. 2.

    (Re)estimate the component score matrices F i : For each data block, decompose X i B into U i , S i , and V i , with X i B = U i S i V i ′. A least squares estimate of the component scores F i for the ith data block is then given by \( {\mathbf{F}_i} = \sqrt {{{N_i}}} {\mathbf{U}_i}{\mathbf{V}_i}^{\prime } \) (ten Berge, 1993).

  3. 3.

    Reestimate the loading matrix B: \( \mathbf{B} = ({(\mathbf{F}'\;\mathbf{F})^{{ - 1}}}\mathbf{F}'\;\mathbf{X})' \), where F is the vertical concatenation of the component scores of all data blocks.

  4. 4.

    Repeat steps 2 and 3 until the decrease of the loss function value L for the current iteration is smaller than the convergence criterion, which is 1e-6 by default.

Clusterwise SCA-ECP solutions are also estimated by means of an ALS procedure (see De Roover, Ceulemans, Timmerman, Vansteelandt, et al., in press, for more details):

  1. 1.

    Randomly initialize partition matrix P : Randomly assign the I data blocks to one of the K clusters, where each cluster has an equal probability of being assigned to. If one of the clusters is empty, repeat this procedure until all clusters contain at least one element.

  2. 2.

    Estimate the SCA-ECP model for each cluster: Estimate the F i and B k matrices for each cluster k by performing a rationally started SCA-ECP analysis, as described above, on the X i data blocks assigned to the kth cluster.

  3. 3.

    Reestimate the partition matrix P : Each data block X i is tentatively assigned to each of the K clusters. On the basis of the loading matrix B k of the cluster k and the data block X i , a component score matrix for block i in cluster k is computed, and the fit of data block i in cluster k is evaluated. Eventually, the data block is assigned to the cluster for which the fit is maximal. When one of the K clusters is empty after this procedure, the data block with the worst fit in its current cluster is moved to the empty cluster.

  4. 4.

    Steps 2 and 3 are repeated until the decrease of the loss function value L for the current iteration is smaller than the convergence criterion, which is 1e-6 by default.

Note that the clusterwise SCA-ECP algorithm may end in a local minimum. Therefore, it is advised to use a multistart procedure (e.g., 25 starts; see De Roover, Ceulemans, Timmerman, Vansteelandt, et al., in press) with different random initializations of the partition matrix P.

Missing data imputation

To perform missing data imputation while fitting multiblock component models, weighted least squares fitting (Kiers, 1997) is used to minimize the weighted loss function (Eq. 3). This iterative procedure, which assumes the missing values to be missing at random (MAR), consists of the following steps:

  1. 1.

    Set t, the iteration number, to one. Initialize the N × J missing values matrix M t by sampling its values from a standard normal distribution (random start) or by setting all entries to zero (zero start).

  2. 2.

    Compute the imputed data matrix \( {\tilde{\mathbf{X}}^t} = W*X + {W^c}*{M^t} \) = W * X + W c * M t, where W c is the binary complement of W (i.e., with ones for the missing values and zeros for the nonmissing values).

  3. 3.

    Perform a multiblock component analysis on \( {\tilde{X}^t} \) (see the Algorithms section).

  4. 4.

    Set t = t + 1 and \( {\mathbf{M}^t} = {\hat{\mathbf{X}}^t} \), where \( {\hat{\mathbf{X}}^t} \) holds the reconstructed scores from step 3.

  5. 5.

    Steps 2–4 are repeated until the decrease of the loss function value L for the current iteration is smaller than the convergence criterion, which is set to 1e-6 times 10% of the data size N × J.; the latter product is added to keep the computation time for larger data sets under control.

In the MBCA program, the described procedure is performed with five different starts (i.e., one zero start and four random starts) for the missing values matrix M t, and the best solution is retained. Note that the computation time will be considerably longer when missing data imputation is performed.

A simulation study was performed to investigate how the clusterwise SCA-ECP algorithm with missing data imputation performs in terms of goodness of recovery. A detailed description of the simulation study is provided in Appendix 1. From the study, which included missing data generated under different mechanisms, it can be concluded that the clustering of the data blocks, as well as each of the cluster loading matrices, is recovered very well in all simulated conditions. The overall mean computation time in the simulation study amounts to 22 min and 25 s, which is about 260 times longer than the computation time of clusterwise SCA-ECP on the complete data sets.

Model selection

In the previous step, the number of components Q and number of clusters K were assumed to be known. This is often not the case, however. In component analysis, the resulting model selection problem is often solved by fitting component models with different numbers of components and then selecting the model with the best balance between complexity and fit. To this end, a generalization of the well-known scree test (Cattell, 1966) can be used, based on a plot of the VAF (Eq. 4) against the number of components. Using this plot, the “best” number of components is determined by searching for the number of components after which the increase in fit with additional components levels off. The decision may be based on a visual inspection of the scree plot, but a number of automated scree test procedures have been proposed as well (e.g., DIFFIT, Timmerman & Kiers, 2000; CHULL, Ceulemans & Kiers, 2006).

Building on the CHULL procedure, we propose to select the component solution for which the scree ratio

$$ s{r_{{(Q)}}} = \frac{{VA{F_Q} - VA{F_{{Q - 1}}}}}{{VA{F_{{Q + 1}}} - VA{F_Q}}} $$
(5)

is maximal, where VAF Q is the VAF of a solution with Q components. Note that the lowest and highest number of components considered will never be selected, since, for them, the scree ratio (Eq. 5) cannot be calculated. For selecting among separate PCA solutions or SCA-ECP solutions, this scree criterion can readily be applied. For clusterwise SCA-ECP, model selection is more intricate, however, because the number of clusters also needs to be determined (which is analogous to the problem of determining the number of mixture components in mixture models; e.g., McLachlan & Peel, 2000). As a way out, one may use a two-step procedure in which, first, the best number of clusters is determined and, second, the best number of components. More specifically, the first step of this procedure starts by calculating the scree ratio sr (K|Q) for each value of K, given different values of Q:

$$ s{r_{{(K|Q)}}} = \frac{{VA{F_K} - VA{F_{{K - 1}}}}}{{VA{F_{{K + 1}}} - VA{F_K}}}. $$
(6)

Subsequently, for each number of components Q, the best number of clusters K is the number of clusters for which the scree ratio is maximal. The overall best number of clusters K best is determined as the K-value that has the highest average scree ratio across the different Q-values. The second step aims at selecting the best number of components. To this end, given K best, the scree ratios are calculated for each number of components Q:

$$ s{r_{{(Q|{K^{{best}}})}}} = \frac{{VA{F_Q} - VA{F_{{Q - 1}}}}}{{VA{F_{{Q + 1}}} - VA{F_Q}}}. $$
(7)

The best number of components Q best is the number of components Q for which the scree ratio is maximal.

We applied this procedure for selecting an adequate clusterwise SCA-ECP solution for the emotion data out of solutions with one to six clusters and one to six components. Table 2 contains the scree ratios for determining the number of clusters and the number of components. Upon inspection of the sr (K|Q) ratios in Table 2 (above), we conclude that the best number of clusters differs over the solutions with one to six components. Therefore, we computed the average scree ratios across the different numbers of components, which equaled 1.88, 2.01, 1.08, and 1.32 for two to five clusters, respectively, and decided that we should retain three clusters. The \( s{r_{{(Q|{K^{{best}}})}}} \) values in Table 2 (below) suggest that the best number of components Q best is two. Hence, we selected the model with three clusters and two components, which was discussed in the Data structure, preprocessing, and missing values section.

Table 2 Scree ratios for the numbers of clusters K given the numbers of components Q and averaged over the numbers of components (above), and for the numbers of components Q given three clusters (below), for the emotion data. The maximal scree ratio in each column is highlighted in boldface

To evaluate the model selection procedure, we performed a simulation study, of which details can be found in Appendix 2. The simulation study revealed that the model selection procedure works rather well, with a correctly selected clusterwise SCA-ECP model in 91% of the simulated cases.

Multiblock component analysis program

The most up-to-date versions of the MBCA software can be downloaded from http://ppw.kuleuven.be/okp/software/MBCA/. When clicking on the .exe file (for the stand-alone version) or typing “MultiBlock_Component_Analysis” in the MATLAB command window (for the MATLAB version), the interface of the software program (see Fig. 1) opens, consisting of three panels: “data description and data files,” “analysis options,” and “output files and options.” In this section, first, the functions of each panel of the software interface are clarified. Next, performing the analysis and error handling are described. Finally, the format and content of the output files is discussed.

Data description and data files

In the “data description and data files” panel, the user first loads the data by clicking the appropriate “browse” button and selecting the data file. This file should be an ASCII (.txt) file, in which the data blocks are placed below each other, with the rows representing the observations and the columns representing the variables. The columns may be separated by a semicolon, one or more spaces, or horizontal tabs (see Fig. 3). Missing values should be indicated in the data file by “.,” “/,” “*” or the letter “m” (e.g., in Fig. 3, the missing data values are indicated by “m”). If some data are missing, the user should select the option “missing data, indicated by . . .” in the “missing data imputation” section of the panel and specify the symbol by which the missing values are indicated.

Fig. 3
figure 3

Screenshot of (from left to right) a data file, a number of rows file, and a labels file. An ”m” in the data file indicates a missing value

Next, the user selects a “number of rows” file (also an ASCII file) by using the corresponding “browse” button. The selected file should contain one column of integers, indicating how many observations the consecutive data blocks contain, where the order of the numbers corresponds to the order of the data blocks in the data file (see Fig. 3).

Finally, the user may choose to upload a file with meaningful labels for the data blocks, the observations within the data blocks, and the variables. The labels file should be an ASCII file containing three groups of labels, in the form of strings that are separated by empty lines, in the following order: block labels, object labels, and variable labels. Note that tabs are not allowed in the label strings. If the user does not load a labels file, the option “no (no labels)” in the right-hand part of the panel is selected. In that case, the program will use default labels in the output (e.g., “block1” for the first data block, “block1, obs1” for the first object of the first data block, and “column1” for the first variable).

Analysis options

In the “type of analysis” section of the “analysis options” panel, the user can choose which types of multiblock component analysis need to be performed, on the basis of the expected differences and/or similarities between the underlying structure of the different data blocks (as was explained in the Introduction). The user selects at least one of the methods: clusterwise SCA-ECP, separate PCA per data block, and SCA-ECP.

In the case of clusterwise SCA-ECP analysis, the user specifies the number of clusters in the “complexity of the clustering” section. The maximum number of clusters is 10, unless the data contain fewer than 10 data blocks (in that case, the maximum number of clusters is the number of data blocks). In addition to that, the user chooses one of the following two options: “analysis with the specified number of clusters only” or “analyses with 1 up to the specified number of clusters.” In the latter case, the software generates solutions with one up to the specified number of clusters and specifies which number of clusters should be retained according to the model selection procedure (see the Model Selection section).

In the “complexity of the component structure” section, the user specifies a number of components between 1 and 10. Just as in specifying the number of clusters (for clusterwise SCA-ECP), the user can choose to perform the selected analyses with one up to the specified number of components or with the specified number of components only. In the former case, the model selection procedure (described in the Model selection section) will be applied to suggest what the best number of components is.

Finally, in the “analysis settings” section, the user can indicate how many random starts will be used, with a maximum of 1,000. The default setting is 25 random starts, on the basis of a simulation study by De Roover et al. (in press).

Output files and options

In the panel “output files and options,” the user indicates, by clicking the appropriate “browse” button, the directory in which the output files are to be stored. The user may also specify a meaningful label for the output files, to be able to differentiate among different sets of output files (for instance, for different data sets) and to avoid the output files to be overwritten next time the software program is used. The specified label is used as the first part of the name of each output file, while the last part of the file names refers to the content of the file and is added by the program. It is important to note that the label for the output files should not contain empty spaces.

In the “required output” section, the parameters to be printed in the output files can be selected. More specifically, the user indicates whether output files with unrotated, orthogonally rotated, and/or obliquely rotated loadings are needed and whether the component scores—counterrotated accordingly—are to be printed in those output files as well. Note that the output files often become very large when component scores are printed. For orthogonal rotation of the component matrices, the normalized varimax rotation (Kaiser, 1958) is used, while oblique rotation is performed according to the HKIC criterion (Harris & Kaiser, 1964; Kiers & ten Berge, 1994a).

Analysis

Performing the analyses

After specifying the necessary files and settings, as described in the previous sections, the user clicks the “run analysis” button to start the analysis. The program will start by reading and preprocessing the data; then the requested analyses are performed. During the analysis, the status of the analyses is displayed in the box at the bottom of the software interface, such that the user can monitor the progress. The status information consists of the type of analysis being performed at that time and the number of (clusters and) components being used (see Fig. 1). For clusterwise SCA-ECP analysis, the random start number is included in the status. When analyses with missing data imputation are performed, the start number and iteration number of the imputation process are added to the status as well. When the analysis is done, a screen pops up to notify the user. After clicking the “OK” button, the user can consult the results in the output files stored in the selected output directory.

Error handling

If the files or options are not correctly specified, one or more error screens will appear, with indications of the errors. After clicking “OK,” the analysis stops, and the content of the error messages is displayed in the box at the bottom of the interface. The user can then correct the files or settings and click “run analysis” again.

In some cases, a warning screen may appear. Specifically, a warning is given when missing data imputation is requested but no missing values are found, when missing data imputation is requested and the analyses are expected to take a very long time (i.e., when more than 10% of the data are missing and/or when more than 20 different analyses are requested, where each analysis refers to a particular K and Q value), or when some variables have a variance of zero for one or more data blocks (see the Checking data requirements section). In the latter case, a warning screen appears with the three options for dealing with invariant variables (as described in the Checking data requirements section). For the first two options, the number of data blocks or variables that would have to be removed for the data set at hand is stated between brackets. In addition to these three options, a fourth option is given that is a reference to a future upgrade of the software program containing a different variant of clusterwise SCA (i.e., clusterwise SCA-P; De Roover, Ceulemans, Timmerman, & Onghena, 2011). Also, a text file with information on which variables are invariant within which data blocks is created in the output directory and opened together with the warning screen. When the user chooses to continue the analysis, the third solution for invariant variables (i.e., replacing the invariant scores by zeros) is applied automatically by the software program. Otherwise, the user can click “no” to stop the analysis and remove data blocks and/or variables to solve the problem.

Output files

The MBCA program creates separate ASCII (.txt) output files for each combination of multiblock component method (separate PCAs, SCA-ECP, and/or clusterwise SCA-ECP) and rotation method (unrotated, orthogonal, and/or oblique; see Fig. 2 for an example). For each used number of (clusters and) components, these output files contain all obtained component loadings and, if requested, the component scores. For separate PCAs, the output is organized per data block. When the solutions are obliquely rotated, the component correlations are added to the output file in question. For separate PCAs, SCA-ECP, and clusterwise SCA-ECP, these correlations are respectively computed for each data block, across all data blocks and across all data blocks within a cluster. In the clusterwise SCA-ECP output files (e.g., Fig. 2), the partition matrices are printed as well.

In addition to the ASCII output files, the software program creates an output overview (.mht) file. For data with missing values, this file contains the percentage of missing values per data block and the total percentage of missing data. The file also displays the overall fit values for each of the performed analyses. When analyses are performed for at least four different numbers of clusters and/or components, the overview file shows the results of the model selection procedures for each component method. Specifically, the overview file suggests how many components and, if applicable, how many clusters should be retained. Sometimes, for clusterwise SCA-ECP, no suggestion can be made with respect to the number of clusters that should be used—for instance, because only two or three numbers of clusters are used. In that case, the best number of components is indicated for each number of clusters separately.

To further facilitate model selection, the output overview provides a scree plot (e.g., Fig. 4) in which the percentage of explained variance is shown as a function of the number of components for each number of clusters separately. When separate PCAs or SCA-ECP analyses are performed, an additional scree line is added to the scree plot. Moreover, all the computed scree ratios are printed. Note that for clusterwise SCA-ECP analyses, a table of scree ratios is provided for each number of clusters, given the different numbers of components and averaged over the numbers of components (e.g., Table 2), as well as a similar table for the numbers of components given the different numbers of clusters. On the basis of these tables, the user can select additional solutions for further consideration. Of course, the interpretability of the different solutions should also be taken into account.

Fig. 4
figure 4

Percentage of explained variance for separate PCA, SCA-ECP and clusterwise SCA-ECP solutions, with the number of components varying from one to six and the number of clusters for clusterwise SCA-ECP varying from one to six for the emotion data

Finally, the output overview provides information on the fit of the different data blocks within all obtained solutions. This information can be consulted to detect data blocks that are aberrant (i.e., fitting poorly) within a certain model.

Conclusion

Behavioral research questions may concern the correlational structure of multivariate multiblock data. To explore this structure, a regular PCA or EFA is inappropriate, because this would mix up the between-block differences in means and in correlational structures. In this article, we gave an overview of more sophisticated factor analysis and component analysis techniques that have been proposed for investigating structural differences and similarities between blocks of data. We focused on multiblock component analysis, because this is a flexible approach that proved its usefulness in empirical practice. Moreover, for clusterwise SCA-ECP, which is the most general multiblock component model, no counterpart exists in factor analysis. An example from emotion research illustrated the value of this approach. To facilitate the use of multiblock component analysis, we introduced the MBCA program and provided guidelines on how to perform a multiblock component analysis in practice.