Single-case or single-subject experimental designs (SSEDs) are used when one is interested in the effect of a treatment for one specific subject, a person or another entity. In the most basic design, the time series design, the subject is observed several times before the treatment, during the so called baseline phase, and several times during or after the treatment. Because it is difficult to generalize the results from such an experiment to other subjects, the experiment can be replicated within or across studies. Next, the results of several single-case studies can be combined using meta-analytic techniques. There is a plethora of research indicating how to perform a meta-analysis of group comparison studies, in which study subjects are typically measured only once or a few times (e.g., Cooper, 2010; Lipsey & Wilson, 2001). In contrast, procedures necessary to conduct a meta-analysis of SSEDs are not well documented, and it is not straightforward how SSEDS should be meta-analyzed. This is because SSED data differ from group comparison study data. SSEDs entail a far smaller number of subjects for whom many repeated measures are taken. The resulting small-sample size and interrupted time series data may involve cyclical patterns or serial dependencies (West & Hepworth, 1991).

Reviews of meta-analyses of SSEDs indicate that a multitude of methods are used (Beretvas & Chung, 2008; Maggin, O'Keeffe, & Johnson, 2011), each method having its own advantages and weaknesses. Most studies about the methodology of SSED meta-analysis have focused on the question of which effect size should be used to describe the treatment effect in each study being synthesized (e.g., Campbell, 2004; Parker, Vannest, & Davis, 2011; Wolery, Busick, Reichow, & Barton, 2010). There have been many proposals: nonparametric approaches (e.g., the percentage of nonoverlapping data statistic; Scruggs, Mastropieri, & Casto, 1987), parametric approaches (e.g., standardized mean difference; Gingerich, 1984), and regression-based methods (Alison & Gorman, 1993; Center, Skiba, & Casey, 1985–1986; Van den Noortgate & Onghena, 2003a, 2003b). Maggin et al. (2011b) compared the weaknesses and advantages of several methods (both parametric and nonparametric), and on the basis of this overview, use of the hierarchical linear or multilevel model seemed most promising. Use of a multilevel model is consistent with the logic of visual analysis, can control for threats to interpretation (e.g., autocorrelation), and has attractive statistical properties (e.g., being able to capture differences between subjects and/or studies in the magnitude of the effect).

In this study, we will examine the effectiveness of using a multilevel meta-analysis to synthesize effect sizes from a set of SSEDs’ results. Van den Noortgate and Onghena (2008) illustrated this approach using a reanalysis of the study results combined in the meta-analysis of Shogren, Fagella-Luby, Bae, and Wehmeyer (2004). However, an illustration using real data does not prove that the multilevel approach results in proper parameter and standard error estimates for the effect size and variance components. Simulation research makes it possible to investigate the latter question, because population parameters are specified in advance and, therefore, also known. First, we will discuss the multilevel approach to meta-analysis of SSEDs; next, we will present our simulation study.

Multilevel meta-analysis of SSEDs

Meta-analysis is a set of statistical methods for combining the results of various studies addressing the same research question (Glass, 1976). In order to combine these analysis results, study results are typically first converted to a common standardized effect size metric. The advantage of using effect sizes is that it is not necessary that all raw data are available in all studies. Effect sizes may already be reported in a study, or they can be calculated on the basis of available test statistics. One possible way to calculate effect sizes when using SSEDs is to make use of a regression model. Center et al. (1985–1986) proposed using the following regression model to analyze data from an SSED study:

$$ {Y_i} = {\beta_0} + {\beta_1}{T_i} + {\beta_2}{D_i} + {\beta_3}{T_i}{D_i} + {e_i} $$
(1)

where Y i is the score of the subject at the ith point in time, D i is a dummy variable that equals 0 in the baseline phase and 1 in the treatment phase, and T i is a time-related variable that equals 0 on the first day of the treatment phase. Therefore, β 0 is the baseline intercept, and β 1 is the linear trend during the baseline. β 2 refers to the treatment effect on the intercept for the trend during the intervention phase, and β 3 refers to the effect of the treatment on the time trend. Van den Noortgate and Onghena (2003a) proposed using the last two regression coefficients as effect size measures: β 2 for the immediate treatment effect and β 3 for the treatment effect on the time trend.

If we can obtain, for each case, estimates for these two effect sizes, either by using Eq. 1 on the raw data or on the basis of reported summary statistics or test statistics, then the resulting effect sizes can be combined in two separate meta-analyses: one for the immediate treatment effect and one for the treatment effect on the time trend. There are several ways to do this. A three-level model presented by Van den Noortgate and Onghena (2003b, 2008) makes it possible to model variability in effect sizes at each of the three levels. At the first level, the effect size estimates of the immediate treatment effect for case j from study k may be modeled to randomly deviate from the unknown population effect size:

$$ \matrix{ {{b_{{2jk}}} = {\beta_{{2jk}}} + {r_{{2jk}}}} \hfill &{{\text{with}}\;{r_{{2jk}}} \sim N\left( {0,\sigma_{{{r_{{2jk}}}}}^2} \right)} \hfill \\ }<!end array> $$
(2)

with b 2jk the ordinary least squares (OLS) estimate of β 2jk . The random deviations of the observed regression coefficients are assumed to be normally distributed with a variance that depends on the coefficient and the subject. Because we have only one estimate per case for a coefficient, this variance cannot be estimated in the three-level analysis. However, the sampling variance of the regression coefficients, \( \sigma_{{{r_{{2jk}}}}}^2 \), can be estimated in the original OLS regression analysis used to estimate the regression coefficient or can be derived from summary or test statistics reported in the primary SSED study. Because these variances are estimated before the actual meta-analysis is performed, the three-level meta-analysis (as well as other typical meta-analyses) can be regarded as a “variance known problem” (Raudenbush & Bryk, 2002).

The population effect sizes β 2jk from study k can be modeled as varying over subjects around a study-specific mean effect θ 20k (second level) as follows:

$$ \matrix{ {{\beta_{{2jk}}} = {\theta_{{20k}}} + {u_{{2jk}}}} \hfill &{{\text{with}}\;{u_{{2jk}}} \sim N\left( {0,\sigma_{{{u_{{2jk}}}}}^2} \right)} \hfill \\ }<!end array> $$
(3)

and the effects for studies can vary between studies (third level):

$$ \matrix{ {{\theta_{{20k}}} = {\gamma_{{200}}} + {\upsilon_{{20k}}}} \hfill &{{\text{with}}\;{\upsilon_{{20k}}} \sim } \hfill \\ }<!end array> N\left( {0,\sigma_{{{\upsilon_{{20k}}}}}^2} \right) $$
(4)

The same fundamental three-level model could also be used to model variability in the treatment’s effect on the time trend, using the following level one, two, and three equations,

$$ \matrix{ {{b_{{3jk}}} = {\beta_{{3jk}}} + {r_{{3jk}}}} \hfill &{{\text{with}}\;{r_{{3jk}}} \sim N\left( {0,\sigma_{{{r_{{3jk}}}}}^2} \right)} \hfill \\ }<!end array> $$
(5)
$$ \matrix{ {{\beta_{{3jk}}} = {\theta_{{30k}}} + {u_{{3jk}}}} \hfill &{{\text{with}}\;{u_{{3jk}}} \sim N\left( {0,\sigma_{{{u_{{3jk}}}}}^2} \right)} \hfill \\ }<!end array> $$
(6)
$$ \matrix{ {{\theta_{{30k}}} = {\gamma_{{300}}} + {\upsilon_{{30k}}}} \hfill &{{\text{with}}\;{\upsilon_{{30k}}} \sim N\left( {0,\sigma_{{{\upsilon_{{30k}}}}}^2} \right),} \hfill \\ }<!end array> $$
(7)

respectively.

When the scale used is not the same for all subjects, the effect sizes b 2jk and b 3jk are not comparable across subjects and, therefore, cannot be combined. For example, if the dependent variable for a first subject can range from 0 to 10 and for a second subject from 0 to 100, the expected unstandardized effect size of the second subject may be 10 times larger than the effect size of the first subject. Unstandardized effect sizes are appropriate only when the dependent variable is measured in the same way for all subjects. In practice, this situation is rare, unless studies are exact replications of each other. In the example, also the expected residual standard deviation σ e will be 10 times larger for the second subject. Therefore, effect sizes can be made comparable over subjects by standardizing b 2jk or b 3jk , by dividing them by the estimated residuals’ standard deviation \( \left( {\widehat{{{\sigma_e}}}} \right) \) (Van den Noortgate & Onghena, 2003b, 2008). This residual standard deviation can be estimated by estimating the OLS regression model in Eq. 1 separately for each subject and then finding the square root of the mean square error or the RMSE of each separate regression. When there are a lot of measurements for each subject, this standard deviation can be estimated reasonably well. On the other hand, when there are only a few measurements available for each subject, this standard deviation might be poorly estimated, resulting in poor estimates of the standardized effect size. Therefore, in this study, we are primarily interested in the performance of a multilevel meta-analysis when there are not many measurements for each subject.

During a single-subject experiment, subsequent measurements can be influenced by common (random) factors, with the result that measurements close to each other in time may be correlated. This phenomenon of autocorrelation can be included in the previous multilevel model by specifying an autoregressive covariance structure for the first-level errors (i.e., the e i from Eq. 1). In the present study, however, we will focus on the basic model with no autocorrelation.

The parameters typically estimated in a multilevel analysis are the fixed effects regression coefficients (e.g., γ200 referring to the average immediate treatment effect over cases and studies and γ300 referring to the average treatment effect on the linear trend in Eqs. 4 and 7, respectively). Although this multilevel approach seems well suited for single-case data, in practice, it is seldom used. One reason might be that little is known about the functioning of this model for SSED meta-analysis. This study assesses the performance of this method in more detail, using a set of realistic situations.

Method

To evaluate the performance of the three-level approach, we used a simulation study consisting of several steps. In a first step, raw data were simulated. In a second step, the unstandardized regression coefficients of Eq. 1 were estimated for each subject, using OLS estimation. Next, these estimates were divided by the residual within-phase standard deviation, to obtain corresponding standardized effect sizes. Finally, the unstandardized and the standardized effect sizes were separately analyzed using the three-level meta-analytic approach (using Eqs. 27), and the results were compared with the parameter values used to generate data. Despite the fact that meta-analyses with unstandardized effect sizes will rarely be performed in practice, they are analyzed in this study, because if problems arise, we can find out whether these problems are due to the standardization of the effects or to the multilevel meta-analysis itself.

To simulate the raw data, the following measurement occasion (level one) model was used:

$$ \matrix{ {{Y_{{ijk}}} = {\beta_{{0jk}}} + {\beta_{{1jk}}}{T_{{ijk}}} + {\beta_{{2jk}}}{D_{{ijk}}} + {\beta_{{3jk}}}{T_{{ijk}}}{D_{{ijk}}} + {e_{{ijk}}}} \hfill &{{\text{with}}\;{e_{{ijk}}} \sim N\left( {0,\sigma_e^2} \right)} \hfill \\ }<!end array> $$
(8)

with occasions nested within individuals at level two:

$$ \left\{ {\begin{array}{*{20}{c}} {{{\beta }_{{0jk}}} = {{\theta }_{{00k}}} + {{u}_{{0jk}}}} \\ {{{\beta }_{{1jk}}} = {{\theta }_{{10k}}} + {{u}_{{1jk}}}} \\ {{{\beta }_{{2jk}}} = {{\theta }_{{20k}}} + {{u}_{{2jk}}}} \\ {{{\beta }_{{3jk}}} = {{\theta }_{{30k}}} + {{u}_{{3jk}}}} \\ \end{array} } \right.{\text{with }}\left[ {\begin{array}{*{20}{c}} {{{u}_{{0jk}}}} \\ {{{u}_{{1jk}}}} \\ {{{u}_{{2jk}}}} \\ {{{u}_{{3jk}}}} \\ \end{array} } \right]\sim N\left( {0,{{\Sigma }_{u}}} \right), $$
(9)

and within studies at level three:

$$ \left\{ {\begin{array}{*{20}{c}} {{{\theta }_{{00k}}} = {{\gamma }_{{000}}} + {{\upsilon }_{{00k}}}} \\ {{{\theta }_{{10k}}} = {{\gamma }_{{100}}} + {{\upsilon }_{{10k}}}} \\ {{{\theta }_{{20k}}} = {{\gamma }_{{200}}} + {{\upsilon }_{{20k}}}} \\ {{{\theta }_{{30k}}} = {{\gamma }_{{300}}} + {{\upsilon }_{{30k}}}} \\ \end{array} } \right.{\text{with}}\left[ {\begin{array}{*{20}{c}} {{{v}_{{00k}}}} \\ {{{v}_{{10k}}}} \\ {{{v}_{{20k}}}} \\ {{{v}_{{30k}}}} \\ \end{array} } \right]\sim N\left( {0,{{\Sigma }_{\upsilon }}} \right) $$
(10)

Note that we simulated data on the same scale for each subject and study. However, this is not a limitation in this simulation study. If we had simulated data using different scales, this effect would be neutralized by the standardization (e.g., if for a specific subject we had multiplied each score by five, both the estimated regression coefficients and the estimated residual standard deviations would be 5 times larger, so the estimated standardized coefficients would remain unchanged). By simulating data on the same scale, however, it is possible to evaluate the use of the multilevel model for SSED meta-analysis for situations in which standardization is required or is not required at the same time.

The effect sizes were synthesized using the restricted maximum likelihood estimation procedure implemented in SAS PROC MIXED (Littell, Milliken, Stroup, Wolfinger, & Schabenberger, 2006). We used the Satterthwaite approach to estimating degrees of freedom. This approach has been shown for two-level analyses of multiple-baseline data to provide relatively accurate confidence intervals for estimates of the average treatment effect (Ferron, Bell, Hess, Rendina-Gobioff, & Hibbard, 2009), using a model without linear trends to simulate and analyze the data. On the basis of these results, we expect that use of this estimation procedure for three-level analyses would also result in relatively accurate confidence intervals for the overall treatment effects.

The γ200 coefficient represents the shift in level that occurs due to the treatment. Data were generated assuming no effect (γ200 = 0) and 2 times the within-phase standard deviation (γ200 = 2). We chose values of 0 (no effect) and 0.2 times the within-phase standard deviation for the overall effect on the slope, γ300. These values were based on reanalyses of meta-analyses (Alen, Grietens, & Van den Noorgate, 2009; Denis, Van den Noortgate, & Maes, 2011; Kokina & Kern, 2010; Shogren et al., 2004; Wang, Cui, & Parrila, 2011). The effects of the baseline regression coefficients γ000 and γ100, were set at zero.

The number of studies (K) in each simulated meta-analysis was manipulated to be 10 or 30. A review of 39 social science single-case meta-analyses (Farmer, Owens, Ferron, & Allsopp, 2010) showed that the number of studies included in a meta-analysis ranged from 3 to 117, with 60 % of the meta-analysis including fewer than 30 studies.

We simulated studies with a multiple baseline design. The number of subjects per study (J) was 3, 4, or 7. These values are based on recommendations of Barlow and Hersen (1984) and Kazdin and Kopel (1975), a survey of multiple-baseline studies (Ferron et al., 2010), a review of Farmer et al. (2010), and a survey of single-case studies of Shadish and Sullivan (2011).

The series lengths (I) consisted of 10, 20, or 40 observations. The survey of Ferron et al. (2010) found average series lengths that ranged from 7 to 58, with a median of 24, and the survey of Shadish and Sullivan (2011) found that the number of data points per case ranged from 2 to 160, with median and mode equal to 20. A meta-analysis of 85 single-case studies (Swanson & Sachse-Lee, 2000) found that 25 studies had fewer than 11 treatment sessions, 37 studies had between 11 and 29 treatment sessions, and 23 studies had more than 29 treatment sessions.

The intervention introductions were staggered across subjects within studies. For each combination of the number of subjects and number of data points, the moment at which the treatment started is given in Table 1. For example, when there were 3 subjects and the number of measurement occasions for each subject equaled ten, then for the first subject, the treatment started on the fourth measurement occasion, for the second subject on the sixth measurement occasion, and for the third subject on the eighth measurement occasion, and for all 3 subjects the treatment lasted until the tenth measurement occasion.

Table 1 The number of the measurement occasion at which the treatment started

Elements of the within-study variance matrix, Σ u , were manipulated to have conditions with relatively small and relatively large amounts of within-study variance. For simplicity, covariances between regression coefficients were set to zero at the subject and study levels. Therefore, Σ u is a diagonal matrix, \( \sum\nolimits_u {{\text{diag}}\left( {\sigma_{{{u_0}}}^2,\sigma_{{{u_1}}}^2,\sigma_{{{u_2}}}^2,\sigma_{{{u_3}}}^2} \right)} \). A review of several reanalyses indicated that the variance between subjects is sometimes less than the within-person variance (Ferron et al., 2009; Van den Noortgate & Onghena, 2003a) and sometimes greater than the within-person variance (Van den Noortgate & Onghena, 2008). If the within-person (level one) variance is set to 1.0, setting the four diagonal elements of Σ u to values of 2, 0.2, 2, 0.2 (for the variances in the baseline intercept, baseline slope, treatment effect on the intercept, and treatment effect on the slope residuals, respectively) represents a relatively large amount of within-study variability, while setting the four diagonal elements of Σ u to values of 0.5, 0.05, 0.5, 0.05, respectively, represents a relatively small amount of within-study variability. Reanalyses of real data sets (Denis et al., 2011; Kokina & Kern, 2010; Shogren et al., 2004) indicated that the variance of the effect of γ200 is sometimes much larger than the variance of the effect of γ300. Therefore, we also set the four diagonal elements of Σ u to values of 8, 0.08, 8, 0.08.

Elements of the between-study variance matrix, Σ v , were also manipulated. On the basis of reanalyses of meta-analyses (Alen et al., 2009; Denis et al., 2011; Heyvaert, Maes, Van den Noorgate, Kuppens, & Onghena, in press; Kokina & Kern, 2010; Shogren et al., 2004; Wang et al., 2011), we set \( \sum\nolimits_u { = {\text{diag}}\left( {\sigma_{{{v_0}}}^2,\sigma_{{{v_1}}}^2,\sigma_{{{v_2}}}^2,\sigma_{{{v_3}}}^2} \right)} \) equal to diag( 2, 0.2, 2, 0.2), diag( 0.5, 0.05, 0.5, 0.05), and diag( 8, 0.08, 8, 0.08).

Crossing the levels of the seven factors leads to a 3 × 3 × 2 × 2 × 2 × 3 × 3 factorial design yielding 648 experimental conditions. For each condition, 2,000 data sets were simulated and analyzed, with a total of 1,296,000 data sets.

Results

We will successively discuss the estimates of both fixed effects used to describe an intervention’s effect (on the intercept and slope), the mean squared error, the estimation of the corresponding standard errors, the confidence interval coverage, the power, and the estimates of the variances. Because it is impossible to discuss the 648 conditions separately, we explored variation between conditions using an ANOVA, modeling main effects and two-way interaction effects, and discuss below only the effect for which the ANOVA showed clear evidence (p < .001). This procedure was primarily used to distinguish the most important patterns in the results.

Overall effect size estimates

In the simulation study, the mean population effect on the intercept was 0 or 2, and the effect on the time trend was 0 or 0.2. In each meta-analysis, we estimated these mean effects. These estimates will likely deviate from this mean population effect, because of random variation at each of the three levels.

In Fig. 1, the distribution of the deviations is given for γ200 equal to 0 or 2 and the number of measurements equal to 10, 20, or 40 when the unstandardized and the standardized regression coefficients are analyzed. For the unstandardized effect size estimates, close to no bias was identified.

Fig. 1
figure 1

Distribution of the deviations of the estimated mean effect on the intercept of the treatment from its populations value (γ200) for both the unstandardized and standardized effect sizes, for γ300 = 0.2, K = 10, J = 4, \( \sigma_{{{v_2}}}^2 = 2,\;{\text{and}}\;\sigma_{{{u_2}}}^2 = 2 \) conditions

The results differ for the standardized effect sizes. When γ200 equaled 0, the estimated bias was close to zero. When γ200 equaled 2, the estimated bias of the standardized effect sizes was 0.310 when there are only 10 measurement occasions, 0.110 when there are 20 measurements, and 0.044 when there are 40 measurements. We also sorted all conditions by their estimated bias, and the 100 conditions with the largest bias all had γ200 = 2 and I = 10. The condition with the most bias was γ200 = 2, γ300 = 0, K = 10, J = 4; I = 10, \( \sigma_{{{v_2}}}^2 = 8,\;{\text{and}}\;\sigma_{{{u_2}}}^2 = 0.5 \); in this condition, the bias equaled 0.350. The number of studies, the number of subjects, and the true values of the between- and within-study variances were each found to have no substantial effect on the estimated bias.

The results for the estimates of γ300 were similar to what was found for γ200. There was positive bias when standardized effect sizes were used if γ300 = 0.2 and I = 10. If γ200 = 2, γ300 = 0.2, K = 10, J = 4; I = 10, \( \sigma_{{{v_2}}}^2 = 2,\;{\text{and}}\;\sigma_{{{u_2}}}^2 = 2 \), the bias was 0.030, with a minimum and maximum deviation of −1.071 and 0.883, respectively, and with lower and upper quartiles of −0.115 and 0.178. The relative biases of the two estimates when a meta-analysis on standardized effect sizes was performed were more or less the same: The relative bias of the estimate of both γ200 and γ300 was 0.155. The standard deviation of the relative deviations, on the other hand, differed: 0.324 for the estimates of γ200 and 1.090 for the estimates of γ300.

These results indicate that the positive bias when γ200 and γ300 are larger than zero and the number of measurements equals 10 results from the standardization of the effects. In this simulation study, however, it is not clear whether this positive bias is caused by calculating a weighted average of the individual regression coefficients or whether these individual regression coefficients themselves are already biased. That is why we also checked the distribution of the deviations from the true value of the individual regression coefficients β 2jk for γ200 = 2, γ300 = 0.2, K = 10, J = 4, I = 10, \( \sigma_{{{v_2}}}^2 = 2,\;{\text{and}}\;\sigma_{{{u_2}}}^2 = 2 \). In this condition, there were 80,000 estimates of β 2jk . Table 2 shows that there is almost no bias when the effect sizes are unstandardized and that there is a positive bias of 0.302 when the effect sizes are standardized. This indicates that the bias of the estimated overall effects is mainly due to the standardization of the individual regression coefficients and not a result of calculating a weighted average.

Table 2 Distribution of the deviations of the individual regression coefficients estimates of β 2jk , for γ200 = 2, γ300 = 0.2, K = 10, J = 4, I = 10, \( \sigma_{{{v_2}}}^2 = 2,\;{\text{and}}\;\sigma_{{{u_2}}}^2 = 2 \) conditions

Mean squared error

The mean squared error (MSE) is equal to the average squared deviation of the estimates from the true value, and therefore, it is preferred that the MSE is as small as possible. The MSE is, as was expected, smaller when the number of studies, the number of cases, or the number of measurement occasions increases. On the basis of the ANOVA, the number of studies and the between-study variance seem to have the most important influence on the MSE. This influence is illustrated in Table 3 for both the unstandardized and the standardized effect sizes.

Table 3 Mean squared error of (MSE) γ200 and γ300 , for γ200 = 2, γ300 = 0.2, J = 4, I = 20, and \( \sigma_{{{u_2}}}^2 = 2 \) conditions

Standard error estimates

In addition to assessing parameter estimation for each of the two fixed effects (on the intercept and slope), we also evaluated estimation of the standard errors of each of these effects. These standard errors can be used to construct confidence intervals and to perform tests of the statistical significance of the effects. By definition, the standard error equals the standard deviation of the sampling distribution of the estimated effects. In this simulation study, we performed for each condition 2,000 meta-analyses, which results in 2,000 estimates of the effect. Because of the reasonably large number of estimates, we can regard the standard deviations of the estimates as a good estimate of the standard deviations of the estimator’s sampling distribution and can, therefore, use this standard deviation to evaluate the quality of the standard error estimates.

The median of the standard error estimates was found to be almost equal to the standard deviation of the estimates of the effect for both the unstandardized and the standardized effects. As was expected, the standard error decreased when the number of studies, the number of measurement occasions, or the number of subjects increased or in conditions with lower between-study or within-study variance values.

The results were similar for both the standardized and the unstandardized effect sizes. The standard error of the unstandardized estimate of γ200 was slightly underestimated in 83.3 % of the conditions and the relative bias over all conditions was −0.022. The difference between the median of the standard error estimates and the standard deviation of the effect size estimates was greatest for γ200 = 0, γ300 = 0, K = 10, J = 3, I = 10, \( \sigma_{{{{\text{v}}_2}}}^2 = 8,\;{\text{and}}\;\sigma_{{{{\text{u}}_2}}}^2 = 8 \). In this situation, the median of the standard error equaled 1.020, and the standard deviations of the estimates equaled 1.103. For the standardized effects, the standard error was slightly underestimated in 86.4 % of the conditions and the relative bias over all conditions was −0.025. The difference between the median of the standard error and the standard deviation of the estimates was greatest in the same conditions as for the unstandardized data, with the median of the standard error equal to 1.198 and the standard deviations of the estimates equal to 1.311. On the basis of the ANOVA, there seemed to be only a small effect of the number of studies on the difference between the median of the standard error and the standard deviation of the estimates. The results were also similar for the estimates of the standard error of γ300.

Confidence interval coverage

Another way to evaluate estimation of the effect sizes and of their corresponding standard errors is by calculating the proportion of replications in which the confidence interval contained the population effect size. The lower and upper limits of the confidence intervals around the estimated effect are constructed by multiplying the estimated standard error with the right critical value of the z-distribution and subtracting from and adding this product to the point estimate of the effect. For a 95 % confidence interval, we expected that the coverage proportion would be around 95 %. Because we simulated 2,000 data sets for each condition, the coverage proportions could be estimated accurately—more specifically, with a standard error of \( 0.005\,\left( { = \sqrt {{{{{\left( {0.95 * 0.05} \right)}} \left/ {{2000}} \right.}}} } \right) \)—and therefore, we expected the coverage proportion to range from 94.04 % to 95.96 % (with α = .05).

The coverage proportions for the unstandardized estimate of γ200 ranged from 93.45 % to 97.10 %, and in 91.36 % of the conditions, the coverage proportion lay between 94.04 % and 95.96 %. The coverage proportions for the unstandardized estimate of γ300 ranged from 93.60 % to 96.85 %. In 92.90 % of the conditions, the coverage proportion lay between 94.04 % and 95.96 %.

The results for the standardized estimates of γ200 differed. In 77.47 % of the conditions, the coverage proportions lay between 94.04 % and 95.95 %; however, for the other conditions, the coverage proportions were often too low, with a minimum of 70.65 % when γ200 = 2, γ300 = 0.2, K = 30, J = 7, I = 10, \( \sigma_{{{v_2}}}^2 = 0.5,\;{\text{and}}\;\sigma_{{{u_2}}}^2 = 0.5 \). There appears to be a problem with coverage when the effect is larger than zero and when the number of measurements occasions is small (I = 10); and the problem gets worse when the number of studies is large (K = 30) and when the between-study \( \left( {\sigma_{{{v_2}}}^2} \right) \) and within-study \( \left( {\sigma_{{{u_2}}}^2} \right) \) variances are rather small (Table 4). Figure 1 already showed that there was positive bias when the number of measurement occasions is small and when the effect is larger than zero. When the number of studies increases, the confidence interval becomes narrower, and the lower coverage proportions make it clear that this confidence interval varies around a biased estimator.

Table 4 Mean coverage proportion of the 95 % confidence interval of γ200 for both the unstandardized (U) and standardized (S) effect sizes for γ300 = 0.2 and J = 4 (upper section of the table) and of γ300 for both the unstandardized (U) and standardized (S) effect sizes for γ200 = 2 and J = 4 (lower section of the table)

The 95 % coverage proportion for the standardized estimates of γ300 ranged from 92.8 % to 97.1 %. In 88.27 % of the conditions, the mean coverage proportion lay between 94.04 % and 95.96 %. We already mentioned that the relative bias for the estimates of γ200 and γ300 was more or less the same but that the standard deviation of the relative deviations of the estimates of γ300 from the true value was larger. This will result (again in relative terms) in larger estimated standard errors and wider confidence intervals, and this results in a better coverage proportion, as compared with the coverage proportion of the standardized estimates of γ200.

Power

In our study, we estimated the actual Type I error rate for conditions where the null hypothesis was true (i.e., γ200 = 0 or γ300 = 0), by calculating the proportion of data sets for which the null hypothesis was rejected with α equal to .05. For γ200, this proportion is given in Table 5, in the fourth through seventh columns, for the unstandardized estimates. The results for the standardized estimates were very similar.

Table 5 Power of γ200 for γ300 = 0.2, for the unstandardized effect sizes

When the null hypothesis is false, we want the proportion of correct rejections of the null hypothesis to be as high as possible and, preferably, above .80. The estimated power for γ200 = 2 and α = .05 is given in the last four columns of Table 5. On the basis of the ANOVA, the number of studies, the within-study variance, the between-study variance, and all the interactions between these factors seemed to influence power. For conditions in which the number of studies being meta-analyzed was 30, the power was high. On the other hand, when there were only 10 studies and a lot of between-study variability, the power was found to be much lower.

Table 6 contains the power of the estimates of γ300 for the unstandardized effects. The estimated power was lower, as compared with the power of γ200. The power was above .80 only when K = 30 and I > 10, except if the number of cases equaled 7. Here again, the results for the estimates of the standardized effects were very similar.

Table 6 Power of γ300 for γ200 = 2, for the unstandardized effect sizes

Variance estimates

In the meta-analyses, the between-study and within-study residuals’ variances were estimated for both the effect on the intercept and the effect on the slope parameters. We will examine the relative deviations, which are the deviations from the true value divided by the value of the population parameter. In this study, the same problem arises for the estimates of the four variances and for the meta-analyses of both the unstandardized and the standardized effect sizes: namely, the distribution is positively skewed, with a long tail at the right side. For example, for the meta-analyses on the standardized effect sizes, the relative bias (i.e., the mean relative deviation) was larger than 100 % for 13.39 % of the estimates of the between-study variance of the effect on the intercept, for 12.52 % of the estimates of the between-study variance of the effect on the slope, for 25.07 % of the estimates of the within-study variance of the effect on the intercept, and for 26.94 % of the estimates of the within-study variance of the effect on the slope. Table 7 shows the mean and median of the relative deviations of the variance estimates.

Table 7 Mean and median of relative deviations of the variance estimates for γ200 = 2, γ300 = 0.2, K = 10, J = 4, \( \sigma_{{{v_2}}}^2 = 2,\;{\text{and}}\;\sigma_{{{u_2}}}^2 = 2 \) conditions

Table 7 indicates that the relative bias of the variance estimates was larger when the meta-analysis was performed on standardized effect sizes. The number of measurements also had a large effect on the bias of the four estimated variances. The within-study variance exhibited the highest bias when the number of measurement occasions was 10. In general, the median relative deviation of the estimates of the within-study variance was larger than that found for the between-study variance estimates. The median of the relative deviation of the within-study variance of γ200 was especially large when the between-study variance was large and the within-study variance was small.

Discussion

In this study, we examined whether single-case studies can be combined using a multilevel meta-analysis, with unstandardized or standardized regression coefficients as effect sizes. Several realistic conditions were simulated and analyzed.

The multilevel approach works well when unstandardized effect sizes are used. The approach is also suitable for standardized effect sizes when there are many studies (30 or more), when there are a lot of measurement occasions for each subject (20 or more), and when the studies are rather homogeneous (which corresponds with a small amount of between-study variance). In these situations, the effects are well estimated, the mean squared error is small, the coverage proportion of the 95 % confidence interval is around 95 %, and the power of each effect is above 80 %.

However, when these criteria are not met, problems may occur. In this study, it became clear that difficulties are encountered in particular when there are only 10 measurement occasions for each subject. In this situation, the overall effect estimate is biased as a result of biased estimates of the individual standardized regression coefficients. Because standardizing is often needed in order to make study results comparable, further research should focus on this problem of standardization when the number of measurements is rather small. A possible solution is the use of iterative bootstrap procedures (Goldstein, 1996), or through correcting the individual regression coefficients for bias (Hedges, 1981). Another solution might be to use the correction procedures for biased standardized regression coefficients that were proposed by Yuan and Chan (2011), but they should be adapted, because the procedures were developed for regression coefficients that are standardized in the traditional way—namely, by multiplying them by the standard deviation of the independent variable and then dividing by the standard deviation of the dependent variable. In this study, we suggest standardizing based on the standard deviation of the dependent variable only, and only in so far as this dependent variable is not explained by the predictors (i.e., we propose to use the residual standard deviation).

An important question for a researcher who wants to conduct a multilevel meta-analysis of SSEDs has to do with the power of different scenarios. The results are the same for the unstandardized and the standardized effect sizes. When the immediate effect of the treatment is estimated, the power is acceptable when the studies are homogeneous, regardless of the number of studies, cases, or measurement occasions. On the other hand, when the studies’ effect sizes are more heterogeneous, a power of 80 % or more can be reached only when there are a lot of studies (e.g., 30) being meta-analyzed. On the other hand, the effect of the treatment on the slope is estimated with enough power only when the number of studies is 30 or more and the number of measurement occasions per subject is 20 or more.

The major advantage of multilevel models is that they result not only in an overall estimate of the effect, but also in an estimate of the between-study and within-study variance. But these estimates are sometimes seriously biased. The estimates are worse for meta-analyses of standardized effects, and the estimates of the within-study variance is especially biased when the number of measurements is rather small.

We should also note that the conclusions are, in principle, limited to the conditions that were simulated. These data were balanced and were sampled from a normal distribution, there was no correlation between consecutive observations, there was no covariance, and the trajectories were not nonlinear. In practice, however, data will probably not be balanced, autocorrelation will likely occur when observations are taken in quick succession, there can be another underlying distribution, there might be correlation between the effects, and the baseline and/or intervention phase trends might be nonlinear. The purpose of this study was to discover which parameters really matter and to identify in which conditions problems occurred. Our aim was to get a preliminary insight into the empirical functioning of the multilevel model for SSED meta-analysis, but in future research, we will also want to investigate more complex situations.

In this study, we performed two separate meta-analyses for the effects b 2jk and b 3jk , whereby we assumed that these two effects are independent of each other. This assumption may not be realistic in applied settings. If a covariance at the second and/or third level can be expected, a multivariate meta-analysis might be more powerful (Kalaian & Raudenbush, 1996). In a pilot study, we already explored the performance of the multivariate approach, with simulated effect sizes that did not covary at the second and third level and showed sampling covariance only at the first level. The results were similar to the ones of this study. In future research, we want to explore the operating characteristics of the multivariate multilevel model for meta-analytic data of SSEDs in situations where there is nonzero covariance between parameters. We expect that the gains of such a multivariate model increases (e.g., higher power and accuracy) when the covariance increases.

We also assumed that the repeated measures were uncorrelated. This is probably too strong an assumption in some real situations, because a typical characteristic of single-case data is that the measurements are taken in rapid succession. Shadish and Sullivan (2011) showed that the size of autocorrelation varies substantially over studies. In previous research (Ferron et al., 2009), where a two-level model was used to analyze SSEDs, it was found that not modeling an existing autocorrelation results in biased parameter estimates. In this study, we modeled the level one errors as σ 2 I, but there are many other covariance structures possible, of which the first-order autoregressive type is often used to model autocorrelation. Thus, future research should explore scenarios where the repeated measures are autocorrelated and assess the impact of failing to model this autocorrelation, as well as evaluating recovery of model parameters, given the small number of data points typically encountered in single-case design research.

Can single-case studies be combined with a multilevel meta-analysis? The answer is yes when it is not necessary to standardize effect sizes. And even if effect sizes should be standardized because studies’ outcomes are on different scales, there are no real problems as long as the studies’ effects are reasonably homogeneous or when there are a lot of measurements per individual and there are a lot of studies being meta-analyzed. But the method does not work well for standardized effect sizes when there are only a few measurements for each subject; this situation calls for an adaptation of the method and additional research into alternative estimation procedures.