Introduction

A large proportion of falls in older adults occurs during locomotion [13]. These falls are often attributed to a decreased quality of gait, due to age-related, peripheral [4] and central [5] impairments. Gait variability and local dynamic stability have received much attention as indicators of fall-related measures of gait quality [6, 7] and several studies have confirmed that these parameters are, indeed, related to fall risk [813]. Although ultimately the ability to predict actual fall risk would remain to be shown, the use of gait quality measures as outcome variables in intervention studies might allow faster iterative development of fall prevention programs, as actual fall risk by gathering fall incidence data requires a long follow-up period. While reliability of gait variability and stability estimates can to some extent be improved by treadmill walking to collect data from a large number of strides [1417], a recent study indicated that reliability between sessions is still only moderate [18]. The statistical consequences of limited test–retest reliability can be overcome by adjusting the measurement strategy, but previous reports do not allow inferences on optimal measurement strategies. In studies investigating differences in gait quality between conditions in a population, the optimal measurement strategy, in terms of the number of subjects and the number of measurements per subject, depends on the variance of the gait parameters between and within subjects.

The first and main aim of this study was to estimate between- and within-subject variance components of gait variability and stability measures in treadmill walking, to allow estimation of the number of subjects necessary to obtain sufficient statistical power in studies that are aimed at detecting relevant differences between conditions in a repeated-measures design using subjects as their own controls. The second aim was to determine how the number of measurement days or measurements per day (i.e., the within-subject data collection strategy) influences the required numbers of subjects to detect differences between conditions with sufficient statistical power.

Materials and methods

Subjects

Sixteen older subjects [n female = 9, n male = 7, mean age 65.6 (SD 5.9) years, mean weight 77.5 (SD 15.3) kg, mean height 1.74 (SD 0.09) m], without physical impairments interfering with their walking ability, participated in this study. All subjects gave informed written consent. The ethics committee of the Faculty of Human Movement Sciences, VU University Amsterdam approved the experimental protocol in accordance with the Declaration of Helsinki.

Study design

Time series of 5 min of treadmill walking at 3.0 km h−1 were collected during four trials (two trials on each of 2 days). In between the walking trials, subjects performed a 15-min trial of perturbed walking at 3.0 km h−1 for another study. Subjects were allowed to rest as long as needed in between walking trials. The median number of days in between the two measurement days was 5 (range 1–21). Subjects were asked to perform their normal activities on the day before each measurement day.

Procedure

Upon arrival at the laboratory, each subject was first informed about the measurement procedure and then familiarized with treadmill walking. Subjects were allowed to practice treadmill walking for any amount of time. In general, subjects were comfortable with treadmill walking within 5 min. Subjects were instrumented with clusters of 3 LED’s on the trunk, at the level of T6, and on both feet. An optoelectronic system (Optotrak Northern Digital Inc., Waterloo, Ontario) measured the LED positions at 50 samples s−1.

Gait measures

The extracted gait variability measures were variability of medio-lateral trunk center of mass velocity (VARml), stride-time-variability (VARST) and step-width-variability (VARSW) of the final 150 strides of each trial (approximately the final 2–3 min). VARml was calculated as the mean of the standard deviations of medio-lateral trunk velocities at each increment of normalized time (0–100 %) of the measured strides. Trunk center of mass position was estimated based on the position of the LED-cluster attached to the trunk, trunk circumference and the position of several bony landmarks relative to the cluster [19]. The data were low-pass filtered (20 Hz, second-order lowpass Butterworth), for gait variability measures only, before 3-point differentiation to obtain trunk velocities. VARST was calculated as the standard deviation of the final 150 stride times. Stride time was calculated as the time between consecutive foot contacts of the same foot, which were determined as the local minima of the vertical position of the feet cluster markers. Step width was calculated as the maximal perpendicular distance relative to the walking direction between the lateral malleoli for each step. VARSW was calculated as the standard deviation of the final 300 steps.

Gait stability was quantified using local divergence exponents (LDE) [20]. LDEs describe how small initial differences in kinematics progress over the course of a step. The method for calculating the LDE has been described previously in more detail [16, 20]. In the present study, we used a reconstructed state-space based on a single time-series of medio-lateral trunk velocity and a state-space reconstructed from trunk kinematics in six degrees of freedom, to obtain LDEml and LDEtrunk, respectively. Parameters for state space reconstruction were based on data-driven estimates of the appropriate time-delay using the average mutual information procedure and the required number of embedded dimensions using the global false nearest neighbor analysis. LDEml was determined from a 5-dimensional state-space from embedded medio-lateral trunk velocity time-series, with a delay of 10 samples. LDEtrunk was based on a 12-dimensional state space reconstructed by combining the 3-dimensional linear and angular velocities of the trunk and their time delayed copies. The embedding delay for this 12-dimensional state-space was 25 samples. Rosenstein’s algorithm was used to calculate the LDE [21] from the state space reconstructions. In short, for each time point in state-space, a nearest neighbor was found and the Euclidean distance between these points in state-space was tracked, resulting in a number of time–distance curves equal to the number of time points in state space. The divergence curve was then calculated as the mean of the natural log of the time–distance curves. Finally, the LDE was determined as the slope of the linear fit through the first 50 samples (time needed for one step on average) of the divergence curve, corresponding to the initial period of rapid exponential divergence. Thus, the LDE indicates the rate of logarithmic divergence as a result of differences in initial conditions over the time needed for one step. A positive LDE indicates local instability.

Statistical analysis

As pointed out in the introduction, power calculations in gait studies require information about between-subjects and within-subjects variance components of the gait measures of interest, the latter including variances between measurement days and between trials within a day. All gait measures were obtained, as described above, in two separate trials on each of two different days for each subject. The parent data set, thus, consisted of 64 values for each gait measure (16 subjects × 2 days × 2 trials). These 64 values provided the basis for the analyses of variance and power, performed for each separate gait measure. A nested random model was used to estimate variance components [22], by solving expected mean squares of the two-way (subject, day) ANOVA corresponding to this model. This assumes that no systematic sources of variance (fixed effects) are present in the data. To check the validity of this assumption, a repeated-measures ANOVA was performed to test for effects of day (first vs second) and trial (first vs second, within day) on each of the gait measures. Neither day, trial nor their interaction had any systematic effect (p > 0.05, absolute differences <5 %).

The estimates obtained from the parent data were the overall mean (m) and three variance components: variance between subjects (\( s_{\text{BS}}^{2} \)), variance between days within subjects (\( s_{\text{BD}}^{2} \)), and variance between trials within days within subjects (\( s_{\text{WD}}^{2} \)). These parameters can be used to estimate the number of subjects required to obtain sufficient power for different measurement strategies as outlined in the “Appendix”. For all analyses, the desired level of significance was set to 0.05 and power was set to 0.80. Additional assumptions needed regard the correlation (ρ) between measurements in the two compared conditions (e.g., before and after an intervention) at the level of individuals, i.e., the predictability of the result in one condition from that in the other for any particular subject. As far as we know, such values have not been reported for gait measures in the literature. Therefore, we explored a range of values of ρ (0.3–0.6–0.9) as possible scenarios.

Based on these settings, we estimated the required number of subjects, n s, to detect effects of 10 and 30 % of the mean of the reference condition for repeated-measures (paired) designs, under the scenario that only one trial was performed by each subject in each condition. The detectable effect sizes were arbitrarily chosen, but are in the order of magnitude reported in the literature for comparisons between fallers and non-fallers [810, 2325].

To answer the second research question, we evaluated how a change in the number of measurement days or trials per day would influence the required number of subjects at a maintained statistical power. One or 2 measurement days and 1–3 trials per day were selected as realistic measurement strategies in clinical gait studies.

To estimate the prediction intervals of the calculated distribution parameters in the parent data set (m, \( s_{\text{BS}}^{2} \), \( s_{\text{BD}}^{2} \), \( s_{\text{WD}}^{2} \)), and of the required numbers of subjects, we used a bootstrap technique [26, 27]. In short, sixteen subjects were randomly drawn with replacement from the original 16 subjects, keeping the results from the four trials of each of the 16 selected subjects. Thus, one resampled bootstrap data set contained the same number of subjects and trials as the parent data set. For the resampled data set, the mean and variance components (m, \( s_{\text{BS}}^{2} \), \( s_{\text{BD}}^{2} \), \( s_{\text{WD}}^{2} \)) as well as ns were estimated for all combinations of number of days and number of trials. This procedure was repeated for 5000 bootstrap data sets, and bias-corrected 95 % prediction intervals for each of the estimated parameters were obtained from the distribution of the 5000 determinations as a measure of estimation uncertainty [28]. All statistical analyses were done in R 2.13 [29].

Results

All three variance components, key factors for estimating the required numbers of subjects in any particular data collection strategy, were substantial (see Table 1). For the gait variability measures VARST, VARSW, and VARml, between-subject variance was larger than within-subject variance. For LDE measures, the sum of the two within-subject variance components was similar to the between-subjects variance, and between-days variance was two to three times larger than within-day variance. All variance components had wide 95 % prediction intervals.

Table 1 Distribution parameters of gait measures

The numbers of subjects required to obtain sufficient statistical power in studies collecting data from one trial on 1 day in each of the two compared conditions ranged from 7 to 13 for highly correlated (ρ = 0.9) data with a large effect (30 %), up to 78–192 for data with a low correlation (ρ = 0.3) and with a small effect (10 %; Table 2).

Table 2 Required numbers of subjects to detect differences of 10 and 30 % of the reference group mean value for repeated-measures (paired) research designs with different values of correlations between measurements within subjects (ρ)

The effect of changing the measurement strategy on the required number of subjects is illustrated for VARST in Fig. 1. Similar effects of changing the measurement strategy were obtained for the other gait measures. The largest decrease in the required numbers of subjects occurred when an additional measurement day was added. Conducting more trials on the same day did result in fewer required subjects, but it was generally less effective than increasing the number of measurement days, in particular when increasing the number of trials from two to three.

Fig. 1
figure 1

The required number of subjects to detect differences in stride time variability, VARST, between two conditions using different repeated-measures designs. The required numbers of subjects (each measured in both conditions) to detect a 10 % (filled circles, left axis) or 30 % (unfilled circles, right axis) change of VARST in in paired designs with ρ = 0.3, 0.6, 0.9 (b, c, d, respectively). Solid and dashed lines indicate measurement strategies of 1 and 2 measurement days (n d = 1 and n d = 2), respectively. Results for one measurement day and one trial per day are identical to those shown in Table 2. Error bars show 95 % prediction intervals according to the bootstrap procedure

Discussion

The main objective of this paper was to assess the numbers of subjects required to obtain sufficient statistical power (80 %) for detecting specified differences in gait measures between two conditions using subjects as their own controls, i.e., a repeated-measures design. In this study, we set the differences to 10 and 30 % of the mean value in the reference condition based on results reported in literature. These differences are in line with suggested meaningful changes reported by Brach et al. [30], i.e., 0.01 s for stance time and swing time variability and 0.25 cm for step length variability. These changes correspond to approximately 10 and 30 %, respectively, of the baseline mean value of these gait measures. However, more research on clinically relevant change in gait variability is warranted. To the best of our knowledge, there is no literature on meaningful or relevant changes of LDE. While we have exemplified calculation procedures and effects on study sizes using the 10 and 30 % differences, any other expected effects can be addressed using the data and equations presented in the paper and “Appendix”.

Regarding effects of physical training on gait variability, one small study [31] reported a large effect (35 %) and one large study a small (4 %) and non-significant effect [31]. To our best knowledge, no reports are available on effects of physical training on gait LDE. A meta-analysis on training effects on standing balance reported a small effect size, i.e. 11 % [32]. The results of the present study demonstrate that when expected differences are small, as illustrated by a 10 % change of the group mean, the required numbers of subjects is large (Table 2). Since a 10 % change, or even less, in gait measures between conditions might be clinically relevant [30], it is advisable to measure a large number of subjects and to report both significant and non-significant results of several gait measures to allow future meta-analyses.

The dominant cause of the need for large study sizes is the large gross between-subjects variance of gait measures, which in turn depends on the between-subjects variance and the variance associated with estimating a mean value of a gait measure in each subject. The latter affects the uncertainty associated with gait studies in its own right and also decreases the effective correlation between pairs of measurements (cf. “Appendix”). Like the clinically relevant effect sizes, the correlations between pairs of measurements before and after intervention, which quantify the predictability of the intervention result for any subject, are largely unknown. Van Schooten et al. [33] found correlations between conditions ranging from 0.55 to 0.97 for gait variability measures and LDE (personal communication). Hak et al. [34] found that the predictability of gait variability and stability measures varied with the effect size, small effects showing correlations from 0.33 to 0.79 and large effects showing correlations between −0.28 and 0.56 (personal communication). A conservative estimate of the correlation may therefore be justified. We tested different sizes of the “true”, error-free correlation between measurements in the pre- and post-intervention conditions in our analyses. From Fig. 1, it is clear that the correlation had a large influence on the required numbers of subjects. The error-free correlation is effectively reduced by the substantial within-subjects error associated with determining gait measures (see “Appendix”).

In the present study, we used treadmill walking at a fixed gait speed. Treadmill walking was used to allow collecting data from a large number of strides, to improve precision of estimates of gait variability [14, 15] and stability [16, 17]. In clinical practice, gait data is often collected in overground walking, using optoelectronic methods or electronic walkways, which limit data collection to a few strides. This increases within-subject variance and thus decreases statistical power to detect differences between groups and conditions. Data on larger numbers of strides can be collected in overground walking when using inertial sensors [35, 36], but the number of consecutive strides is usually still limited by spatial constraints. Therefore, as an alternative to collecting a large number of consecutive strides, the number of trials can be increased [37, 38]. It should be kept in mind that treadmill walking in itself affects gait variability and stability [39] and this may limit generalizability of the present results to overground walking, although statistical precision of stability estimates appears similar between overground [36, 38] and treadmill walking [18]. The fixed gait speed used, may have affected the between- and within-subjects variance components. However, since we did not establish preferred gait speeds, and since there is no consensus on the nature of the relationship between gait speed on the one hand and gait variability [4044] and LDE [40, 41, 4547] on the other hand, it is impossible to estimate the effect of gait speed on the results. Thus, generalization to studies using preferred speed should be done with care.

For VARST and LDEml and LDEtrunk, the between-days variance was higher than the within-day variance, but the between-days variance was also substantial for the other gait measures. Since subjects were exposed to similar conditions on both measurement days, the large between-day variances imply that other factors might influence the gait measures on a particular day. It could be that healthy subjects have a broad array of variability and LDE within which, for example, balance and agility are sufficient, and thus not further controlled. This could imply that a more challenging gait assessment, i.e., using mechanical and/or cognitive challenges to bring gait more toward the boundary of stable gait, is required to assess gait quality. The requirement to maintain global stability in such conditions might reduce the redundancy of gait performance and consequently reduce within-subject variance. In addition, more challenging test conditions, whether mechanical or cognitive, may increase effect sizes, much like these conditions often increase between-group differences in stability and variability [e.g. 48, 49]. However, decreased between-group differences under more challenging conditions have also been described [e.g., 50] and consequently the effect of using more challenging test conditions on statistical power of measurement strategies requires further study.

Our analysis of the effects of changing the number of measurements days per subject and trials per day clearly demonstrated that the former is more effective in reducing the number of required subjects than the latter, but that both have an effect. The large increase in statistical power when measuring subjects on multiple days is an effect of the generally large between-days variance, while within-day variances were, in general, smaller. It should be noted, though, that it will always be more beneficial to allocate multiple measurements to different days than to collect them on the same day, since this will more effectively reduce the gross between-subject variance (“Appendix”, Eq. 4).

Within-subject variance components as well as between-subject variance may be dependent on the subject group studied. The present study involved healthy and relatively young (mean age 65 years) older adults. Results can, thus, not be generalized to patient populations and older and potentially more frail elderly.

Calculations of LDE allow for many different choices of the number of embedding dimensions and time-delays when constructing the state-space. While it is most common to use a fixed dimensionality (5D or 12D) of the state-space, different approaches to estimate these parameters have also been used [51]. Furthermore, the region of the divergence curve used to estimate the slope also needs to be selected. We did not investigate the effects of these choices on statistical power of LDE in gait studies. However, a study on the effects of these choices on the reliability of LDE exponents demonstrated that a fixed state-space reconstruction is generally more reliable than an individualized approach [36].

The prediction intervals of variance components (Table 1) and thus of the required number of subjects (Table 2) were wide, in the latter case particularly when investigating small differences between conditions. Wide prediction intervals of variance components are in line with reports from a few studies assessing postures and muscle activity in occupational settings [27, 52]. These wide prediction intervals complicate the determination of the required numbers of subjects. It has been suggested to base the study size on the 80th percentile of the distribution of the required number of subjects (cf. Table 2) rather than on the point estimate, which is in general downward (“optimistically”) biased [53]. The wide prediction intervals also imply that a pilot study with a small number of subjects is not likely to result in reliable data for power calculations. An unreliable power analysis could lead to underpowered studies and hence a waste of time, effort, and money in executing a study that will probably be inconclusive, but it could also result in overpowered studies, which would, indeed, have a high probability of resulting in statistically significant findings, but also consume unnecessarily large resources in reaching these results.

Conclusions

The results of the present study indicate that studies attempting to detect small changes in gait variability and stability between conditions measured in the same subjects (i.e., a repeated-measures design) need a large sample of subjects, generally well over 50, to obtain sufficient statistical power. To increase statistical power, increasing the number of measurement days is more effective than increasing the number of trials within a day. The presented results are important when interpreting studies that report small and non-significant effects.