Skip to main content
Log in

Statistical Methods For Assessing Measurement Error (Reliability) in Variables Relevant to Sports Medicine

  • Review Article
  • Published:
Sports Medicine Aims and scope Submit manuscript

Abstract

Minimal measurement error (reliability) during the collection of interval- and ratio-type data is critically important to sports medicine research. The main components of measurement error are systematic bias (e.g. general learning or fatigue effects on the tests) and random error due to biological or mechanical variation. Both error components should be meaningfully quantified for the sports physician to relate the described error to judgements regarding ‘analytical goals’ (the requirements of the measurement tool for effective practical use) rather than the statistical significance of any reliability indicators.

Methods based on correlation coefficients and regression provide an indication of ‘relative reliability’. Since these methods are highly influenced by the range of measured values, researchers should be cautious in: (i) concluding acceptable relative reliability even if a correlation is above 0.9; (ii) extrapolating the results of a test-retest correlation to a new sample of individuals involved in an experiment; and (iii) comparing test-retest correlations between different reliability studies.

Methods used to describe ‘absolute reliability’ include the standard error of measurements (SEM), coefficient of variation (CV) and limits of agreement (LOA). These statistics are more appropriate for comparing reliability between different measurement tools in different studies. They can be used in multiple retest studies from ANOVA procedures, help predict the magnitude of a ‘real’ change in individual athletes and be employed to estimate statistical power for a repeated-measures experiment.

These methods vary considerably in the way they are calculated and their use also assumes the presence (CV) or absence (SEM) of heteroscedasticity. Most methods of calculating SEM and CV represent approximately 68% of the error that is actually present in the repeated measurements for the ‘average’ individual in the sample. LOA represent the test-retest differences for 95% of a population. The associated Bland-Altman plot shows the measurement error schematically and helps to identify the presence of heteroscedasticity. If there is evidence of heteroscedasticity or non-normality, one should logarithmically transform the data and quote the bias and random error as ratios. This allows simple comparisons of reliability across different measurement tools.

It is recommended that sports clinicians and researchers should cite and interpret a number of statistical methods for assessing reliability. We encourage the inclusion of the LOA method, especially the exploration of heteroscedasticity that is inherent in this analysis. We also stress the importance of relating the results of any reliability statistic to ‘analytical goals’ in sports medicine.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Yeadon MR, Challis JH. The future of performance-related sports biomechanics research. J Sports Sci 1994; 12: 3–32

    Article  PubMed  CAS  Google Scholar 

  2. Jakeman PM, Winter EM, Doust J. A review of research in sports physiology. J Sports Sci 1994; 12: 33–60

    Article  PubMed  CAS  Google Scholar 

  3. Hardy L, Jones G. Current issues and future directions for performance-related research in sport psychology. J Sports Sci 1994; 12: 61–92

    Article  PubMed  CAS  Google Scholar 

  4. Nevill AM. Statistical methods in kinanthropometry and exercise physiology. In. Eston R, Reilly T, editors. Kinanthropometry and exercise physiology laboratory manual. London: E and FN Spon, 1996: 297–320

    Google Scholar 

  5. Safrit MJ. An overview of measurement. In. Safrit MJ, Wood TM, editors. Measurement concepts in physical education and exercise science. Champaign (IL): Human Kinetics, 1989: 3–20

    Google Scholar 

  6. Zar JH. Biostatistical analysis. London: Prentice Hall, 1996

    Google Scholar 

  7. Mathews JN. A formula for the probability of discordant classification in method comparison studies. Stat Med 1997; 16 (6): 705–10

    Article  Google Scholar 

  8. Bates BT, Dufek JS, Davis HP. The effects of trial size on statistical power. Med Sci Sports Exerc 1992; 24 (9): 1059–65

    PubMed  CAS  Google Scholar 

  9. Dufek JS, Bates BT, Davis HP. The effect of trial size and variability on statistical power. Med Sci Sports Exerc 1995; 27: 288–95

    PubMed  CAS  Google Scholar 

  10. Atkinson G. [Letter]. British Association of Sports Sciences Newsletter, 1995 Sep: 5

    Google Scholar 

  11. Nevill AM. Validity and measurement agreement in sports performance [abstract]. J Sports Sci 1996; 14: 199

    Article  PubMed  CAS  Google Scholar 

  12. Ottenbacher KJ, Stull GA. The analysis and interpretation of method comparison studies in rehabilitation research. Am J Phys Med Rehab 1993; 72: 266–71

    Article  CAS  Google Scholar 

  13. Hollis S. Analysis of method comparison studies. Ann Clin Biochem 1996; 33: 1–4

    PubMed  Google Scholar 

  14. Liehr P, Dedo YL, Torres S, et al. Assessing agreement between clinical measurement methods. Heart Lung 1995; 24: 240–5

    Article  PubMed  CAS  Google Scholar 

  15. Ottenbacher KJ, Tomcheck SD. Measurement variation in method comparison studies: an empirical examination. Arch Phys Med Rehabil 1994; 75 (5): 505–12

    PubMed  CAS  Google Scholar 

  16. Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 1986; I: 307–10

    Article  Google Scholar 

  17. Safrit MJ, Wood TM, editors. Measurement concepts in physical education and exercise science. Champaign (IL): Human Kinetics, 1989

    Google Scholar 

  18. Baumgarter TA. Norm-referenced measurement: reliability. In. Safrit MJ, Wood TM, editors. Measurement concepts in physical education and exercise science. Champaign (IL): Human Kinetics, 1989: 45–72

    Google Scholar 

  19. Atkinson G. Reilly T. Circadian variation in sports performance. Sports Med 1996; 21 (4): 292–312

    Article  PubMed  CAS  Google Scholar 

  20. Morrow JR, Jackson AW, Disch JG, et al. Measurement and evaluation in human performance. Champaign (IL): Human Kinetics, 1995

    Google Scholar 

  21. Morrow JR. Generalizability theory. In. Safrit MJ, Wood TM, editors. Measurement concepts in physical education and exercise science. Champaign (IL): Human Kinetics, 1989: 73–96

    Google Scholar 

  22. Roebroeck ME, Harlaar J, Lankhorst GJ. The application of generalizability theory to reliability assessment: an illustration using isometric force measurements. Phys Ther 1993; 73 (6): 386–95

    PubMed  CAS  Google Scholar 

  23. Chatburn RL. Evaluation of instrument error and method agreement. Am Assoc Nurse Anesthet J 1996; 64 (3): 261–8

    CAS  Google Scholar 

  24. Coldwells A, Atkinson G, Reilly T. Sources of variation in back and leg dynamometry. Ergonomics 1994; 37: 79–86

    Article  PubMed  CAS  Google Scholar 

  25. Hickey MS, Costill DL, McConnell GK, et al. Day-to-day variation in time trial cycling performance. Int J Sports Med 1992; 13: 467–70

    Article  PubMed  CAS  Google Scholar 

  26. Nevill A. Why the analysis of performance variables recorded on a ratio scale will invariably benefit from a log transformation. J Sports Sci 1997; 15: 457–8

    Article  PubMed  CAS  Google Scholar 

  27. Bland JM, Altman DG. Transforming data. BMJ 1996; 312 (7033): 770

    Article  PubMed  CAS  Google Scholar 

  28. Schultz RW. Analysing change. In. Safrit MJ, Wood TM, editors. Measurement concepts in physical education and exercise science. Champaign (IL): Human Kinetics, 1989: 207–28

    Google Scholar 

  29. Morrow JR, Jackson AW. How ’significant’ is your reliability?. Res Q Exerc Sport 1993; 64 (3): 352–5

    PubMed  Google Scholar 

  30. Altman DG. Practical statistics for medical research. London: Chapman and Hall, 1991: 396–403

    Google Scholar 

  31. Mathews JNS, Altman DG, Campbell MJ, et al. Analysis of serial measurements in medical research. BMJ 1990; 300: 230–5

    Article  Google Scholar 

  32. Vincent J. Statistics in kinesiology. Champaign (IL): Human Kinetics Books, 1994

    Google Scholar 

  33. Ross JW, Fraser MD. Analytical goals developed from the inherent error of medical tests. Clin Chem 1993; 39 (7): 1481–93

    PubMed  CAS  Google Scholar 

  34. Fraser CG, Hyltoft Peterson P, et al. Setting analytical goals for random analytical error in specific clinical monitoring situations. Clin Chem 1990; 36 (9): 1625–8

    PubMed  CAS  Google Scholar 

  35. Zehr ER, Sale DG. Reproducibility of ballistic movement. Med Sci Sports Exerc 1997; 29: 1383–8

    Article  PubMed  CAS  Google Scholar 

  36. Hofstra WB, Sont JK, Sterk PJ, et al. Sample size estimation in studies monitoring exercise-induced bronchoconstriction in asthmatic children. Thorax 1997; 52: 739–41

    Article  PubMed  CAS  Google Scholar 

  37. Schabort EJ, Hopkins WG, Hawley JA. Reproducibility of selfpaced treadmill performance of trained endurance runners. Int J Sports Med 1998; 19: 48–51

    Article  PubMed  CAS  Google Scholar 

  38. Hopkins W. A new view of statistics. Internet site, 1997, http://www.sportsci.org/resource/stats/index.html

    Google Scholar 

  39. Bland M. An introduction to medical statistics. Oxford: University Press, 1995

    Google Scholar 

  40. Proceedings of the 43rd Meeting of the American College of Sports Medicine. Med Sci Sports Exerc 1996; 28: S1-211

    Google Scholar 

  41. Altman DG, Bland JM. Measurement in medicine: the analysis of method comparison studies. Statistician 1983; 32: 307–17

    Article  Google Scholar 

  42. Bland JM, Altman DG. Comparing two methods of clinical measurement: a personal history. Int J Epidemiol 1995; 24 Suppl. 1: S7–14

    PubMed  Google Scholar 

  43. Bland JM, Altman DG. Measurement error. BMJ 1996; 312 (7047): 1654

    Article  PubMed  CAS  Google Scholar 

  44. Bland JM, Altman DG. Measurement error proportional to the mean. BMJ 1996; 313 (7049): 106

    Article  PubMed  CAS  Google Scholar 

  45. Thomas JR, Nelson JK. Research methods in physical activity. Champaign (IL): Human Kinetics, 1990

    Google Scholar 

  46. Nevill AN, Atkinson G. Assessing measurement agreement (repeatability) between 3 or more trials [abstract]. J Sports Sci 1998; 16: 29

    Google Scholar 

  47. Coolican H. Research methods and statistics in psychology. London: Hodder and Stoughton, 1994

    Google Scholar 

  48. Sale DG. Testing strength and power. In. MacDougall JD, Wenger HA, Green HJ, editors. Physiological testing of the high performance athlete. Champaign (IL): Human Kinetics, 1991: 21–106

    Google Scholar 

  49. Bates BT, Zhang S, Dufek JS, et al. The effects of sample size and variability on the correlation coefficient. Med Sci Sports Exerc 1996; 28 (3): 386–91

    PubMed  CAS  Google Scholar 

  50. Perrin DH. Isokinetic exercise and assessment. Champaign (IL): Human Kinetics, 1993

    Google Scholar 

  51. Glass GV, Hopkins KD. Statistical methods in education and psychology. 2nd ed. Englewood Cliffs (NJ): Prentice-Hall, 1984

    Google Scholar 

  52. Estelberger W, Reibnegger G. The rank correlation coefficient: an additional aid in the interpretation of laboratory data. Clin Chim Acta 1995; 239 (2): 203–7

    Article  PubMed  CAS  Google Scholar 

  53. Nevill AN, Atkinson G. Assessing agreement between measurements recorded on a ratio scale in sports medicine and sports science. Br J Sports Med 1997; 31: 314–8

    Article  PubMed  CAS  Google Scholar 

  54. Atkinson G, Greeves J, Reilly T, et al. Day-to-day and circadian variability of leg strength measured with the lido isokinetic dynamometer. J Sports Sci 1995; 13: 18–9

    Google Scholar 

  55. Bailey SM, Sarmandal P, Grant JM. A comparison of three methods of assessing inter-observer variation applied to measurement of the symphysis-fundal height. Br J Obstet Gynaecol 1989; 96 (11): 1266–71

    Article  PubMed  CAS  Google Scholar 

  56. Sarmandal P, Bailey SM, Grant JM. A comparison of three methods of assessing inter-observer variation applied to ultrasonic fetal measurement in the third trimester. Br J Obstet Gynaecol 1989; 96 (11): 1261–5

    Article  PubMed  CAS  Google Scholar 

  57. Atkinson G, Coldwells A, Reilly T, et al. Does the within-test session variation in measurements of muscle strength depend on time of day?. [abstract] J Sports Sci 1997; 15: 22

    Article  Google Scholar 

  58. Charter RA. Effect of measurement error on tests of statistical significance. J Clin Exp Neuropsychol 1997; 19 (3): 458–62

    Article  PubMed  CAS  Google Scholar 

  59. Muller R, Buttner P. A critical discussion of intraclass correlation coefficients. Stat Med 1994; 13: 23–4, 2465-76

    Article  Google Scholar 

  60. Eliasziw M, Young SL, Woodbury MG, et al. Statistical methodology for the concurrent assessment of inter-rater and intra-rater reliability: using goniometric measurements as an example. Phys Ther 1994; 74 (8): 777–88

    PubMed  CAS  Google Scholar 

  61. Krebs DE. Declare your ICC type [letter]. Phys Ther 1986; 66: 1431

    PubMed  CAS  Google Scholar 

  62. Atkinson G. A comparison of statistical methods for assessing measurement repeatability in ergonomics research. In. Atkinson G, Reilly T, editors. Sport, leisure and ergonomics. London: E and FN Spon, 1995: 218–22

    Google Scholar 

  63. Bland JM, Altman DG. A note on the use of the intraclass correlation coefficient in the evaluation of agreement between two methods of measurement. Comput Biol Med 1990; 20: 337–40

    Article  PubMed  CAS  Google Scholar 

  64. Myrer JW, Schulthies SS, Fellingham GW. Relative and absolute reliability of the KT-2000 arthrometer for uninjured knees. Testing at 67, 89, 134 and 178 N and manual maximum forces. Am J Sports Med 1996; 24 (1): 104–8

    Article  PubMed  CAS  Google Scholar 

  65. Quan H, Shih WJ. Assessing reproducibility by the withinsubject coefficient of variation with random effects models. Biometrics 1996; 52 (4): 1195–203

    Article  PubMed  CAS  Google Scholar 

  66. Lin LI-K. A concordance correlation coefficient to evaluate reproducibility. Biometrics 1989; 45: 255–68

    Article  PubMed  CAS  Google Scholar 

  67. Nickerson CAE. A note on ‘A concordance correlation coefficient to evaluate reproducibility’. Biometrics 1997; 53: 1503–7

    Article  Google Scholar 

  68. Atkinson G, Nevill A. Comment on the use of concordance correlation to assess the agreement between two variables. Biometrics 1997; 53: 775–7

    Google Scholar 

  69. Stratford PW, Goldsmith CH. Use of the standard error as a reliability index of interest: an applied example using elbow flexor strength data. Phys Ther 1997; 77 (7): 745–50

    PubMed  CAS  Google Scholar 

  70. Payne RW. Reliability theory and clinical psychology. J Clin Psychol 1989; 45 (2): 351–2

    Article  PubMed  CAS  Google Scholar 

  71. Strike PW. Statistical methods in laboratory medicine. Oxford: Butterworth-Heinemann, 1991

    Google Scholar 

  72. Fetz CJ, Miller GE. An asymptotic test for the equality of coefficients of variation from k populations. Stat Med 1996; 15 (6): 646–58

    Google Scholar 

  73. Allison DB. Limitations of coefficient of variation as index of measurement reliability [editorial]. Nutrition 1993; 9 (6): 559–61

    PubMed  CAS  Google Scholar 

  74. Yao L, Sayre JW. Statistical concepts in the interpretation of serial bone densitometry. Invest Radiol 1994; 29 (10): 928–32

    Article  PubMed  CAS  Google Scholar 

  75. Detwiler JS, Jarisch W, Caritis SN. Statistical fluctuations in heart rate variability indices. Am J Obstet Gynecol 1980; 136 (2): 243–8

    PubMed  CAS  Google Scholar 

  76. Stokes M. Reliability and repeatability of methods for measuring muscle in physiotherapy. Physiother Pract 1985; 1: 71–6

    Article  Google Scholar 

  77. Bishop D. Reliability of a 1-h endurance performance test in trained female cyclists. Med Sci Sports Exerc 1997; 29: 554–9

    Article  PubMed  CAS  Google Scholar 

  78. Bland JM, Altman DG. Comparing methods of measurement: why plotting difference against the standard method is misleading. Lancet 1995; 346 (8982): 1085–7

    Article  PubMed  CAS  Google Scholar 

  79. British Standards Institution. Precision of test methods I. Guide for the determination and reproducibility for a standard test method. BS5497: Pt 1. London: BSI, 1979

    Google Scholar 

  80. de Jong JS, van Diest PJ, Baak JPA. In response [letter]. Lab Invest 1996; 75 (5): 756–8

    Google Scholar 

  81. Wisen AG, Wohlfart B. A comparison between two exercise tests on cycle; a computerised test versus the Astrand test. Clin Physiol 1995; 15: 91–102

    Article  PubMed  CAS  Google Scholar 

  82. Wilmore JH, Costill DL. Physiology of sport and exercise. Champaign (IL): Human Kinetics, 1994

    Google Scholar 

  83. Pollock ML. Quantification of endurance training programmes. Exerc Sports Sci Rev 1973; 1: 155–88

    Article  CAS  Google Scholar 

  84. Doyle JR, Doyle JM. Measurement error is that which we have not yet explained. BMJ 1997; 314: 147–8

    Article  PubMed  CAS  Google Scholar 

  85. Schaefer F, Georgi M, Zieger A, et al. Usefulness of bioelectric impedance and skinfold measurements in predicting fat-free mass derived from total body potassium in children. Pediatr Res 1994; 35: 617–24

    Article  PubMed  CAS  Google Scholar 

  86. Webber J, Donaldson M, Allison SP, et al. Comparison of skinfold thickness, body mass index, bioelectrical impedance analysis and x-ray absorptiometry in assessing body composition in obese subjects. Clin Nutr 1994; 13: 177–82

    Article  PubMed  CAS  Google Scholar 

  87. Fuller NJ, Sawyer MB, Laskey MA, et al. Prediction of body composition in elderly men over 75 years of age. Ann Hum Biol 1996; 23: 127–47

    Article  PubMed  CAS  Google Scholar 

  88. Gutin B, Litaker M, Islam S, et al. Body composition measurement in 9-11 year old children by dual energy x-ray absorptiometry, skinfold thickness measures and bioimpedance analysis. Am J Clin Nutr 1996; 63: 287–92

    PubMed  CAS  Google Scholar 

  89. Reilly JJ, Wilson J, McColl JH, et al. Ability of bioelectric impedance to predict fat-free mass in prepubertal children. Pediatr Res 1996; 39: 176–9

    Article  PubMed  CAS  Google Scholar 

  90. Wood TM. The changing nature of norm-referenced validity. In. Safrit MJ, Wood TM, editors, Measurement concepts in physical education and exercise science. Champaign (IL): Human Kinetics, 1989: 23–44

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Greg Atkinson.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Atkinson, G., Nevill, A.M. Statistical Methods For Assessing Measurement Error (Reliability) in Variables Relevant to Sports Medicine. Sports Med 26, 217–238 (1998). https://doi.org/10.2165/00007256-199826040-00002

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.2165/00007256-199826040-00002

Keywords

Navigation