Skip to main content
Original Articles and Reviews

Assessing Test-Retest Reliability of Psychological Measures

Persistent Methodological Problems

Published Online:https://doi.org/10.1027/1016-9040/a000298

Abstract. Psychological research and clinical practice rely heavily on psychometric testing for measuring psychological constructs that represent symptoms of psychopathology, individual difference characteristics, or cognitive profiles. Test-retest reliability assessment is crucial in the development of psychometric tools, helping to ensure that measurement variation is due to replicable differences between people regardless of time, target behavior, or user profile. While psychological studies testing the reliability of measurement tools are pervasive in the literature, many still discuss and assess this form of reliability inappropriately with regard to the specified aims of the study or the intended use of the tool. The current paper outlines important factors to consider in test-retest reliability analyses, common errors, and some initial methods for conducting and reporting reliability analyses to avoid such errors. The paper aims to highlight a persistently problematic area in psychological assessment, to illustrate the real-world impact that these problems can have on measurement validity, and to offer relatively simple methods for improving the validity and practical use of reliability statistics.

References

  • Altman, D. G. & Bland, J. M. (1983). Measurement in medicine – The analysis of method comparison studies. Statistician, 32, 307–317. doi: 10.2307/2987937 First citation in articleCrossrefGoogle Scholar

  • Altman, D. G. & Bland, J. M. (1995). Statistics notes: Absence of evidence is not evidence of absence. British Medical Journal, 311, 485. doi: 10.1136/bmj.311.7003.485 First citation in articleCrossrefGoogle Scholar

  • Angold, A. & Costello, E. J. (1995). A test-retest reliability study of child-reported psychiatric-symptoms and diagnoses using the Child and Adolescent Psychiatric-Assessment (CAPA-C). Psychological Medicine, 25, 755–762. doi: 10.1017/S0033291700034991 First citation in articleCrossrefGoogle Scholar

  • Atkinson, G. & Nevill, A. M. (1998). Statistical methods for assessing measurement error (reliability) in variables relevant to sports medicine. Sports Medicine, 26, 217–238. doi: 10.2165/00007256-199826040-00002 First citation in articleCrossrefGoogle Scholar

  • Baumgartner, T. A. (2000). Estimating the stability reliability of a score. Measurement in Physical Education and Exercise Science, 4, 175–178. doi: 10.1207/S15327841Mpee0403_3 First citation in articleCrossrefGoogle Scholar

  • Bedard, M., Martin, N. J., Krueger, P. & Brazil, K. (2000). Assessing reproducibility of data obtained with instruments based on continuous measurements. Experimental Aging Research, 26, 353–365. doi: 10.1080/036107300750015741 First citation in articleCrossrefGoogle Scholar

  • Bennett, R. J. & Robinson, S. L. (2000). Development of a measure of workplace deviance. The Journal of Applied Psychology, 85, 349–360. doi: 10.1037/0021-9010.85.3.349 First citation in articleCrossrefGoogle Scholar

  • Bland, J. M. & Altman, D. G. (1986). Statistical methods for assessing agreement between two methods of clinical measurement. Lancet, 1, 307–310. doi: 10.1016/S0140-6736(86)90837-8 First citation in articleCrossrefGoogle Scholar

  • Bland, J. M. & Altman, D. G. (1990). A note on the use of the intraclass correlation coefficient in the evaluation of agreement between two methods of measurement. Computers in Biology and Medicine, 20, 337–340. doi: 10.1016/0010-4825(90)90013-F First citation in articleCrossrefGoogle Scholar

  • Bland, J. M. & Altman, D. G. (1999). Measuring agreement in method comparison studies. Statistical Methods in Medical Research, 8, 135–160. doi: 10.1191/096228099673819272 First citation in articleCrossrefGoogle Scholar

  • Chmielewski, M. & Watson, D. (2009). What is being assessed and why it matters: The impact of transient error on trait research. Journal of Personality and Social Psychology, 97, 186–202. doi: 10.1037/a0015618 First citation in articleCrossrefGoogle Scholar

  • Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46. doi: 10.1177/001316446002000104 First citation in articleCrossrefGoogle Scholar

  • Egger, H. L., Erkanli, A., Keeler, G., Potts, E., Walter, B. K. & Angold, A. (2006). Test-retest reliability of the Preschool Age Psychiatric Assessment (PAPA). Journal of the American Academy of Child and Adolescent Psychiatry, 45, 538–549. doi: 10.1097/01.chi.0000205705.71194.b8 First citation in articleCrossrefGoogle Scholar

  • Garner, D. M., Olmstead, M. P. & Polivy, J. (1983). Development and validation of a multidimensional eating disorder inventory for anorexia-nervosa and bulimia. The International Journal of Eating Disorders, 2, 15–34. doi: 10.1002/1098-108x(198321)2:2<15::Aid-Eat2260020203>3.0.Co;2-6 First citation in articleCrossrefGoogle Scholar

  • Goodman, R. (1997). The Strengths and Difficulties Questionnaire: A research note. Journal of Child Psychology and Psychiatry, and Allied Disciplines, 38, 581–586. doi: 10.1111/j.1469-7610.1997.tb01545.x9255702 First citation in articleCrossrefGoogle Scholar

  • Goodman, R. (2001). Psychometric properties of the Strengths and Difficulties Questionnaire. Journal of the American Academy of Child and Adolescent Psychiatry, 40, 1337–1345. doi: 10.1097/00004583-200111000-00015 First citation in articleCrossrefGoogle Scholar

  • Grant, B. F., Dawson, D. A., Stinson, F. S., Chou, P. S., Kay, W. & Pickering, R. (2003). The Alcohol Use Disorder and Associated Disabilities Interview Schedule-IV (AUDADIS-IV): Reliability of alcohol consumption, tobacco use, family history of depression and psychiatric diagnostic modules in a general population sample. Drug and Alcohol Dependence, 71, 7–16. doi: 10.1016/S0376-8716(03)00070-X First citation in articleCrossrefGoogle Scholar

  • Hole, G. (2014). Eight things you need to know about interpreting correlations, Retrieved from http://www.sussex.ac.uk/Users/grahamh/RM1web/Eight%20things%20you%20need%20to%20know%20about%20interpreting%20correlations.pdf First citation in articleGoogle Scholar

  • Kernot, J., Olds, T., Lewis, L. K. & Maher, C. (2015). Test-retest reliability of the English version of the Edinburgh Postnatal Depression Scale. Archives of Women’s Mental Health, 18, 255–257. doi: 10.1007/s00737-014-0461-425209355 First citation in articleCrossrefGoogle Scholar

  • Krebs, D. E. (1986). Declare your ICC type. Physical Therapy, 66, 1431. doi: 10.1093/ptj/66.9.1431 First citation in articleCrossrefGoogle Scholar

  • Lee, K. M., Lee, J., Chung, C. Y., Ahn, S., Sung, K. H., Kim, T. W., … Park, M. S. (2012). Pitfalls and important issues in testing reliability using intraclass correlation coefficients in orthopaedic research. Clinics in Orthopedic Surgery, 4, 149–155. doi: 10.4055/cios.2012.4.2.149 First citation in articleCrossrefGoogle Scholar

  • Ludbrook, J. (2002). Statistical techniques for comparing measurers and methods of measurement: A critical review. Clinical and Experimental Pharmacology and Physiology, 29, 527–536. doi: 10.1046/j.1440-1681.2002.03686.x First citation in articleCrossrefGoogle Scholar

  • March, J. S., Sullivan, K. & Parker, J. (1999). Test-retest reliability of the multidimensional anxiety scale for children. Journal of Anxiety Disorders, 13, 349–358. doi: 10.1016/S0887-6185(99)00009-2 First citation in articleCrossrefGoogle Scholar

  • Meyer, T. J., Miller, M. L., Metzger, R. L. & Borkovec, T. D. (1990). Development and validation of the Penn State Worry Questionnaire. Behaviour Research and Therapy, 28, 487–495. doi: 10.1016/0005-7967(90)90135-6 First citation in articleCrossrefGoogle Scholar

  • Pliner, P. & Hobden, K. (1992). Development of a scale to measure the trait of food neophobia in humans. Appetite, 19, 105–120. doi: 10.1016/0195-6663(92)90014-W First citation in articleCrossrefGoogle Scholar

  • Rust, J. & Golombok, S. (2009). Modern psychometrics: The science of psychological assessment (3rd ed.). New York, NY: Routledge/Taylor & Francis. First citation in articleGoogle Scholar

  • Shrout, P. E. & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86, 420–428. doi: 10.1037/0033-2909.86.2.420 First citation in articleCrossrefGoogle Scholar

  • Silverman, W. K., Saavedra, L. M. & Pina, A. A. (2001). Test-retest reliability of anxiety symptoms and diagnoses with the anxiety disorders interview schedule for DSM-IV: Child and parent versions. Journal of the American Academy of Child and Adolescent Psychiatry, 40, 937–944. doi: 10.1097/00004583-200108000-00016 First citation in articleCrossrefGoogle Scholar

  • Steptoe, A., Pollard, T. M. & Wardle, J. (1995). Development of a measure of the motives underlying the selection of food: The food choice questionnaire. Appetite, 25, 267–284. doi: 10.1006/appe.1995.0061 First citation in articleCrossrefGoogle Scholar

  • Streiner, D. L. (2007). A shortcut to rejection: How not to write the results section of a paper. Canadian Journal of Psychiatry – Revue Canadienne De Psychiatrie, 52, 385–389. doi: 10.1177/070674370705200608 First citation in articleCrossrefGoogle Scholar

  • Streiner, D. L., Norman, G. R. & Cairney, J. (2014). Health measurement scales: A practical guide to their development and use. Oxford, UK: Oxford University Press. First citation in articleCrossrefGoogle Scholar

  • Tighe, S. K., Ritchey, M., Schweizer, B., Goes, F. S., MacKinnon, D., Mondimore, F., … Potash, J. B. (2015). Test-retest reliability of a new questionnaire for the retrospective assessment of long-term lithium use in bipolar disorder. Journal of Affective Disorders, 174, 589–593. doi: 10.1016/j.jad.2014.11.021 First citation in articleCrossrefGoogle Scholar

  • Viglione, D. J., Blume-Marcovici, A. C., Miller, H. L., Giromini, L. & Meyer, G. (2012). An inter-rater reliability study for the Rorschach performance assessment system. Journal of Personality Assessment, 94, 607–612. doi: 10.1080/00223891.2012.684118 First citation in articleCrossrefGoogle Scholar

  • Weir, J. P. (2005). Quantifying test-retest reliability using the intraclass correlation coefficient and the SEM. Journal of Strength and Conditioning Research, 19, 231–240. doi: 10.1519/15184.1 First citation in articleCrossrefGoogle Scholar