Assessing Test-Retest Reliability of Psychological Measures
Persistent Methodological Problems
Abstract
Abstract. Psychological research and clinical practice rely heavily on psychometric testing for measuring psychological constructs that represent symptoms of psychopathology, individual difference characteristics, or cognitive profiles. Test-retest reliability assessment is crucial in the development of psychometric tools, helping to ensure that measurement variation is due to replicable differences between people regardless of time, target behavior, or user profile. While psychological studies testing the reliability of measurement tools are pervasive in the literature, many still discuss and assess this form of reliability inappropriately with regard to the specified aims of the study or the intended use of the tool. The current paper outlines important factors to consider in test-retest reliability analyses, common errors, and some initial methods for conducting and reporting reliability analyses to avoid such errors. The paper aims to highlight a persistently problematic area in psychological assessment, to illustrate the real-world impact that these problems can have on measurement validity, and to offer relatively simple methods for improving the validity and practical use of reliability statistics.
References
1983). Measurement in medicine – The analysis of method comparison studies. Statistician, 32, 307–317. doi: 10.2307/2987937
(1995). Statistics notes: Absence of evidence is not evidence of absence. British Medical Journal, 311, 485. doi: 10.1136/bmj.311.7003.485
(1995). A test-retest reliability study of child-reported psychiatric-symptoms and diagnoses using the Child and Adolescent Psychiatric-Assessment (CAPA-C). Psychological Medicine, 25, 755–762. doi: 10.1017/S0033291700034991
(1998). Statistical methods for assessing measurement error (reliability) in variables relevant to sports medicine. Sports Medicine, 26, 217–238. doi: 10.2165/00007256-199826040-00002
(2000). Estimating the stability reliability of a score. Measurement in Physical Education and Exercise Science, 4, 175–178. doi: 10.1207/S15327841Mpee0403_3
(2000). Assessing reproducibility of data obtained with instruments based on continuous measurements. Experimental Aging Research, 26, 353–365. doi: 10.1080/036107300750015741
(2000). Development of a measure of workplace deviance. The Journal of Applied Psychology, 85, 349–360. doi: 10.1037/0021-9010.85.3.349
(1986). Statistical methods for assessing agreement between two methods of clinical measurement. Lancet, 1, 307–310. doi: 10.1016/S0140-6736(86)90837-8
(1990). A note on the use of the intraclass correlation coefficient in the evaluation of agreement between two methods of measurement. Computers in Biology and Medicine, 20, 337–340. doi: 10.1016/0010-4825(90)90013-F
(1999). Measuring agreement in method comparison studies. Statistical Methods in Medical Research, 8, 135–160. doi: 10.1191/096228099673819272
(2009). What is being assessed and why it matters: The impact of transient error on trait research. Journal of Personality and Social Psychology, 97, 186–202. doi: 10.1037/a0015618
(1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46. doi: 10.1177/001316446002000104
(2006). Test-retest reliability of the Preschool Age Psychiatric Assessment (PAPA). Journal of the American Academy of Child and Adolescent Psychiatry, 45, 538–549. doi: 10.1097/01.chi.0000205705.71194.b8
(1983). Development and validation of a multidimensional eating disorder inventory for anorexia-nervosa and bulimia. The International Journal of Eating Disorders, 2, 15–34. doi: 10.1002/1098-108x(198321)2:2<15::Aid-Eat2260020203>3.0.Co;2-6
(1997). The Strengths and Difficulties Questionnaire: A research note. Journal of Child Psychology and Psychiatry, and Allied Disciplines, 38, 581–586. doi: 10.1111/j.1469-7610.1997.tb01545.x9255702
(2001). Psychometric properties of the Strengths and Difficulties Questionnaire. Journal of the American Academy of Child and Adolescent Psychiatry, 40, 1337–1345. doi: 10.1097/00004583-200111000-00015
(2003). The Alcohol Use Disorder and Associated Disabilities Interview Schedule-IV (AUDADIS-IV): Reliability of alcohol consumption, tobacco use, family history of depression and psychiatric diagnostic modules in a general population sample. Drug and Alcohol Dependence, 71, 7–16. doi: 10.1016/S0376-8716(03)00070-X
(2014). Eight things you need to know about interpreting correlations, Retrieved from http://www.sussex.ac.uk/Users/grahamh/RM1web/Eight%20things%20you%20need%20to%20know%20about%20interpreting%20correlations.pdf
(2015). Test-retest reliability of the English version of the Edinburgh Postnatal Depression Scale. Archives of Women’s Mental Health, 18, 255–257. doi: 10.1007/s00737-014-0461-425209355
(1986). Declare your ICC type. Physical Therapy, 66, 1431. doi: 10.1093/ptj/66.9.1431
(2012). Pitfalls and important issues in testing reliability using intraclass correlation coefficients in orthopaedic research. Clinics in Orthopedic Surgery, 4, 149–155. doi: 10.4055/cios.2012.4.2.149
(2002). Statistical techniques for comparing measurers and methods of measurement: A critical review. Clinical and Experimental Pharmacology and Physiology, 29, 527–536. doi: 10.1046/j.1440-1681.2002.03686.x
(1999). Test-retest reliability of the multidimensional anxiety scale for children. Journal of Anxiety Disorders, 13, 349–358. doi: 10.1016/S0887-6185(99)00009-2
(1990). Development and validation of the Penn State Worry Questionnaire. Behaviour Research and Therapy, 28, 487–495. doi: 10.1016/0005-7967(90)90135-6
(1992). Development of a scale to measure the trait of food neophobia in humans. Appetite, 19, 105–120. doi: 10.1016/0195-6663(92)90014-W
(2009). Modern psychometrics: The science of psychological assessment (3rd ed.). New York, NY: Routledge/Taylor & Francis.
(1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86, 420–428. doi: 10.1037/0033-2909.86.2.420
(2001). Test-retest reliability of anxiety symptoms and diagnoses with the anxiety disorders interview schedule for DSM-IV: Child and parent versions. Journal of the American Academy of Child and Adolescent Psychiatry, 40, 937–944. doi: 10.1097/00004583-200108000-00016
(1995). Development of a measure of the motives underlying the selection of food: The food choice questionnaire. Appetite, 25, 267–284. doi: 10.1006/appe.1995.0061
(2007). A shortcut to rejection: How not to write the results section of a paper. Canadian Journal of Psychiatry – Revue Canadienne De Psychiatrie, 52, 385–389. doi: 10.1177/070674370705200608
(2014). Health measurement scales: A practical guide to their development and use. Oxford, UK: Oxford University Press.
(2015). Test-retest reliability of a new questionnaire for the retrospective assessment of long-term lithium use in bipolar disorder. Journal of Affective Disorders, 174, 589–593. doi: 10.1016/j.jad.2014.11.021
(2012). An inter-rater reliability study for the Rorschach performance assessment system. Journal of Personality Assessment, 94, 607–612. doi: 10.1080/00223891.2012.684118
(2005). Quantifying test-retest reliability using the intraclass correlation coefficient and the SEM. Journal of Strength and Conditioning Research, 19, 231–240. doi: 10.1519/15184.1
(