Re-training writing raters online: How does it compare with face-to-face training?
Section snippets
Background
Although there have been substantial advances in automated rating of writing in recent years (Jamieson, 2005), it is still the norm in writing assessment to use human raters. Unfortunately, their judgements are prone to various sources of bias and error which can ultimately compromise the quality of the ratings. A number of studies using a range of psychometric methods have identified various rater effects (Myford and Wolfe, 2003, Myford and Wolfe, 2004) which need to be addressed if an
The assessment instrument
The Diagnostic English Language Needs Assessment is an initiative funded by the university to identify the academic English needs of undergraduate students following their admission to a degree programme. Those that are found to be at risk are offered suitable English language support. DELNA consists of a screening and a diagnostic component. The main purpose of the screening, which is made up of vocabulary and text editing tasks, is to identify students who are highly proficient users of
Results
We will present the results of the study by working systematically through the research questions, covering first the outcomes of the FACETS analysis and then the qualitative data from the questionnaires and interviews.
Discussion and conclusion
The findings indicate that, in terms of severity, both forms of training were successful in bringing the raters closer together in their ratings. There was an indication that the online training might have been slightly more successful. Both groups rated consistently before and after the training. Afterwards, the online group might have become slightly more consistent, whilst the face-to-face group rated with slightly more variation. On the individual level, only one rater moved outside the
References (19)
- et al.
Teachers’ perceptions of on-line rater training and monitoring
System
(2001) - et al.
The stability of rater severity in large-scale assessment programs
Journal of Educational Measurement
(2000) - et al.
Evaluating rater responses to an online rater training program
Language Testing
(2007) Reliability statistics
Rasch Measurement: Transactions of the Rasch Measurement SIG
(1992)Trend in computer-based second language assessment
Annual Review of Applied Linguistics
(2005)- et al.
Evaluating the efficacy of rater self-training
(1993) - et al.
The measurement of work performance: Methods, theory, and application
(1983) Facets Rasch measurement computer program
(2006)- et al.
Rater characteristics and rater bias: Implications for training
Language Testing
(1995)
Cited by (82)
Halo effects in rating data: Assessing speech fluency
2023, Research Methods in Applied LinguisticsUsing a logic model to evaluate rater training for EAP writing assessment
2022, Journal of English for Academic PurposesIndividualized feedback to raters in language assessment: Impacts on rater effects
2022, Assessing WritingCitation Excerpt :Likewise, they used another term, “differential rater leniency,” to describe the case that a rater tends to assign higher scores to one or more particular groups of examinees than model expectations. Knoch et al. (2007) combined these two terms and extended their application from one particular group of examinees to an aspect of one facet. They used “bias effect” to describe the case that a rater tends to assign scores that are relatively low or high in terms of an aspect of one facet.
Validating a rubric for assessing integrated writing in an EAP context
2022, Assessing WritingCitation Excerpt :One of the key elements in an evidence-based approach is the elicitation and interpretation of rater perceptions about evaluation criteria. Studies drawing on qualitative methods, such as interviews and think-aloud protocols, have shown that rater perceptions play an important role in the standardization of scoring rubrics (Knoch et al., 2007). For example, Cumming, Kantor, and Powers (2001) found that while raters attended to rhetoric and content when scoring integrated writing tasks, they focused more on language use when scoring independent tasks.
Introduction to many-facet rasch measurement: Analyzing and evaluating rater-mediated assessments
2023, Introduction to Many-Facet Rasch Measurement: Analyzing and Evaluating Rater-Mediated Assessments