Re-training writing raters online: How does it compare with face-to-face training?

doi:10.1016/j.asw.2007.04.001

Assessing Writing

Volume 12, Issue 1, 2007, Pages 26-43

https://doi.org/10.1016/j.asw.2007.04.001 Get rights and content

Abstract

The training of raters for writing assessment through web-based programmes is emerging as an attractive and flexible alternative to the conventional method of face-to-face training sessions. Although some online training programmes have been developed, there is little published research on them. The current study aims to compare the effectiveness of online and face-to-face training in the context of a large-scale academic writing assessment for students entering a major English-medium university. A team of 16 raters, divided into two groups of 8, all initially rated a set of 70 scripts. In the training phase, the online group rated 15 benchmark scripts online and received immediate feedback, whereas the face-to-face group received individual feedback on their pre-training performance, rated the 15 scripts at home and then met for a face-to-face session. After the training, both groups re-rated the initial 70 scripts and then reported their attitudes towards the different forms of training by means of questionnaires and interviews. According to the statistical results, using multi-faceted Rasch measurement, both types of training were effective overall, but the self-report data revealed various responses favouring one type or the other. The findings are discussed in terms of the factors influencing rater responsiveness and the refinements that are needed for future rater training programmes.

Section snippets

Background

Although there have been substantial advances in automated rating of writing in recent years (Jamieson, 2005), it is still the norm in writing assessment to use human raters. Unfortunately, their judgements are prone to various sources of bias and error which can ultimately compromise the quality of the ratings. A number of studies using a range of psychometric methods have identified various rater effects (Myford and Wolfe, 2003, Myford and Wolfe, 2004) which need to be addressed if an

The assessment instrument

The Diagnostic English Language Needs Assessment is an initiative funded by the university to identify the academic English needs of undergraduate students following their admission to a degree programme. Those that are found to be at risk are offered suitable English language support. DELNA consists of a screening and a diagnostic component. The main purpose of the screening, which is made up of vocabulary and text editing tasks, is to identify students who are highly proficient users of

Results

We will present the results of the study by working systematically through the research questions, covering first the outcomes of the FACETS analysis and then the qualitative data from the questionnaires and interviews.

Discussion and conclusion

The findings indicate that, in terms of severity, both forms of training were successful in bringing the raters closer together in their ratings. There was an indication that the online training might have been slightly more successful. Both groups rated consistently before and after the training. Afterwards, the online group might have become slightly more consistent, whilst the face-to-face group rated with slightly more variation. On the individual level, only one rater moved outside the

References (19)

J. Hamilton et al.
Teachers’ perceptions of on-line rater training and monitoring
System
(2001)
P.J. Congdon et al.
The stability of rater severity in large-scale assessment programs
Journal of Educational Measurement
(2000)
C. Elder et al.
Evaluating rater responses to an online rater training program
Language Testing
(2007)
W.P. Fisher
Reliability statistics
Rasch Measurement: Transactions of the Rasch Measurement SIG
(1992)
J. Jamieson
Trend in computer-based second language assessment
Annual Review of Applied Linguistics
(2005)
D. Kenyon et al.
Evaluating the efficacy of rater self-training
(1993)
F.J. Landy et al.
The measurement of work performance: Methods, theory, and application
(1983)
J.M. Linacre
Facets Rasch measurement computer program
(2006)
T. Lumley et al.
Rater characteristics and rater bias: Implications for training
Language Testing
(1995)

There are more references available in the full text version of this article.

Cited by (82)

Halo effects in rating data: Assessing speech fluency
2023, Research Methods in Applied Linguistics
Fluency is a common objective in English language learning and teaching. However, researchers have commented on the absence of a widely accepted definition of the construct and this sense of uncertainty may hinder efforts to measure fluency for purposes of research or assessment. To date, the extent to which rating instruments measure fluency independently from other areas of speech production such as complexity and accuracy has been under-explored. This is a significant gap because the literature broadly suggests that rater scores are susceptible to halo effects that have a distorting influence on the measurement of speaking skills and blur boundaries between assessment criteria. To investigate this issue, the current study examines a data set of scores assigned to 77 English language learners on two speaking tasks using an analytic rating scale featuring criteria for speech complexity, accuracy and fluency (CAF). The tasks were transcribed and analysed using measures of CAF. Rater scores were analysed using many-facet Rasch measurement and multiple regression. Results revealed that rated fluency was influenced by lexical complexity, indicating that fluency scores represented more than the fluency construct outlined in the analytic scale. Measures of speech speed, phonation time ration, length of utterance, lexical complexity, total speaking time and repair fluency explained the largest amount of variance in the fluency scores. Implications for research, language teaching and assessment are discussed.
Using a logic model to evaluate rater training for EAP writing assessment
2022, Journal of English for Academic Purposes
Assessment by written exams and coursework is common practice in pre-sessional and preliminary year EAP programmes, but the allocation of marks for written assessment is complex, as is training raters to apply specified assessment standards. This practitioner research uses a Logic Model, a visual diagram commonly used in programme evaluation, to evaluate the rater training procedure for writing assessment in an English-medium university department. This study integrates data from surveys, interviews and workshops with the stakeholders involved in the rater training procedure to develop a Logic Model as part of an ongoing theory of change evaluation. The final product is a Model that reveals the guiding principles of rater training in the department, text that describes the evaluation process, and a measurement plan. This paper showcases how practitioner research can enhance EAP practice by demonstrating how an essential component of EAP assessment, rater training, and the rationale behind it, can be made cogent to the various stakeholders involved in the procedure. This paper offers considerations for EAP practitioners, managers, and testing staff when developing or working with rater training, bridging the gap between EAP and language testing and assessment communities.
Individualized feedback to raters in language assessment: Impacts on rater effects
2022, Assessing Writing
Citation Excerpt :
Likewise, they used another term, “differential rater leniency,” to describe the case that a rater tends to assign higher scores to one or more particular groups of examinees than model expectations. Knoch et al. (2007) combined these two terms and extended their application from one particular group of examinees to an aspect of one facet. They used “bias effect” to describe the case that a rater tends to assign scores that are relatively low or high in terms of an aspect of one facet.
This study examined the impacts of individualized feedback on rater effects in performance-based language assessment. The raters were 93 native Chinese speakers without previous rating experience, and they were randomly assigned to one of the three treatment groups. The three groups of raters differed in the way that they received individualized feedback in a given period: (a) a control group receiving no feedback, (b) a single-feedback group receiving the feedback once, and (c) a double-feedback group receiving the feedback twice. The results indicated that individualized feedback significantly reduced rater severity/leniency and rater misfit. Specifically, raters from both the double-feedback and single-feedback groups exhibited lower levels of rater severity/leniency, compared with those from the control group. Raters from the double-feedback group showed lower levels of rater misfit than those from the single-feedback group. With regard to the retention of reduction in rater effects, individualized feedback was found to assist raters from the double-feedback group to maintain the reduction in rater severity/leniency. It also helped raters from the single-feedback group to maintain the reduction in rater misfit. These findings may shed light on the applications of individualized feedback in the designs of face-to-face and online rater training programs.
Validating a rubric for assessing integrated writing in an EAP context
2022, Assessing Writing
Citation Excerpt :
One of the key elements in an evidence-based approach is the elicitation and interpretation of rater perceptions about evaluation criteria. Studies drawing on qualitative methods, such as interviews and think-aloud protocols, have shown that rater perceptions play an important role in the standardization of scoring rubrics (Knoch et al., 2007). For example, Cumming, Kantor, and Powers (2001) found that while raters attended to rhetoric and content when scoring integrated writing tasks, they focused more on language use when scoring independent tasks.
Although researchers have argued for a mixed-method approach to rubric design and validation, such research is sparse in the area of L2 integrated writing. This article reports on the validation of an analytic rubric for assessing a classroom-based integrated writing test. Argumentative integrated essays (N = 48) written by EAP students at an English-medium Canadian university were rated by instructors (N = 10) with prior EAP teaching experience. Employing a mixed methods design, the quality of the rubric was established through many facet Rasch measurement and perceptions from the instructors elicited during semi-structured interviews. To further explore the rubric’s ability to differentiate among students, essays from three performance levels (low, average, high) were compared in terms of fluency, syntactic and lexical complexity, cohesion, and lexical diversity measures. Results have suggested the rubric can capture variation in student performance. Implications are discussed in terms of validation of assessment rubrics in localized assessment contexts.
The analysis of marking reliability through the approach of gauge repeatability and reproducibility (GR&R) study: a case of English-speaking test
2024, Language Testing in Asia
Introduction to many-facet rasch measurement: Analyzing and evaluating rater-mediated assessments
2023, Introduction to Many-Facet Rasch Measurement: Analyzing and Evaluating Rater-Mediated Assessments

View all citing articles on Scopus

¹: Tel.: +64 9 3737599x87673; fax: +64 9 3082360.

²: Tel.: +64 9 3737599x82427; fax: +64 9 3082360.

View full text

Re-training writing raters online: How does it compare with face-to-face training?

Abstract

Section snippets

Background

The assessment instrument

Results

Discussion and conclusion

System

The stability of rater severity in large-scale assessment programs

Journal of Educational Measurement

Evaluating rater responses to an online rater training program

Language Testing

Reliability statistics

Rasch Measurement: Transactions of the Rasch Measurement SIG

Trend in computer-based second language assessment

Annual Review of Applied Linguistics

Evaluating the efficacy of rater self-training

The measurement of work performance: Methods, theory, and application

Facets Rasch measurement computer program

Rater characteristics and rater bias: Implications for training

Language Testing