Rater calibration when observational assessment occurs at large scale: Degree of calibration and characteristics of raters associated with calibration

https://doi.org/10.1016/j.ecresq.2011.12.006Get rights and content

Abstract

Observational assessment is used to study program and teacher effectiveness across large numbers of classrooms, but training a workforce of raters that can assign reliable scores when observations are used in large-scale contexts can be challenging and expensive. Limited data are available to speak to the feasibility of training large numbers of raters to calibrate to an observation tool, or the characteristics of raters associated with calibration. This study reports on the success of rater calibration across 2093 raters trained by the Office of Head Start (OHS) in 2008–2009 on the Classroom Assessment Scoring System (CLASS), and for a subsample of 704 raters, characteristics that predict their calibration. Findings indicate that it is possible to train large numbers of raters to calibrate to an observation tool, and rater beliefs about teachers and children predicted the degree of calibration. Implications for large-scale observational assessments are discussed.

Highlights

► Observation is being used to assess program and teacher quality at large scale. ► Data are needed on feasibility of training large numbers of observers. ► Efforts by Office of Head Start demonstrate large-scale rater training possible. ► Rater characteristics associated with calibration are explored.

Section snippets

Current context for classroom observation

Information on best practices for observational assessment is much needed as observation is increasingly being used to describe and evaluate teacher performance and classroom and program quality in early childhood contexts and K-12 settings. Initially, these measures were used as a part of research on quality and effectiveness. For example, observation measures were included in several large-scale research studies of early childhood education settings, including the National Institute of Child

Challenges to coordinating observational assessment

Two significant concerns for coordinators of large-scale observational assessments include the feasibility of training large numbers of staff in a timely and effective manner and hiring staff who are capable of observing in objective ways. Despite many large-scale efforts to observe classrooms, limited data have been collected which can inform responses to these challenges. Each concern is described in more detail below.

When observation occurs at scale

The challenges described above are magnified when observational assessment is used in large-scale contexts. Implementation becomes a major challenge, as there are typically constraints placed on time and resources such that project coordinators want raters to be trained quickly and cheaply, yet effectively. When raters are trained for a single research project in an academic institution, having 20% of raters fail an initial calibration assessment translates into additional support required for

The current study

The current project examines the feasibility of training a large group of raters and the characteristics of raters associated with calibration by capitalizing on data collected through the Office of Head Start (OHS) in 2008–2009. The Improving Head Start for School Readiness Act of 2007 required OHS to include a valid and reliable observational tool for assessing program quality (U.S. Department of Health and Human Services, Administration for Children and Families, & Office of Head Start, 2008

Participants

All Head Start grantees and delegate agencies were invited by Head Start Regional Offices to send staff to participate in regionally based, three-day CLASS trainings. At least one staff member from every grantee and delegate agency was eligible to participate in these trainings. Additional staff members were eligible based on the number of children served by each program. Head Start programs serving less than 500 children were allowed to send one staff member to these trainings. Programs

Describing calibration

Descriptive statistics for the five calibration metrics are presented in Table 1. Because information on calibration was available across all participants, both the full sample (N = 2093) and the survey subsample (N = 704) are listed. The majority of raters passed the calibration assessment according to preset criteria on their first attempt; 71% of raters assigned at least 80% of codes within-1 of master-codes. Adjacent calibration rates varied by dimension (see Table 1), with the poorest

Discussion

Observational assessment can be used to identify teacher practices that are associated with children's academic, social, and behavioral outcomes (Burchinal et al., 2008, Burchinal et al., 2010, Curby et al., 2009, Howes et al., 2008, Mashburn et al., 2008, Rimm-Kaufman et al., 2009) and to provide material for feedback and professional development (Dickinson and Caswell, 2007, Pianta and Allen, 2008, Pianta and Hamre, 2009). For these reasons, observation tools are now being incorporated into

Conclusion

This study has addressed issues relevant to rater calibration in large-scale contexts in a number of ways. First, we have learned that a scaled-up “train-the-trainer” approach to calibrating raters can work. The majority, but not all, of raters do calibrate after just two days of training. Second, rater beliefs about teaching and children predict the degree of calibration after initial training.

It is difficult to assess the degree to which the findings can be generalized to large-scale

References (59)

  • J. Cohen

    Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit

    Psychological Bulletin

    (1968)
  • G. Colvin et al.

    Using observational data to provide performance feedback to teachers: A high school case study

    Preventing School Failure

    (2009)
  • T.W. Curby et al.

    The relations of observed pre-K classroom quality profiles to children's achievement and social competence

    Early Education & Development

    (2009)
  • J.T. Downer et al.

    Beliefs about intentional instruction [Survey]

    (2010)
  • J.L. Fleiss et al.

    The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability

    Educational and Pyschological Measurement

    (1973)
  • Frank Porter Graham Child Development Institute

    Levels of training on the environment rating scales

    (2009)
  • J.W. Graham

    Missing data analysis: Making it work in the real world

    Annual Review of Psychology

    (2009)
  • T. Halle et al.

    Quality in early childhood care and education settings: A compendium of measures

    (2007)
  • B.K. Hamre et al.

    Can instructional and emotional support in the first-grade classroom make a difference for children at risk of school failure?

    Child Development

    (2005)
  • B.K. Hamre et al.

    Building a science of classrooms: Application of the CLASS framework in over 4,000 US early childhood and elementary classrooms

    (2008)
  • T. Harms et al.

    The Early Childhood Environment Rating Scale

    (1980)
  • T. Harms et al.

    The Early Childhood Environment Rating Scale

    (1998)
  • J.M. Hintze

    Psychometrics of direct observation

    School Psychology Review

    (2005)
  • R.L. Johnson et al.

    Assessing performance: Designing, scoring, and validating performance tasks

    (2008)
  • J.R. Landis et al.

    The measurement of observer agreement for categorical data

    Biometrics

    (1977)
  • LoCasale-Crouch, J., Downer, J. T., & Hamre, B.K. (2011). Assessing teacher beliefs about intentional instruction....
  • A.J. Mashburn et al.

    Measures of classroom quality in prekindergarten and children's development of academic, language, and social skills

    Child Development

    (2008)
  • K.W. Merrell

    Behavioral, social, and emotional assessment of children and adolescents

    (1999)
  • MET Project

    Working with teachers to develop fair and reliable measures of effective teaching

    (2010)
  • Cited by (42)

    • A study on posture-based teacher-student behavioral engagement pattern

      2021, Sustainable Cities and Society
      Citation Excerpt :

      However, this method tends to be positive because it is influenced by instructor subjective consciousness. Observation method has also been widely applied to evaluate student behavioral engagement and measure teaching practice (Cash, Hamre, Pianta, & Myers, 2012). It is suitable for evaluating student academic participation in individual under specific tasks.

    • Preschool practices in Sweden, Portugal, and the United States

      2021, Early Childhood Research Quarterly
      Citation Excerpt :

      The most frequently mentioned weaknesses for rating scale systems are reliability and validity, as observational rating scales can be subjective and therefore associated with rater error and bias problems (Grammatikopoulos, Gregoriadis, & Zachopoulou, 2015; Hoyt & Kerns, 1999; Mashburn, 2017). General classroom rating scales are likely more sensitive to variations in raters’ values, biases and specific experiences of preschool practices (e.g., Cash, Hamre, Pianta, & Myers, 2012; Mashburn, 2017); these differences are likely to vary among raters in different countries (Boer, Hanke, & He, 2018). Mashburn, Meyer, Allen, and Pianta (2014) found that differences among observers’ ratings, when using rating scales, were the largest source of error variance for the CLASS as did Cash et al. (2012).

    • Signal, error, or bias? exploring the uses of scores from observation systems

      2024, Educational Assessment, Evaluation and Accountability
    • Investigating applicability of ratings of indicators of the CLASS Pre-K instrument

      2023, International Journal of Research and Method in Education
    View all citing articles on Scopus
    View full text