Rater calibration when observational assessment occurs at large scale: Degree of calibration and characteristics of raters associated with calibration
Highlights
► Observation is being used to assess program and teacher quality at large scale. ► Data are needed on feasibility of training large numbers of observers. ► Efforts by Office of Head Start demonstrate large-scale rater training possible. ► Rater characteristics associated with calibration are explored.
Section snippets
Current context for classroom observation
Information on best practices for observational assessment is much needed as observation is increasingly being used to describe and evaluate teacher performance and classroom and program quality in early childhood contexts and K-12 settings. Initially, these measures were used as a part of research on quality and effectiveness. For example, observation measures were included in several large-scale research studies of early childhood education settings, including the National Institute of Child
Challenges to coordinating observational assessment
Two significant concerns for coordinators of large-scale observational assessments include the feasibility of training large numbers of staff in a timely and effective manner and hiring staff who are capable of observing in objective ways. Despite many large-scale efforts to observe classrooms, limited data have been collected which can inform responses to these challenges. Each concern is described in more detail below.
When observation occurs at scale
The challenges described above are magnified when observational assessment is used in large-scale contexts. Implementation becomes a major challenge, as there are typically constraints placed on time and resources such that project coordinators want raters to be trained quickly and cheaply, yet effectively. When raters are trained for a single research project in an academic institution, having 20% of raters fail an initial calibration assessment translates into additional support required for
The current study
The current project examines the feasibility of training a large group of raters and the characteristics of raters associated with calibration by capitalizing on data collected through the Office of Head Start (OHS) in 2008–2009. The Improving Head Start for School Readiness Act of 2007 required OHS to include a valid and reliable observational tool for assessing program quality (U.S. Department of Health and Human Services, Administration for Children and Families, & Office of Head Start, 2008
Participants
All Head Start grantees and delegate agencies were invited by Head Start Regional Offices to send staff to participate in regionally based, three-day CLASS trainings. At least one staff member from every grantee and delegate agency was eligible to participate in these trainings. Additional staff members were eligible based on the number of children served by each program. Head Start programs serving less than 500 children were allowed to send one staff member to these trainings. Programs
Describing calibration
Descriptive statistics for the five calibration metrics are presented in Table 1. Because information on calibration was available across all participants, both the full sample (N = 2093) and the survey subsample (N = 704) are listed. The majority of raters passed the calibration assessment according to preset criteria on their first attempt; 71% of raters assigned at least 80% of codes within-1 of master-codes. Adjacent calibration rates varied by dimension (see Table 1), with the poorest
Discussion
Observational assessment can be used to identify teacher practices that are associated with children's academic, social, and behavioral outcomes (Burchinal et al., 2008, Burchinal et al., 2010, Curby et al., 2009, Howes et al., 2008, Mashburn et al., 2008, Rimm-Kaufman et al., 2009) and to provide material for feedback and professional development (Dickinson and Caswell, 2007, Pianta and Allen, 2008, Pianta and Hamre, 2009). For these reasons, observation tools are now being incorporated into
Conclusion
This study has addressed issues relevant to rater calibration in large-scale contexts in a number of ways. First, we have learned that a scaled-up “train-the-trainer” approach to calibrating raters can work. The majority, but not all, of raters do calibrate after just two days of training. Second, rater beliefs about teaching and children predict the degree of calibration after initial training.
It is difficult to assess the degree to which the findings can be generalized to large-scale
References (59)
Caregivers in day-care centers: Does training matter
Journal of Applied Developmental Psychology
(1989)- et al.
Threshold analysis of association between child care quality and child outcomes for low-income children in pre-kindergarten programs
Early Childhood Research Quarterly
(2010) - et al.
Building support for language and early literacy in preschool classrooms through in-service professional development: Effects of the Literacy Environment Enrichment Program (LEEP)
Early Childhood Research Quarterly
(2007) - et al.
Ready to learn? Children's pre-academic achievement in pre-Kindergarten programs
Early Childhood Research Quarterly
(2008) - et al.
Observed classroom quality profiles in state-funded pre-kindergarten programs and associations with teacher, program, and classroom characteristics
Early Childhood Research Quarterly
(2007) Reading high stakes writing samples: My life as a reader
Assessing Writing
(2003)- et al.
Standards for educational and psychological testing
(1999) - et al.
The state of preschool 2008: State preschool yearbook
(2008) - et al.
Predicting child outcomes at the end of kindergarten from the quality of pre-kindergarten teacher–child interactions and instruction
Applied Developmental Science
(2008) - et al.
What is pre-kindergarten? Characteristics of public pre-kindergarten programs
Applied Developmental Science
(2005)
Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit
Psychological Bulletin
Using observational data to provide performance feedback to teachers: A high school case study
Preventing School Failure
The relations of observed pre-K classroom quality profiles to children's achievement and social competence
Early Education & Development
Beliefs about intentional instruction [Survey]
The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability
Educational and Pyschological Measurement
Levels of training on the environment rating scales
Missing data analysis: Making it work in the real world
Annual Review of Psychology
Quality in early childhood care and education settings: A compendium of measures
Can instructional and emotional support in the first-grade classroom make a difference for children at risk of school failure?
Child Development
Building a science of classrooms: Application of the CLASS framework in over 4,000 US early childhood and elementary classrooms
The Early Childhood Environment Rating Scale
The Early Childhood Environment Rating Scale
Psychometrics of direct observation
School Psychology Review
Assessing performance: Designing, scoring, and validating performance tasks
The measurement of observer agreement for categorical data
Biometrics
Measures of classroom quality in prekindergarten and children's development of academic, language, and social skills
Child Development
Behavioral, social, and emotional assessment of children and adolescents
Working with teachers to develop fair and reliable measures of effective teaching
Cited by (42)
A study on posture-based teacher-student behavioral engagement pattern
2021, Sustainable Cities and SocietyCitation Excerpt :However, this method tends to be positive because it is influenced by instructor subjective consciousness. Observation method has also been widely applied to evaluate student behavioral engagement and measure teaching practice (Cash, Hamre, Pianta, & Myers, 2012). It is suitable for evaluating student academic participation in individual under specific tasks.
Preschool practices in Sweden, Portugal, and the United States
2021, Early Childhood Research QuarterlyCitation Excerpt :The most frequently mentioned weaknesses for rating scale systems are reliability and validity, as observational rating scales can be subjective and therefore associated with rater error and bias problems (Grammatikopoulos, Gregoriadis, & Zachopoulou, 2015; Hoyt & Kerns, 1999; Mashburn, 2017). General classroom rating scales are likely more sensitive to variations in raters’ values, biases and specific experiences of preschool practices (e.g., Cash, Hamre, Pianta, & Myers, 2012; Mashburn, 2017); these differences are likely to vary among raters in different countries (Boer, Hanke, & He, 2018). Mashburn, Meyer, Allen, and Pianta (2014) found that differences among observers’ ratings, when using rating scales, were the largest source of error variance for the CLASS as did Cash et al. (2012).
Signal, error, or bias? exploring the uses of scores from observation systems
2024, Educational Assessment, Evaluation and AccountabilityMeasures of success: characterizing teaching and teaching change with segmented and holistic observation data
2023, International Journal of STEM EducationInvestigating applicability of ratings of indicators of the CLASS Pre-K instrument
2023, International Journal of Research and Method in Education