Psychometric Principles in Student Assessment

Mislevy, Robert J.; Wilson, Mark R.; Ercikan, Kadriye; Chudowsky, Naomi

doi:10.1007/978-94-010-0309-4_31

Robert J. Mislevy³,
Mark R. Wilson⁴,
Kadriye Ercikan⁵ &
…
Naomi Chudowsky⁶

Part of the book series: Kluwer International Handbooks of Education ((SIHE,volume 9))

2939 Accesses
16 Citations

Abstract

“Validity, reliability, comparability, and fairness are not just measurement issues, but social values that have meaning and force outside of measurement wherever evaluative judgments and decisions are made” (Messick, 1994, p. 2).

Endnotes

This work draws in part on the authors’ work on the National Research Council’s Committee on the Foundations of Assessment. The first author received support under the Educational Research and Development Centers Program, PR/Award Number R305B60002, as administered by the Office of Educational Research and Improvement, U.S. Department of Education. The second author received support from the National Science Foundation under grant No. ESI-9910154. The findings and opinions expressed in this report do not reflect the positions or policies of the National Research Council, the National Institute on Student Achievement, Curriculum, and Assessment, the Office of Educational Research and Improvement, the National Science Foundation, or the U.S. Department of Education.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 749.00; Price excludes VAT (USA)

Softcover Book: USD 949.99; Price excludes VAT (USA)

Hardcover Book: USD 949.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Adams, R., Wilson, M.R., & Wang, W.-C. (1997). The multidimensional random coefficients multino mial logit model. Applied Psychological Measurement, 21, 1–23.
Article Google Scholar
Almond, R.G., & Mislevy, R.J. (1999). Graphical models and computerized adaptive testing. Applied Psychological Measurement, 23, 223–237.
Article Google Scholar
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.
Google Scholar
Anderson, J.R., Boyle, C.F., & Corbett, A.T. (1990). Cognitive modeling and intelligent tutoring. Artificial Intelligence, 42, 7–49.
Article Google Scholar
Bennett, R.E. (2001). How the internet will help large-scale assessment reinvent itself. Education Policy Analysis, 9(5). Retrieved from http://epaa.asu.edu/epaa/v9n5.htlm.
Bradlow, E.T., Wainer, H., & Wang, X. (1999). A Bayesian random effects model for testlets. Psychometrika, 64, 153–168.
Article Google Scholar
Brennan, R.L. (1983). The elements of generalizability theory. Iowa City, IA: American College Testing Program.
Google Scholar
Brennan, R.L. (2001). An essay on the history and future of reliability from the perspective of replications. Journal of Educational Measurement, 38(4), 295–317
Article Google Scholar
Brown, W. (1910). Some experimental results in the correlation of mental abilities. British Journal of Psychology, 3, 296–322.
Google Scholar
Bryk, A.S., & Raudenbush, S. (1992). Hierarchical linear models: Applications and data analysis methods. Newbury Park: Sage.
Google Scholar
Campbell, D.T., & Fiske, D.W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56, 81–105.
Article Google Scholar
Cohen, J.A. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46.
Article Google Scholar
Cronbach, L.J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 17, 297–334.
Article Google Scholar
Cronbach, L.J. (1989). Construct validation after thirty years. In R.L. Linn (Ed.), Intelligence: Measurement, theory, and public policy (pp. 147–171). Urbana, IL: University of Illinois Press.
Google Scholar
Cronbach, L.J., Gleser, G.C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral measurements: Theory of generalizability for scores and profiles. New York: Wiley.
Google Scholar
Cronbach, L.J., & Meehl, P.E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281–302.
Article Google Scholar
Dayton, CM. (1999). Latent class scaling analysis. Thousand Oaks, CA: Sage.
Google Scholar
Dibello, L.V., Stout, W.F., & Roussos, L.A. (1995). Unified cognitive/psychometric diagnostic assessment likelihood based classification techniques. In P. Nichols, S. Chipman, & R. Brennan (Eds.), Cognitively diagnostic assessment (pp. 361–389). Hillsdale, NJ: Erlbaum.
Google Scholar
Embretson, S. (1983). Construct validity: Construct representation versus nomothetic span. Psychological Bulletin, 93, 179–197.
Article Google Scholar
Embretson, S.E. (1998). A cognitive design systems approach to generating valid tests: Application to abstract reasoning. Psychological Methods, 3, 380–396.
Article Google Scholar
Ercikan, K. (1998). Translation effects in international assessments. International Journal of Educational Research, 29, 543–553.
Article Google Scholar
Ercikan, K., & Julian, M. (2002). Classification accuracy of assigning student performance to proficiency levels: Guidelines for assessment design. Applied Measurement in Education. 15, 269–294.
Article Google Scholar
Falmagne, J.-C., & Doignon, J.-P. (1988). A class of stochastic procedures for the assessment of knowledge. British Journal of Mathematical and Statistical Psychology, 41, 1–23.
Article Google Scholar
Fischer, G.H. (1973). The linear logistic test model as an instrument in educational research. Acta Psychologica, 37, 359–374.
Article Google Scholar
Gelman, A., Carlin, J., Stern, H., & Rubin, D.B. (1995). Bayesian data analysis. London: Chapman & Hall.
Google Scholar
Greeno, J.G., Collins, A.M., & Resnick, L.B. (1996). Cognition and learning. In D.C Berliner, & R.C Calfee (Eds.), Handbook of educational psychology (pp. 15–146). New York: Macmillan.
Google Scholar
Gulliksen, H. (1950/1987). Theory of mental tests. New York: John Wiley/Hillsdale, NJ: Lawrence Erlbaum.
Book Google Scholar
Haertel, E.H. (1989). Using restricted latent class models to map the skill structure of achievement test items. Journal of Educational Measurement, 26, 301–321.
Article Google Scholar
Haertel, E.H., & Wiley, D.E. (1993). Representations of ability structures: Implications for testing. In N. Frederiksen, R.J. Mislevy, & I.I. Bejar (Eds.), Test theory for a new generation of tests. Hillsdale, NJ: Lawrence Erlbaum.
Google Scholar
Hambleton, R.K. (1989). Principles and selected applications of item response theory. In R.L. Linn (Ed.), Educational measurement (3rd ed.) (pp. 147–200). Phoenix, AZ: American Council on Education/Oryx Press.
Google Scholar
Hambleton, R.K., & Slater, S.C. (1997). Reliability of credentialing examinations and the impact of scoring models and standard-setting policies. Applied Measurement in Education, 10, 19–39.
Article Google Scholar
Holland, P.W., & Thayer, D.T. (1988). Differential item performance and the Mantel-Haenzsel procedures. In H. Wainer, & H.I. Braun (Eds.), Test validity (pp. 129–145). Hillsdale, NJ: Lawrence Erlbaum.
Google Scholar
Holland, P.W., & Wainer, H. (1993). Differential item functioning. Hillsdale, NJ: Lawrence Erlbaum.
Google Scholar
Jöreskog, K.G., & Sörbom, D. (1979). Advances in factor analysis and structural equation models. Cambridge, MA: Abt Books.
Google Scholar
Kadane, J.B., & Schum, D.A. (1996). A probabilistic analysis of the Sacco and Vanzetti evidence. New York: Wiley.
Google Scholar
Kane, M.T. (1992). An argument-based approach to validity. Psychological Bulletin, 112, 527–535.
Article Google Scholar
Kelley, T.L. (1927). Interpretation of educational measurements. New York: World Book.
Google Scholar
Kuder, G.F., & Richardson, M.W. (1937). The theory of estimation of test reliability. Psychometrika, 2, 151–160.
Article Google Scholar
Lane, W., Wang, N., & Magone, M. (1996). Gender-related differential item functioning on a middle-school mathematics performance assessment. Educational Measurement: Issues and Practice, 15(4), 21–27, 31.
Article Google Scholar
Lazarsfeld, P.F. (1950). The logical and mathematical foundation of latent structure analysis. In S.A. Stouffer, L. Guttman, E.A. Suchman, P.R. Lazarsfeld, S.A. Star, & J.A Clausen (Eds.), Measurement and prediction (pp. 362–412). Princeton, NJ: Princeton University Press.
Google Scholar
Levine, M., & Drasgow, F. (1982). Appropriateness measurement: Review, critique, and validating studies. British Journal of Mathematical and Statistical Psychology, 35, 42–56.
Article Google Scholar
Linacre, J.M. (1989). Many faceted Rasch measurement. Doctoral dissertation, University of Chicago.
Google Scholar
Lord, F.M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum.
Google Scholar
Lord, R.M., & Novick, M.R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley.
Google Scholar
Martin, J.D., & VanLehn, K. (1995). A Bayesian approach to cognitive assessment. In P. Nichols, S. Chipman, & R. Brennan (Eds.), Cognitively diagnostic assessment (pp. 141–165). Hillsdale, NJ: Erlbaum.
Google Scholar
Messick, S. (1989). Validity. In R.L. Linn (Ed.), Educational measurement (3rd ed.) (pp. 13–103). New York: American Council on Education/Macmillan.
Google Scholar
Messick, S. (1994). The interplay of evidence and consequences in the validation of performance assessments. Education Researcher, 32, 13–23.
Google Scholar
Messick, S., Beaton, A.E., & Lord, F.M. (1983). National Assessment of Educational Progress reconsidered: A new design for a new era. NAEP Report 83-1. Princeton, NJ: National Assessment for Educational Progress.
Google Scholar
Mislevy, R.J., Steinberg, L.S., & Almond, R.G. (in press). On the structure of educational assessments. Measurement: Interdisciplinary Research and Perspectives. In S. Irvine, & P. Kyllonen (Eds.), Generating items for cognitive tests: Theory and practice. Hillsdale, NJ: Erlbaum.
Google Scholar
Mislevy, R.J., Steinberg, L.S., Almond, R.G., Haertel, G., & Penuel, W. (in press). Leverage points for improving educational assessment. In B. Means, & G. Haertel (Eds.), Evaluating the effects of technology in education. Hillsdale, NJ: Erlbaum.
Google Scholar
Mislevy, R.J., Steinberg, L.S., Breyer, F.J., Almond, R.G., & Johnson, L. (1999). A cognitive task analysis, with implications for designing a simulation-based assessment system. Computers and Human Behavior, 15, 335–374.
Article Google Scholar
Mislevy, R.J., Steinberg, L.S., Breyer, F.J., Almond, R.G., & Johnson, L. (in press). Making sense of data from complex assessment. Applied Measurement in Education.
Google Scholar
Myford, C.M., & Mislevy, R.J. (1995). Monitoring and improving a portfolio assessment system (Center for Performance Assessment Research Report). Princeton, NJ: Educational Testing Service.
Google Scholar
National Research Council (1999). How people learn: Brain, mind, experience, and school. Committee on Developments in the Science of Learning. Bransford, J.D., Brown, A.L., & Cocking, R.R. (Eds.). Washington, DC: National Academy Press.
Google Scholar
National Research Council (2001). Knowing what students know: The science and design of educational assessment. Committee on the Foundations of Assessment. Pellegrino, J., Chudowsky, N., & Glaser, R. (Eds.). Washington, DC: National Academy Press.
Google Scholar
O’Neil, K.A., & McPeek, W.M. (1993). Item and test characteristics that are associated with Differential Item Functioning. In P.W. Holland, & H. Wainer (Eds.), Differential item functioning (pp. 255–276). Hillsdale, NJ: Erlbaum.
Google Scholar
Patz, R.J., & Junker, B.W. (1999). Applications and extensions of MCMC in IRT: Multiple item types, missing data, and rated responses. Journal of Educational and Behavioral Statistics, 24, 342–366.
Google Scholar
Petersen, N.S., Kolen, M.J., & Hoover, H.D. (1989). Scaling, norming, and equating. In R.L. Linn (Ed.), Educational measurement (3rd ed.) (pp. 221–262). New York: American Council on Education/Macmillan.
Google Scholar
Pirolli. P., & Wilson, M. (1998). A theory of the measurement of knowledge content, access, and learning. Psychological Review, 105, 58–82.
Article Google Scholar
Rasch, G. (1960/1980). Probabilistic models for some intelligence and attainment tests. Copenhagen: Danish Institute for Educational Research/Chicago: University of Chicago Press (reprint).
Google Scholar
Reckase, M. (1985). The difficulty of test items that measure more than one ability. Applied Psychological Measurement, 9, 401–412.
Article Google Scholar
Rogosa, D.R., & Ghandour, G.A. (1991). Statistical models for behavioral observations (with discussion). Journal of Educational Statistics, 16, 157–252.
Article Google Scholar
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph No. 17, 34, (No. 4, Part 2).
Google Scholar
Samejima, F. (1973). Homogeneous case of the continuous response level. Psychometrika, 38, 203–219.
Article Google Scholar
Schum, D.A. (1987). Evidence and inference for the intelligence analyst. Lanham, MD: University Press of America.
Google Scholar
Schum, D.A. (1994). The evidential foundations of probabilistic reasoning. New York: Wiley.
Google Scholar
SEPUP (1995). Issues, evidence, and you: Teacher’s guide. Berkeley: Lawrence Hall of Science.
Google Scholar
Shavelson, R.J., & Webb, N.W (1991). Generalizability theory: A primer. Newbury Park, CA: Sage.
Google Scholar
Spearman, C. (1904). The proof and measurement of association between two things. American Journal of Psychology, 15, 72–101.
Article Google Scholar
Spearman, C. (1910). Correlation calculated with faulty data. British Journal of Psychology, 3, 271–295.
Google Scholar
Spiegelhalter, D.J., Thomas, A., Best, N.G., & Gilks, W.R. (1995). BUGS: Bayesian inference using Gibbs sampling, Version 0.50. Cambridge: MRC Biostatistics Unit.
Google Scholar
Tatsuoka, K.K. (1990). Toward an integration of item response theory and cognitive error diagnosis. In N. Frederiksen, R. Glaser, A. Lesgold, & M.G. Shafto, (Eds.), Diagnostic monitoring of skill and knowledge acquisition (pp. 453–488). Hillsdale, NJ: Erlbaum.
Google Scholar
Thissen, D., & Steinberg, L. (1986). A taxonomy of item response models. Psychometrika, 51, 567–77.
Article Google Scholar
Toulmin, S. (1958). The uses of argument. Cambridge, England: University of Cambridge Press.
Google Scholar
Traub, R.E., & Rowley, G.L. (1980). Reliability of test scores and decisions. Applied Psychological Measurement, 4, 517–545.
Article Google Scholar
van der Linden, W.J. (1998). Optimal test assembly. Applied Psychological Measurement, 22, 195–202.
Article Google Scholar
van der Linden, W.J., & Hambleton, R.K. (1997). Handbook of modern item response theory. New York: Springer.
Google Scholar
Wainer, H., Dorans, N.J., Flaugher, R., Green, B.F., Mislevy, R.J., Steinberg, L., & Thissen, D. (2000). Computerized adaptive testing: A primer (2nd ed). Hillsdale, NJ: Lawrence Erlbaum.
Google Scholar
Wainer, H., & Keily, G.L. (1987). Item clusters and computerized adaptive testing: A case for testlets. Journal of Educational Measurement, 24, 195–201.
Article Google Scholar
Wiley, D.E. (1991). Test validity and invalidity reconsidered. In R.E. Snow, & D.E. Wiley (Eds.), Improving inquiry in social science (pp. 75–107). Hillsdale, NJ: Erlbaum.
Google Scholar
Willingham, W.W., & Cole, N.S. (1997). Gender and fair assessment. Mahwah, NJ: Lawrence Erlbaum.
Google Scholar
Wilson, M., & Sloane, K. (2000). From principles to practice: An embedded assessment system. Applied Measurement in Education, 13, 181–208.
Article Google Scholar
Wolf, D., Bixby, J., Glenn, J., & Gardner, H. (1991). To use their minds well: Investigating new forms of student assessment. In G. Grant (Ed.), Review of Educational Research, Vol. 17 (pp. 31–74). Washington, DC: American Educational Research Association.
Google Scholar
Wright, B.D., & Masters, G.N. (1982). Rating scale analysis. Chicago: MESA Press.
Google Scholar
Yen, W.M. (1993). Scaling performance assessments: Strategies for managing local item dependence. Journal of Educational Measurement, 30, 187–213.
Article Google Scholar

Download references

Author information

Authors and Affiliations

University of Maryland, MD, USA
Robert J. Mislevy
University of California, Berkeley, CA, USA
Mark R. Wilson
University of British Columbia, BC, USA
Kadriye Ercikan
National Research Council, Washington DC, USA
Naomi Chudowsky

Authors

Robert J. Mislevy
View author publications
You can also search for this author in PubMed Google Scholar
Mark R. Wilson
View author publications
You can also search for this author in PubMed Google Scholar
Kadriye Ercikan
View author publications
You can also search for this author in PubMed Google Scholar
Naomi Chudowsky
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Educational Research Centre, St. Patrick’s College, Dublin, Ireland
Thomas Kellaghan
The Evaluation Center, Western Michigan University, Kalamazoo, MI, USA
Daniel L. Stufflebeam

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Mislevy, R.J., Wilson, M.R., Ercikan, K., Chudowsky, N. (2003). Psychometric Principles in Student Assessment. In: Kellaghan, T., Stufflebeam, D.L. (eds) International Handbook of Educational Evaluation. Kluwer International Handbooks of Education, vol 9. Springer, Dordrecht. https://doi.org/10.1007/978-94-010-0309-4_31

Download citation

DOI: https://doi.org/10.1007/978-94-010-0309-4_31
Publisher Name: Springer, Dordrecht
Print ISBN: 978-1-4020-0849-8
Online ISBN: 978-94-010-0309-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics