Abstract
“Validity, reliability, comparability, and fairness are not just measurement issues, but social values that have meaning and force outside of measurement wherever evaluative judgments and decisions are made” (Messick, 1994, p. 2).
Endnotes
This work draws in part on the authors’ work on the National Research Council’s Committee on the Foundations of Assessment. The first author received support under the Educational Research and Development Centers Program, PR/Award Number R305B60002, as administered by the Office of Educational Research and Improvement, U.S. Department of Education. The second author received support from the National Science Foundation under grant No. ESI-9910154. The findings and opinions expressed in this report do not reflect the positions or policies of the National Research Council, the National Institute on Student Achievement, Curriculum, and Assessment, the Office of Educational Research and Improvement, the National Science Foundation, or the U.S. Department of Education.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Adams, R., Wilson, M.R., & Wang, W.-C. (1997). The multidimensional random coefficients multino mial logit model. Applied Psychological Measurement, 21, 1–23.
Almond, R.G., & Mislevy, R.J. (1999). Graphical models and computerized adaptive testing. Applied Psychological Measurement, 23, 223–237.
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.
Anderson, J.R., Boyle, C.F., & Corbett, A.T. (1990). Cognitive modeling and intelligent tutoring. Artificial Intelligence, 42, 7–49.
Bennett, R.E. (2001). How the internet will help large-scale assessment reinvent itself. Education Policy Analysis, 9(5). Retrieved from http://epaa.asu.edu/epaa/v9n5.htlm.
Bradlow, E.T., Wainer, H., & Wang, X. (1999). A Bayesian random effects model for testlets. Psychometrika, 64, 153–168.
Brennan, R.L. (1983). The elements of generalizability theory. Iowa City, IA: American College Testing Program.
Brennan, R.L. (2001). An essay on the history and future of reliability from the perspective of replications. Journal of Educational Measurement, 38(4), 295–317
Brown, W. (1910). Some experimental results in the correlation of mental abilities. British Journal of Psychology, 3, 296–322.
Bryk, A.S., & Raudenbush, S. (1992). Hierarchical linear models: Applications and data analysis methods. Newbury Park: Sage.
Campbell, D.T., & Fiske, D.W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56, 81–105.
Cohen, J.A. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46.
Cronbach, L.J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 17, 297–334.
Cronbach, L.J. (1989). Construct validation after thirty years. In R.L. Linn (Ed.), Intelligence: Measurement, theory, and public policy (pp. 147–171). Urbana, IL: University of Illinois Press.
Cronbach, L.J., Gleser, G.C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral measurements: Theory of generalizability for scores and profiles. New York: Wiley.
Cronbach, L.J., & Meehl, P.E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281–302.
Dayton, CM. (1999). Latent class scaling analysis. Thousand Oaks, CA: Sage.
Dibello, L.V., Stout, W.F., & Roussos, L.A. (1995). Unified cognitive/psychometric diagnostic assessment likelihood based classification techniques. In P. Nichols, S. Chipman, & R. Brennan (Eds.), Cognitively diagnostic assessment (pp. 361–389). Hillsdale, NJ: Erlbaum.
Embretson, S. (1983). Construct validity: Construct representation versus nomothetic span. Psychological Bulletin, 93, 179–197.
Embretson, S.E. (1998). A cognitive design systems approach to generating valid tests: Application to abstract reasoning. Psychological Methods, 3, 380–396.
Ercikan, K. (1998). Translation effects in international assessments. International Journal of Educational Research, 29, 543–553.
Ercikan, K., & Julian, M. (2002). Classification accuracy of assigning student performance to proficiency levels: Guidelines for assessment design. Applied Measurement in Education. 15, 269–294.
Falmagne, J.-C., & Doignon, J.-P. (1988). A class of stochastic procedures for the assessment of knowledge. British Journal of Mathematical and Statistical Psychology, 41, 1–23.
Fischer, G.H. (1973). The linear logistic test model as an instrument in educational research. Acta Psychologica, 37, 359–374.
Gelman, A., Carlin, J., Stern, H., & Rubin, D.B. (1995). Bayesian data analysis. London: Chapman & Hall.
Greeno, J.G., Collins, A.M., & Resnick, L.B. (1996). Cognition and learning. In D.C Berliner, & R.C Calfee (Eds.), Handbook of educational psychology (pp. 15–146). New York: Macmillan.
Gulliksen, H. (1950/1987). Theory of mental tests. New York: John Wiley/Hillsdale, NJ: Lawrence Erlbaum.
Haertel, E.H. (1989). Using restricted latent class models to map the skill structure of achievement test items. Journal of Educational Measurement, 26, 301–321.
Haertel, E.H., & Wiley, D.E. (1993). Representations of ability structures: Implications for testing. In N. Frederiksen, R.J. Mislevy, & I.I. Bejar (Eds.), Test theory for a new generation of tests. Hillsdale, NJ: Lawrence Erlbaum.
Hambleton, R.K. (1989). Principles and selected applications of item response theory. In R.L. Linn (Ed.), Educational measurement (3rd ed.) (pp. 147–200). Phoenix, AZ: American Council on Education/Oryx Press.
Hambleton, R.K., & Slater, S.C. (1997). Reliability of credentialing examinations and the impact of scoring models and standard-setting policies. Applied Measurement in Education, 10, 19–39.
Holland, P.W., & Thayer, D.T. (1988). Differential item performance and the Mantel-Haenzsel procedures. In H. Wainer, & H.I. Braun (Eds.), Test validity (pp. 129–145). Hillsdale, NJ: Lawrence Erlbaum.
Holland, P.W., & Wainer, H. (1993). Differential item functioning. Hillsdale, NJ: Lawrence Erlbaum.
Jöreskog, K.G., & Sörbom, D. (1979). Advances in factor analysis and structural equation models. Cambridge, MA: Abt Books.
Kadane, J.B., & Schum, D.A. (1996). A probabilistic analysis of the Sacco and Vanzetti evidence. New York: Wiley.
Kane, M.T. (1992). An argument-based approach to validity. Psychological Bulletin, 112, 527–535.
Kelley, T.L. (1927). Interpretation of educational measurements. New York: World Book.
Kuder, G.F., & Richardson, M.W. (1937). The theory of estimation of test reliability. Psychometrika, 2, 151–160.
Lane, W., Wang, N., & Magone, M. (1996). Gender-related differential item functioning on a middle-school mathematics performance assessment. Educational Measurement: Issues and Practice, 15(4), 21–27, 31.
Lazarsfeld, P.F. (1950). The logical and mathematical foundation of latent structure analysis. In S.A. Stouffer, L. Guttman, E.A. Suchman, P.R. Lazarsfeld, S.A. Star, & J.A Clausen (Eds.), Measurement and prediction (pp. 362–412). Princeton, NJ: Princeton University Press.
Levine, M., & Drasgow, F. (1982). Appropriateness measurement: Review, critique, and validating studies. British Journal of Mathematical and Statistical Psychology, 35, 42–56.
Linacre, J.M. (1989). Many faceted Rasch measurement. Doctoral dissertation, University of Chicago.
Lord, F.M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum.
Lord, R.M., & Novick, M.R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley.
Martin, J.D., & VanLehn, K. (1995). A Bayesian approach to cognitive assessment. In P. Nichols, S. Chipman, & R. Brennan (Eds.), Cognitively diagnostic assessment (pp. 141–165). Hillsdale, NJ: Erlbaum.
Messick, S. (1989). Validity. In R.L. Linn (Ed.), Educational measurement (3rd ed.) (pp. 13–103). New York: American Council on Education/Macmillan.
Messick, S. (1994). The interplay of evidence and consequences in the validation of performance assessments. Education Researcher, 32, 13–23.
Messick, S., Beaton, A.E., & Lord, F.M. (1983). National Assessment of Educational Progress reconsidered: A new design for a new era. NAEP Report 83-1. Princeton, NJ: National Assessment for Educational Progress.
Mislevy, R.J., Steinberg, L.S., & Almond, R.G. (in press). On the structure of educational assessments. Measurement: Interdisciplinary Research and Perspectives. In S. Irvine, & P. Kyllonen (Eds.), Generating items for cognitive tests: Theory and practice. Hillsdale, NJ: Erlbaum.
Mislevy, R.J., Steinberg, L.S., Almond, R.G., Haertel, G., & Penuel, W. (in press). Leverage points for improving educational assessment. In B. Means, & G. Haertel (Eds.), Evaluating the effects of technology in education. Hillsdale, NJ: Erlbaum.
Mislevy, R.J., Steinberg, L.S., Breyer, F.J., Almond, R.G., & Johnson, L. (1999). A cognitive task analysis, with implications for designing a simulation-based assessment system. Computers and Human Behavior, 15, 335–374.
Mislevy, R.J., Steinberg, L.S., Breyer, F.J., Almond, R.G., & Johnson, L. (in press). Making sense of data from complex assessment. Applied Measurement in Education.
Myford, C.M., & Mislevy, R.J. (1995). Monitoring and improving a portfolio assessment system (Center for Performance Assessment Research Report). Princeton, NJ: Educational Testing Service.
National Research Council (1999). How people learn: Brain, mind, experience, and school. Committee on Developments in the Science of Learning. Bransford, J.D., Brown, A.L., & Cocking, R.R. (Eds.). Washington, DC: National Academy Press.
National Research Council (2001). Knowing what students know: The science and design of educational assessment. Committee on the Foundations of Assessment. Pellegrino, J., Chudowsky, N., & Glaser, R. (Eds.). Washington, DC: National Academy Press.
O’Neil, K.A., & McPeek, W.M. (1993). Item and test characteristics that are associated with Differential Item Functioning. In P.W. Holland, & H. Wainer (Eds.), Differential item functioning (pp. 255–276). Hillsdale, NJ: Erlbaum.
Patz, R.J., & Junker, B.W. (1999). Applications and extensions of MCMC in IRT: Multiple item types, missing data, and rated responses. Journal of Educational and Behavioral Statistics, 24, 342–366.
Petersen, N.S., Kolen, M.J., & Hoover, H.D. (1989). Scaling, norming, and equating. In R.L. Linn (Ed.), Educational measurement (3rd ed.) (pp. 221–262). New York: American Council on Education/Macmillan.
Pirolli. P., & Wilson, M. (1998). A theory of the measurement of knowledge content, access, and learning. Psychological Review, 105, 58–82.
Rasch, G. (1960/1980). Probabilistic models for some intelligence and attainment tests. Copenhagen: Danish Institute for Educational Research/Chicago: University of Chicago Press (reprint).
Reckase, M. (1985). The difficulty of test items that measure more than one ability. Applied Psychological Measurement, 9, 401–412.
Rogosa, D.R., & Ghandour, G.A. (1991). Statistical models for behavioral observations (with discussion). Journal of Educational Statistics, 16, 157–252.
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph No. 17, 34, (No. 4, Part 2).
Samejima, F. (1973). Homogeneous case of the continuous response level. Psychometrika, 38, 203–219.
Schum, D.A. (1987). Evidence and inference for the intelligence analyst. Lanham, MD: University Press of America.
Schum, D.A. (1994). The evidential foundations of probabilistic reasoning. New York: Wiley.
SEPUP (1995). Issues, evidence, and you: Teacher’s guide. Berkeley: Lawrence Hall of Science.
Shavelson, R.J., & Webb, N.W (1991). Generalizability theory: A primer. Newbury Park, CA: Sage.
Spearman, C. (1904). The proof and measurement of association between two things. American Journal of Psychology, 15, 72–101.
Spearman, C. (1910). Correlation calculated with faulty data. British Journal of Psychology, 3, 271–295.
Spiegelhalter, D.J., Thomas, A., Best, N.G., & Gilks, W.R. (1995). BUGS: Bayesian inference using Gibbs sampling, Version 0.50. Cambridge: MRC Biostatistics Unit.
Tatsuoka, K.K. (1990). Toward an integration of item response theory and cognitive error diagnosis. In N. Frederiksen, R. Glaser, A. Lesgold, & M.G. Shafto, (Eds.), Diagnostic monitoring of skill and knowledge acquisition (pp. 453–488). Hillsdale, NJ: Erlbaum.
Thissen, D., & Steinberg, L. (1986). A taxonomy of item response models. Psychometrika, 51, 567–77.
Toulmin, S. (1958). The uses of argument. Cambridge, England: University of Cambridge Press.
Traub, R.E., & Rowley, G.L. (1980). Reliability of test scores and decisions. Applied Psychological Measurement, 4, 517–545.
van der Linden, W.J. (1998). Optimal test assembly. Applied Psychological Measurement, 22, 195–202.
van der Linden, W.J., & Hambleton, R.K. (1997). Handbook of modern item response theory. New York: Springer.
Wainer, H., Dorans, N.J., Flaugher, R., Green, B.F., Mislevy, R.J., Steinberg, L., & Thissen, D. (2000). Computerized adaptive testing: A primer (2nd ed). Hillsdale, NJ: Lawrence Erlbaum.
Wainer, H., & Keily, G.L. (1987). Item clusters and computerized adaptive testing: A case for testlets. Journal of Educational Measurement, 24, 195–201.
Wiley, D.E. (1991). Test validity and invalidity reconsidered. In R.E. Snow, & D.E. Wiley (Eds.), Improving inquiry in social science (pp. 75–107). Hillsdale, NJ: Erlbaum.
Willingham, W.W., & Cole, N.S. (1997). Gender and fair assessment. Mahwah, NJ: Lawrence Erlbaum.
Wilson, M., & Sloane, K. (2000). From principles to practice: An embedded assessment system. Applied Measurement in Education, 13, 181–208.
Wolf, D., Bixby, J., Glenn, J., & Gardner, H. (1991). To use their minds well: Investigating new forms of student assessment. In G. Grant (Ed.), Review of Educational Research, Vol. 17 (pp. 31–74). Washington, DC: American Educational Research Association.
Wright, B.D., & Masters, G.N. (1982). Rating scale analysis. Chicago: MESA Press.
Yen, W.M. (1993). Scaling performance assessments: Strategies for managing local item dependence. Journal of Educational Measurement, 30, 187–213.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Kluwer Academic Publishers
About this chapter
Cite this chapter
Mislevy, R.J., Wilson, M.R., Ercikan, K., Chudowsky, N. (2003). Psychometric Principles in Student Assessment. In: Kellaghan, T., Stufflebeam, D.L. (eds) International Handbook of Educational Evaluation. Kluwer International Handbooks of Education, vol 9. Springer, Dordrecht. https://doi.org/10.1007/978-94-010-0309-4_31
Download citation
DOI: https://doi.org/10.1007/978-94-010-0309-4_31
Publisher Name: Springer, Dordrecht
Print ISBN: 978-1-4020-0849-8
Online ISBN: 978-94-010-0309-4
eBook Packages: Springer Book Archive