skip to main content
research-article

Overview and Framework for Data and Information Quality Research

Published:01 June 2009Publication History
Skip Abstract Section

Abstract

Awareness of data and information quality issues has grown rapidly in light of the critical role played by the quality of information in our data-intensive, knowledge-based economy. Research in the past two decades has produced a large body of data quality knowledge and has expanded our ability to solve many data and information quality problems. In this article, we present an overview of the evolution and current landscape of data and information quality research. We introduce a framework to characterize the research along two dimensions: topics and methods. Representative papers are cited for purposes of illustrating the issues addressed and the methods used. We also identify and discuss challenges to be addressed in future research.

References

  1. Abdel-Hamid, T. K. 1988. The economics of software quality assurance: A simulation-based case study. MIS Quart. 12, 3, 395--411. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Abdel-Hamid, T. K. and Madnick, S. E. 1990. Dynamics of Software Project Management. Prentice-Hall, Englewood Cliffs, NJ.Google ScholarGoogle Scholar
  3. Ang, W. H., Lee, Y. W., Madnick, S. E., Mistress, D., Siegel, M., Strong, D. M., Wang, R. Y., and Yao, C. 2006. House of security: Locale, roles, resources for ensuring information security. In Proceedings of the 12th Americas Conference on Information Systems.Google ScholarGoogle Scholar
  4. Ballou, D. P., Chengalur-Smith, I. N., and Wang, R. Y. 2006. Sample-Based quality estimation of query results in relational database environments. IEEE Trans. Knowl. Data Eng. 18, 5, 639--650. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Ballou, D. and Pazer, H. 1995. Designing information systems to optimize accuracy-timeliness trade-off. Inf. Syst. Res. 6, 1, 51--72.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Ballou, D. and Tayi, G. K. 1999. Enhancing data quality in data warehouse environments. Commun. ACM 41, 1, 73--78. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Ballou, D., Wang, R. Y., Pazer, H., and Tayi, G. K. 1998. Modeling information manufacturing systems to determine information product quality. Manag. Sci. 44, 4, 462--484. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Baskerville, R. and Wood-Harper, A. T. 1996. A critical perspective on action research as a method for information systems research. J. Inf. Technol. 11, 235--246.Google ScholarGoogle ScholarCross RefCross Ref
  9. Batini, C., Lenzerini, M., and Navathe, S. 1986. A comparative analysis of methodologies for database schema integration. ACM Comput. Surv. 18, 4, 323--364. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Batini, C. and Scannapieco, M. 2006. Data Quality: Concepts, Methodologies, and Techniques. Springer Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Benjelloun, O., Das Sarma, A., Halevy, A., and Widom, J. 2006. ULDBs: Databases with uncertainty and lineage. In Proceedings of the 32nd VLDB Conference, 935--964. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Bovee, M., Ettredge, M. L., Srivastava, R. P., and Vasarhelyi, M. A. 2002. Does the year 2000 XBRL taxonomy accommodate current business financial-reporting practice? J. Inf. Syst. 16, 2, 165--182.Google ScholarGoogle ScholarCross RefCross Ref
  13. Buneman, P., Chapman, A., and Cheney, J. 2006. Provenance management in curated databases. In Proceedings of ACM SIGMOD International Conference on Management of Data, 539--550. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Buneman, P., Khanna, S., and Tan, W. C. 2001. Why and where: A characterization of data provenance. In International Conference on Database Theory, J. Van den Bussche and V. Vianu, Eds. Lecture Notes in Computer Science, vol. 1973. Springer, 316--330. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Chen, P. P. 1976. The entity-relationship model: Toward a unified view of data. ACM Trans. Database Syst. 1, 1, 1--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Chengular-Smith, I., Ballou, D. P., and Pazer, H. L. 1999. The impact of data quality information on decision making: An exploratory analysis. IEEE Trans. Knowl. Data Eng. 11, 6, 853--865. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Dalvi, N. and Suciu, D. 2007. Management of probabilistic data: Foundations and challenges. In Proceedings of the ACM Symposium on Principles of Database Systems (PODS), 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Dasgupta, P. and Stiglitz, J. 1980. Uncertainty, industrial structure, and the speed of R&D. The Bell J. Econom. 11, 1, 1--28.Google ScholarGoogle ScholarCross RefCross Ref
  19. Dasu, T. and Johnson, T. 2003. Exploratory Data Minding and Data Cleaning. John Wiley & Sons, Hoboken, NJ. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Davidson, B., Lee, Y. W., and Wang, R. 2004. Developing data production maps: Meeting patient discharge data 1.Google ScholarGoogle Scholar
  21. De Vany, S. and Saving, T. 1983. The economics of quality. The J. Political Econ. 91, 6, 979--1000.Google ScholarGoogle ScholarCross RefCross Ref
  22. Deming, W. E. 1982. Out of the Crisis. MIT Press, Cambridge, MA.Google ScholarGoogle Scholar
  23. Doan, A., Domingos, P., and Halevy, A. 2001. Reconciling schemas of disparate data sources: A machine learning approach. In Proceedings of the ACM SIGMOD Conference, 509--520. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Doan, A. and Halevy, A. Y. 2005. Semantic-Integration research in the database community: A brief survey. AI Mag. 26, 1, 83--94. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Fagin, R., Kolaitis, P. G., Miller, R., and Popa, L. 2005. Data exchange: Semantics and query answering. Theoretical Comput. Sci. 336, 1, 89--124. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Fan, W., Lu, H., Madnick, S. E., and Cheung, D. W. 2001. Discovering and reconciling data value conflicts for numerical data integration. Inf. Syst. 26, 8, 635--656. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Fisher, C., Chengular-Smith, I., and Ballou, D. 2003. The impact of experience and time on the use of data quality information in decision making. Inf. Syst. Res. 14, 2, 170--188. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Fisher, C. and Kingma, B. 2001. Criticality of data quality as exemplified in two disasters. Inf. Manag. 39, 109--116. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Flyvbjerg, B. 2006. Five misunderstandings about case study research. Qualitative Inquiry 12, 2, 219--245.Google ScholarGoogle ScholarCross RefCross Ref
  30. Frawley, W. J., Piateksky-Shapiro, G., and Matheu S, C. J. 1992. Knowledge discovery in databases: An overview. AI Mag. 13, 3, 57--70. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Galahards, H., Florescu, D., Shasha, D., Simon, E., and Saita, C. A. 2001. Declarative data cleaning: Language, model and algorithms. In Proceedings of the 27th VLDB Conference, 371--380. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Goh, C. H., Bressan, S., Madnick, S. E., and Siegel, M. D. 1999. Context interchange: New features and formalisms for the intelligent integration of information. ACM Trans. Inf. Syst. 17, 3, 270--293 Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. He, B., Chang, K. C. C., and Han, J. 2004. Mining complex matchings across Web query interfaces. In Proceedings of the 9th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, 3--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Herbert, K. G., Gehani, N. H., Piel, W. H., Wang, J. T. L., and Wu, C. H. 2004. BIO-AJAX: An extensible framework for biological data cleaning. SIGMOD Rec. 33, 2, 51--57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Hernandez, M. and Stolfo. 1998. Real-World data is dirty: Data cleansing and the merge/purge problem. J. Data Mining Knowl. Discov. 2, 1, 9--37. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Hevner, A. T., March, S. T., Park, J., and Ram, S. 2004. Design science in information systems research. MIS Quart. 28, 1, 75--105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Jarke, M., Jeusfeld, M. A., Quix, C., and Vassiliadis, P. 1999. Architecture and quality in data warehouse: An extended repository approach. Inf. Syst. 24, 3, 229--253.Google ScholarGoogle ScholarCross RefCross Ref
  38. Jung, W., Olfman, L., Ryan, T., and Park, Y. 2005. An experimental study of the effects of contextual data quality and task complexity on decision performance. In Proceedings of the IEEE International Conference on Information Reuse and Integration, 149--154.Google ScholarGoogle Scholar
  39. Juran, J. and Goferey, A. B. 1999. Juran’s Quality Handbook. 5th ed. McGraw-Hill, New York.Google ScholarGoogle Scholar
  40. Kaomea, P. and Page, W. 1997. A flexible information manufacturing system for the generation of tailored information products. Decision Support Syst. 20, 4, 345--355. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Kerr, K. 2006. The institutionalization of data quality in the New Zealand health sector. Ph.D. dissertation, The University of Auckland, New Zealand.Google ScholarGoogle Scholar
  42. Klein, B. D. and Rossin, D. F. 1999. Data quality in neural network models: Effect of error rate and magnitude of error on predictive accuracy. Omega 27, 5, 569--582.Google ScholarGoogle ScholarCross RefCross Ref
  43. Lee, Y. W. 2004. Crafting rules: Context-reflective data quality problem solving. J. Manag. Inf. Syst. 20, 3, 93--119. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Lee, Y. W., Chase, S., Fisher, J., Leinung, A., McDowell, D., Paradiso, M., Simons, J., and Yarawich, C. 2007a. CEIP maps: Context-Embedded information product maps. In Proceedings of Americas’ Conference on Information Systems.Google ScholarGoogle Scholar
  45. Lee, Y. W., Pierce, E., Talburt, J., Wang, R. Y., and Zhu, H. 2007b. A curriculum for a master of science in information quality. J. Inf. Syst. Educ. 18, 2.Google ScholarGoogle Scholar
  46. Lee, Y. W., Pipino, L. L., Fund, J. F., and Wang, R. Y. 2006. Journey to Data Quality. The MIT Press, Cambridge, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Lee, Y. W., Pipino, L., Strong, D., and Wang, R. 2004. Process embedded data integrity. J. Database Manag. 15, 1, 87--103.Google ScholarGoogle ScholarCross RefCross Ref
  48. Lee, Y. and Strong, D. 2004. Knowing-Why about data processes and data quality. J. Manag. Inf. Syst. 20, 3, 13--39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Lee, Y., Strong, D., Kahn, B., and Wang, R. 2002. AIMQ: A methodology for information quality assessment. Inf. Manag. 40, 133--146. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Li, X. B. and Sarkar, S. 2006. Privacy protection in data mining: A perturbation approach for categorical data. Inf. Syst. Res. 17, 3, 254--270. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Madnick, S. and Prat, N. 2008. Measuring data believability: A provenance approach. In Proceedings of the 41st Annual Hawaii International Conference on System Sciences. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Madnick, S. and Wang, R. Y. 1992. Introduction to total data quality management (TDQM) research program. TDQM-92-01, Total Data Quality Management Program, MIT Sloan School of Management.Google ScholarGoogle Scholar
  53. Madnick, S. E., Wang, R. Y., Dravis, F., and Chen, X. 2001. Improving the quality of corporate household data: Current practices and research directions. In Proceedings of the 6th International Conference on Information Quality, 92--104Google ScholarGoogle Scholar
  54. Madnick, S. E., Wang, R. Y., Krishna, C., Dravis, F., Funk, J., Katz-Hass, R., Lee, C., Lee, Y., Xiam, X., and Bhansali, S. 2005. Exemplifying business opportunities for improving data quality from corporate household research. In Information Quality. R. Y. Wang et al., Eds. M. E. Sharpe, Armonk, NY, 181--196.Google ScholarGoogle Scholar
  55. Madnick, S. E., Wang, R. Y., and Xian, X. 2004. The design and implementation of a corporate householding knowledge processor to improve data quality. J. Manag. Inf. Syst. 20, 3, 41--69. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Madnick, S. E. and Zhu, H. 2006. Improving data quality with effective use of data semantics. Data Knowl. Eng. 59, 2, 460--475. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Marco, D., Duate-Melo, E., Liu, M., and Neuhoffand, D. 2003. On the many-to-one transport capacity of a dense wireless sensor network and the compressibility of its data. In Information Processing in Sensor Networks. In Goos et al., Eds. Lecture Notes in Computer Science, vol. 2634, Springer Berlin, 556. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Mikkelsen, G. and Aasly, J. 2005. Consequences of impaired data quality on information retrieval in electronic patient records. Int. J. Med. Inf. 74, 5, 387--394.Google ScholarGoogle ScholarCross RefCross Ref
  59. Myers, M. D. 1997. Qualitative research in information systems. http://www.misq.org/discovery/MISQD_isworld/index.html (retrieved on October 5, 2007). Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. O’Callaghan, L., Mishira, N., Meyerson, A., Guha, S., and Motwaniha, R. 2002. In Proceedings of the 18th International Conference on Data and Engineering, 685--694.Google ScholarGoogle Scholar
  61. OMB (Office of Management & Budget). 2007. FEA reference models. http://www.whitehouse.gov/omb/egov/a-2-EAModelsNEW2.html. (retrieved on October 5, 2007).Google ScholarGoogle Scholar
  62. Øvretveit, J. 2000. The economics of quality -- A practical approach. Int. J. Health Care Quality Assurance 13, 5, 200--207.Google ScholarGoogle ScholarCross RefCross Ref
  63. Petrovskiy, M. I. 2003. Outlier detection algorithms in data mining systems. Program. Comput. Softw. 29, 4, 228--237. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Pierce, E. M. 2004. Assessing data quality with control matrices. Commun. ACM 47, 2, 82--86. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Pipino, L., Lee, Y., and Wang, R. 2002. Data quality assessment. Commun. ACM 45, 4, 211--218. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Raghunathan, S. 1999. Impact of information quality and decision-making quality on decision quality: A theoretical model. Decision Support Syst. 25, 4, 275--287. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Rahm, E. and Bernstein, P. 2001. On matching schemas automatically. VLDB J. 10, 4, 334--350.Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Redman, T. C. 1998. The impact of poor data quality on the typical enterprise. Commun. ACM 41, 2, 79--82. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Schekkerman, J. 2004. How to Survive in the Jungle of Enterprise Architecture Frameworks: Creating or Choosing an Enterprise Architecture Framework. Trafford Publishing. Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. Shankaranarayan, G., Ziad, M., and Wang, R. Y. 2003. Managing data quality in dynamic decision environment: An information product approach. J. Database Manag. 14, 4, 14--32.Google ScholarGoogle ScholarCross RefCross Ref
  71. Sheng, Y. and Mykytyn, P. 2002. Information technology investment and firm performance: A perspective of data quality. In Proceedings of the 7th International Conference on Information Quality, 132--141.Google ScholarGoogle Scholar
  72. Slone, J. P. 2006. Information quality strategy: An empirical investigation of the relationship between information quality improvements and organizational outcomes. Ph.D. dissertation, Capella University.Google ScholarGoogle Scholar
  73. Storey, V. and Wang, R. Y. 1998. Modeling quality requirements in conceptual database design. In Proceedings of the International Conference on Information Quality, 64--87Google ScholarGoogle Scholar
  74. Strong, D., Lee, Y. W., and Wang, R. Y. 1997. Data quality in context. Commun. ACM 40, 5, 103--110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. Talburt, J., Morgan, C., Talley, T., and Archer, K. 2005. Using commercial data integration technologies to improve the quality of anonymous entity resolution in the public sector. In Proceedings of the 10th International Conference on Information Quality (ICIQ’05), 133--142.Google ScholarGoogle Scholar
  76. Tejada, S., Knoblock, C., and Minton, S. 2001. Learning object identification rules from information extraction. Inf. Syst. 26, 8, 607--633. Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. Thatcher, M. E. and Pingry, D. E. 2004. An economic model of product quality and IT value. Inf. Syst. Res. 15, 3, 268--286. Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. Vassiliadis, P., Vagena, Z., Skiadopoulos, S., Karayannidis, N., and Sellis, T. 2001. ARKTOS: Towards the modeling, design, control and execution of ETL processes. Inf. Syst. 26, 537--561. Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. Wang, R. Y., Kon, H. B., and Madnick, S. E. 1993. Data quality requirements analysis and modeling. In Proceedings of the 9th International Conference of Data Engineering, 670--677. Google ScholarGoogle ScholarDigital LibraryDigital Library
  80. Wang, R. Y., Lee, Y., Pipino, L., and Strong, D. 1998. Managing your information as a product. Sloan Manag. Rev. Summer 1998, 95--106.Google ScholarGoogle Scholar
  81. Wang, R. Y. and Madnick, S. E. 1989. The inter-database instance identification problem in integrating autonomous systems. In Proceedings of the 5th International Conference on Data Engineering, 46--55. Google ScholarGoogle ScholarDigital LibraryDigital Library
  82. Wang, R. Y. and Madnick, S. E. 1990. A polygen model for heterogeneous database systems: The source tagging perspective. In Proceedings of the 16th VLDB Conference, 519--538. Google ScholarGoogle ScholarDigital LibraryDigital Library
  83. Wang, R. Y., Reddy, M., and Kon, H. 1995a. Toward quality data: An attribute-based approach. Decision Support Syst. 13, 349--372. Google ScholarGoogle ScholarDigital LibraryDigital Library
  84. Wang, R. Y., Storey, V. C., and Firth, C. P. 1995b. A framework for analysis of data quality research. IEEE Trans. Knowl. Data Eng. 7, 4, 623--640. Google ScholarGoogle ScholarDigital LibraryDigital Library
  85. Wang, R. Y. and Strong, D. M. 1996. Beyond accuracy: What data quality means to data consumers. J. Manag. Inf. Syst. 12, 4, 5--34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  86. Widom, J. 2005. Trio: A system for integrated management of data, accuracy, and lineage. In Proceedings of the 2nd Biennial Conference on Innovative Data Systems Research (CIDR’05).Google ScholarGoogle Scholar
  87. Winkler, W. E. 2006. Overview of record linkage and current research directions. Tech. rep. U.S. Census Bureau, Statistics #2006-2.Google ScholarGoogle Scholar
  88. Xiao, X. and Tao, Y. 2006. Anatomy: Simple and effective privacy preservation. In Proceedings of the 32nd VLDB Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  89. Xu H., Nord, J. H., Brown, N., and Nord, G. G. 2002. Data quality issues in implementing an ERP. Industrial Manag. Data Syst. 102, 1, 47--58.Google ScholarGoogle ScholarCross RefCross Ref
  90. Yin, R. 2002. Case Study Research: Design and Methods, 3rd ed. Sage Publications, Thousand Oaks, CA.Google ScholarGoogle Scholar
  91. Zachman, J. A. 1987. A framework for information systems architecture. IBM Syst. J. 26, 3, 276--292. Google ScholarGoogle ScholarDigital LibraryDigital Library
  92. Zhu, X., Khoshgoftaar, T., Davidson, I., and Zhang, S. 2007. Editorial: Special issue on mining low-quality data. Knowl. Inf. Syst. 11, 2, 131--136. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Overview and Framework for Data and Information Quality Research

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image Journal of Data and Information Quality
        Journal of Data and Information Quality  Volume 1, Issue 1
        June 2009
        94 pages
        ISSN:1936-1955
        EISSN:1936-1963
        DOI:10.1145/1515693
        Issue’s Table of Contents

        Copyright © 2009 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 1 June 2009
        • Accepted: 1 March 2009
        • Revised: 1 February 2009
        • Received: 1 February 2008
        Published in jdiq Volume 1, Issue 1

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader