Abstract
Awareness of data and information quality issues has grown rapidly in light of the critical role played by the quality of information in our data-intensive, knowledge-based economy. Research in the past two decades has produced a large body of data quality knowledge and has expanded our ability to solve many data and information quality problems. In this article, we present an overview of the evolution and current landscape of data and information quality research. We introduce a framework to characterize the research along two dimensions: topics and methods. Representative papers are cited for purposes of illustrating the issues addressed and the methods used. We also identify and discuss challenges to be addressed in future research.
- Abdel-Hamid, T. K. 1988. The economics of software quality assurance: A simulation-based case study. MIS Quart. 12, 3, 395--411. Google ScholarDigital Library
- Abdel-Hamid, T. K. and Madnick, S. E. 1990. Dynamics of Software Project Management. Prentice-Hall, Englewood Cliffs, NJ.Google Scholar
- Ang, W. H., Lee, Y. W., Madnick, S. E., Mistress, D., Siegel, M., Strong, D. M., Wang, R. Y., and Yao, C. 2006. House of security: Locale, roles, resources for ensuring information security. In Proceedings of the 12th Americas Conference on Information Systems.Google Scholar
- Ballou, D. P., Chengalur-Smith, I. N., and Wang, R. Y. 2006. Sample-Based quality estimation of query results in relational database environments. IEEE Trans. Knowl. Data Eng. 18, 5, 639--650. Google ScholarDigital Library
- Ballou, D. and Pazer, H. 1995. Designing information systems to optimize accuracy-timeliness trade-off. Inf. Syst. Res. 6, 1, 51--72.Google ScholarDigital Library
- Ballou, D. and Tayi, G. K. 1999. Enhancing data quality in data warehouse environments. Commun. ACM 41, 1, 73--78. Google ScholarDigital Library
- Ballou, D., Wang, R. Y., Pazer, H., and Tayi, G. K. 1998. Modeling information manufacturing systems to determine information product quality. Manag. Sci. 44, 4, 462--484. Google ScholarDigital Library
- Baskerville, R. and Wood-Harper, A. T. 1996. A critical perspective on action research as a method for information systems research. J. Inf. Technol. 11, 235--246.Google ScholarCross Ref
- Batini, C., Lenzerini, M., and Navathe, S. 1986. A comparative analysis of methodologies for database schema integration. ACM Comput. Surv. 18, 4, 323--364. Google ScholarDigital Library
- Batini, C. and Scannapieco, M. 2006. Data Quality: Concepts, Methodologies, and Techniques. Springer Verlag. Google ScholarDigital Library
- Benjelloun, O., Das Sarma, A., Halevy, A., and Widom, J. 2006. ULDBs: Databases with uncertainty and lineage. In Proceedings of the 32nd VLDB Conference, 935--964. Google ScholarDigital Library
- Bovee, M., Ettredge, M. L., Srivastava, R. P., and Vasarhelyi, M. A. 2002. Does the year 2000 XBRL taxonomy accommodate current business financial-reporting practice? J. Inf. Syst. 16, 2, 165--182.Google ScholarCross Ref
- Buneman, P., Chapman, A., and Cheney, J. 2006. Provenance management in curated databases. In Proceedings of ACM SIGMOD International Conference on Management of Data, 539--550. Google ScholarDigital Library
- Buneman, P., Khanna, S., and Tan, W. C. 2001. Why and where: A characterization of data provenance. In International Conference on Database Theory, J. Van den Bussche and V. Vianu, Eds. Lecture Notes in Computer Science, vol. 1973. Springer, 316--330. Google ScholarDigital Library
- Chen, P. P. 1976. The entity-relationship model: Toward a unified view of data. ACM Trans. Database Syst. 1, 1, 1--36. Google ScholarDigital Library
- Chengular-Smith, I., Ballou, D. P., and Pazer, H. L. 1999. The impact of data quality information on decision making: An exploratory analysis. IEEE Trans. Knowl. Data Eng. 11, 6, 853--865. Google ScholarDigital Library
- Dalvi, N. and Suciu, D. 2007. Management of probabilistic data: Foundations and challenges. In Proceedings of the ACM Symposium on Principles of Database Systems (PODS), 1--12. Google ScholarDigital Library
- Dasgupta, P. and Stiglitz, J. 1980. Uncertainty, industrial structure, and the speed of R&D. The Bell J. Econom. 11, 1, 1--28.Google ScholarCross Ref
- Dasu, T. and Johnson, T. 2003. Exploratory Data Minding and Data Cleaning. John Wiley & Sons, Hoboken, NJ. Google ScholarDigital Library
- Davidson, B., Lee, Y. W., and Wang, R. 2004. Developing data production maps: Meeting patient discharge data 1.Google Scholar
- De Vany, S. and Saving, T. 1983. The economics of quality. The J. Political Econ. 91, 6, 979--1000.Google ScholarCross Ref
- Deming, W. E. 1982. Out of the Crisis. MIT Press, Cambridge, MA.Google Scholar
- Doan, A., Domingos, P., and Halevy, A. 2001. Reconciling schemas of disparate data sources: A machine learning approach. In Proceedings of the ACM SIGMOD Conference, 509--520. Google ScholarDigital Library
- Doan, A. and Halevy, A. Y. 2005. Semantic-Integration research in the database community: A brief survey. AI Mag. 26, 1, 83--94. Google ScholarDigital Library
- Fagin, R., Kolaitis, P. G., Miller, R., and Popa, L. 2005. Data exchange: Semantics and query answering. Theoretical Comput. Sci. 336, 1, 89--124. Google ScholarDigital Library
- Fan, W., Lu, H., Madnick, S. E., and Cheung, D. W. 2001. Discovering and reconciling data value conflicts for numerical data integration. Inf. Syst. 26, 8, 635--656. Google ScholarDigital Library
- Fisher, C., Chengular-Smith, I., and Ballou, D. 2003. The impact of experience and time on the use of data quality information in decision making. Inf. Syst. Res. 14, 2, 170--188. Google ScholarDigital Library
- Fisher, C. and Kingma, B. 2001. Criticality of data quality as exemplified in two disasters. Inf. Manag. 39, 109--116. Google ScholarDigital Library
- Flyvbjerg, B. 2006. Five misunderstandings about case study research. Qualitative Inquiry 12, 2, 219--245.Google ScholarCross Ref
- Frawley, W. J., Piateksky-Shapiro, G., and Matheu S, C. J. 1992. Knowledge discovery in databases: An overview. AI Mag. 13, 3, 57--70. Google ScholarDigital Library
- Galahards, H., Florescu, D., Shasha, D., Simon, E., and Saita, C. A. 2001. Declarative data cleaning: Language, model and algorithms. In Proceedings of the 27th VLDB Conference, 371--380. Google ScholarDigital Library
- Goh, C. H., Bressan, S., Madnick, S. E., and Siegel, M. D. 1999. Context interchange: New features and formalisms for the intelligent integration of information. ACM Trans. Inf. Syst. 17, 3, 270--293 Google ScholarDigital Library
- He, B., Chang, K. C. C., and Han, J. 2004. Mining complex matchings across Web query interfaces. In Proceedings of the 9th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, 3--10. Google ScholarDigital Library
- Herbert, K. G., Gehani, N. H., Piel, W. H., Wang, J. T. L., and Wu, C. H. 2004. BIO-AJAX: An extensible framework for biological data cleaning. SIGMOD Rec. 33, 2, 51--57. Google ScholarDigital Library
- Hernandez, M. and Stolfo. 1998. Real-World data is dirty: Data cleansing and the merge/purge problem. J. Data Mining Knowl. Discov. 2, 1, 9--37. Google ScholarDigital Library
- Hevner, A. T., March, S. T., Park, J., and Ram, S. 2004. Design science in information systems research. MIS Quart. 28, 1, 75--105. Google ScholarDigital Library
- Jarke, M., Jeusfeld, M. A., Quix, C., and Vassiliadis, P. 1999. Architecture and quality in data warehouse: An extended repository approach. Inf. Syst. 24, 3, 229--253.Google ScholarCross Ref
- Jung, W., Olfman, L., Ryan, T., and Park, Y. 2005. An experimental study of the effects of contextual data quality and task complexity on decision performance. In Proceedings of the IEEE International Conference on Information Reuse and Integration, 149--154.Google Scholar
- Juran, J. and Goferey, A. B. 1999. Juran’s Quality Handbook. 5th ed. McGraw-Hill, New York.Google Scholar
- Kaomea, P. and Page, W. 1997. A flexible information manufacturing system for the generation of tailored information products. Decision Support Syst. 20, 4, 345--355. Google ScholarDigital Library
- Kerr, K. 2006. The institutionalization of data quality in the New Zealand health sector. Ph.D. dissertation, The University of Auckland, New Zealand.Google Scholar
- Klein, B. D. and Rossin, D. F. 1999. Data quality in neural network models: Effect of error rate and magnitude of error on predictive accuracy. Omega 27, 5, 569--582.Google ScholarCross Ref
- Lee, Y. W. 2004. Crafting rules: Context-reflective data quality problem solving. J. Manag. Inf. Syst. 20, 3, 93--119. Google ScholarDigital Library
- Lee, Y. W., Chase, S., Fisher, J., Leinung, A., McDowell, D., Paradiso, M., Simons, J., and Yarawich, C. 2007a. CEIP maps: Context-Embedded information product maps. In Proceedings of Americas’ Conference on Information Systems.Google Scholar
- Lee, Y. W., Pierce, E., Talburt, J., Wang, R. Y., and Zhu, H. 2007b. A curriculum for a master of science in information quality. J. Inf. Syst. Educ. 18, 2.Google Scholar
- Lee, Y. W., Pipino, L. L., Fund, J. F., and Wang, R. Y. 2006. Journey to Data Quality. The MIT Press, Cambridge, MA. Google ScholarDigital Library
- Lee, Y. W., Pipino, L., Strong, D., and Wang, R. 2004. Process embedded data integrity. J. Database Manag. 15, 1, 87--103.Google ScholarCross Ref
- Lee, Y. and Strong, D. 2004. Knowing-Why about data processes and data quality. J. Manag. Inf. Syst. 20, 3, 13--39. Google ScholarDigital Library
- Lee, Y., Strong, D., Kahn, B., and Wang, R. 2002. AIMQ: A methodology for information quality assessment. Inf. Manag. 40, 133--146. Google ScholarDigital Library
- Li, X. B. and Sarkar, S. 2006. Privacy protection in data mining: A perturbation approach for categorical data. Inf. Syst. Res. 17, 3, 254--270. Google ScholarDigital Library
- Madnick, S. and Prat, N. 2008. Measuring data believability: A provenance approach. In Proceedings of the 41st Annual Hawaii International Conference on System Sciences. Google ScholarDigital Library
- Madnick, S. and Wang, R. Y. 1992. Introduction to total data quality management (TDQM) research program. TDQM-92-01, Total Data Quality Management Program, MIT Sloan School of Management.Google Scholar
- Madnick, S. E., Wang, R. Y., Dravis, F., and Chen, X. 2001. Improving the quality of corporate household data: Current practices and research directions. In Proceedings of the 6th International Conference on Information Quality, 92--104Google Scholar
- Madnick, S. E., Wang, R. Y., Krishna, C., Dravis, F., Funk, J., Katz-Hass, R., Lee, C., Lee, Y., Xiam, X., and Bhansali, S. 2005. Exemplifying business opportunities for improving data quality from corporate household research. In Information Quality. R. Y. Wang et al., Eds. M. E. Sharpe, Armonk, NY, 181--196.Google Scholar
- Madnick, S. E., Wang, R. Y., and Xian, X. 2004. The design and implementation of a corporate householding knowledge processor to improve data quality. J. Manag. Inf. Syst. 20, 3, 41--69. Google ScholarDigital Library
- Madnick, S. E. and Zhu, H. 2006. Improving data quality with effective use of data semantics. Data Knowl. Eng. 59, 2, 460--475. Google ScholarDigital Library
- Marco, D., Duate-Melo, E., Liu, M., and Neuhoffand, D. 2003. On the many-to-one transport capacity of a dense wireless sensor network and the compressibility of its data. In Information Processing in Sensor Networks. In Goos et al., Eds. Lecture Notes in Computer Science, vol. 2634, Springer Berlin, 556. Google ScholarDigital Library
- Mikkelsen, G. and Aasly, J. 2005. Consequences of impaired data quality on information retrieval in electronic patient records. Int. J. Med. Inf. 74, 5, 387--394.Google ScholarCross Ref
- Myers, M. D. 1997. Qualitative research in information systems. http://www.misq.org/discovery/MISQD_isworld/index.html (retrieved on October 5, 2007). Google ScholarDigital Library
- O’Callaghan, L., Mishira, N., Meyerson, A., Guha, S., and Motwaniha, R. 2002. In Proceedings of the 18th International Conference on Data and Engineering, 685--694.Google Scholar
- OMB (Office of Management & Budget). 2007. FEA reference models. http://www.whitehouse.gov/omb/egov/a-2-EAModelsNEW2.html. (retrieved on October 5, 2007).Google Scholar
- Øvretveit, J. 2000. The economics of quality -- A practical approach. Int. J. Health Care Quality Assurance 13, 5, 200--207.Google ScholarCross Ref
- Petrovskiy, M. I. 2003. Outlier detection algorithms in data mining systems. Program. Comput. Softw. 29, 4, 228--237. Google ScholarDigital Library
- Pierce, E. M. 2004. Assessing data quality with control matrices. Commun. ACM 47, 2, 82--86. Google ScholarDigital Library
- Pipino, L., Lee, Y., and Wang, R. 2002. Data quality assessment. Commun. ACM 45, 4, 211--218. Google ScholarDigital Library
- Raghunathan, S. 1999. Impact of information quality and decision-making quality on decision quality: A theoretical model. Decision Support Syst. 25, 4, 275--287. Google ScholarDigital Library
- Rahm, E. and Bernstein, P. 2001. On matching schemas automatically. VLDB J. 10, 4, 334--350.Google ScholarDigital Library
- Redman, T. C. 1998. The impact of poor data quality on the typical enterprise. Commun. ACM 41, 2, 79--82. Google ScholarDigital Library
- Schekkerman, J. 2004. How to Survive in the Jungle of Enterprise Architecture Frameworks: Creating or Choosing an Enterprise Architecture Framework. Trafford Publishing. Google ScholarDigital Library
- Shankaranarayan, G., Ziad, M., and Wang, R. Y. 2003. Managing data quality in dynamic decision environment: An information product approach. J. Database Manag. 14, 4, 14--32.Google ScholarCross Ref
- Sheng, Y. and Mykytyn, P. 2002. Information technology investment and firm performance: A perspective of data quality. In Proceedings of the 7th International Conference on Information Quality, 132--141.Google Scholar
- Slone, J. P. 2006. Information quality strategy: An empirical investigation of the relationship between information quality improvements and organizational outcomes. Ph.D. dissertation, Capella University.Google Scholar
- Storey, V. and Wang, R. Y. 1998. Modeling quality requirements in conceptual database design. In Proceedings of the International Conference on Information Quality, 64--87Google Scholar
- Strong, D., Lee, Y. W., and Wang, R. Y. 1997. Data quality in context. Commun. ACM 40, 5, 103--110. Google ScholarDigital Library
- Talburt, J., Morgan, C., Talley, T., and Archer, K. 2005. Using commercial data integration technologies to improve the quality of anonymous entity resolution in the public sector. In Proceedings of the 10th International Conference on Information Quality (ICIQ’05), 133--142.Google Scholar
- Tejada, S., Knoblock, C., and Minton, S. 2001. Learning object identification rules from information extraction. Inf. Syst. 26, 8, 607--633. Google ScholarDigital Library
- Thatcher, M. E. and Pingry, D. E. 2004. An economic model of product quality and IT value. Inf. Syst. Res. 15, 3, 268--286. Google ScholarDigital Library
- Vassiliadis, P., Vagena, Z., Skiadopoulos, S., Karayannidis, N., and Sellis, T. 2001. ARKTOS: Towards the modeling, design, control and execution of ETL processes. Inf. Syst. 26, 537--561. Google ScholarDigital Library
- Wang, R. Y., Kon, H. B., and Madnick, S. E. 1993. Data quality requirements analysis and modeling. In Proceedings of the 9th International Conference of Data Engineering, 670--677. Google ScholarDigital Library
- Wang, R. Y., Lee, Y., Pipino, L., and Strong, D. 1998. Managing your information as a product. Sloan Manag. Rev. Summer 1998, 95--106.Google Scholar
- Wang, R. Y. and Madnick, S. E. 1989. The inter-database instance identification problem in integrating autonomous systems. In Proceedings of the 5th International Conference on Data Engineering, 46--55. Google ScholarDigital Library
- Wang, R. Y. and Madnick, S. E. 1990. A polygen model for heterogeneous database systems: The source tagging perspective. In Proceedings of the 16th VLDB Conference, 519--538. Google ScholarDigital Library
- Wang, R. Y., Reddy, M., and Kon, H. 1995a. Toward quality data: An attribute-based approach. Decision Support Syst. 13, 349--372. Google ScholarDigital Library
- Wang, R. Y., Storey, V. C., and Firth, C. P. 1995b. A framework for analysis of data quality research. IEEE Trans. Knowl. Data Eng. 7, 4, 623--640. Google ScholarDigital Library
- Wang, R. Y. and Strong, D. M. 1996. Beyond accuracy: What data quality means to data consumers. J. Manag. Inf. Syst. 12, 4, 5--34. Google ScholarDigital Library
- Widom, J. 2005. Trio: A system for integrated management of data, accuracy, and lineage. In Proceedings of the 2nd Biennial Conference on Innovative Data Systems Research (CIDR’05).Google Scholar
- Winkler, W. E. 2006. Overview of record linkage and current research directions. Tech. rep. U.S. Census Bureau, Statistics #2006-2.Google Scholar
- Xiao, X. and Tao, Y. 2006. Anatomy: Simple and effective privacy preservation. In Proceedings of the 32nd VLDB Conference. Google ScholarDigital Library
- Xu H., Nord, J. H., Brown, N., and Nord, G. G. 2002. Data quality issues in implementing an ERP. Industrial Manag. Data Syst. 102, 1, 47--58.Google ScholarCross Ref
- Yin, R. 2002. Case Study Research: Design and Methods, 3rd ed. Sage Publications, Thousand Oaks, CA.Google Scholar
- Zachman, J. A. 1987. A framework for information systems architecture. IBM Syst. J. 26, 3, 276--292. Google ScholarDigital Library
- Zhu, X., Khoshgoftaar, T., Davidson, I., and Zhang, S. 2007. Editorial: Special issue on mining low-quality data. Knowl. Inf. Syst. 11, 2, 131--136. Google ScholarDigital Library
Index Terms
- Overview and Framework for Data and Information Quality Research
Recommendations
A Framework for Analysis of Data Quality Research
Organizational databases are pervaded with data of poor quality. However, there has not been an analysis of the data quality literature that provides an overall understanding of the state-of-art research in this area. Using an analogy between product ...
The Impact of Experience and Time on the Use of Data Quality Information in Decision Making
Data Quality Information (DQI) is metadata that can be included with data to provide the user with information regarding the quality of that data. As users are increasingly removed from any personal experience with data, knowledge that would be ...
Research and Implementation of Information Quality Improvement
COINFO '09: Proceedings of the 2009 Fourth International Conference on Cooperation and Promotion of Information Resources in Science and TechnologyInformation quality is the premise for scientific decision making. Along with the development of various types of information-sharing project, problems of information quality are increasingly apparent. This paper firstly analyzed the criteria of ...
Comments