skip to main content
research-article

The Effects and Interactions of Data Quality and Problem Complexity on Classification

Published:01 February 2011Publication History
Skip Abstract Section

Abstract

Data quality remains a persistent problem in practice and a challenge for research. In this study we focus on the four dimensions of data quality noted as the most important to information consumers, namely accuracy, completeness, consistency, and timeliness. These dimensions are of particular concern for operational systems, and most importantly for data warehouses, which are often used as the primary data source for analyses such as classification, a general type of data mining. However, the definitions and conceptual models of these dimensions have not been collectively considered with respect to data mining in general or classification in particular. Nor have they been considered for problem complexity. Conversely, these four dimensions of data quality have only been indirectly addressed by data mining research. Using definitions and constructs of data quality dimensions, our research evaluates the effects of both data quality and problem complexity on generated data and tests the results in a real-world case. Six different classification outcomes selected from the spectrum of classification algorithms show that data quality and problem complexity have significant main and interaction effects. From the findings of significant effects, the economics of higher data quality are evaluated for a frequent application of classification and illustrated by the real-world case.

References

  1. Ali, S. and Smith, K. A. 2006. On learning algorithm selection for classification. Appl. Soft Comput. J. 6, 119--138. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Apte, C., Liu, B., Pednault, E. P. D., and Smyth, P. 2002. Business applications of data mining. Comm. ACM 45, 49--53. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Ballou, D., Wang, R., Pazer, H., and Kumar, T. G. 1998. Modeling information manufacturing systems to determine information product quality. Manag. Sci. 44, 462--484. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Ballou, D. P. and Pazer, H. L. 1985. Modeling data and process quality in multi-input, multi-output information systems. Manag. Sci. 31, 150--162.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Ballou, D. P. and Pazer, H. L. 2003. Modeling completeness versus consistency tradeoffs in information decision contexts. IEEE Trans. Knowl. Data Engin. 15, 240--243. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Davenport, T. H. and Harris, J. G. 2007. Competing on Analytics: The New Science of Winning. Harvard Business School Publishing Company, Boston, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Dillard, R. A. 1992. Using data quality measures in decision-making algorithms. IEEE Intell. Syst. Appl. 7, 63--72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Eckerson, W. W. 2002. Data warehousing special report: Data quality and the bottom line. In Applications Development Trends.Google ScholarGoogle Scholar
  9. Even, A. and Shankaranarayanan, G. 2007. Utility-driven configuration of data quality in data repositories. Int. J. Inf. Quality 1, 22--40.Google ScholarGoogle ScholarCross RefCross Ref
  10. Even, A. and Shankaranarayanan, G. 2009. Dual assessment of data quality in customer databases. ACM J. Inf. Data Quality 1, 3. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Fisher, C., Lauria, E., and Matheus, C. 2007. In search of an accuracy metric. In Proceedings of the 12th International Conference on Information Quality.Google ScholarGoogle Scholar
  12. Ge, M. and Helfert, M. 2006. A framework to assess decision quality using information quality dimensions. In Proceedings of the International Conference on Information Quality.Google ScholarGoogle Scholar
  13. Gomes, P., Farinha, J., and Trigueiros, M. J. 2007. A data quality metamodel extension to CWM. In Proceedings of the 4th Asia-Pacific Conference on Conceptual Modeling. 17--26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Hadden, J., Tiwari, A., Roy, R., and Ruta, D. 2007. Computer assisted customer churn management: State-of-the-Art and future trends. Comput. Oper. Res. 34, 2902--2917. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Heinrich, B., Klier, M., and Kaiser, M. 2009. A procedure to develop metrics for currency and its application in CRM. ACM J. Inf. Data Quality 1, 3. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Hickey, R. 1996. Noise modelling and evaluating learning from examples. Artif. Intell. 82, 157--179. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Kahn, B. K., Strong, D. M., and Wang, R. Y. 2002. Information quality benchmarks: Product and service performance. Comm. ACM 45, 185--192. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Karr, A. F., Sanil, A. P., and Banks, D. L. 2006. Data quality: A statistical perspective. Statist. Method. 3, 137--173.Google ScholarGoogle ScholarCross RefCross Ref
  19. Klein, B. D., Goodhue, D. L., and Davis, G. B. 1997. Can humans detect errors in data? Impact of base rates, incentives, and goals. MIS Quart. 21, 169--194. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Kohavi, R., Rothleder, N. J., and Simoudis, E. 2002. Emerging trends in business analytics. Comm. ACM 45, 45--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Lakshminarayan, K., Harp, S. A., and Samad, T. 1999. Imputation of missing data in industrial databases. Appl. Intell. 11, 259--275. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Lee, Y. W., Pipino, L., Strong, D. M., and Wang, R. Y. 2004. Process-embedded data integrity. J. Datab. Manag. 15, 87--103.Google ScholarGoogle ScholarCross RefCross Ref
  23. Lee, Y. W., Pipino, L. L., Funk, J. D., and Wang, R. Y. 2006. Journey to Data Quality. The MIT Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Lee, Y. W., Strong, D. M., Kahn, B. K., and Wang, R. Y. 2002. AIMQ: A methodology for information quality assessment. Inf. Manag. 40, 133--146. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Madnick, S. and Wang, R. Y. 1992. Introduction to total data quality management (TDQM). Research Program TDQM-92-01, Total Data Quality Management Program, MIT Sloan School of Management.Google ScholarGoogle Scholar
  26. March, S. T. and Hevner, A. R. 2007. Integrated decision support systems: A data warehousing perspective. Decis. Support Syst. 43, 1031--1043. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Oates, T. and Jensen, D. 1997. The effects of training set size on decision tree complexity. In Proceedings of the 14th International Conference on Machine Learning. Morgan Kaufmann Publishers, 254--262. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Ordonez, C. and García-García, J. 2008. Referential integrity quality metrics. Decis. Support Syst. 44, 495--508. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Parssian, A. 2006. Managerial decision support with knowledge of accuracy and completeness of the relational aggregate functions. Decis. Support Syst. 42, 1494--1502. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Parssian, A., Sarkar, S., and Jacob, V. S. 2004. Assessing data quality for information products: Impact of selection, projection, and cartesian product. Manag. Sci. 50, 967--982. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Pipino, L. L., Lee, Y. W., and Wang, R. Y. 2002. Data quality assessment. Comm. ACM 45, 211--218. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Quinlan, J. R. 1986. Induction of decision trees. Mach. Learn. 1, 81--106. Google ScholarGoogle ScholarCross RefCross Ref
  33. Redman, T. C. 2004. Data: An unfolding quality disaster. DM Rev. 6.Google ScholarGoogle Scholar
  34. Reichheld, F. F. and Sasser, W. E. 1990. Zero defections. Harvard Bus. Rev. 68, 105--111.Google ScholarGoogle Scholar
  35. Sessions, V. and Valtorta, M. 2006. Learning Bayesian networks from inaccurate data. In Proceedings of the 11th International Conference on Information Quality.Google ScholarGoogle Scholar
  36. Shankaranarayanan, G. and Cai, Y. 2006. Supporting data quality management in decision-making. Decis. Support Syst. 42, 302--317. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Su, Y. and Jin, Z. 2007. Assessment and improvement of data and information quality. In Information Quality Management: Theory and Applications. Idea Group, Inc.Google ScholarGoogle Scholar
  38. Swait, J. and Adamowicz, W. 2001. Choice environment, market complexity, and consumer behavior: A theoretical and empirical approach for incorporating decision complexity into models of consumer choice. Organiz. Behav. Hum. Decis. Process. 86, 141--167.Google ScholarGoogle ScholarCross RefCross Ref
  39. Wand, Y. and Wang, R. Y. 1996. Anchoring data quality dimensions in ontological foundations. Comm. ACM 39, 86--95. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Wang, R. Y. and Strong, D. M. 1996. Beyond accuracy: What data quality means to data consumers. J. Manag. Inf. Syst. 12, 5--33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Wang, R. Y., Ziad, M., and Lee, Y. W. 2000. Data Quality. Kluwer Academic Publishers.Google ScholarGoogle Scholar
  42. Wu, Y., Frizelle, G., and Efstathiou, J. 2007. A study on the cost of operational complexity in customer-supplier systems. Int. J. Product. Econom. 106, 217--229.Google ScholarGoogle ScholarCross RefCross Ref
  43. Zhu, X. and Wu, X. 2004. Class noise vs. attribute noise: A quantitative study. Artif. Intell. Rev. 22, 177--210. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. The Effects and Interactions of Data Quality and Problem Complexity on Classification

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image Journal of Data and Information Quality
        Journal of Data and Information Quality  Volume 2, Issue 2
        February 2011
        102 pages
        ISSN:1936-1955
        EISSN:1936-1963
        DOI:10.1145/1891879
        Issue’s Table of Contents

        Copyright © 2011 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 1 February 2011
        • Accepted: 1 November 2010
        • Revised: 1 September 2010
        • Received: 1 December 2008
        Published in jdiq Volume 2, Issue 2

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader