Abstract
Data quality remains a persistent problem in practice and a challenge for research. In this study we focus on the four dimensions of data quality noted as the most important to information consumers, namely accuracy, completeness, consistency, and timeliness. These dimensions are of particular concern for operational systems, and most importantly for data warehouses, which are often used as the primary data source for analyses such as classification, a general type of data mining. However, the definitions and conceptual models of these dimensions have not been collectively considered with respect to data mining in general or classification in particular. Nor have they been considered for problem complexity. Conversely, these four dimensions of data quality have only been indirectly addressed by data mining research. Using definitions and constructs of data quality dimensions, our research evaluates the effects of both data quality and problem complexity on generated data and tests the results in a real-world case. Six different classification outcomes selected from the spectrum of classification algorithms show that data quality and problem complexity have significant main and interaction effects. From the findings of significant effects, the economics of higher data quality are evaluated for a frequent application of classification and illustrated by the real-world case.
- Ali, S. and Smith, K. A. 2006. On learning algorithm selection for classification. Appl. Soft Comput. J. 6, 119--138. Google ScholarDigital Library
- Apte, C., Liu, B., Pednault, E. P. D., and Smyth, P. 2002. Business applications of data mining. Comm. ACM 45, 49--53. Google ScholarDigital Library
- Ballou, D., Wang, R., Pazer, H., and Kumar, T. G. 1998. Modeling information manufacturing systems to determine information product quality. Manag. Sci. 44, 462--484. Google ScholarDigital Library
- Ballou, D. P. and Pazer, H. L. 1985. Modeling data and process quality in multi-input, multi-output information systems. Manag. Sci. 31, 150--162.Google ScholarDigital Library
- Ballou, D. P. and Pazer, H. L. 2003. Modeling completeness versus consistency tradeoffs in information decision contexts. IEEE Trans. Knowl. Data Engin. 15, 240--243. Google ScholarDigital Library
- Davenport, T. H. and Harris, J. G. 2007. Competing on Analytics: The New Science of Winning. Harvard Business School Publishing Company, Boston, MA. Google ScholarDigital Library
- Dillard, R. A. 1992. Using data quality measures in decision-making algorithms. IEEE Intell. Syst. Appl. 7, 63--72. Google ScholarDigital Library
- Eckerson, W. W. 2002. Data warehousing special report: Data quality and the bottom line. In Applications Development Trends.Google Scholar
- Even, A. and Shankaranarayanan, G. 2007. Utility-driven configuration of data quality in data repositories. Int. J. Inf. Quality 1, 22--40.Google ScholarCross Ref
- Even, A. and Shankaranarayanan, G. 2009. Dual assessment of data quality in customer databases. ACM J. Inf. Data Quality 1, 3. Google ScholarDigital Library
- Fisher, C., Lauria, E., and Matheus, C. 2007. In search of an accuracy metric. In Proceedings of the 12th International Conference on Information Quality.Google Scholar
- Ge, M. and Helfert, M. 2006. A framework to assess decision quality using information quality dimensions. In Proceedings of the International Conference on Information Quality.Google Scholar
- Gomes, P., Farinha, J., and Trigueiros, M. J. 2007. A data quality metamodel extension to CWM. In Proceedings of the 4th Asia-Pacific Conference on Conceptual Modeling. 17--26. Google ScholarDigital Library
- Hadden, J., Tiwari, A., Roy, R., and Ruta, D. 2007. Computer assisted customer churn management: State-of-the-Art and future trends. Comput. Oper. Res. 34, 2902--2917. Google ScholarDigital Library
- Heinrich, B., Klier, M., and Kaiser, M. 2009. A procedure to develop metrics for currency and its application in CRM. ACM J. Inf. Data Quality 1, 3. Google ScholarDigital Library
- Hickey, R. 1996. Noise modelling and evaluating learning from examples. Artif. Intell. 82, 157--179. Google ScholarDigital Library
- Kahn, B. K., Strong, D. M., and Wang, R. Y. 2002. Information quality benchmarks: Product and service performance. Comm. ACM 45, 185--192. Google ScholarDigital Library
- Karr, A. F., Sanil, A. P., and Banks, D. L. 2006. Data quality: A statistical perspective. Statist. Method. 3, 137--173.Google ScholarCross Ref
- Klein, B. D., Goodhue, D. L., and Davis, G. B. 1997. Can humans detect errors in data? Impact of base rates, incentives, and goals. MIS Quart. 21, 169--194. Google ScholarDigital Library
- Kohavi, R., Rothleder, N. J., and Simoudis, E. 2002. Emerging trends in business analytics. Comm. ACM 45, 45--48. Google ScholarDigital Library
- Lakshminarayan, K., Harp, S. A., and Samad, T. 1999. Imputation of missing data in industrial databases. Appl. Intell. 11, 259--275. Google ScholarDigital Library
- Lee, Y. W., Pipino, L., Strong, D. M., and Wang, R. Y. 2004. Process-embedded data integrity. J. Datab. Manag. 15, 87--103.Google ScholarCross Ref
- Lee, Y. W., Pipino, L. L., Funk, J. D., and Wang, R. Y. 2006. Journey to Data Quality. The MIT Press. Google ScholarDigital Library
- Lee, Y. W., Strong, D. M., Kahn, B. K., and Wang, R. Y. 2002. AIMQ: A methodology for information quality assessment. Inf. Manag. 40, 133--146. Google ScholarDigital Library
- Madnick, S. and Wang, R. Y. 1992. Introduction to total data quality management (TDQM). Research Program TDQM-92-01, Total Data Quality Management Program, MIT Sloan School of Management.Google Scholar
- March, S. T. and Hevner, A. R. 2007. Integrated decision support systems: A data warehousing perspective. Decis. Support Syst. 43, 1031--1043. Google ScholarDigital Library
- Oates, T. and Jensen, D. 1997. The effects of training set size on decision tree complexity. In Proceedings of the 14th International Conference on Machine Learning. Morgan Kaufmann Publishers, 254--262. Google ScholarDigital Library
- Ordonez, C. and García-García, J. 2008. Referential integrity quality metrics. Decis. Support Syst. 44, 495--508. Google ScholarDigital Library
- Parssian, A. 2006. Managerial decision support with knowledge of accuracy and completeness of the relational aggregate functions. Decis. Support Syst. 42, 1494--1502. Google ScholarDigital Library
- Parssian, A., Sarkar, S., and Jacob, V. S. 2004. Assessing data quality for information products: Impact of selection, projection, and cartesian product. Manag. Sci. 50, 967--982. Google ScholarDigital Library
- Pipino, L. L., Lee, Y. W., and Wang, R. Y. 2002. Data quality assessment. Comm. ACM 45, 211--218. Google ScholarDigital Library
- Quinlan, J. R. 1986. Induction of decision trees. Mach. Learn. 1, 81--106. Google ScholarCross Ref
- Redman, T. C. 2004. Data: An unfolding quality disaster. DM Rev. 6.Google Scholar
- Reichheld, F. F. and Sasser, W. E. 1990. Zero defections. Harvard Bus. Rev. 68, 105--111.Google Scholar
- Sessions, V. and Valtorta, M. 2006. Learning Bayesian networks from inaccurate data. In Proceedings of the 11th International Conference on Information Quality.Google Scholar
- Shankaranarayanan, G. and Cai, Y. 2006. Supporting data quality management in decision-making. Decis. Support Syst. 42, 302--317. Google ScholarDigital Library
- Su, Y. and Jin, Z. 2007. Assessment and improvement of data and information quality. In Information Quality Management: Theory and Applications. Idea Group, Inc.Google Scholar
- Swait, J. and Adamowicz, W. 2001. Choice environment, market complexity, and consumer behavior: A theoretical and empirical approach for incorporating decision complexity into models of consumer choice. Organiz. Behav. Hum. Decis. Process. 86, 141--167.Google ScholarCross Ref
- Wand, Y. and Wang, R. Y. 1996. Anchoring data quality dimensions in ontological foundations. Comm. ACM 39, 86--95. Google ScholarDigital Library
- Wang, R. Y. and Strong, D. M. 1996. Beyond accuracy: What data quality means to data consumers. J. Manag. Inf. Syst. 12, 5--33. Google ScholarDigital Library
- Wang, R. Y., Ziad, M., and Lee, Y. W. 2000. Data Quality. Kluwer Academic Publishers.Google Scholar
- Wu, Y., Frizelle, G., and Efstathiou, J. 2007. A study on the cost of operational complexity in customer-supplier systems. Int. J. Product. Econom. 106, 217--229.Google ScholarCross Ref
- Zhu, X. and Wu, X. 2004. Class noise vs. attribute noise: A quantitative study. Artif. Intell. Rev. 22, 177--210. Google ScholarDigital Library
Index Terms
- The Effects and Interactions of Data Quality and Problem Complexity on Classification
Recommendations
Quality Data for Data Mining and Data Mining for Quality Data: A Constraint Based Approach in XML
FGCNS '08: Proceedings of the 2008 Second International Conference on Future Generation Communication and Networking Symposia - Volume 02As quality data is important for data mining, reversely data mining is necessary to measure the quality of data. Specifically, in XML, the issue of quality data for mining purposes and also using data mining techniques for quality measures is becoming ...
A Taxonomy of Dirty Data
Today large corporations are constructing enterprise data warehouses from disparate data sources in order to run enterprise-wide data analysis applications, including decision support systems, multidimensional online analytical applications, data mining,...
The Impact of Experience and Time on the Use of Data Quality Information in Decision Making
Data Quality Information (DQI) is metadata that can be included with data to provide the user with information regarding the quality of that data. As users are increasingly removed from any personal experience with data, knowledge that would be ...
Comments