skip to main content
10.1145/1376916.1376940acmconferencesArticle/Chapter ViewAbstractPublication PagespodsConference Proceedingsconference-collections
tutorial

Dependencies revisited for improving data quality

Authors Info & Claims
Published:09 June 2008Publication History

ABSTRACT

Dependency theory is almost as old as relational databases themselves, and has traditionally been used to improve the quality of schema, among other things. Recently there has been renewed interest in dependencies for improving the quality of data. The increasing demand for data quality technology has also motivated revisions of classical dependencies, to capture more inconsistencies in real-life data, and to match, repair and query the inconsistent data. This paper aims to provide an overview of recent advances in revising classical dependencies for improving data quality.

Skip Supplemental Material Section

Supplemental Material

47.flv

flv

418.4 MB

p159-fan_56k.mp4

mp4

40.3 MB

p159-fan_768k.mp4

mp4

344.5 MB

References

  1. S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. Abiteboul, L. Segoufin, and V. Vianu. Representing and querying XML with incomplete information. TODS 31(1): 208--254, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. P. Andritsos, A. Fuxman, and R. J.Miller. Clean answers over dirty databases: A probabilistic approach. In ICDE, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. L. Antova, C. Koch and D. Olteanu. From complete to incomplete information and back. In SIGMOD, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. L. Antova, C. Koch and D. Olteanu. From complete to incomplete information and back. In SIGMOD, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. Arenas, L. E. Bertossi, and J. Chomicki. Answer sets for consistent query answering in inconsistent databases. TPLP 3(4-5): 393--424, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. Arenas, L. E. Bertossi, and J. Chomicki. Consistent query answers in inconsistent databases. In PODS, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. Arenas, L. E. Bertossi, J. Chomicki, X. He, V. Raghavan, and J. Spinrad. Scalar aggregation in inconsistent databases. TCS 296(3): 405--434, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. Arenas, W. Fan, and L. Libkin. On the complexity of verifying consistency of XML specifications. SICOMP, to appear. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. C. Batini and M. Scannapieco. Data Quality: Concepts, Methodologies and Techniques. Springer, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M. Baudinet, J. Chomicki, and P. Wolper. Constraint-generating dependencies. JCSS 59(1): 94--115, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. L. Bertossi. Consistent query answering in databases. SIG-MOD Rec. 35(2): 68--76, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. L. E. Bertossi, L. Bravo, E. Franconi, and A. Lopatenko. Complexity and approximation of fixing numerical attributes in databases under integrity constraints. In DBPL, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. L. Bertossi and J. Chomicki. Query answering in inconsistent databases. Logics for Emerging Applications of Databases, 2003.Google ScholarGoogle Scholar
  15. P. Bohannon, W. Fan, E. Elnahrawy, and M. Flaster. Putting context into schema matching. In VLDB, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. P. Bohannon, W. Fan, M. Flaster, and R. Rastogi. A costbased model and effective heuristic for repairing constraints by value modification. In SIGMOD, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. P. D. Bra and J. Paredaens. Conditional dependencies for horizontal decompositions. In ICALP, 1983. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. L. Bravo and L. E. Bertossi. Consistent query answers in virtual data integration systems. Inconsistency Tolerance, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. L. Bravo, W. Fan, F. Geerts, and S. Ma. Increasing the expressivity of conditional functional dependencies without extra charge for complexity. In ICDE, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. L. Bravo, W. Fan, and S. Ma. Extending dependencies with conditions. In VLDB, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. F. Bry. Query answering in information systems with integrity constraints. In IICIS, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. P. Buneman, J. Cheney, W. Tan, and S. Vansummeren. Curated databases. In PODS, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. A. Calì, D. Lembo, and R. Rosati. On the decidability and complexity of query answering over inconsistent and incomplete databases. In PODS, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J. Chomicki. Consistent query answering: Five easy pieces. In ICDT, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. J. Chomicki and J. Marcinkowski. Minimal-change integrity maintenance using tuple deletions. Inf. Comput. 197(1-2):90--121, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. Chomicki and J.Marcinkowski. On the computational complexity of minimal-change integrity maintenance in relational databases. Inconsistency Tolerance:119--150, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. E. F. Codd. Relational completeness of data base sublanguages. In R. Rustin (ed.): Database Systems: 65-98, Prentice Hall and IBM Research Report RJ 987, 1972.Google ScholarGoogle Scholar
  28. G. Cong, W. Fan, F. Geerts, X. Jia, and S.Ma. Improving data quality: Consistency and accuracy. In VLDB, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. N. N. Dalvi and D. Suciu. Management of probabilistic data: Foundations and challenges. In PODS, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. A. Dreibelbis, E. Hechler, B. Mathews, M. Oberhofer, and G. Sauter. Master Data Management architecture patterns. IBM, Mar. 2007.Google ScholarGoogle Scholar
  31. W. W. Eckerson. Data quality and the bottom line: Achieving business success through a commitment to high quality data. The Data Warehousing Institute, 2002.Google ScholarGoogle Scholar
  32. A. K. Elmagarmid, P. G. Ipeirotis and V. S. Verykios. Duplicate record detection: A survey. TKDE 19(1): 1--16, 1007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. L. English. Plain English on data quality: Information quality management: The next frontier. DM Review Magazine, 2000.Google ScholarGoogle Scholar
  34. R. Fagin. Inverting schema mappings. in PODS, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. R. Fagin and M. Y. Vardi. The theory of data dependencies - An overview. In ICALP, 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for capturing data inconsistencies. TODS, to appear. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. W. Fan, Y. Hu, J. Liu, S. Ma, and Y. Wu. Computing view dependencies with conditions. Unpublished manuscript.Google ScholarGoogle Scholar
  38. W. Fan, X. Jia, and S. Ma. Object identification based on dependencies. Unpublished manuscript.Google ScholarGoogle Scholar
  39. W. Fan and L. Libkin. On XML integrity constraints in the presence of DTDs. J. ACM 49(3):368--406, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. I. Fellegi and D. Holt. A systematic approach to automatic edit and imputation. J. American Statistical Association 71(353):17--35, 1976.Google ScholarGoogle ScholarCross RefCross Ref
  41. S. Flesca, F. Furfaro, S. Greco, and E. Zumpano. Querying and repairing inconsistent XML data. In WISE 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. A. Fuxman, E. Fazli, and R. J. Miller. ConQuer: Efficient management of inconsistent databases. In SIGMOD 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. A. Fuxman and R. J. Miller. First-order query rewriting for inconsistent databases. JCSS 73(4): 610--635, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Gartner. Forecast: Data quality tools, worldwide, 2006--2011. 2007.Google ScholarGoogle Scholar
  45. S. Ginsburg and E. H. Spanier. On completing tables to satisfy functional dependencies. TCS 39: 309--317, 1985.Google ScholarGoogle ScholarCross RefCross Ref
  46. G. Grahne. The Problem of Incomplete Information in Relational Databases. Springer, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. G. Greco, S. Greco, and E. Zumpano. A logical framework for querying and repairing inconsistent databases. TKDE 15(6): 1389--1408, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. M. A. Hernandez and S. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Min. Knowl. Discov. 2(1): 9--37, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. R. Hull. Specifiable implicational dependency families. J. ACM 31(2): 210--226, 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. T. Imieliński and W. Lipski Jr. Incomplete information in relational databases. J. ACM 31(4): 761--791, 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. P. C. Kanellakis. Elements of relational database theory. In Handbook of Theoretical Computer Science, Volume B: Formal Models and Semantics: 1073--1156, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. A. C. Klug. Calculating constraints on relational expressions. TODS 5(3):260--290, 1980. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. A. C. Klug and R. Price. Determining view dependencies using tableaux. TODS 7(3):361--380, 1982. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. P. G. Kolaitis. Schema mappings, data exchange, and metadata management. In PODS, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. D. Lembo, M. Lenzerini, and R. Rosati. Source inconsistency and incompleteness in data integration. In KRDB, 2002.Google ScholarGoogle Scholar
  56. M. Lenzerini. Data integration: A theoretical perspective. In PODS, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. A. Lopatenko and L. E. Bertossi. Complexity of consistent query answering in databases under cardinality-based and incremental repair semantics. In ICDT, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. A. Lopatenko and L. Bravo. Efficient approximation algorithms for repairing inconsistent databases. In ICDE, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  59. M. J. Maher. Constrained dependencies. TCS 173(1): 113--149, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. M. J. Maher and D. Srivastava. Chasing constrained tuple-generating dependencies. In PODS, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. R. van der Meyden. Logical approaches to incomplete information: A survey. In J. Chomicki and G. Saake (eds.): Logics for Databases and Information Systems: 307--356, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. J. Radcliffe and A. White. Key issues for Master Data Management. Gartner, Jan. 2008.Google ScholarGoogle Scholar
  63. K. V. S. V. N. Raju and A. K. Majumdar. Fuzzy functional dependencies and lossless join decomposition of fuzzy relational database systems. TODS 13(2): 129--166, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. E. Rahm and H. H. Do. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull. 23(4): 3--13, 2000.Google ScholarGoogle Scholar
  65. T. Redman. The impact of poor data quality on the typical enterprise. Commun. ACM 41(2): 79--82, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. C. C. Shilakes and J. Tylman. Enterprise information portals. Merrill Lynch, 1998.Google ScholarGoogle Scholar
  67. S. Staworko. Declarative inconsistency handling in relational and semi-structured databases. PhD thesis, the State University of New York at Buffalo, 2007, UB CSE TR 2008-03. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. J. Wijsen. Database repairing using updates. TODS 30(3): 722--768, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. W. E.Winkler. Methods for evaluating and creating data quality. Inf. Syst. 29(7): 531--550, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. M. Winslett. Reasoning about action using a possible models approach. In AAAI, 1988.Google ScholarGoogle Scholar

Index Terms

  1. Dependencies revisited for improving data quality

      Recommendations

      Reviews

      David Gary Hill

      Poor data quality, especially in relational databases-such as incorrectly priced products in retail databases-has proven to be very expensive. Helping to improve the quality of data through semi-automated software processes is, therefore, a worthwhile goal. This paper explores the reasons for using dependency theory to aid in the goal of improving data quality. The treatment of the subject is a formal mathematical analysis, so the readers most likely to benefit from the paper are researchers who focus on the data quality issue and software specialists who are looking to see if they can gain some insight into how they can code algorithms that can assist in improving data quality. The paper provides an overview of recent advances in using classical dependency theory-a theory that has been around nearly as long as relational databases-for improving data quality. To start the process, the first task is to capture as many inconsistencies in real-life data as possible. The second task is to deal with those errors by matching, repairing, and querying inconsistent data. Speaking of the repair process, it cannot necessarily resolve all errors. However, a semi-automated repair process may improve the data quality enough to be worthwhile. All in all, Fan's depth of analysis is useful in providing a mathematically rigorous exploration of the value of dependency theory for improving data quality. Furthermore, this paper serves as a starting point for future research. Online Computing Reviews Service

      Access critical reviews of Computing literature here

      Become a reviewer for Computing Reviews.

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        PODS '08: Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
        June 2008
        330 pages
        ISBN:9781605581521
        DOI:10.1145/1376916

        Copyright © 2008 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 9 June 2008

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • tutorial

        Acceptance Rates

        PODS '08 Paper Acceptance Rate28of159submissions,18%Overall Acceptance Rate642of2,707submissions,24%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader