Abstract
The problem of mining integrity constraints from data has been extensively studied over the past two decades for commonly used types of constraints, including the classic Functional Dependencies (FDs) and the more general Denial Constraints (DCs). In this paper, we investigate the problem of mining from data approximate DCs, that is, DCs that are "almost" satisfied. Approximation allows us to discover more accurate constraints in inconsistent databases and detect rules that are generally correct but may have a few exceptions. It also allows to avoid overfitting and obtain constraints that are more general, more natural, and less contrived. We introduce the algorithm ADCMiner for mining approximate DCs. An important feature of this algorithm is that it does not assume any specific approximation function for DCs, but rather allows for arbitrary approximation functions that satisfy some natural axioms that we define in the paper. We also show how our algorithm can be combined with sampling to return highly accurate results considerably faster.
- R. Abreu and A. J. C. van Gemund. A low-cost approximate minimal hitting set algorithm and its application to model-based diagnosis. In SARA, 2009.Google Scholar
- P. Arabie, S. A. Boorman, and P. R. Levitt. Constructing blockmodels: How and why. Journal of mathematical psychology, 17(1):21--63, 1978.Google Scholar
- R. Bar-Yehuda and S. Even. A linear-time approximation algorithm for the weighted vertex cover problem. J. Algorithms, 2(2):198--203, 1981.Google ScholarCross Ref
- T. Bleifuß, S. Kruse, and F. Naumann. Efficient denial constraint discovery with hydra. PVLDB, 11(3):311--323, 2017. Google ScholarDigital Library
- M. Boullé. Universal approximation of edge density in large graphs. arXiv preprint arXiv:1508.01340, 2015.Google Scholar
- N. Bus, N. H. Mustafa, and S. Ray. Practical and efficient algorithms for the geometric hitting set problem. Discrete Applied Mathematics, 240:25--32, 2018.Google ScholarCross Ref
- N. Cardoso and R. Abreu. MHS2: A map-reduce heuristic-driven minimal hitting set search algorithm. In MUSEPAT, pages 25--36, 2013. Google ScholarDigital Library
- K. Chandrasekaran, R. M. Karp, E. Moreno-Centeno, and S. Vempala. Algorithms for implicit hitting set problems. In SODA, pages 614--629, 2011. Google ScholarDigital Library
- F. Chiang and R. J. Miller. Discovering data quality rules. PVLDB, 1(1):1166--1177, 2008. Google ScholarDigital Library
- J. Chomicki and J. Marcinkowski. Minimal-change integrity maintenance using tuple deletions. Inf. Comput., 197(1-2):90--121, 2005. Google ScholarDigital Library
- X. Chu, I. F. Ilyas, and P. Papotti. Discovering denial constraints. PVLDB, 6(13):1498--1509, 2013. Google ScholarDigital Library
- C. Combi, M. Mantovani, A. Sabaini, P. Sala, F. Amaddeo, U. Moretti, and G. Pozzi. Mining approximate temporal functional dependencies with pure temporal grouping in clinical databases. Comp. in Bio. and Med., 62:306--324, 2015. Google ScholarDigital Library
- W. Fan, F. Geerts, J. Li, and M. Xiong. Discovering conditional functional dependencies. IEEE Trans. Knowl. Data Eng., 23(5):683--698, 2011. Google ScholarDigital Library
- U. Feige. On sums of independent random variables with unbounded variance and estimating the average degree in a graph. SIAM Journal on Computing, 35(4):964--984, 2006. Google ScholarDigital Library
- P. A. Flach and I. Savnik. Database dependency discovery: A machine learning approach. AI Commun., 12(3):139--160, 1999. Google ScholarDigital Library
- S. Fortunato. Community detection in graphs. Physics reports, 486(3-5):75--174, 2010.Google ScholarCross Ref
- A. Gainer-Dewar and P. Vera-Licona. The minimal hitting set generation problem: Algorithms and computation. SIAM J. Discrete Math., 31(1):63--100, 2017.Google ScholarDigital Library
- O. Goldreich and D. Ron. On estimating the average degree of a graph. Electronic Colloquim on Computational Complexity (ECCC), 2004.Google Scholar
- E. Gribkoff, G. V. den Broeck, and D. Suciu. The most probable database problem. 2014.Google Scholar
- A. Heidari, I. F. Ilyas, and T. Rekatsinas. Approximate inference in structured instances with noisy categorical observations. In UAI, page 152. AUAI Press, 2019.Google Scholar
- A. Heidari, J. McGrath, I. F. Ilyas, and T. Rekatsinas. Holodetect: Few-shot learning for error detection. In SIGMOD Conference, pages 829--846. ACM, 2019. Google ScholarDigital Library
- P. W. Holland, K. B. Laskey, and S. Leinhardt. Stochastic blockmodels: First steps. Social networks, 5(2):109--137, 1983.Google ScholarCross Ref
- Y. Huhtala, J. Kärkkäinen, P. Porkka, and H. Toivonen. TANE: an efficient algorithm for discovering functional and approximate dependencies. Comput. J., 42(2):100--111, 1999.Google ScholarCross Ref
- J. Kivinen and H. Mannila. Approximate dependency inference from relations, pages 86--98. Springer Berlin Heidelberg, Berlin, Heidelberg, 1992. Google ScholarDigital Library
- W. Li, Z. Li, Q. Chen, T. Jiang, and Z. Yin. Discovering approximate functional dependencies from distributed big data. In APWeb, pages 289--301, 2016.Google ScholarCross Ref
- J. Liu, J. Li, C. Liu, and Y. Chen. Discover dependencies from data - A review. IEEE Trans. Knowl. Data Eng., 24(2):251--264, 2012. Google ScholarDigital Library
- E. Livshits, A. Heidari, I. F. Ilyas, and B. Kimelfeld. Approximate denial constraints. CoRR, abs/2005.08540, 2020.Google Scholar
- E. Livshits, I. F. Ilyas, B. Kimelfeld, and S. Roy. Principles of progress indicators for database repairing. CoRR, abs/1904.06492, 2019.Google Scholar
- E. Livshits, B. Kimelfeld, and S. Roy. Computing optimal repairs for functional dependencies. ACM Trans. Database Syst., 45(1):4:1--4:46, 2020. Google ScholarDigital Library
- A. Lopatenko and L. E. Bertossi. Complexity of consistent query answering in databases under cardinality-based and incremental repair semantics. In ICDT, pages 179--193, 2007. Google ScholarDigital Library
- S. Lopes, J. Petit, and L. Lakhal. Efficient discovery of functional dependencies and armstrong relations. In EDBT, pages 350--364, 2000. Google ScholarDigital Library
- F. Lorrain and H. C. White. Structural equivalence of individuals in social networks. The Journal of mathematical sociology, 1(1):49--80, 1971.Google Scholar
- K. Murakami and T. Uno. Efficient algorithms for dualizing large-scale hypergraphs. Discrete Applied Mathematics, 170:83--94, 2014. Google ScholarDigital Library
- L. Nourine, A. Quilliot, and H. Toussaint. Partial enumeration of minimal transversals of a hypergraph. In CLA, pages 123--134, 2015.Google Scholar
- N. Novelli and R. Cicchetti. FUN: an efficient algorithm for mining functional and embedded dependencies. In ICDT, pages 189--203, 2001. Google ScholarDigital Library
- T. Papenbrock, J. Ehrlich, J. Marten, T. Neubert, J. Rudolph, M. Schönberg, J. Zwiener, and F. Naumann. Functional dependency discovery: An experimental evaluation of seven algorithms. PVLDB, 8(10):1082--1093, 2015. Google ScholarDigital Library
- E. H. M. Pena and E. C. de Almeida. BFASTDC: A bitwise algorithm for mining denial constraints. In DEXA, pages 53--68, 2018.Google ScholarDigital Library
- E. H. M. Pena, E. C. de Almeida, and F. Naumann. Discovery of approximate (and exact) denial constraints. PVLDB, 13(3), 2019. Google ScholarDigital Library
- J. Rammelaere and F. Geerts. Revisiting conditional functional dependency discovery: Splitting the "c" from the "fd". In ECML/PKDD (2), volume 11052 of Lecture Notes in Computer Science, pages 552--568. Springer, 2018.Google Scholar
- S. E. Schaeffer. Graph clustering. Computer science review, 1(1):27--64, 2007. Google ScholarDigital Library
- S. A. Vinterbo and A. Öhrn. Minimal approximate hitting sets and rule templates. Int. J. Approx. Reasoning, 25(2):123--143, 2000.Google ScholarCross Ref
- C. M. Wyss, C. Giannella, and E. L. Robertson. Fastfds: A heuristic-driven, depth-first algorithm for mining functional dependencies from relation instances - extended abstract. In DaWaK, pages 101--110, 2001. Google ScholarDigital Library
Recommendations
Fast approximate denial constraint discovery
We investigate the problem of discovering approximate denial constraints (DCs), for finding DCs that hold with some exceptions to avoid overfitting real-life dirty data and facilitate data cleaning tasks. Different methods have been proposed to address ...
Discovery of approximate (and exact) denial constraints
Maintaining data consistency is known to be hard. Recent approaches have relied on integrity constraints to deal with the problem - correct and complete constraints naturally work towards data consistency. State-of-the-art data cleaning frameworks have ...
Discovering denial constraints
Integrity constraints (ICs) provide a valuable tool for enforcing correct application semantics. However, designing ICs requires experts and time. Proposals for automatic discovery have been made for some formalisms, such as functional dependencies and ...
Comments