ABSTRACT
Classical approaches to clean data have relied on using integrity constraints, statistics, or machine learning. These approaches are known to be limited in the cleaning accuracy, which can usually be improved by consulting master data and involving experts to resolve ambiguity. The advent of knowledge bases KBs both general-purpose and within enterprises, and crowdsourcing marketplaces are providing yet more opportunities to achieve higher accuracy at a larger scale. We propose KATARA, a knowledge base and crowd powered data cleaning system that, given a table, a KB, and a crowd, interprets table semantics to align it with the KB, identifies correct and incorrect data, and generates top-k possible repairs for incorrect data. Experiments show that KATARA can be applied to various datasets and KBs, and can efficiently annotate data and suggest possible repairs.
- S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1995. Google ScholarDigital Library
- P. Bohannon, W. Fan, M. Flaster, and R. Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMOD, 2005. Google ScholarDigital Library
- G. Bouma. Normalized (pointwise) mutual information in collocation extraction. Proceedings of GSCL, pages 31--40, 2009.Google Scholar
- S. Buchholz and J. Latorre. Crowdsourcing preference tests, and how to detect cheating. 2011.Google ScholarCross Ref
- A. Calì, G. Gottlob, and A. Pieris. Advanced processing for ontological queries. PVLDB, 2010. Google ScholarDigital Library
- F. Chiang and R. J. Miller. Discovering data quality rules. PVLDB, 2008. Google ScholarDigital Library
- F. Chiang and R. J. Miller. A unified model for data and constraint repair. In ICDE, 2011. Google ScholarDigital Library
- X. Chu, I. F. Ilyas, and P. Papotti. Discovering denial constraints. PVLDB, 2013. Google ScholarDigital Library
- X. Chu, I. F. Ilyas, and P. Papotti. Holistic data cleaning: Putting violations into context. In ICDE, 2013.Google ScholarDigital Library
- K. W. Church and P. Hanks. Word association norms, mutual information, and lexicography. Comput. Linguist., 16(1):22--29, Mar. 1990. Google ScholarDigital Library
- G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma. Improving data quality: Consistency and accuracy. In VLDB, 2007. Google ScholarDigital Library
- M. Dallachiesa, A. Ebaid, A. Eldawy, A. K. Elmagarmid, I. F. Ilyas, M. Ouzzani, and N. Tang. NADEEF: a commodity data cleaning system. In SIGMOD, 2013. Google ScholarDigital Library
- D. Deng, Y. Jiang, G. Li, J. Li, and C. Yu. Scalable column concept determination for web tables using large knowledge bases. PVLDB, 2013. Google ScholarDigital Library
- O. Deshpande, D. S. Lamba, M. Tourn, S. Das, S. Subramaniam, A. Rajaraman, V. Harinarayan, and A. Doan. Building, maintaining, and using knowledge bases: a report from the trenches. In SIGMOD Conference, 2013. Google ScholarDigital Library
- X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T. Strohmann, S. Sun, and W. Zhang. Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In SIGKDD, 2014. Google ScholarDigital Library
- X. L. Dong, E. Gabrilovich, G. Heitz, W. Horn, K. Murphy, S. Sun, and W. Zhang. From data fusion to knowledge fusion. PVLDB, 2014. Google ScholarDigital Library
- W. Fan. Dependencies revisited for improving data quality. In PODS, 2008. Google ScholarDigital Library
- W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Interaction between record matching and data repairing. In SIGMOD, 2011. Google ScholarDigital Library
- W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Towards certain fixes with editing rules and master data. VLDB J., 21(2), 2012. Google ScholarDigital Library
- F. Geerts, G. Mecca, P. Papotti, and D. Santoro. The LLUNATIC Data-Cleaning Framework. PVLDB, 2013. Google ScholarDigital Library
- J. Hoffart, F. M. Suchanek, K. Berberich, and G. Weikum. YAGO2: A spatially and temporally enhanced knowledge base from wikipedia. Artif. Intell., 194, 2013. Google ScholarDigital Library
- I. F. Ilyas, W. G. Aref, and A. K. Elmagarmid. Supporting top-k join queries in relational databases. VLDB J., 13(3), 2004. Google ScholarDigital Library
- I. F. Ilyas, V. Markl, P. J. Haas, P. Brown, and A. Aboulnaga. Cords: Automatic discovery of correlations and soft functional dependencies. In SIGMOD, 2004. Google ScholarDigital Library
- M. Interlandi and N. Tang. Proof positive and negative data cleaning. In ICDE, 2015.Google ScholarCross Ref
- H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek. Interpreting and unifying outlier scores. In SDM, pages 13--24, 2011.Google ScholarCross Ref
- R. Lange and X. Lange. Quality control in crowdsourcing: An objective measurement approach to identifying and correcting rater effects in the social evaluation of products and services. In AAAI, 2012.Google Scholar
- J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P. N. Mendes, S. Hellmann, M. Morsey, P. van Kleef, S. Auer, and C. Bizer. DBpedia - a large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web Journal, 6(2):167--195, 2015.Google ScholarCross Ref
- G. Limaye, S. Sarawagi, and S. Chakrabarti. Annotating and searching web tables using entities, types and relationships. PVLDB, 3(1), 2010. Google ScholarDigital Library
- C. D. Manning, P. Raghavan, and H. Schütze. Scoring, term weighting and the vector space model. Introduction to Information Retrieval, 100, 2008.Google ScholarCross Ref
- C. Mayfield, J. Neville, and S. Prabhakar. ERACER: a database approach for statistical inference and data cleaning. In SIGMOD, 2010. Google ScholarDigital Library
- M. Morsey, J. Lehmann, S. Auer, and A. N. Ngomo. Dbpedia SPARQL benchmark - performance assessment with real queries on real data. In ISWC, 2011. Google ScholarDigital Library
- J. Pasternack and D. Roth. Knowing what to believe (when you already know something). In COLING, 2010. Google ScholarDigital Library
- A. Poggi, D. Lembo, D. Calvanese, G. D. Giacomo, M. Lenzerini, and R. Rosati. Linking data to ontologies. J. Data Semantics, 10, 2008. Google ScholarDigital Library
- L. Qian, M. J. Cafarella, and H. V. Jagadish. Sample-driven schema mapping. In SIGMOD, 2012. Google ScholarDigital Library
- V. Raman and J. M. Hellerstein. Potter's Wheel: An interactive data cleaning system. In VLDB, 2001. Google ScholarDigital Library
- S. Song, H. Cheng, J. X. Yu, and L. Chen. Repairing vertex labels under neighborhood constraints. PVLDB, 7(11), 2014. Google ScholarDigital Library
- F. M. Suchanek, S. Abiteboul, and P. Senellart. Paris: Probabilistic alignment of relations, instances, and schema. PVLDB, 2011. Google ScholarDigital Library
- B. Trushkowsky, T. Kraska, M. J. Franklin, and P. Sarkar. Crowdsourced enumeration queries. In ICDE, 2013.Google ScholarDigital Library
- P. Venetis, A. Y. Halevy, J. Madhavan, M. Pasca, W. Shen, F. Wu, G. Miao, and C. Wu. Recovering semantics of tables on the web. PVLDB, 2011. Google ScholarDigital Library
- M. Volkovs, F. Chiang, J. Szlichta, and R. J. Miller. Continuous data cleaning. In ICDE, 2014.Google ScholarCross Ref
- J. Wang, T. Kraska, M. J. Franklin, and J. Feng. Crowder: Crowdsourcing entity resolution. PVLDB, 2012. Google ScholarDigital Library
- J. Wang and N. Tang. Towards dependable data repairing with fixing rules. In SIGMOD, 2014. Google ScholarDigital Library
- M. Yakout, L. Berti-Equille, and A. K. Elmagarmid. Don't be SCAREd: use SCalable Automatic REpairing with maximal likelihood and bounded changes. In SIGMOD, 2013. Google ScholarDigital Library
- M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas. Guided data repair. PVLDB, 2011. Google ScholarDigital Library
- C. J. Zhang, L. Chen, H. V. Jagadish, and C. C. Cao. Reducing uncertainty of schema matching via crowdsourcing. PVLDB, 6, 2013. Google ScholarDigital Library
Index Terms
- KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing
Recommendations
An Enhanced Technique to Clean Data in the Data Warehouse
DESE '11: Proceedings of the 2011 Developments in E-systems EngineeringData quality is a critical factor for the success of data warehousing projects. Improving the quality of data is important in data warehouse, because it is used in the process of decision support, which requires accurate data. There are many errors and ...
KATARA: reliable data cleaning with knowledge bases and crowdsourcing
Proceedings of the 41st International Conference on Very Large Data Bases, Kohala Coast, HawaiiData cleaning with guaranteed reliability is hard to achieve without accessing external sources, since the truth is not necessarily discoverable from the data at hand. Furthermore, even in the presence of external sources, mainly knowledge bases and ...
A Comparative Study of Data Cleaning Tools
In the information era, data is crucial in decision making. Most data sets contain impurities that need to be weeded out before any meaningful decision can be made from the data. Hence, data cleaning is essential and often takes more than 80 percent ...
Comments