research-article

KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing

Authors:
Xu Chu

University of Waterloo, Waterloo, Canada

University of Waterloo, Waterloo, Canada
View Profile

,
John Morcos

University of Waterloo, Waterloo, Canada

University of Waterloo, Waterloo, Canada
View Profile

,
Ihab F. Ilyas

University of Waterloo, Waterloo, Canada

University of Waterloo, Waterloo, Canada
View Profile

,
Mourad Ouzzani

Qatar Computing Research Institute, Doha, Qatar

Qatar Computing Research Institute, Doha, Qatar
View Profile

,
Paolo Papotti

Qatar Computing Research Institute, Doha, Qatar

Qatar Computing Research Institute, Doha, Qatar
View Profile

,
Nan Tang

Qatar Computing Research Institute, Doha, Qatar

Qatar Computing Research Institute, Doha, Qatar
View Profile

,
Yin Ye

Qatar Computing Research Institute, Doha, Qatar

Qatar Computing Research Institute, Doha, Qatar
View Profile

SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of DataMay 2015Pages 1247–1261https://doi.org/10.1145/2723372.2749431

Published:27 May 2015Publication History

SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

Pages 1247–1261

ABSTRACT

Classical approaches to clean data have relied on using integrity constraints, statistics, or machine learning. These approaches are known to be limited in the cleaning accuracy, which can usually be improved by consulting master data and involving experts to resolve ambiguity. The advent of knowledge bases KBs both general-purpose and within enterprises, and crowdsourcing marketplaces are providing yet more opportunities to achieve higher accuracy at a larger scale. We propose KATARA, a knowledge base and crowd powered data cleaning system that, given a table, a KB, and a crowd, interprets table semantics to align it with the KB, identifies correct and incorrect data, and generates top-k possible repairs for incorrect data. Experiments show that KATARA can be applied to various datasets and KBs, and can efficiently annotate data and suggest possible repairs.

References

S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1995. Google ScholarDigital Library
P. Bohannon, W. Fan, M. Flaster, and R. Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMOD, 2005. Google ScholarDigital Library
G. Bouma. Normalized (pointwise) mutual information in collocation extraction. Proceedings of GSCL, pages 31--40, 2009.Google Scholar
S. Buchholz and J. Latorre. Crowdsourcing preference tests, and how to detect cheating. 2011.Google ScholarCross Ref
A. Calì, G. Gottlob, and A. Pieris. Advanced processing for ontological queries. PVLDB, 2010. Google ScholarDigital Library
F. Chiang and R. J. Miller. Discovering data quality rules. PVLDB, 2008. Google ScholarDigital Library
F. Chiang and R. J. Miller. A unified model for data and constraint repair. In ICDE, 2011. Google ScholarDigital Library
X. Chu, I. F. Ilyas, and P. Papotti. Discovering denial constraints. PVLDB, 2013. Google ScholarDigital Library
X. Chu, I. F. Ilyas, and P. Papotti. Holistic data cleaning: Putting violations into context. In ICDE, 2013.Google ScholarDigital Library
K. W. Church and P. Hanks. Word association norms, mutual information, and lexicography. Comput. Linguist., 16(1):22--29, Mar. 1990. Google ScholarDigital Library
G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma. Improving data quality: Consistency and accuracy. In VLDB, 2007. Google ScholarDigital Library
M. Dallachiesa, A. Ebaid, A. Eldawy, A. K. Elmagarmid, I. F. Ilyas, M. Ouzzani, and N. Tang. NADEEF: a commodity data cleaning system. In SIGMOD, 2013. Google ScholarDigital Library
D. Deng, Y. Jiang, G. Li, J. Li, and C. Yu. Scalable column concept determination for web tables using large knowledge bases. PVLDB, 2013. Google ScholarDigital Library
O. Deshpande, D. S. Lamba, M. Tourn, S. Das, S. Subramaniam, A. Rajaraman, V. Harinarayan, and A. Doan. Building, maintaining, and using knowledge bases: a report from the trenches. In SIGMOD Conference, 2013. Google ScholarDigital Library
X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T. Strohmann, S. Sun, and W. Zhang. Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In SIGKDD, 2014. Google ScholarDigital Library
X. L. Dong, E. Gabrilovich, G. Heitz, W. Horn, K. Murphy, S. Sun, and W. Zhang. From data fusion to knowledge fusion. PVLDB, 2014. Google ScholarDigital Library
W. Fan. Dependencies revisited for improving data quality. In PODS, 2008. Google ScholarDigital Library
W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Interaction between record matching and data repairing. In SIGMOD, 2011. Google ScholarDigital Library
W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Towards certain fixes with editing rules and master data. VLDB J., 21(2), 2012. Google ScholarDigital Library
F. Geerts, G. Mecca, P. Papotti, and D. Santoro. The LLUNATIC Data-Cleaning Framework. PVLDB, 2013. Google ScholarDigital Library
J. Hoffart, F. M. Suchanek, K. Berberich, and G. Weikum. YAGO2: A spatially and temporally enhanced knowledge base from wikipedia. Artif. Intell., 194, 2013. Google ScholarDigital Library
I. F. Ilyas, W. G. Aref, and A. K. Elmagarmid. Supporting top-k join queries in relational databases. VLDB J., 13(3), 2004. Google ScholarDigital Library
I. F. Ilyas, V. Markl, P. J. Haas, P. Brown, and A. Aboulnaga. Cords: Automatic discovery of correlations and soft functional dependencies. In SIGMOD, 2004. Google ScholarDigital Library
M. Interlandi and N. Tang. Proof positive and negative data cleaning. In ICDE, 2015.Google ScholarCross Ref
H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek. Interpreting and unifying outlier scores. In SDM, pages 13--24, 2011.Google ScholarCross Ref
R. Lange and X. Lange. Quality control in crowdsourcing: An objective measurement approach to identifying and correcting rater effects in the social evaluation of products and services. In AAAI, 2012.Google Scholar
J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P. N. Mendes, S. Hellmann, M. Morsey, P. van Kleef, S. Auer, and C. Bizer. DBpedia - a large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web Journal, 6(2):167--195, 2015.Google ScholarCross Ref
G. Limaye, S. Sarawagi, and S. Chakrabarti. Annotating and searching web tables using entities, types and relationships. PVLDB, 3(1), 2010. Google ScholarDigital Library
C. D. Manning, P. Raghavan, and H. Schütze. Scoring, term weighting and the vector space model. Introduction to Information Retrieval, 100, 2008.Google ScholarCross Ref
C. Mayfield, J. Neville, and S. Prabhakar. ERACER: a database approach for statistical inference and data cleaning. In SIGMOD, 2010. Google ScholarDigital Library
M. Morsey, J. Lehmann, S. Auer, and A. N. Ngomo. Dbpedia SPARQL benchmark - performance assessment with real queries on real data. In ISWC, 2011. Google ScholarDigital Library
J. Pasternack and D. Roth. Knowing what to believe (when you already know something). In COLING, 2010. Google ScholarDigital Library
A. Poggi, D. Lembo, D. Calvanese, G. D. Giacomo, M. Lenzerini, and R. Rosati. Linking data to ontologies. J. Data Semantics, 10, 2008. Google ScholarDigital Library
L. Qian, M. J. Cafarella, and H. V. Jagadish. Sample-driven schema mapping. In SIGMOD, 2012. Google ScholarDigital Library
V. Raman and J. M. Hellerstein. Potter's Wheel: An interactive data cleaning system. In VLDB, 2001. Google ScholarDigital Library
S. Song, H. Cheng, J. X. Yu, and L. Chen. Repairing vertex labels under neighborhood constraints. PVLDB, 7(11), 2014. Google ScholarDigital Library
F. M. Suchanek, S. Abiteboul, and P. Senellart. Paris: Probabilistic alignment of relations, instances, and schema. PVLDB, 2011. Google ScholarDigital Library
B. Trushkowsky, T. Kraska, M. J. Franklin, and P. Sarkar. Crowdsourced enumeration queries. In ICDE, 2013.Google ScholarDigital Library
P. Venetis, A. Y. Halevy, J. Madhavan, M. Pasca, W. Shen, F. Wu, G. Miao, and C. Wu. Recovering semantics of tables on the web. PVLDB, 2011. Google ScholarDigital Library
M. Volkovs, F. Chiang, J. Szlichta, and R. J. Miller. Continuous data cleaning. In ICDE, 2014.Google ScholarCross Ref
J. Wang, T. Kraska, M. J. Franklin, and J. Feng. Crowder: Crowdsourcing entity resolution. PVLDB, 2012. Google ScholarDigital Library
J. Wang and N. Tang. Towards dependable data repairing with fixing rules. In SIGMOD, 2014. Google ScholarDigital Library
M. Yakout, L. Berti-Equille, and A. K. Elmagarmid. Don't be SCAREd: use SCalable Automatic REpairing with maximal likelihood and bounded changes. In SIGMOD, 2013. Google ScholarDigital Library
M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas. Guided data repair. PVLDB, 2011. Google ScholarDigital Library
C. J. Zhang, L. Chen, H. V. Jagadish, and C. C. Cao. Reducing uncertainty of schema matching via crowdsourcing. PVLDB, 6, 2013. Google ScholarDigital Library

Index Terms

KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing
1. Information systems
  1. Information systems applications

Recommendations

An Enhanced Technique to Clean Data in the Data Warehouse
DESE '11: Proceedings of the 2011 Developments in E-systems Engineering

Data quality is a critical factor for the success of data warehousing projects. Improving the quality of data is important in data warehouse, because it is used in the process of decision support, which requires accurate data. There are many errors and ...
Read More
KATARA: reliable data cleaning with knowledge bases and crowdsourcing
Proceedings of the 41st International Conference on Very Large Data Bases, Kohala Coast, Hawaii

Data cleaning with guaranteed reliability is hard to achieve without accessing external sources, since the truth is not necessarily discoverable from the data at hand. Furthermore, even in the presence of external sources, mainly knowledge bases and ...
Read More
A Comparative Study of Data Cleaning Tools

In the information era, data is crucial in decision making. Most data sets contain impurities that need to be weeded out before any meaningful decision can be made from the data. Hence, data cleaning is essential and often takes more than 80 percent ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
May 2015
2110 pages
ISBN:9781450327589
DOI:10.1145/2723372
General Chair:
Timos Sellis
RMIT University, Australia
,
Program Chairs:
Susan B. Davidson
University of Pennsylvania, USA
,
Zack Ives
University of Pennsylvania, USA
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 May 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
crowdsourcing
data cleaning
data quality
knowledge base
Qualifiers
- research-article
Conference

Acceptance Rates
SIGMOD '15 Paper Acceptance Rate106of415submissions,26%Overall Acceptance Rate785of4,003submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 201
  Total Citations
  View Citations
- 1,647
  Total Downloads
- Downloads (Last 12 months)147
- Downloads (Last 6 weeks)26
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing

SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

An Enhanced Technique to Clean Data in the Data Warehouse

KATARA: reliable data cleaning with knowledge bases and crowdsourcing

A Comparative Study of Data Cleaning Tools