skip to main content
10.1145/2723372.2749431acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing

Authors Info & Claims
Published:27 May 2015Publication History

ABSTRACT

Classical approaches to clean data have relied on using integrity constraints, statistics, or machine learning. These approaches are known to be limited in the cleaning accuracy, which can usually be improved by consulting master data and involving experts to resolve ambiguity. The advent of knowledge bases KBs both general-purpose and within enterprises, and crowdsourcing marketplaces are providing yet more opportunities to achieve higher accuracy at a larger scale. We propose KATARA, a knowledge base and crowd powered data cleaning system that, given a table, a KB, and a crowd, interprets table semantics to align it with the KB, identifies correct and incorrect data, and generates top-k possible repairs for incorrect data. Experiments show that KATARA can be applied to various datasets and KBs, and can efficiently annotate data and suggest possible repairs.

References

  1. S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. P. Bohannon, W. Fan, M. Flaster, and R. Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMOD, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. G. Bouma. Normalized (pointwise) mutual information in collocation extraction. Proceedings of GSCL, pages 31--40, 2009.Google ScholarGoogle Scholar
  4. S. Buchholz and J. Latorre. Crowdsourcing preference tests, and how to detect cheating. 2011.Google ScholarGoogle ScholarCross RefCross Ref
  5. A. Calì, G. Gottlob, and A. Pieris. Advanced processing for ontological queries. PVLDB, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. F. Chiang and R. J. Miller. Discovering data quality rules. PVLDB, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. F. Chiang and R. J. Miller. A unified model for data and constraint repair. In ICDE, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. X. Chu, I. F. Ilyas, and P. Papotti. Discovering denial constraints. PVLDB, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. X. Chu, I. F. Ilyas, and P. Papotti. Holistic data cleaning: Putting violations into context. In ICDE, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. K. W. Church and P. Hanks. Word association norms, mutual information, and lexicography. Comput. Linguist., 16(1):22--29, Mar. 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma. Improving data quality: Consistency and accuracy. In VLDB, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. Dallachiesa, A. Ebaid, A. Eldawy, A. K. Elmagarmid, I. F. Ilyas, M. Ouzzani, and N. Tang. NADEEF: a commodity data cleaning system. In SIGMOD, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D. Deng, Y. Jiang, G. Li, J. Li, and C. Yu. Scalable column concept determination for web tables using large knowledge bases. PVLDB, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. O. Deshpande, D. S. Lamba, M. Tourn, S. Das, S. Subramaniam, A. Rajaraman, V. Harinarayan, and A. Doan. Building, maintaining, and using knowledge bases: a report from the trenches. In SIGMOD Conference, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T. Strohmann, S. Sun, and W. Zhang. Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In SIGKDD, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. X. L. Dong, E. Gabrilovich, G. Heitz, W. Horn, K. Murphy, S. Sun, and W. Zhang. From data fusion to knowledge fusion. PVLDB, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. W. Fan. Dependencies revisited for improving data quality. In PODS, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Interaction between record matching and data repairing. In SIGMOD, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Towards certain fixes with editing rules and master data. VLDB J., 21(2), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. F. Geerts, G. Mecca, P. Papotti, and D. Santoro. The LLUNATIC Data-Cleaning Framework. PVLDB, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. Hoffart, F. M. Suchanek, K. Berberich, and G. Weikum. YAGO2: A spatially and temporally enhanced knowledge base from wikipedia. Artif. Intell., 194, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. I. F. Ilyas, W. G. Aref, and A. K. Elmagarmid. Supporting top-k join queries in relational databases. VLDB J., 13(3), 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. I. F. Ilyas, V. Markl, P. J. Haas, P. Brown, and A. Aboulnaga. Cords: Automatic discovery of correlations and soft functional dependencies. In SIGMOD, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. M. Interlandi and N. Tang. Proof positive and negative data cleaning. In ICDE, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  25. H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek. Interpreting and unifying outlier scores. In SDM, pages 13--24, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  26. R. Lange and X. Lange. Quality control in crowdsourcing: An objective measurement approach to identifying and correcting rater effects in the social evaluation of products and services. In AAAI, 2012.Google ScholarGoogle Scholar
  27. J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P. N. Mendes, S. Hellmann, M. Morsey, P. van Kleef, S. Auer, and C. Bizer. DBpedia - a large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web Journal, 6(2):167--195, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  28. G. Limaye, S. Sarawagi, and S. Chakrabarti. Annotating and searching web tables using entities, types and relationships. PVLDB, 3(1), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. C. D. Manning, P. Raghavan, and H. Schütze. Scoring, term weighting and the vector space model. Introduction to Information Retrieval, 100, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  30. C. Mayfield, J. Neville, and S. Prabhakar. ERACER: a database approach for statistical inference and data cleaning. In SIGMOD, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. M. Morsey, J. Lehmann, S. Auer, and A. N. Ngomo. Dbpedia SPARQL benchmark - performance assessment with real queries on real data. In ISWC, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. J. Pasternack and D. Roth. Knowing what to believe (when you already know something). In COLING, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. A. Poggi, D. Lembo, D. Calvanese, G. D. Giacomo, M. Lenzerini, and R. Rosati. Linking data to ontologies. J. Data Semantics, 10, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. L. Qian, M. J. Cafarella, and H. V. Jagadish. Sample-driven schema mapping. In SIGMOD, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. V. Raman and J. M. Hellerstein. Potter's Wheel: An interactive data cleaning system. In VLDB, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. S. Song, H. Cheng, J. X. Yu, and L. Chen. Repairing vertex labels under neighborhood constraints. PVLDB, 7(11), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. F. M. Suchanek, S. Abiteboul, and P. Senellart. Paris: Probabilistic alignment of relations, instances, and schema. PVLDB, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. B. Trushkowsky, T. Kraska, M. J. Franklin, and P. Sarkar. Crowdsourced enumeration queries. In ICDE, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. P. Venetis, A. Y. Halevy, J. Madhavan, M. Pasca, W. Shen, F. Wu, G. Miao, and C. Wu. Recovering semantics of tables on the web. PVLDB, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. M. Volkovs, F. Chiang, J. Szlichta, and R. J. Miller. Continuous data cleaning. In ICDE, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  41. J. Wang, T. Kraska, M. J. Franklin, and J. Feng. Crowder: Crowdsourcing entity resolution. PVLDB, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. J. Wang and N. Tang. Towards dependable data repairing with fixing rules. In SIGMOD, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. M. Yakout, L. Berti-Equille, and A. K. Elmagarmid. Don't be SCAREd: use SCalable Automatic REpairing with maximal likelihood and bounded changes. In SIGMOD, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas. Guided data repair. PVLDB, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. C. J. Zhang, L. Chen, H. V. Jagadish, and C. C. Cao. Reducing uncertainty of schema matching via crowdsourcing. PVLDB, 6, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
      May 2015
      2110 pages
      ISBN:9781450327589
      DOI:10.1145/2723372

      Copyright © 2015 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 27 May 2015

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      SIGMOD '15 Paper Acceptance Rate106of415submissions,26%Overall Acceptance Rate785of4,003submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader