skip to main content
research-article

A Framework to Evaluate the Quality of Integrated Datasets

Published:10 February 2023Publication History
Skip Abstract Section

Abstract

Evaluation is a bottleneck in data integration processes: it is performed by domain experts through manual onerous data inspections. This task is particularly heavy in real business scenarios, where the large amount of data makes checking all integrated tuples infeasible. Our idea is to address this issue by providing the experts with an unsupervised measure, based on word frequencies, which quantifies how much a dataset is representative of another dataset, giving an indication of how good is the integration process. The paper motivates and introduces the measure and provides extensive experimental evaluations, that show the effectiveness and the efficiency of the approach.

References

  1. Y. Altowim and S. Mehrotra. Parallel Progressive Approach to Entity Resolution Using MapReduce. In 33rd IEEE International Conference on Data Engineering, (ICDE), pages 909--920, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  2. H. Altwaijry, S. Mehrotra, and D. V. Kalashnikov. QuERy: A Framework for Integrating Entity Resolution with Query Processing. Proceedings of VLDB Endowment, 9(3):120--131, nov 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Baraldi, F. D. Buono, M. Paganelli, and F. Guerra. Using landmarks for explaining entity matching models. In 24th International Conference on Extending Database Technology (EDBT), pages 451--456, 2021.Google ScholarGoogle Scholar
  4. K. Bellare, S. Iyengar, A. G. Parameswaran, and V. Rastogi. Active sampling for entity matching. In The 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (KDD), pages 1131--1139, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. G. D. Bianco, R. Galante, C. A. Heuser, M. A. Gonçalves, and S. D. Canuto. A practical and effective sampling selection strategy for large scale deduplication. In 32nd IEEE International Conference on Data Engineering, (ICDE), pages 1518--1519, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  6. A. Bogatu, N. Paton, M. Douthwaite, S. Davie, and A. Freitas. Cost-effective Variational Active Entity Resolution. In 2021 IEEE 37th International Conference on Data Engineering (ICDE), pages 1272--1283, 4 2021.Google ScholarGoogle Scholar
  7. P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. Enriching Word Vectors with Subword Information. ACL Transactions of the Association for Computational Linguistics, 5:135--146, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  8. U. Brunner and K. Stockinger. Entity Matching on Unstructured Data: An Active Learning Approach. In 6th Swiss Conference on Data Science, SDS, pages 97--102, 2019.Google ScholarGoogle Scholar
  9. Q. Bui-Nguyen, Q. Wang, J. Shao, and D. Vatsalan. Repairing of Record Linkage: Turning Errors into Insight. In 22nd International Conference on Extending Database Technology (EDBT), pages 638--641, 2019.Google ScholarGoogle Scholar
  10. R. Cappuzzo, P. Papotti, and S. Thirumuruganathan. Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks. In 2020 ACM SIGMOD International Conference on Management of Data, volume 2020, pages 1335--1349, 6 2020.Google ScholarGoogle Scholar
  11. C. Chai, G. Li, J. Li, D. Deng, and J. Feng. Cost-Effective Crowdsourced Entity Resolution: A Partial-Order Approach. In 2016 International Conference on Management of Data (SIGMOD), pages 969--984, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Z. Chen, Q. Chen, F. Fan, Y. Wang, Z. Wang, Y. Nafa, Z. Li, H. Liu, and W. Pan. Enabling Quality Control for Entity Resolution: A Human and Machine Cooperation Framework. In 34th IEEE International Conference on Data Engineering, (ICDE), pages 1156--1167, 2018.Google ScholarGoogle Scholar
  13. S. Das, P. S. G. C., A. Doan, J. F. Naughton, G. Krishnan, R. Deep, E. Arcaute, V. Raghavendra, and Y. Park. Falcon: Scaling Up Hands-Off Crowdsourced Entity Matching to Build Cloud Services. In 2017 ACM International Conference on Management of Data (SIGMOD), pages 1431--1446, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. D. Deng, W. Tao, Z. Abedjan, A. K. Elmagarmid, I. F. Ilyas, G. Li, S. Madden, M. Ouzzani, M. Stonebraker, and N. Tang. Unsupervised String Transformation Learning for Entity Consolidation. In 35th IEEE International Conference on Data Engineering, (ICDE), pages 196--207, 2019.Google ScholarGoogle Scholar
  15. M. Dolatshah, M. Teoh, J. Wang, and J. Pei. Cleaning Crowdsourced Labels Using Oracles For Statistical Classification. Proceedings of the VLDB Endowment, 12(4):376--389, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. X. L. Dong and D. Srivastava. Big Data Integration. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  17. U. Draisbach, P. Christen, and F. Naumann. Transforming Pairwise Duplicates to Entity Clusters for High-quality Duplicate Detection. Journal of Data and Information Quality, 12(1):1--30, 1 2020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. Ebraheem, S. Thirumuruganathan, S. Joty, M. Ouzzani, and N. Tang. Distributed representations of tuples for entity resolution. Proceedings of the VLDB Endowment, 11(11):1454--1467, 7 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. V. Efthymiou, G. Papadakis, K. Stefanidis, and V. Christophides. Simplifying Entity Resolution on Web Data with Schema-Agnostic, Non-Iterative Matching. In 34th IEEE International Conference on Data Engineering, (ICDE), pages 1296--1299, 2018.Google ScholarGoogle Scholar
  20. N. Fanizzi, C. d'Amato, and F. Esposito. Composite ontology matching with uncertain mappings recovery. ACM SIGAPP Applied Computing Review, 11(2):17--29, mar 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. D. Firmani, B. Saha, and D. Srivastava. Online entity resolution using an Oracle. Proceedings of the VLDB Endowment, 9(5):384--395, 1 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. Franke, Z. Sehili, F. Rohde, and E. Rahm. Evaluation of Hardening Techniques for Privacy-Preserving Record Linkage. In 24th International Conference on Extending Database Technology (EDBT), pages 289--300, 2021.Google ScholarGoogle Scholar
  23. S. Galhotra, D. Firmani, B. Saha, and D. Srivastava. Robust Entity Resolution using Random Graphs. In 2018 International Conference on Management of Data (SIGMOD), pages 3--18, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. L. Gazzarri and M. Herschel. Boosting Blocking Performance in Entity Resolution Pipelines: Comparison Cleaning using Bloom Filters. In 23rd International Conference on Extending Database Technology (EDBT), pages 419--422, 2020.Google ScholarGoogle Scholar
  25. L. Gazzarri and M. Herschel. End-to-end Task Based Parallelization for Entity Resolution on Dynamic Data. In 2021 IEEE 37th International Conference on Data Engineering (ICDE), pages 1248--1259, 4 2021.Google ScholarGoogle Scholar
  26. A. L. Gentile, P. Ristoski, S. Eckel, D. Ritze, and H. Paulheim. Entity Matching on Web Tables: a Table Embeddings approach for Blocking. In 20th International Conference on Extending Database Technology (EDBT), pages 510--513, 2017.Google ScholarGoogle Scholar
  27. B. Gu, Z. Li, X. Zhang, A. Liu, G. Liu, K. Zheng, L. Zhao, and X. Zhou. The Interaction Between Schema Matching and Record Matching in Data Integration. IEEE Transactions on Knowledge and Data Engineering, 29(1):186--199, 1 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. B. Hou, Q. Chen, Z. Chen, Y. Nafa, and Z. Li. r-HUMO: A Risk-Aware Human-Machine Cooperation Framework for Entity Resolution with Quality Guarantees. IEEE Transactions on Knowledge and Data Engineering, 32(2):347--359, 2020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. D. Karapiperis and V. Verykios. Load-Balancing the Distance Computations in Record Linkage. ACM SIGKDD Explorations Newsletter, 17(1):1--7, 9 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. A. R. Khan and H. Garcia-Molina. Attribute-based Crowd Entity Resolution. In 25th ACM International Conference on Information and Knowledge Management (CIKM), pages 549--558, 2016.Google ScholarGoogle Scholar
  31. P. Konda, S. Das, P. S. G. C., A. Doan, A. Ardalan, J. R. Ballard, H. Li, F. Panahi, H. Zhang, J. F. Naughton, S. Prasad, G. Krishnan, R. Deep, and V. Raghavendra. Magellan: Toward Building Entity Matching Management Systems. Proceedings of the VLDB Endowment, 9(12):1197--1208, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. P. Konda, S. S. Seshadri, E. Segarra, B. Hueth, and A. Doan. Executing entity matching end to end: A case study. In 22nd International Conference on Extending Database Technology (EDBT), 2019.Google ScholarGoogle Scholar
  33. I. K. Koumarelas, T. Papenbrock, and F. Naumann. MDedup: Duplicate Detection with Matching Dependencies. Proceedings of the VLDB Endowment, 13(5):712--725, 2020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. S. Kwashie, J. Liu, J. Li, L. Liu, M. Stumptner, and L. Yang. Certus: An Effective Entity Resolution Approach with Graph Differential Dependencies (GDDs). Proceedings of the VLDB Endowment, 12(6):653--666, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. L. Leitão and P. Calado. An automatic blocking strategy for xml duplicate detection. ACM SIGAPP Applied Computing Review, 13(2):42--53, jun 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. H. Li, P. Konda, P. S. G. C., A. Doan, B. Snyder, Y. Park, G. Krishnan, R. Deep, and V. Raghavendra. MatchCatcher: A Debugger for Blocking in Entity Matching. In 21st International Conference on Extending Database Technology (EDBT), pages 193--204, 2018.Google ScholarGoogle Scholar
  37. Y. Li, J. Li, Y. Suhara, A. Doan, and W.-C. Tan. Deep entity matching with pre-trained language models. Proceedings of the VLDB Endowment, 14(1):50--60, 9 2020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Y. Li, J. Li, Y. Suhara, J. Wang, W. Hirota, and W. Tan. Deep Entity Matching: Challenges and Opportunities. ACM Journal of Data and Information Quality, 13(1):1:1--1:17, 2021.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Y. Lin, H. Wang, J. Li, and H. Gao. Efficient Entity Resolution on Heterogeneous Records. IEEE Transactions on Knowledge and Data Engineering, 32(5):912--926, 2020.Google ScholarGoogle Scholar
  40. M. Loster, I. Koumarelas, and F. Naumann. Knowledge Transfer for Entity Resolution with Siamese Neural Networks. ACM Journal of Data and Information Quality, 13(1):1--25, 1 2021.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. N. G. Marchant and B. I. P. Rubinstein. In Search of an Entity Resolution OASIS: Optimal Asymptotic Sequential Importance Sampling. Proceedings of the VLDB Endowment, 10(11):1322--1333, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. A. McCallum. Cora Dataset, 2017.Google ScholarGoogle Scholar
  43. V. V. Meduri, L. Popa, P. Sen, and M. Sarwat. A Comprehensive Benchmark Framework for Active Learning Methods in Entity Matching. In 2020 International Conference on Management of Data, (SIGMOD), pages 1133--1147, 2020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Z. Miao, Y. Li, and X. Wang. Rotom: A Meta-Learned Data Augmentation Framework for Entity Matching, Data Cleaning, Text Classification, and Beyond. In 2021 International Conference on Management of Data (SIGMOD), pages 1303--1316, 2021.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. In 1st International Conference on Learning Representations, ((ICLR), 2013.Google ScholarGoogle Scholar
  46. S. Mudgal, H. Li, T. Rekatsinas, A. Doan, Y. Park, G. Krishnan, R. Deep, E. Arcaute, and V. Raghavendra. Deep Learning for Entity Matching: A Design Space Exploration. In 2018 International Conference on Management of Data (SIGMOD), pages 19--34, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. H. Nie, X. Han, B. He, L. Sun, B. Chen, W. Zhang, S. Wu, and H. Kong. Deep Sequence-to-Sequence Entity Matching for Heterogeneous Entity Resolution. In 28th ACM International Conference on Information and Knowledge Management (CIKM), pages 629--638, 2019.Google ScholarGoogle Scholar
  48. S. Ortona, V. V. Meduri, and P. Papotti. Robust discovery of positive and negative rules in knowledge bases. In 34th IEEE International Conference on Data Engineering, (ICDE), pages 1168--1179, 2018.Google ScholarGoogle ScholarCross RefCross Ref
  49. M. Paganelli, F. D. Buono, A. Baraldi, and F. Guerra. Analyzing how BERT performs entity matching. Proceedings of the VLDB Endowment, 15(8):1726--1738, 2022.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. M. Paganelli, F. D. Buono, F. Guerra, and N. Ferro. Unsupervised Evaluation of Data Integration Processes. In 22nd International Conference on Information Integration and Web-based Applications & Services (iiWAS), pages 77--81, 2020.Google ScholarGoogle Scholar
  51. M. Paganelli, F. D. Buono, F. Guerra, and N. Ferro. Evaluating the integration of datasets. In SAC '22: The 37th ACM/SIGAPP Symposium on Applied Computing, Virtual Event, April 25 - 29, 2022, pages 347--356, 2022.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. M. Paganelli, F. D. Buono, M. Pevarello, F. Guerra, and M. Vincini. Automated machine learning for entity matching tasks. In 24th International Conference on Extending Database Technology (EDBT), pages 325--330, 2021.Google ScholarGoogle Scholar
  53. M. Paganelli, P. Sottovia, F. Guerra, and Y. Velegrakis. TuneR: Fine Tuning of Rule-based Entity Matchers. In 28th ACM International Conference on Information and Knowledge Management (CIKM), pages 2945--2948, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. F. Panahi, W. Wu, A. Doan, and J. F. Naughton. Towards Interactive Debugging of Rule-based Entity Matching. In 20th International Conference on Extending Database Technology (EDBT), pages 354--365, 2017.Google ScholarGoogle Scholar
  55. G. Papadakis, G. Papastefanatos, T. Palpanas, and M. Koubarakis. Scaling Entity Resolution to Large, Heterogeneous Data with Enhanced Meta-blocking. In 19th International Conference on Extending Database Technology (EDBT), pages 221--232, 2016.Google ScholarGoogle Scholar
  56. G. Papadakis, J. Svirsky, A. Gal, and T. Palpanas. Comparative analysis of approximate blocking techniques for entity resolution. Proceedings of the VLDB Endowment, 9(9):684--695, 5 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. K. Papineni, S. Roukos, T. Ward, and W. Zhu. Bleu: a Method for Automatic Evaluation of Machine Translation. In 40th Annual Meeting of the Association for Computational Linguistics, pages 311--318, 2002.Google ScholarGoogle Scholar
  58. R. Peeters and C. Bizer. Dual-Objective Fine-Tuning of BERT for Entity Matching. Proceedings of the VLDB Endowment, 14(10):1913--1921, 2021.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. J. Pennington, R. Socher, and C. D. Manning. Glove: Global Vectors for Word Representation. In 2014 Conference on Empirical Methods in Natural Language Processing, (EMNLP), pages 1532--1543, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  60. R. Pradhan, S. Bykau, and S. Prabhakar. Staging User Feedback toward Rapid Conflict Resolution in Data Fusion. In 2017 ACM International Conference on Management of Data (SIGMOD), pages 603--618, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. K. Qian, L. Popa, and P. Sen. Active Learning for Large-Scale Entity Resolution. In 26th ACM on Conference on Information and Knowledge Management (CIKM), volume 2017, pages 1379--1388, 11 2017.Google ScholarGoogle Scholar
  62. G. Simonini, S. Bergamaschi, and H. V. Jagadish. BLAST: a Loosely Schema-aware Meta-blocking Approach for Entity Resolution. Proceedings of the VLDB Endowment, 9(12):1173--1184, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. G. Simonini, G. Papadakis, T. Palpanas, and S. Bergamaschi. Schema-Agnostic Progressive Entity Resolution. IEEE Transactions on Knowledge and Data Engineering, 31(6):1208--1221, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  64. R. Singh, V. V. Meduri, A. K. Elmagarmid, S. Madden, P. Papotti, J. Quiané-Ruiz, A. Solar-Lezama, and N. Tang. Synthesizing Entity Matching Rules by Examples. Proceedings of the VLDB Endowment, 11(2):189--202, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. N. A. Smith. Contextual word representations: putting words into computers. Communications of the ACM, 63(6):66--74, 2020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. K. S. Teong, L. Soon, and T. T. Su. Schema-Agnostic Entity Matching using Pre-trained Language Models. In 29th ACM International Conference on Information and Knowledge Management (CIKM), pages 2241--2244, 2020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. V. Verroios and H. Garcia-Molina. Entity Resolution with crowd errors. In 31st IEEE International Conference on Data Engineering, (ICDE), pages 219--230, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  68. N. Vesdapunt, K. Bellare, and N. N. Dalvi. Crowdsourcing algorithms for entity resolution. Proceedings of the VLDB Endowment, 7(12):1071--1082, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. A. Walker, T. Cerny, and E. Song. Open-source tools and benchmarks for code-clone detection: Past, present, and future trends. ACM SIGAPP Applied Computing Review, 19(4):28--39, jan 2020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. H. Wang, X. Ding, J. Li, and H. Gao. Rule-based Entity Resolution on Database with hidden temporal Information. IEEE Transactions on Knowledge and Data Engineering, pages 1--1, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. P. Wang, W. Zheng, J. Wang, and J. Pei. Automating Entity Matching Model Development. In 2021 IEEE 37th International Conference on Data Engineering (ICDE), pages 1296--1307, 4 2021.Google ScholarGoogle Scholar
  72. Q. Wang, M. Cui, and H. Liang. Semantic-aware blocking for entity resolution. In 2016 IEEE 32nd International Conference on Data Engineering (ICDE), pages 166--180, 5 2016.Google ScholarGoogle ScholarCross RefCross Ref
  73. R. Wu, S. Chaba, S. Sawlani, X. Chu, and S. Thirumuruganathan. ZeroER: Entity Resolution using Zero Labeled Examples. In 2020 International Conference on Management of Data (SIGMOD), pages 1149--1164, 2020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. V. Yalavarthi, X. Ke, and A. Khan. Select Your Questions Wisely. In 26th ACM International Conference on Information and Knowledge Management (CIKM), 11 2017.Google ScholarGoogle Scholar
  75. B. Zhang, S. Sanner, M. Bouadjenek, and S. Gupta. Bayesian Networks for Data Integration in the Absence of Foreign Keys. IEEE Transactions on Knowledge and Data Engineering, 32(4):803--808, 4 2020.Google ScholarGoogle ScholarCross RefCross Ref
  76. D. Zhang, L. Guo, X. He, J. Shao, S. Wu, and H. T. Shen. A Graph-Theoretic Fusion Framework for Unsupervised Entity Resolution. In 34th IEEE International Conference on Data Engineering, (ICDE), pages 713--724, 2018.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. A Framework to Evaluate the Quality of Integrated Datasets

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM SIGAPP Applied Computing Review
          ACM SIGAPP Applied Computing Review  Volume 22, Issue 4
          December 2022
          42 pages

          Copyright © 2023 Copyright is held by the owner/author(s)

          Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 10 February 2023

          Check for updates

          Qualifiers

          • research-article
        • Article Metrics

          • Downloads (Last 12 months)41
          • Downloads (Last 6 weeks)0

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader