Abstract
Evaluation is a bottleneck in data integration processes: it is performed by domain experts through manual onerous data inspections. This task is particularly heavy in real business scenarios, where the large amount of data makes checking all integrated tuples infeasible. Our idea is to address this issue by providing the experts with an unsupervised measure, based on word frequencies, which quantifies how much a dataset is representative of another dataset, giving an indication of how good is the integration process. The paper motivates and introduces the measure and provides extensive experimental evaluations, that show the effectiveness and the efficiency of the approach.
- Y. Altowim and S. Mehrotra. Parallel Progressive Approach to Entity Resolution Using MapReduce. In 33rd IEEE International Conference on Data Engineering, (ICDE), pages 909--920, 2017.Google ScholarCross Ref
- H. Altwaijry, S. Mehrotra, and D. V. Kalashnikov. QuERy: A Framework for Integrating Entity Resolution with Query Processing. Proceedings of VLDB Endowment, 9(3):120--131, nov 2015.Google ScholarDigital Library
- A. Baraldi, F. D. Buono, M. Paganelli, and F. Guerra. Using landmarks for explaining entity matching models. In 24th International Conference on Extending Database Technology (EDBT), pages 451--456, 2021.Google Scholar
- K. Bellare, S. Iyengar, A. G. Parameswaran, and V. Rastogi. Active sampling for entity matching. In The 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (KDD), pages 1131--1139, 2012.Google ScholarDigital Library
- G. D. Bianco, R. Galante, C. A. Heuser, M. A. Gonçalves, and S. D. Canuto. A practical and effective sampling selection strategy for large scale deduplication. In 32nd IEEE International Conference on Data Engineering, (ICDE), pages 1518--1519, 2016.Google ScholarCross Ref
- A. Bogatu, N. Paton, M. Douthwaite, S. Davie, and A. Freitas. Cost-effective Variational Active Entity Resolution. In 2021 IEEE 37th International Conference on Data Engineering (ICDE), pages 1272--1283, 4 2021.Google Scholar
- P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. Enriching Word Vectors with Subword Information. ACL Transactions of the Association for Computational Linguistics, 5:135--146, 2017.Google ScholarCross Ref
- U. Brunner and K. Stockinger. Entity Matching on Unstructured Data: An Active Learning Approach. In 6th Swiss Conference on Data Science, SDS, pages 97--102, 2019.Google Scholar
- Q. Bui-Nguyen, Q. Wang, J. Shao, and D. Vatsalan. Repairing of Record Linkage: Turning Errors into Insight. In 22nd International Conference on Extending Database Technology (EDBT), pages 638--641, 2019.Google Scholar
- R. Cappuzzo, P. Papotti, and S. Thirumuruganathan. Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks. In 2020 ACM SIGMOD International Conference on Management of Data, volume 2020, pages 1335--1349, 6 2020.Google Scholar
- C. Chai, G. Li, J. Li, D. Deng, and J. Feng. Cost-Effective Crowdsourced Entity Resolution: A Partial-Order Approach. In 2016 International Conference on Management of Data (SIGMOD), pages 969--984, 2016.Google ScholarDigital Library
- Z. Chen, Q. Chen, F. Fan, Y. Wang, Z. Wang, Y. Nafa, Z. Li, H. Liu, and W. Pan. Enabling Quality Control for Entity Resolution: A Human and Machine Cooperation Framework. In 34th IEEE International Conference on Data Engineering, (ICDE), pages 1156--1167, 2018.Google Scholar
- S. Das, P. S. G. C., A. Doan, J. F. Naughton, G. Krishnan, R. Deep, E. Arcaute, V. Raghavendra, and Y. Park. Falcon: Scaling Up Hands-Off Crowdsourced Entity Matching to Build Cloud Services. In 2017 ACM International Conference on Management of Data (SIGMOD), pages 1431--1446, 2017.Google ScholarDigital Library
- D. Deng, W. Tao, Z. Abedjan, A. K. Elmagarmid, I. F. Ilyas, G. Li, S. Madden, M. Ouzzani, M. Stonebraker, and N. Tang. Unsupervised String Transformation Learning for Entity Consolidation. In 35th IEEE International Conference on Data Engineering, (ICDE), pages 196--207, 2019.Google Scholar
- M. Dolatshah, M. Teoh, J. Wang, and J. Pei. Cleaning Crowdsourced Labels Using Oracles For Statistical Classification. Proceedings of the VLDB Endowment, 12(4):376--389, 2018.Google ScholarDigital Library
- X. L. Dong and D. Srivastava. Big Data Integration. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2015.Google ScholarCross Ref
- U. Draisbach, P. Christen, and F. Naumann. Transforming Pairwise Duplicates to Entity Clusters for High-quality Duplicate Detection. Journal of Data and Information Quality, 12(1):1--30, 1 2020.Google ScholarDigital Library
- M. Ebraheem, S. Thirumuruganathan, S. Joty, M. Ouzzani, and N. Tang. Distributed representations of tuples for entity resolution. Proceedings of the VLDB Endowment, 11(11):1454--1467, 7 2018.Google ScholarDigital Library
- V. Efthymiou, G. Papadakis, K. Stefanidis, and V. Christophides. Simplifying Entity Resolution on Web Data with Schema-Agnostic, Non-Iterative Matching. In 34th IEEE International Conference on Data Engineering, (ICDE), pages 1296--1299, 2018.Google Scholar
- N. Fanizzi, C. d'Amato, and F. Esposito. Composite ontology matching with uncertain mappings recovery. ACM SIGAPP Applied Computing Review, 11(2):17--29, mar 2011.Google ScholarDigital Library
- D. Firmani, B. Saha, and D. Srivastava. Online entity resolution using an Oracle. Proceedings of the VLDB Endowment, 9(5):384--395, 1 2016.Google ScholarDigital Library
- M. Franke, Z. Sehili, F. Rohde, and E. Rahm. Evaluation of Hardening Techniques for Privacy-Preserving Record Linkage. In 24th International Conference on Extending Database Technology (EDBT), pages 289--300, 2021.Google Scholar
- S. Galhotra, D. Firmani, B. Saha, and D. Srivastava. Robust Entity Resolution using Random Graphs. In 2018 International Conference on Management of Data (SIGMOD), pages 3--18, 2018.Google ScholarDigital Library
- L. Gazzarri and M. Herschel. Boosting Blocking Performance in Entity Resolution Pipelines: Comparison Cleaning using Bloom Filters. In 23rd International Conference on Extending Database Technology (EDBT), pages 419--422, 2020.Google Scholar
- L. Gazzarri and M. Herschel. End-to-end Task Based Parallelization for Entity Resolution on Dynamic Data. In 2021 IEEE 37th International Conference on Data Engineering (ICDE), pages 1248--1259, 4 2021.Google Scholar
- A. L. Gentile, P. Ristoski, S. Eckel, D. Ritze, and H. Paulheim. Entity Matching on Web Tables: a Table Embeddings approach for Blocking. In 20th International Conference on Extending Database Technology (EDBT), pages 510--513, 2017.Google Scholar
- B. Gu, Z. Li, X. Zhang, A. Liu, G. Liu, K. Zheng, L. Zhao, and X. Zhou. The Interaction Between Schema Matching and Record Matching in Data Integration. IEEE Transactions on Knowledge and Data Engineering, 29(1):186--199, 1 2017.Google ScholarDigital Library
- B. Hou, Q. Chen, Z. Chen, Y. Nafa, and Z. Li. r-HUMO: A Risk-Aware Human-Machine Cooperation Framework for Entity Resolution with Quality Guarantees. IEEE Transactions on Knowledge and Data Engineering, 32(2):347--359, 2020.Google ScholarDigital Library
- D. Karapiperis and V. Verykios. Load-Balancing the Distance Computations in Record Linkage. ACM SIGKDD Explorations Newsletter, 17(1):1--7, 9 2015.Google ScholarDigital Library
- A. R. Khan and H. Garcia-Molina. Attribute-based Crowd Entity Resolution. In 25th ACM International Conference on Information and Knowledge Management (CIKM), pages 549--558, 2016.Google Scholar
- P. Konda, S. Das, P. S. G. C., A. Doan, A. Ardalan, J. R. Ballard, H. Li, F. Panahi, H. Zhang, J. F. Naughton, S. Prasad, G. Krishnan, R. Deep, and V. Raghavendra. Magellan: Toward Building Entity Matching Management Systems. Proceedings of the VLDB Endowment, 9(12):1197--1208, 2016.Google ScholarDigital Library
- P. Konda, S. S. Seshadri, E. Segarra, B. Hueth, and A. Doan. Executing entity matching end to end: A case study. In 22nd International Conference on Extending Database Technology (EDBT), 2019.Google Scholar
- I. K. Koumarelas, T. Papenbrock, and F. Naumann. MDedup: Duplicate Detection with Matching Dependencies. Proceedings of the VLDB Endowment, 13(5):712--725, 2020.Google ScholarDigital Library
- S. Kwashie, J. Liu, J. Li, L. Liu, M. Stumptner, and L. Yang. Certus: An Effective Entity Resolution Approach with Graph Differential Dependencies (GDDs). Proceedings of the VLDB Endowment, 12(6):653--666, 2019.Google ScholarDigital Library
- L. Leitão and P. Calado. An automatic blocking strategy for xml duplicate detection. ACM SIGAPP Applied Computing Review, 13(2):42--53, jun 2013.Google ScholarDigital Library
- H. Li, P. Konda, P. S. G. C., A. Doan, B. Snyder, Y. Park, G. Krishnan, R. Deep, and V. Raghavendra. MatchCatcher: A Debugger for Blocking in Entity Matching. In 21st International Conference on Extending Database Technology (EDBT), pages 193--204, 2018.Google Scholar
- Y. Li, J. Li, Y. Suhara, A. Doan, and W.-C. Tan. Deep entity matching with pre-trained language models. Proceedings of the VLDB Endowment, 14(1):50--60, 9 2020.Google ScholarDigital Library
- Y. Li, J. Li, Y. Suhara, J. Wang, W. Hirota, and W. Tan. Deep Entity Matching: Challenges and Opportunities. ACM Journal of Data and Information Quality, 13(1):1:1--1:17, 2021.Google ScholarDigital Library
- Y. Lin, H. Wang, J. Li, and H. Gao. Efficient Entity Resolution on Heterogeneous Records. IEEE Transactions on Knowledge and Data Engineering, 32(5):912--926, 2020.Google Scholar
- M. Loster, I. Koumarelas, and F. Naumann. Knowledge Transfer for Entity Resolution with Siamese Neural Networks. ACM Journal of Data and Information Quality, 13(1):1--25, 1 2021.Google ScholarDigital Library
- N. G. Marchant and B. I. P. Rubinstein. In Search of an Entity Resolution OASIS: Optimal Asymptotic Sequential Importance Sampling. Proceedings of the VLDB Endowment, 10(11):1322--1333, 2017.Google ScholarDigital Library
- A. McCallum. Cora Dataset, 2017.Google Scholar
- V. V. Meduri, L. Popa, P. Sen, and M. Sarwat. A Comprehensive Benchmark Framework for Active Learning Methods in Entity Matching. In 2020 International Conference on Management of Data, (SIGMOD), pages 1133--1147, 2020.Google ScholarDigital Library
- Z. Miao, Y. Li, and X. Wang. Rotom: A Meta-Learned Data Augmentation Framework for Entity Matching, Data Cleaning, Text Classification, and Beyond. In 2021 International Conference on Management of Data (SIGMOD), pages 1303--1316, 2021.Google ScholarDigital Library
- T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. In 1st International Conference on Learning Representations, ((ICLR), 2013.Google Scholar
- S. Mudgal, H. Li, T. Rekatsinas, A. Doan, Y. Park, G. Krishnan, R. Deep, E. Arcaute, and V. Raghavendra. Deep Learning for Entity Matching: A Design Space Exploration. In 2018 International Conference on Management of Data (SIGMOD), pages 19--34, 2018.Google ScholarDigital Library
- H. Nie, X. Han, B. He, L. Sun, B. Chen, W. Zhang, S. Wu, and H. Kong. Deep Sequence-to-Sequence Entity Matching for Heterogeneous Entity Resolution. In 28th ACM International Conference on Information and Knowledge Management (CIKM), pages 629--638, 2019.Google Scholar
- S. Ortona, V. V. Meduri, and P. Papotti. Robust discovery of positive and negative rules in knowledge bases. In 34th IEEE International Conference on Data Engineering, (ICDE), pages 1168--1179, 2018.Google ScholarCross Ref
- M. Paganelli, F. D. Buono, A. Baraldi, and F. Guerra. Analyzing how BERT performs entity matching. Proceedings of the VLDB Endowment, 15(8):1726--1738, 2022.Google ScholarDigital Library
- M. Paganelli, F. D. Buono, F. Guerra, and N. Ferro. Unsupervised Evaluation of Data Integration Processes. In 22nd International Conference on Information Integration and Web-based Applications & Services (iiWAS), pages 77--81, 2020.Google Scholar
- M. Paganelli, F. D. Buono, F. Guerra, and N. Ferro. Evaluating the integration of datasets. In SAC '22: The 37th ACM/SIGAPP Symposium on Applied Computing, Virtual Event, April 25 - 29, 2022, pages 347--356, 2022.Google ScholarDigital Library
- M. Paganelli, F. D. Buono, M. Pevarello, F. Guerra, and M. Vincini. Automated machine learning for entity matching tasks. In 24th International Conference on Extending Database Technology (EDBT), pages 325--330, 2021.Google Scholar
- M. Paganelli, P. Sottovia, F. Guerra, and Y. Velegrakis. TuneR: Fine Tuning of Rule-based Entity Matchers. In 28th ACM International Conference on Information and Knowledge Management (CIKM), pages 2945--2948, 2019.Google ScholarDigital Library
- F. Panahi, W. Wu, A. Doan, and J. F. Naughton. Towards Interactive Debugging of Rule-based Entity Matching. In 20th International Conference on Extending Database Technology (EDBT), pages 354--365, 2017.Google Scholar
- G. Papadakis, G. Papastefanatos, T. Palpanas, and M. Koubarakis. Scaling Entity Resolution to Large, Heterogeneous Data with Enhanced Meta-blocking. In 19th International Conference on Extending Database Technology (EDBT), pages 221--232, 2016.Google Scholar
- G. Papadakis, J. Svirsky, A. Gal, and T. Palpanas. Comparative analysis of approximate blocking techniques for entity resolution. Proceedings of the VLDB Endowment, 9(9):684--695, 5 2016.Google ScholarDigital Library
- K. Papineni, S. Roukos, T. Ward, and W. Zhu. Bleu: a Method for Automatic Evaluation of Machine Translation. In 40th Annual Meeting of the Association for Computational Linguistics, pages 311--318, 2002.Google Scholar
- R. Peeters and C. Bizer. Dual-Objective Fine-Tuning of BERT for Entity Matching. Proceedings of the VLDB Endowment, 14(10):1913--1921, 2021.Google ScholarDigital Library
- J. Pennington, R. Socher, and C. D. Manning. Glove: Global Vectors for Word Representation. In 2014 Conference on Empirical Methods in Natural Language Processing, (EMNLP), pages 1532--1543, 2014.Google ScholarCross Ref
- R. Pradhan, S. Bykau, and S. Prabhakar. Staging User Feedback toward Rapid Conflict Resolution in Data Fusion. In 2017 ACM International Conference on Management of Data (SIGMOD), pages 603--618, 2017.Google ScholarDigital Library
- K. Qian, L. Popa, and P. Sen. Active Learning for Large-Scale Entity Resolution. In 26th ACM on Conference on Information and Knowledge Management (CIKM), volume 2017, pages 1379--1388, 11 2017.Google Scholar
- G. Simonini, S. Bergamaschi, and H. V. Jagadish. BLAST: a Loosely Schema-aware Meta-blocking Approach for Entity Resolution. Proceedings of the VLDB Endowment, 9(12):1173--1184, 2016.Google ScholarDigital Library
- G. Simonini, G. Papadakis, T. Palpanas, and S. Bergamaschi. Schema-Agnostic Progressive Entity Resolution. IEEE Transactions on Knowledge and Data Engineering, 31(6):1208--1221, 2019.Google ScholarCross Ref
- R. Singh, V. V. Meduri, A. K. Elmagarmid, S. Madden, P. Papotti, J. Quiané-Ruiz, A. Solar-Lezama, and N. Tang. Synthesizing Entity Matching Rules by Examples. Proceedings of the VLDB Endowment, 11(2):189--202, 2017.Google ScholarDigital Library
- N. A. Smith. Contextual word representations: putting words into computers. Communications of the ACM, 63(6):66--74, 2020.Google ScholarDigital Library
- K. S. Teong, L. Soon, and T. T. Su. Schema-Agnostic Entity Matching using Pre-trained Language Models. In 29th ACM International Conference on Information and Knowledge Management (CIKM), pages 2241--2244, 2020.Google ScholarDigital Library
- V. Verroios and H. Garcia-Molina. Entity Resolution with crowd errors. In 31st IEEE International Conference on Data Engineering, (ICDE), pages 219--230, 2015.Google ScholarCross Ref
- N. Vesdapunt, K. Bellare, and N. N. Dalvi. Crowdsourcing algorithms for entity resolution. Proceedings of the VLDB Endowment, 7(12):1071--1082, 2014.Google ScholarDigital Library
- A. Walker, T. Cerny, and E. Song. Open-source tools and benchmarks for code-clone detection: Past, present, and future trends. ACM SIGAPP Applied Computing Review, 19(4):28--39, jan 2020.Google ScholarDigital Library
- H. Wang, X. Ding, J. Li, and H. Gao. Rule-based Entity Resolution on Database with hidden temporal Information. IEEE Transactions on Knowledge and Data Engineering, pages 1--1, 2018.Google ScholarDigital Library
- P. Wang, W. Zheng, J. Wang, and J. Pei. Automating Entity Matching Model Development. In 2021 IEEE 37th International Conference on Data Engineering (ICDE), pages 1296--1307, 4 2021.Google Scholar
- Q. Wang, M. Cui, and H. Liang. Semantic-aware blocking for entity resolution. In 2016 IEEE 32nd International Conference on Data Engineering (ICDE), pages 166--180, 5 2016.Google ScholarCross Ref
- R. Wu, S. Chaba, S. Sawlani, X. Chu, and S. Thirumuruganathan. ZeroER: Entity Resolution using Zero Labeled Examples. In 2020 International Conference on Management of Data (SIGMOD), pages 1149--1164, 2020.Google ScholarDigital Library
- V. Yalavarthi, X. Ke, and A. Khan. Select Your Questions Wisely. In 26th ACM International Conference on Information and Knowledge Management (CIKM), 11 2017.Google Scholar
- B. Zhang, S. Sanner, M. Bouadjenek, and S. Gupta. Bayesian Networks for Data Integration in the Absence of Foreign Keys. IEEE Transactions on Knowledge and Data Engineering, 32(4):803--808, 4 2020.Google ScholarCross Ref
- D. Zhang, L. Guo, X. He, J. Shao, S. Wu, and H. T. Shen. A Graph-Theoretic Fusion Framework for Unsupervised Entity Resolution. In 34th IEEE International Conference on Data Engineering, (ICDE), pages 713--724, 2018.Google ScholarCross Ref
Index Terms
- A Framework to Evaluate the Quality of Integrated Datasets
Recommendations
Evaluating the integration of datasets
SAC '22: Proceedings of the 37th ACM/SIGAPP Symposium on Applied ComputingEvaluation is a bottleneck in data integration processes: it is performed by domain experts through manual onerous data inspections. This task is particularly heavy in real business scenarios, where the large amount of data makes checking all integrated ...
Network metrics for assessing the quality of entity resolution between multiple datasets1
Matching entities between datasets is a crucial step for combining multiple datasets on the semantic web. A rich literature exists on different approaches to this entity resolution problem. However, much less work has been done on how to assess the ...
Multidimensional Integration of RDF Datasets
Big Data Analytics and Knowledge DiscoveryAbstractData providers have been uploading RDF datasets on the web to aid researchers and analysts in finding insights. These datasets, made available by different data providers, contain common characteristics that enable their integration. However, ...
Comments