research-article

Dual-objective fine-tuning of BERT for entity matching

Authors:
Ralph Peeters

University of Mannheim, Mannheim, Germany

University of Mannheim, Mannheim, Germany
View Profile

,
Christian Bizer

University of Mannheim, Mannheim, Germany

University of Mannheim, Mannheim, Germany
View Profile

Proceedings of the VLDB Endowment Volume 14 Issue 10pp 1913–1921https://doi.org/10.14778/3467861.3467878

Published:01 June 2021Publication History

Proceedings of the VLDB Endowment

Abstract

An increasing number of data providers have adopted shared numbering schemes such as GTIN, ISBN, DUNS, or ORCID numbers for identifying entities in the respective domain. This means for data integration that shared identifiers are often available for a subset of the entity descriptions to be integrated while such identifiers are not available for others. The challenge in these settings is to learn a matcher for entity descriptions without identifiers using the entity descriptions containing identifiers as training data. The task can be approached by learning a binary classifier which distinguishes pairs of entity descriptions for the same real-world entity from descriptions of different entities. The task can also be modeled as a multi-class classification problem by learning classifiers for identifying descriptions of individual entities. We present a dual-objective training method for BERT, called JointBERT, which combines binary matching and multi-class classification, forcing the model to predict the entity identifier for each entity description in a training pair in addition to the match/non-match decision. Our evaluation across five entity matching benchmark datasets shows that dual-objective training can increase the matching performance for seen products by 1% to 5% F1 compared to single-objective Transformer-based methods, given that enough training data is available for both objectives. In order to gain a deeper understanding of the strengths and weaknesses of the proposed method, we compare JointBERT to several other BERT-based matching methods as well as baseline systems along a set of specific matching challenges. This evaluation shows that JointBERT, given enough training data for both objectives, outperforms the other methods on tasks involving seen products, while it underperforms for unseen products. Using a combination of LIME explanations and domain-specific word classes, we analyze the matching decisions of the different deep learning models and conclude that BERT-based models are better at focusing on relevant word classes compared to RNN-based models.

References

Nils Barlaug and Jon Atle Gulla. 2021. Neural Networks for Entity Matching: A Survey. ACM Transactions on Knowledge Discovery from Data 15, 3 (2021), 52:1--52:37.Google Scholar
Alexander Brinkmann and Christian Bizer. 2021. Improving Hierarchical Product Classification using Domain-specific Language Modelling. In Proceedings of Workshop on Knowledge Management in e-Commerce.Google Scholar
Ursin Brunner and Kurt Stockinger. 2020. Entity Matching with Transformer Architectures - a Step Forward in Data Integration. In Proceedings of the International Conference on Extending Database Technology. 463--473.Google Scholar
Rich Caruana. 1997. Multitask Learning. Machine Learning 28 (1997), 41--75. Google ScholarDigital Library
Peter Christen. 2012. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer-Verlag, Berlin Heidelberg. Google ScholarDigital Library
Vassilis Christophides, Vasilis Efthymiou, Themis Palpanas, George Papadakis, and Kostas Stefanidis. 2020. An Overview of End-to-End Entity Resolution for Big Data. Comput. Surveys 53, 6 (2020), 127:1--127:42. Google ScholarDigital Library
Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. 2020. ELECTRA:Pre-Training Text Encoders as Discriminators Rather Than Generators. In 8th International Conference on Learning Representations.Google Scholar
Valter Crescenzi, Andrea De Angelis, Donatella Firmani, Maurizio Mazzei, Paolo Merialdo, et al. 2021. Alaska: A Flexible Benchmark for Data Integration Tasks. arXiv:2101.11259 [cs] (Feb. 2021).Google Scholar
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1. 4171--4186.Google Scholar
Vincenzo Di Cicco, Donatella Firmani, Nick Koudas, Paolo Merialdo, and Divesh Srivastava. 2019. Interpreting Deep Learning Models for Entity Resolution: An Experience Report Using LIME. In Proceedings of the Second International Workshop on Exploiting Artificial Intelligence Techniques for Data Management. 8:1--8:4. Google ScholarDigital Library
William B Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing. 9--16.Google Scholar
Amr Ebaid, Saravanan Thirumuruganathan, Walid G. Aref, Ahmed Elmagarmid, and Mourad Ouzzani. 2019. EXPLAINER: Entity Resolution Explanations. In 2019 IEEE 35th International Conference on Data Engineering. 2000--2003.Google Scholar
Muhammad Ebraheem, Saravanan Thirumuruganathan, Shafiq Joty, Mourad Ouzzani, and Nan Tang. 2018. Distributed Representations of Tuples for Entity Resolution. Proceedings of the VLDB Endowment 11, 11 (2018), 1454--1467. Google ScholarDigital Library
Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. 2007. Duplicate Record Detection: A Survey. IEEE Transactions on Knowledge and Data Engineering 19, 1 (2007), 1--16. Google ScholarDigital Library
Ivan P. Fellegi and Alan B. Sunter. 1969. A Theory for Record Linkage. J. Amer. Statist. Assoc. 64, 328 (1969), 1183--1210.Google ScholarCross Ref
Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, et al. 2020. XTREME: A Massively Multilingual Multi-Task Benchmark for Evaluating Cross-Lingual Generalisation. In International Conference on Machine Learning. 4411--4421.Google Scholar
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations.Google Scholar
Pradap Konda, Sanjib Das, Paul Suganthan G. C., AnHai Doan, Adel Ardalan, et al. 2016. Magellan: Toward Building Entity Matching Management Systems. Proceedings of the VLDB Endowment 9, 12 (2016), 1197--1208. Google ScholarDigital Library
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, et al. 2020. ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations. In 8th International Conference on Learning Representations.Google Scholar
Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-Chiew Tan. 2020. Deep Entity Matching with Pre-Trained Language Models. Proceedings of the VLDB Endowment 14, 1 (2020), 50--60. Google ScholarDigital Library
Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. 2019. Multi-Task Deep Neural Networks for Natural Language Understanding. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 4487--4496.Google ScholarCross Ref
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, et al. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692 [cs] (July 2019).Google Scholar
Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, et al. 2018. Deep Learning for Entity Matching: A Design Space Exploration. In Proceedings of the 2018 International Conference on Management of Data. 19--34. Google ScholarDigital Library
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, et al. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32. 8024--8035. Google ScholarDigital Library
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, et al. 2011. Scikit-Learn: Machine Learning in Python. Journal of machine learning research 12 (2011), 2825--2830. Google ScholarDigital Library
Ralph Peeters, Anna Primpeli, Benedikt Wichtlhuber, and Christian Bizer. 2020. Using Schema.Org Annotations for Training and Maintaining Product Matchers. In Proceedings of the 10th International Conference on Web Intelligence, Mining and Semantics. 195--204.Google ScholarDigital Library
Anna Primpeli, Ralph Peeters, and Christian Bizer. 2019. The WDC Training Dataset and Gold Standard for Large-Scale Product Matching. In Companion Proceedings ofThe 2019 World Wide Web Conference. 381--386. Google ScholarDigital Library
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why Should I Trust You?": Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1135--1144. Google ScholarDigital Library
Sebastian Ruder. 2017. An Overview of Multi-Task Learning in Deep Neural Networks. arXiv:1706.05098 [cs, stat] (June 2017).Google Scholar
Kashif Shah, Selcuk Kopru, and Jean David Ruvini. 2018. Neural Network Based Extreme Classification and Similarity Models for Product Matching. In Proceedings of the 2018 Conference of the Association for Computational Linguistics, Volume 3. 8--15.Google ScholarCross Ref
Kai-Sheng Teong, Lay-Ki Soon, and Tin Tin Su. 2020. Schema-Agnostic Entity Matching Using Pre-Trained Language Models. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 2241--2244. Google ScholarDigital Library
Saravanan Thirumuruganathan, Mourad Ouzzani, and Nan Tang. 2019. Explaining Entity Resolution Predictions: Where Are We and What Needs to Be Done?. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics. 1--6. Google ScholarDigital Library
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, et al. 2017. Attention Is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems. 6000--6010. Google ScholarDigital Library
Alex Wang, Jan Hula, Patrick Xia, Raghavendra Pappagari, R Thomas McCoy, et al. 2019. Can You Tell Me How to Get Past Sesame Street? Sentence-Level Pretraining Beyond Language Modeling. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 4465--4476.Google ScholarCross Ref
Thomas Wolf, Lysandre Debut,Victor Sanh,Julien Chaumond,Clement Delangue, et al. 2020. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 38--45.Google ScholarCross Ref
Pengcheng Yin, Graham Neubig, Wen-tau Yih, and Sebastian Riedel. 2020. TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 8413--8426.Google ScholarCross Ref
Ziqi Zhang, Christian Bizer, Ralph Peeters, and Anna Primpeli. 2020. MWPD2020: Semantic Web Challenge on Mining the Web of HTML-Embedded Product Data. In CEUR Workshop Proceedings, Vol. 2720. 2--18.Google Scholar

Index Terms

Dual-objective fine-tuning of BERT for entity matching
1. Computing methodologies
  1. Machine learning

Index terms have been assigned to the content through auto-classification.

Recommendations

Deep Entity Matching: Challenges and Opportunities
On the Horizon, On the Horizon and Experience Papers

Entity matching refers to the task of determining whether two different representations refer to the same real-world entity. It continues to be a prevalent problem for many organizations where data resides in different sources and duplicates the need to ...
Read More
Neural Networks for Entity Matching: A Survey
Entity matching is the problem of identifying which records refer to the same real-world entity. It has been actively researched for decades, and a variety of different approaches have been developed. Even today, it remains a challenging problem, and ...
Read More
Frameworks for entity matching: A comparison

Entity matching is a crucial and difficult task for data integration. Entity matching frameworks provide several methods and their combination to effectively solve different match tasks. In this paper, we comparatively analyze 11 proposed frameworks for ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 14, Issue 10
June 2021
219 pages
ISSN:2150-8097
Editors:
Xin Luna Dong
Amazon
,
Felix Naumann
HPI, University of Potsdam
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 June 2021
Published in pvldb Volume 14, Issue 10
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 19
  Total Citations
  View Citations
- 126
  Total Downloads
- Downloads (Last 12 months)23
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Dual-objective fine-tuning of BERT for entity matching

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Deep Entity Matching: Challenges and Opportunities

Neural Networks for Entity Matching: A Survey

Frameworks for entity matching: A comparison

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Dual-objective fine-tuning of BERT for entity matching

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Deep Entity Matching: Challenges and Opportunities

Neural Networks for Entity Matching: A Survey

Frameworks for entity matching: A comparison

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media