skip to main content
research-article

Dual-objective fine-tuning of BERT for entity matching

Published:01 June 2021Publication History
Skip Abstract Section

Abstract

An increasing number of data providers have adopted shared numbering schemes such as GTIN, ISBN, DUNS, or ORCID numbers for identifying entities in the respective domain. This means for data integration that shared identifiers are often available for a subset of the entity descriptions to be integrated while such identifiers are not available for others. The challenge in these settings is to learn a matcher for entity descriptions without identifiers using the entity descriptions containing identifiers as training data. The task can be approached by learning a binary classifier which distinguishes pairs of entity descriptions for the same real-world entity from descriptions of different entities. The task can also be modeled as a multi-class classification problem by learning classifiers for identifying descriptions of individual entities. We present a dual-objective training method for BERT, called JointBERT, which combines binary matching and multi-class classification, forcing the model to predict the entity identifier for each entity description in a training pair in addition to the match/non-match decision. Our evaluation across five entity matching benchmark datasets shows that dual-objective training can increase the matching performance for seen products by 1% to 5% F1 compared to single-objective Transformer-based methods, given that enough training data is available for both objectives. In order to gain a deeper understanding of the strengths and weaknesses of the proposed method, we compare JointBERT to several other BERT-based matching methods as well as baseline systems along a set of specific matching challenges. This evaluation shows that JointBERT, given enough training data for both objectives, outperforms the other methods on tasks involving seen products, while it underperforms for unseen products. Using a combination of LIME explanations and domain-specific word classes, we analyze the matching decisions of the different deep learning models and conclude that BERT-based models are better at focusing on relevant word classes compared to RNN-based models.

References

  1. Nils Barlaug and Jon Atle Gulla. 2021. Neural Networks for Entity Matching: A Survey. ACM Transactions on Knowledge Discovery from Data 15, 3 (2021), 52:1--52:37.Google ScholarGoogle Scholar
  2. Alexander Brinkmann and Christian Bizer. 2021. Improving Hierarchical Product Classification using Domain-specific Language Modelling. In Proceedings of Workshop on Knowledge Management in e-Commerce.Google ScholarGoogle Scholar
  3. Ursin Brunner and Kurt Stockinger. 2020. Entity Matching with Transformer Architectures - a Step Forward in Data Integration. In Proceedings of the International Conference on Extending Database Technology. 463--473.Google ScholarGoogle Scholar
  4. Rich Caruana. 1997. Multitask Learning. Machine Learning 28 (1997), 41--75. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Peter Christen. 2012. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer-Verlag, Berlin Heidelberg. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Vassilis Christophides, Vasilis Efthymiou, Themis Palpanas, George Papadakis, and Kostas Stefanidis. 2020. An Overview of End-to-End Entity Resolution for Big Data. Comput. Surveys 53, 6 (2020), 127:1--127:42. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. 2020. ELECTRA:Pre-Training Text Encoders as Discriminators Rather Than Generators. In 8th International Conference on Learning Representations.Google ScholarGoogle Scholar
  8. Valter Crescenzi, Andrea De Angelis, Donatella Firmani, Maurizio Mazzei, Paolo Merialdo, et al. 2021. Alaska: A Flexible Benchmark for Data Integration Tasks. arXiv:2101.11259 [cs] (Feb. 2021).Google ScholarGoogle Scholar
  9. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1. 4171--4186.Google ScholarGoogle Scholar
  10. Vincenzo Di Cicco, Donatella Firmani, Nick Koudas, Paolo Merialdo, and Divesh Srivastava. 2019. Interpreting Deep Learning Models for Entity Resolution: An Experience Report Using LIME. In Proceedings of the Second International Workshop on Exploiting Artificial Intelligence Techniques for Data Management. 8:1--8:4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. William B Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing. 9--16.Google ScholarGoogle Scholar
  12. Amr Ebaid, Saravanan Thirumuruganathan, Walid G. Aref, Ahmed Elmagarmid, and Mourad Ouzzani. 2019. EXPLAINER: Entity Resolution Explanations. In 2019 IEEE 35th International Conference on Data Engineering. 2000--2003.Google ScholarGoogle Scholar
  13. Muhammad Ebraheem, Saravanan Thirumuruganathan, Shafiq Joty, Mourad Ouzzani, and Nan Tang. 2018. Distributed Representations of Tuples for Entity Resolution. Proceedings of the VLDB Endowment 11, 11 (2018), 1454--1467. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. 2007. Duplicate Record Detection: A Survey. IEEE Transactions on Knowledge and Data Engineering 19, 1 (2007), 1--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Ivan P. Fellegi and Alan B. Sunter. 1969. A Theory for Record Linkage. J. Amer. Statist. Assoc. 64, 328 (1969), 1183--1210.Google ScholarGoogle ScholarCross RefCross Ref
  16. Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, et al. 2020. XTREME: A Massively Multilingual Multi-Task Benchmark for Evaluating Cross-Lingual Generalisation. In International Conference on Machine Learning. 4411--4421.Google ScholarGoogle Scholar
  17. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations.Google ScholarGoogle Scholar
  18. Pradap Konda, Sanjib Das, Paul Suganthan G. C., AnHai Doan, Adel Ardalan, et al. 2016. Magellan: Toward Building Entity Matching Management Systems. Proceedings of the VLDB Endowment 9, 12 (2016), 1197--1208. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, et al. 2020. ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations. In 8th International Conference on Learning Representations.Google ScholarGoogle Scholar
  20. Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-Chiew Tan. 2020. Deep Entity Matching with Pre-Trained Language Models. Proceedings of the VLDB Endowment 14, 1 (2020), 50--60. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. 2019. Multi-Task Deep Neural Networks for Natural Language Understanding. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 4487--4496.Google ScholarGoogle ScholarCross RefCross Ref
  22. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, et al. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692 [cs] (July 2019).Google ScholarGoogle Scholar
  23. Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, et al. 2018. Deep Learning for Entity Matching: A Design Space Exploration. In Proceedings of the 2018 International Conference on Management of Data. 19--34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, et al. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32. 8024--8035. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, et al. 2011. Scikit-Learn: Machine Learning in Python. Journal of machine learning research 12 (2011), 2825--2830. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Ralph Peeters, Anna Primpeli, Benedikt Wichtlhuber, and Christian Bizer. 2020. Using Schema.Org Annotations for Training and Maintaining Product Matchers. In Proceedings of the 10th International Conference on Web Intelligence, Mining and Semantics. 195--204.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Anna Primpeli, Ralph Peeters, and Christian Bizer. 2019. The WDC Training Dataset and Gold Standard for Large-Scale Product Matching. In Companion Proceedings ofThe 2019 World Wide Web Conference. 381--386. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why Should I Trust You?": Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1135--1144. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Sebastian Ruder. 2017. An Overview of Multi-Task Learning in Deep Neural Networks. arXiv:1706.05098 [cs, stat] (June 2017).Google ScholarGoogle Scholar
  30. Kashif Shah, Selcuk Kopru, and Jean David Ruvini. 2018. Neural Network Based Extreme Classification and Similarity Models for Product Matching. In Proceedings of the 2018 Conference of the Association for Computational Linguistics, Volume 3. 8--15.Google ScholarGoogle ScholarCross RefCross Ref
  31. Kai-Sheng Teong, Lay-Ki Soon, and Tin Tin Su. 2020. Schema-Agnostic Entity Matching Using Pre-Trained Language Models. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 2241--2244. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Saravanan Thirumuruganathan, Mourad Ouzzani, and Nan Tang. 2019. Explaining Entity Resolution Predictions: Where Are We and What Needs to Be Done?. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics. 1--6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, et al. 2017. Attention Is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems. 6000--6010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Alex Wang, Jan Hula, Patrick Xia, Raghavendra Pappagari, R Thomas McCoy, et al. 2019. Can You Tell Me How to Get Past Sesame Street? Sentence-Level Pretraining Beyond Language Modeling. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 4465--4476.Google ScholarGoogle ScholarCross RefCross Ref
  35. Thomas Wolf, Lysandre Debut,Victor Sanh,Julien Chaumond,Clement Delangue, et al. 2020. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 38--45.Google ScholarGoogle ScholarCross RefCross Ref
  36. Pengcheng Yin, Graham Neubig, Wen-tau Yih, and Sebastian Riedel. 2020. TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 8413--8426.Google ScholarGoogle ScholarCross RefCross Ref
  37. Ziqi Zhang, Christian Bizer, Ralph Peeters, and Anna Primpeli. 2020. MWPD2020: Semantic Web Challenge on Mining the Web of HTML-Embedded Product Data. In CEUR Workshop Proceedings, Vol. 2720. 2--18.Google ScholarGoogle Scholar

Index Terms

  1. Dual-objective fine-tuning of BERT for entity matching
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image Proceedings of the VLDB Endowment
      Proceedings of the VLDB Endowment  Volume 14, Issue 10
      June 2021
      219 pages
      ISSN:2150-8097
      Issue’s Table of Contents

      Publisher

      VLDB Endowment

      Publication History

      • Published: 1 June 2021
      Published in pvldb Volume 14, Issue 10

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader