Efficient Approximate Entity Matching Using Jaro-Winkler Distance

Wang, Yaoshu; Qin, Jianbin; Wang, Wei

doi:10.1007/978-3-319-68783-4_16

Yaoshu Wang²⁴,
Jianbin Qin²⁴ &
Wei Wang²⁴

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10569))

Included in the following conference series:

International Conference on Web Information Systems Engineering

1741 Accesses
16 Citations

Abstract

Jaro-Winkler distance is a measurement to measure the similarity between two strings. Since Jaro-Winkler distance performs well in matching personal and entity names, it is widely used in the areas of record linkage, entity linking, information extraction. Given a query string q, Jaro-Winkler distance similarity search finds all strings in a dataset D whose Jaro-Winkler distance similarity with q is no more than a given threshold \(\tau \). With the growth of the dataset size, to efficiently perform Jaro-Winkler distance similarity search becomes challenge problem. In this paper, we propose an index-based method that relies on a filter-and-verify framework to support efficient Jaro-Winkler distance similarity search on a large dataset. We leverage e-variants methods to build the index structure and pigeonhole principle to perform the search. The experiment results clearly demonstrate the efficiency of our methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Notes

1.
https://github.com/AKSW/LIMES-dev.

References

Agirre, E., Barrena, A., Soroa. A.: Studying the Wikipedia hyperlink graph for relatedness and disambiguation. CoRR, abs/1503.01655 (2015)
Google Scholar
Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S.E., Widom, J.: Swoosh: a generic approach to entity resolution. VLDB J. 18(1), 255–276 (2009)
Article Google Scholar
Christen, P.: Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface. In: SIGKDD, KDD 2008, New York, NY, USA, pp. 1065–1068. ACM (2008)
Google Scholar
Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A comparison of string distance metrics for name-matching tasks. In: Proceedings of IJCAI-03 Workshop on Information Integration on the Web (IIWeb-03), 9–10 August 2003, Acapulco, Mexico, pp. 73–78 (2003)
Google Scholar
Dreßler, K., Ngomo, A.N.: On the efficient execution of bounded jaro-winkler distances. Semant. Web 8(2), 185–196 (2017)
Article Google Scholar
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)
Article Google Scholar
Feng, J., Wang, J., Li, G.: Trie-join: a trie-based method for efficient string similarity joins. VLDB J. 21(4), 437–461 (2012)
Article Google Scholar
Galárraga, L., Heitz, G., Murphy, K., Suchanek, F.M.: Canonicalizing open knowledge bases. In: CIKM ’2014, New York, NY, USA, pp. 1679–1688. ACM (2014)
Google Scholar
Hoffart, J., Yosef, M.A., Bordino, I., Fürstenau, H., Pinkal, M., Spaniol, M., Taneva, B., Thater, S., Weikum, G.: Robust disambiguation of named entities in text. In: EMNLP 2011, Stroudsburg, PA, USA, pp. 782–792. Association for Computational Linguistics (2011)
Google Scholar
Li, G., Deng, D., Wang, J., Feng, J.: Pass-join: a partition-based method for similarity joins. Proc. VLDB Endow. 5(3), 253–264 (2011)
Article Google Scholar
Liu, Y., Shen, W., Yuan, X.: Deola: a system for linking author entities in web document with DBLP. In: CIKM (2016)
Google Scholar
Prokoshyna, N., Szlichta, J., Chiang, F., Miller, R.J., Srivastava, D.: Combining quantitative and logical data cleaning. Proc. VLDB Endow. 9(4), 300–311 (2015)
Article Google Scholar
Qin, J., Wang, W., Lu, Y., Xiao, C., Lin, X.: Efficient exact edit similarity query processing with the asymmetric signature scheme. In: SIGMOD 2011, New York, NY, USA, pp. 1033–1044. ACM (2011)
Google Scholar
Tang, J., Zhang, J., Yao, L., Li, J., Zhang, L., Su, Z.: Arnetminer: extraction and mining of academic social networks. In: KDD 2008, pp. 990–998 (2008)
Google Scholar
Wang, W., Xiao, C., Lin, X., Zhang, C.: Efficient approximate entity extraction with edit distance constraints. In: SIGMOD 2009, New York, NY, USA, pp. 759–770. ACM (2009)
Google Scholar
Xiao, C., Wang, W., Lin, X.: Ed-join: an efficient algorithm for similarity joins with edit distance constraints. Proc. VLDB Endow. 1(1), 933–944 (2008)
Article MathSciNet Google Scholar
Xiao, C., Wang, W., Lin, X., Yu, J.X., Wang, G.: Efficient similarity joins for near-duplicate detection. ACM Trans. Database Syst. 36(3), 15:1–15:41 (2011)
Article Google Scholar

Download references

Acknowledgement

This work was supported by ARC DP DP130103401 and DP170103710.

Author information

Authors and Affiliations

School of Computer Science and Engineering, Univeristy of New South Wales, Sydney, Australia
Yaoshu Wang, Jianbin Qin & Wei Wang

Authors

Yaoshu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jianbin Qin
View author publications
You can also search for this author in PubMed Google Scholar
Wei Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yaoshu Wang .

Editor information

Editors and Affiliations

University of Sydney, Darlington, NSW, Australia
Athman Bouguettaya
Zhejiang University, Hangzhou, China
Yunjun Gao
Institute of Computing for Physics and Technology, Protvino, Russia
Andrey Klimenko
Nanyang Technological University, Singapore, Singapore
Lu Chen
King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
Xiangliang Zhang
Institute of Computing for Physics and Technology, Protvino, Russia
Fedor Dzerzhinskiy
Shanghai Jiao Tong University, Minhang Qu, China
Weijia Jia
Institute of Computing for Physics and Technology, Protvino, Russia
Stanislav V. Klimenko
City University of Hong Kong, Kowloon, Hong Kong
Qing Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, Y., Qin, J., Wang, W. (2017). Efficient Approximate Entity Matching Using Jaro-Winkler Distance. In: Bouguettaya, A., et al. Web Information Systems Engineering – WISE 2017. WISE 2017. Lecture Notes in Computer Science(), vol 10569. Springer, Cham. https://doi.org/10.1007/978-3-319-68783-4_16

Download citation

DOI: https://doi.org/10.1007/978-3-319-68783-4_16
Published: 04 October 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-68782-7
Online ISBN: 978-3-319-68783-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics