Abstract
Jaro-Winkler distance is a measurement to measure the similarity between two strings. Since Jaro-Winkler distance performs well in matching personal and entity names, it is widely used in the areas of record linkage, entity linking, information extraction. Given a query string q, Jaro-Winkler distance similarity search finds all strings in a dataset D whose Jaro-Winkler distance similarity with q is no more than a given threshold \(\tau \). With the growth of the dataset size, to efficiently perform Jaro-Winkler distance similarity search becomes challenge problem. In this paper, we propose an index-based method that relies on a filter-and-verify framework to support efficient Jaro-Winkler distance similarity search on a large dataset. We leverage e-variants methods to build the index structure and pigeonhole principle to perform the search. The experiment results clearly demonstrate the efficiency of our methods.
References
Agirre, E., Barrena, A., Soroa. A.: Studying the Wikipedia hyperlink graph for relatedness and disambiguation. CoRR, abs/1503.01655 (2015)
Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S.E., Widom, J.: Swoosh: a generic approach to entity resolution. VLDB J. 18(1), 255–276 (2009)
Christen, P.: Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface. In: SIGKDD, KDD 2008, New York, NY, USA, pp. 1065–1068. ACM (2008)
Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A comparison of string distance metrics for name-matching tasks. In: Proceedings of IJCAI-03 Workshop on Information Integration on the Web (IIWeb-03), 9–10 August 2003, Acapulco, Mexico, pp. 73–78 (2003)
Dreßler, K., Ngomo, A.N.: On the efficient execution of bounded jaro-winkler distances. Semant. Web 8(2), 185–196 (2017)
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)
Feng, J., Wang, J., Li, G.: Trie-join: a trie-based method for efficient string similarity joins. VLDB J. 21(4), 437–461 (2012)
Galárraga, L., Heitz, G., Murphy, K., Suchanek, F.M.: Canonicalizing open knowledge bases. In: CIKM ’2014, New York, NY, USA, pp. 1679–1688. ACM (2014)
Hoffart, J., Yosef, M.A., Bordino, I., Fürstenau, H., Pinkal, M., Spaniol, M., Taneva, B., Thater, S., Weikum, G.: Robust disambiguation of named entities in text. In: EMNLP 2011, Stroudsburg, PA, USA, pp. 782–792. Association for Computational Linguistics (2011)
Li, G., Deng, D., Wang, J., Feng, J.: Pass-join: a partition-based method for similarity joins. Proc. VLDB Endow. 5(3), 253–264 (2011)
Liu, Y., Shen, W., Yuan, X.: Deola: a system for linking author entities in web document with DBLP. In: CIKM (2016)
Prokoshyna, N., Szlichta, J., Chiang, F., Miller, R.J., Srivastava, D.: Combining quantitative and logical data cleaning. Proc. VLDB Endow. 9(4), 300–311 (2015)
Qin, J., Wang, W., Lu, Y., Xiao, C., Lin, X.: Efficient exact edit similarity query processing with the asymmetric signature scheme. In: SIGMOD 2011, New York, NY, USA, pp. 1033–1044. ACM (2011)
Tang, J., Zhang, J., Yao, L., Li, J., Zhang, L., Su, Z.: Arnetminer: extraction and mining of academic social networks. In: KDD 2008, pp. 990–998 (2008)
Wang, W., Xiao, C., Lin, X., Zhang, C.: Efficient approximate entity extraction with edit distance constraints. In: SIGMOD 2009, New York, NY, USA, pp. 759–770. ACM (2009)
Xiao, C., Wang, W., Lin, X.: Ed-join: an efficient algorithm for similarity joins with edit distance constraints. Proc. VLDB Endow. 1(1), 933–944 (2008)
Xiao, C., Wang, W., Lin, X., Yu, J.X., Wang, G.: Efficient similarity joins for near-duplicate detection. ACM Trans. Database Syst. 36(3), 15:1–15:41 (2011)
Acknowledgement
This work was supported by ARC DP DP130103401 and DP170103710.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Wang, Y., Qin, J., Wang, W. (2017). Efficient Approximate Entity Matching Using Jaro-Winkler Distance. In: Bouguettaya, A., et al. Web Information Systems Engineering – WISE 2017. WISE 2017. Lecture Notes in Computer Science(), vol 10569. Springer, Cham. https://doi.org/10.1007/978-3-319-68783-4_16
Download citation
DOI: https://doi.org/10.1007/978-3-319-68783-4_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-68782-7
Online ISBN: 978-3-319-68783-4
eBook Packages: Computer ScienceComputer Science (R0)