Skip to main content

Efficient Approximate Entity Matching Using Jaro-Winkler Distance

  • Conference paper
  • First Online:
Web Information Systems Engineering – WISE 2017 (WISE 2017)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10569))

Included in the following conference series:

Abstract

Jaro-Winkler distance is a measurement to measure the similarity between two strings. Since Jaro-Winkler distance performs well in matching personal and entity names, it is widely used in the areas of record linkage, entity linking, information extraction. Given a query string q, Jaro-Winkler distance similarity search finds all strings in a dataset D whose Jaro-Winkler distance similarity with q is no more than a given threshold \(\tau \). With the growth of the dataset size, to efficiently perform Jaro-Winkler distance similarity search becomes challenge problem. In this paper, we propose an index-based method that relies on a filter-and-verify framework to support efficient Jaro-Winkler distance similarity search on a large dataset. We leverage e-variants methods to build the index structure and pigeonhole principle to perform the search. The experiment results clearly demonstrate the efficiency of our methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Notes

  1. 1.

    https://github.com/AKSW/LIMES-dev.

References

  1. Agirre, E., Barrena, A., Soroa. A.: Studying the Wikipedia hyperlink graph for relatedness and disambiguation. CoRR, abs/1503.01655 (2015)

    Google Scholar 

  2. Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S.E., Widom, J.: Swoosh: a generic approach to entity resolution. VLDB J. 18(1), 255–276 (2009)

    Article  Google Scholar 

  3. Christen, P.: Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface. In: SIGKDD, KDD 2008, New York, NY, USA, pp. 1065–1068. ACM (2008)

    Google Scholar 

  4. Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A comparison of string distance metrics for name-matching tasks. In: Proceedings of IJCAI-03 Workshop on Information Integration on the Web (IIWeb-03), 9–10 August 2003, Acapulco, Mexico, pp. 73–78 (2003)

    Google Scholar 

  5. Dreßler, K., Ngomo, A.N.: On the efficient execution of bounded jaro-winkler distances. Semant. Web 8(2), 185–196 (2017)

    Article  Google Scholar 

  6. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)

    Article  Google Scholar 

  7. Feng, J., Wang, J., Li, G.: Trie-join: a trie-based method for efficient string similarity joins. VLDB J. 21(4), 437–461 (2012)

    Article  Google Scholar 

  8. Galárraga, L., Heitz, G., Murphy, K., Suchanek, F.M.: Canonicalizing open knowledge bases. In: CIKM ’2014, New York, NY, USA, pp. 1679–1688. ACM (2014)

    Google Scholar 

  9. Hoffart, J., Yosef, M.A., Bordino, I., Fürstenau, H., Pinkal, M., Spaniol, M., Taneva, B., Thater, S., Weikum, G.: Robust disambiguation of named entities in text. In: EMNLP 2011, Stroudsburg, PA, USA, pp. 782–792. Association for Computational Linguistics (2011)

    Google Scholar 

  10. Li, G., Deng, D., Wang, J., Feng, J.: Pass-join: a partition-based method for similarity joins. Proc. VLDB Endow. 5(3), 253–264 (2011)

    Article  Google Scholar 

  11. Liu, Y., Shen, W., Yuan, X.: Deola: a system for linking author entities in web document with DBLP. In: CIKM (2016)

    Google Scholar 

  12. Prokoshyna, N., Szlichta, J., Chiang, F., Miller, R.J., Srivastava, D.: Combining quantitative and logical data cleaning. Proc. VLDB Endow. 9(4), 300–311 (2015)

    Article  Google Scholar 

  13. Qin, J., Wang, W., Lu, Y., Xiao, C., Lin, X.: Efficient exact edit similarity query processing with the asymmetric signature scheme. In: SIGMOD 2011, New York, NY, USA, pp. 1033–1044. ACM (2011)

    Google Scholar 

  14. Tang, J., Zhang, J., Yao, L., Li, J., Zhang, L., Su, Z.: Arnetminer: extraction and mining of academic social networks. In: KDD 2008, pp. 990–998 (2008)

    Google Scholar 

  15. Wang, W., Xiao, C., Lin, X., Zhang, C.: Efficient approximate entity extraction with edit distance constraints. In: SIGMOD 2009, New York, NY, USA, pp. 759–770. ACM (2009)

    Google Scholar 

  16. Xiao, C., Wang, W., Lin, X.: Ed-join: an efficient algorithm for similarity joins with edit distance constraints. Proc. VLDB Endow. 1(1), 933–944 (2008)

    Article  MathSciNet  Google Scholar 

  17. Xiao, C., Wang, W., Lin, X., Yu, J.X., Wang, G.: Efficient similarity joins for near-duplicate detection. ACM Trans. Database Syst. 36(3), 15:1–15:41 (2011)

    Article  Google Scholar 

Download references

Acknowledgement

This work was supported by ARC DP DP130103401 and DP170103710.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yaoshu Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Wang, Y., Qin, J., Wang, W. (2017). Efficient Approximate Entity Matching Using Jaro-Winkler Distance. In: Bouguettaya, A., et al. Web Information Systems Engineering – WISE 2017. WISE 2017. Lecture Notes in Computer Science(), vol 10569. Springer, Cham. https://doi.org/10.1007/978-3-319-68783-4_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-68783-4_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-68782-7

  • Online ISBN: 978-3-319-68783-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics