Trie-join: a trie-based method for efficient string similarity joins

Feng, Jianhua; Wang, Jiannan; Li, Guoliang

doi:10.1007/s00778-011-0252-8

Trie-join: a trie-based method for efficient string similarity joins

Regular Paper
Published: 04 October 2011

Volume 21, pages 437–461, (2012)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Jianhua Feng¹,
Jiannan Wang¹ &
Guoliang Li¹

588 Accesses
44 Citations
9 Altmetric
Explore all metrics

Abstract

A string similarity join finds similar pairs between two collections of strings. Many applications, e.g., data integration and cleaning, can significantly benefit from an efficient string-similarity-join algorithm. In this paper, we study string similarity joins with edit-distance constraints. Existing methods usually employ a filter-and-refine framework and suffer from the following limitations: (1) They are inefficient for the data sets with short strings (the average string length is not larger than 30); (2) They involve large indexes; (3) They are expensive to support dynamic update of data sets. To address these problems, we propose a novel method called trie-join, which can generate results efficiently with small indexes. We use a trie structure to index the strings and utilize the trie structure to efficiently find similar string pairs based on subtrie pruning. We devise efficient trie-join algorithms and pruning techniques to achieve high performance. Our method can be easily extended to support dynamic update of data sets efficiently. We conducted extensive experiments on four real data sets. Experimental results show that our algorithms outperform state-of-the-art methods by an order of magnitude on the data sets with short strings.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

http://secondstring.sourceforge.net/
http://www.dcs.shef.ac.uk/~sam/simmetrics.html
Agrawal S., Chakrabarti K., Chaudhuri S., Ganti V.: Scalable ad-hoc entity extraction from text collections. PVLDB 1(1), 945–957 (2008)
Google Scholar
Arasu, A., Chaudhuri, S., Kaushik, R.: Transformation-based framework for record matching. In: ICDE, pp. 40–49 (2008)
Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: VLDB, pp. 918–929 (2006)
Augsten, N., Böhlen, M.H., Dyreson, C.E., Gamper, J.: Approximate joins for data-centric xml. In: ICDE, pp. 814–823 (2008)
Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: WWW, pp. 131–140 (2007)
Bryan, B., Eberhardt, F., Faloutsos, C.: Compact similarity joins. In: ICDE, pp. 346–355 (2008)
Celikik, M., Bast, H.: Fast error-tolerant search on very large texts. In: SAC, pp. 1724–1731 (2009)
Chakrabarti, K., Chaudhuri, S., Ganti, V., Xin, D.: An efficient filter for approximate membership checking. In: SIGMOD Conference, pp. 805–818 (2008)
Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and efficient fuzzy match for online data cleaning. In: SIGMOD Conference, pp. 313–324 (2003)
Chaudhuri S., Ganti V., Kaushik R.: Data debugger: An operator-centric approach for data quality solutions. IEEE Data Eng. Bull. 29(2), 60–66 (2006)
Google Scholar
Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: ICDE, pp. 5–16 (2006)
Chaudhuri, S., Kaushik, R.: Extending autocompletion to tolerate errors. In: SIGMOD Conference, pp. 707–718 (2009)
Ciaccia, P., Patella, M., Zezula, P.: M-tree: An efficient access method for similarity search in metric spaces. In: VLDB, pp. 426–435 (1997)
Cole, R., Gottlieb, L.-A., Lewenstein, M.: Dictionary matching and indexing with errors and don’t cares. In: STOC, pp. 91–100 (2004)
Fredkin E.: Trie memory. Commun. ACM 3(9), 490–499 (1960)
Article Google Scholar
Gonnet G.H.: Handbook of Algorithms and Data structures. Addison-Wesley , Reading (1984)
MATH Google Scholar
Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: VLDB, pp. 491–500 (2001)
Guha, S., Koudas, N., Srivastava, D., Yu, T.: Index-based approximate xml joins. In: ICDE, pp. 708–710 (2003)
Hadjieleftheriou, M., Chandel, A., Koudas, N., Srivastava, D.: Fast indexes and algorithms for set similarity selection queries. In: ICDE, pp. 267–276 (2008)
Hadjieleftheriou, M., Koudas, N., Srivastava, D.: Incremental maintenance of length normalized indexes for approximate string matching. In: SIGMOD Conference, pp. 429–440 (2009)
Hadjieleftheriou M., Srivastava D.: Weighted set-based string similarity. IEEE Data Eng. Bull. 33(1), 25–36 (2010)
Google Scholar
Hadjieleftheriou M., Yu X., Koudas N., Srivastava D.: Hashed samples: selectivity estimators for set similarity selection queries. PVLDB 1(1), 201–212 (2008)
Google Scholar
Heinz S., Zobel J., Williams H.E.: Burst tries: a fast, efficient data structure for string keys. ACM Trans. Inf. Syst. 20(2), 192–223 (2002)
Article Google Scholar
Jaro, M.A. Unimatch: A record linkage system: User’s manual. Technical report, U.S. Bureau of the Census, Washington, D.C., (1976)
Jestes, J., Li, F., Yan, Z., Yi, K.: Probabilistic string similarity joins. In: SIGMOD Conference, pp. 327–338 (2010)
Ji, S., Li, G., Li, C., Feng, J.: Efficient interactive fuzzy keyword search. In WWW, pp. 433–439 (2009)
Kahveci, T., Singh, A.K.: Efficient index structures for string databases. In: VLDB, pp. 351–360 (2001)
Kim, M.-S., Whang, K.-Y., Lee, J.-G., Lee, M.-J. n-Gram/2L: A space and time efficient two-level n-gram inverted index structure. In: VLDB, pp. 325–336 (2005)
Knuth D.E.: The Art of Computer Programming, Volume 1: Fundamental algorithms. Addison-Wesley, Reading (1968)
Google Scholar
Lee, H., Ng, R.T., Shim, K.: Extending q-grams to estimate selectivity of string matching with low edit distance. In: VLDB, pp. 195–206 (2007)
Lee H., Ng R.T., Shim K.: Power-law based estimation of set similarity join size. PVLDB 2(1), 658–669 (2009)
Google Scholar
Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: ICDE, pp. 257–266 (2008)
Li, C., Wang, B., Yang, X. VGRAM: Improving performance of approximate queries on string collections using variable-length grams. In: VLDB, pp. 303–314 (2007)
Li, G., Deng, D., Feng, J. Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction. In: SIGMOD Conference, pp. 529–540 (2011)
Li G., Ji S., Li C., Feng J.: Efficient fuzzy full-text type-ahead search. VLDB J. 20(4), 617–640 (2011)
Article Google Scholar
Lian X., Chen L.: Set similarity join on probabilistic data. PVLDB 3(1), 650–659 (2010)
Google Scholar
Lu, J., Han, J., Meng, X.: Efficient algorithms for approximate member extraction using signature-based inverted lists. In: CIKM, pp. 315–324 (2009)
Morrison D.R.: Patricia: practical algorithm to retrieve information coded in alphanumeric. J. ACM 15, 514–534 (1968)
Article Google Scholar
Navarro G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001)
Article Google Scholar
Nilsson S., Karlsson G.: Ip-address lookup using lc-tries. IEEE J. Selected Areas Commun. 17, 1083–1092 (1999)
Article Google Scholar
Peterson J.L.: Computer programs for detecting and correcting spelling errors. Commun. ACM 23(12), 676–687 (1980)
Article Google Scholar
Russell, R.C.: Available at http://patft.uspto.gov/netacgi/nph-Parser?patentnumber=1261167 (1918)
Sahinalp, S.C., Tasan, M., Macker, J., Özsoyoglu, Z.M.: Distance based indexing for string proximity search. In: ICDE, pp. 125–136 (2003)
Sakoe H., Chiba S.: Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. Acoust Speech Signal Process 26, 43–49 (1978)
Article MATH Google Scholar
Salton G.: Introduction to Modern Information Retrieval. McGraw Hill, NY (1987)
Google Scholar
Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: SIGMOD Conference, pp. 743–754 (2004)
Schulz K.U., Mihov S.: Fast string correction with levenshtein automata. Intl J Doc Anal Recognit 5(1), 67–85 (2002)
Article MATH Google Scholar
Sussenguth E.H.: Use of tree structures for processing files. Commun. ACM 6, 272–279 (1963)
Article Google Scholar
Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using mapreduce. In: SIGMOD Conference, pp. 495–506 (2010)
Wang J., Li G., Feng J.: Trie-join: Efficient trie-based string similarity joins with edit-distance constraints. PVLDB 3(1), 1219–1230 (2010)
Google Scholar
Wang, J., Li, G., Feng, J.: Fast-join: An efficient method for fuzzy token matching based string similarity join. In: ICDE pp. 458–469 (2011)
Wang J., Li G., Yu J.X., Feng J.: Entity matching: how similar is similar. PVLDB 4(10), 622–633 (2011)
Google Scholar
Wang, W., Xiao, C., Lin, X., Zhang, C.: Efficient approximate entity extraction with edit distance constraints. In: SIGMOD Conference, pp. 759–770 (2009)
Xiao C., Wang W., Lin X.: Ed-join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB 1(1), 933–944 (2008)
Google Scholar
Xiao, C., Wang, W., Lin, X., Shang, H.: Top-k set similarity joins. In: ICDE, pp. 916–927 (2009)
Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: WWW, pp. 131–140 (2008)
Yang, X., Wang, B., Li, C.: Cost-based variable-length-gram selection for string collections to support approximate queries efficiently. In: SIGMOD Conference, pp. 353–364 (2008)

Download references

Author information

Authors and Affiliations

Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, China
Jianhua Feng, Jiannan Wang & Guoliang Li

Authors

Jianhua Feng
View author publications
You can also search for this author in PubMed Google Scholar
Jiannan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Guoliang Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jiannan Wang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Feng, J., Wang, J. & Li, G. Trie-join: a trie-based method for efficient string similarity joins. The VLDB Journal 21, 437–461 (2012). https://doi.org/10.1007/s00778-011-0252-8

Download citation

Received: 24 January 2011
Revised: 20 June 2011
Accepted: 25 August 2011
Published: 04 October 2011
Issue Date: August 2012
DOI: https://doi.org/10.1007/s00778-011-0252-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Trie-join: a trie-based method for efficient string similarity joins

Abstract

Access this article

Similar content being viewed by others

A Partition-Based Bi-directional Filtering Method for String Similarity JOINs

String similarity join with different similarity thresholds based on novel indexing techniques

GFSF: A Novel Similarity Join Method Based on Frequency Vector

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Trie-join: a trie-based method for efficient string similarity joins

Abstract

Access this article

Similar content being viewed by others

A Partition-Based Bi-directional Filtering Method for String Similarity JOINs

String similarity join with different similarity thresholds based on novel indexing techniques

GFSF: A Novel Similarity Join Method Based on Frequency Vector

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation