Abstract
A string similarity join finds similar pairs between two collections of strings. Many applications, e.g., data integration and cleaning, can significantly benefit from an efficient string-similarity-join algorithm. In this paper, we study string similarity joins with edit-distance constraints. Existing methods usually employ a filter-and-refine framework and suffer from the following limitations: (1) They are inefficient for the data sets with short strings (the average string length is not larger than 30); (2) They involve large indexes; (3) They are expensive to support dynamic update of data sets. To address these problems, we propose a novel method called trie-join, which can generate results efficiently with small indexes. We use a trie structure to index the strings and utilize the trie structure to efficiently find similar string pairs based on subtrie pruning. We devise efficient trie-join algorithms and pruning techniques to achieve high performance. Our method can be easily extended to support dynamic update of data sets efficiently. We conducted extensive experiments on four real data sets. Experimental results show that our algorithms outperform state-of-the-art methods by an order of magnitude on the data sets with short strings.
Similar content being viewed by others
References
Agrawal S., Chakrabarti K., Chaudhuri S., Ganti V.: Scalable ad-hoc entity extraction from text collections. PVLDB 1(1), 945–957 (2008)
Arasu, A., Chaudhuri, S., Kaushik, R.: Transformation-based framework for record matching. In: ICDE, pp. 40–49 (2008)
Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: VLDB, pp. 918–929 (2006)
Augsten, N., Böhlen, M.H., Dyreson, C.E., Gamper, J.: Approximate joins for data-centric xml. In: ICDE, pp. 814–823 (2008)
Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: WWW, pp. 131–140 (2007)
Bryan, B., Eberhardt, F., Faloutsos, C.: Compact similarity joins. In: ICDE, pp. 346–355 (2008)
Celikik, M., Bast, H.: Fast error-tolerant search on very large texts. In: SAC, pp. 1724–1731 (2009)
Chakrabarti, K., Chaudhuri, S., Ganti, V., Xin, D.: An efficient filter for approximate membership checking. In: SIGMOD Conference, pp. 805–818 (2008)
Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and efficient fuzzy match for online data cleaning. In: SIGMOD Conference, pp. 313–324 (2003)
Chaudhuri S., Ganti V., Kaushik R.: Data debugger: An operator-centric approach for data quality solutions. IEEE Data Eng. Bull. 29(2), 60–66 (2006)
Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: ICDE, pp. 5–16 (2006)
Chaudhuri, S., Kaushik, R.: Extending autocompletion to tolerate errors. In: SIGMOD Conference, pp. 707–718 (2009)
Ciaccia, P., Patella, M., Zezula, P.: M-tree: An efficient access method for similarity search in metric spaces. In: VLDB, pp. 426–435 (1997)
Cole, R., Gottlieb, L.-A., Lewenstein, M.: Dictionary matching and indexing with errors and don’t cares. In: STOC, pp. 91–100 (2004)
Fredkin E.: Trie memory. Commun. ACM 3(9), 490–499 (1960)
Gonnet G.H.: Handbook of Algorithms and Data structures. Addison-Wesley , Reading (1984)
Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: VLDB, pp. 491–500 (2001)
Guha, S., Koudas, N., Srivastava, D., Yu, T.: Index-based approximate xml joins. In: ICDE, pp. 708–710 (2003)
Hadjieleftheriou, M., Chandel, A., Koudas, N., Srivastava, D.: Fast indexes and algorithms for set similarity selection queries. In: ICDE, pp. 267–276 (2008)
Hadjieleftheriou, M., Koudas, N., Srivastava, D.: Incremental maintenance of length normalized indexes for approximate string matching. In: SIGMOD Conference, pp. 429–440 (2009)
Hadjieleftheriou M., Srivastava D.: Weighted set-based string similarity. IEEE Data Eng. Bull. 33(1), 25–36 (2010)
Hadjieleftheriou M., Yu X., Koudas N., Srivastava D.: Hashed samples: selectivity estimators for set similarity selection queries. PVLDB 1(1), 201–212 (2008)
Heinz S., Zobel J., Williams H.E.: Burst tries: a fast, efficient data structure for string keys. ACM Trans. Inf. Syst. 20(2), 192–223 (2002)
Jaro, M.A. Unimatch: A record linkage system: User’s manual. Technical report, U.S. Bureau of the Census, Washington, D.C., (1976)
Jestes, J., Li, F., Yan, Z., Yi, K.: Probabilistic string similarity joins. In: SIGMOD Conference, pp. 327–338 (2010)
Ji, S., Li, G., Li, C., Feng, J.: Efficient interactive fuzzy keyword search. In WWW, pp. 433–439 (2009)
Kahveci, T., Singh, A.K.: Efficient index structures for string databases. In: VLDB, pp. 351–360 (2001)
Kim, M.-S., Whang, K.-Y., Lee, J.-G., Lee, M.-J. n-Gram/2L: A space and time efficient two-level n-gram inverted index structure. In: VLDB, pp. 325–336 (2005)
Knuth D.E.: The Art of Computer Programming, Volume 1: Fundamental algorithms. Addison-Wesley, Reading (1968)
Lee, H., Ng, R.T., Shim, K.: Extending q-grams to estimate selectivity of string matching with low edit distance. In: VLDB, pp. 195–206 (2007)
Lee H., Ng R.T., Shim K.: Power-law based estimation of set similarity join size. PVLDB 2(1), 658–669 (2009)
Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: ICDE, pp. 257–266 (2008)
Li, C., Wang, B., Yang, X. VGRAM: Improving performance of approximate queries on string collections using variable-length grams. In: VLDB, pp. 303–314 (2007)
Li, G., Deng, D., Feng, J. Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction. In: SIGMOD Conference, pp. 529–540 (2011)
Li G., Ji S., Li C., Feng J.: Efficient fuzzy full-text type-ahead search. VLDB J. 20(4), 617–640 (2011)
Lian X., Chen L.: Set similarity join on probabilistic data. PVLDB 3(1), 650–659 (2010)
Lu, J., Han, J., Meng, X.: Efficient algorithms for approximate member extraction using signature-based inverted lists. In: CIKM, pp. 315–324 (2009)
Morrison D.R.: Patricia: practical algorithm to retrieve information coded in alphanumeric. J. ACM 15, 514–534 (1968)
Navarro G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001)
Nilsson S., Karlsson G.: Ip-address lookup using lc-tries. IEEE J. Selected Areas Commun. 17, 1083–1092 (1999)
Peterson J.L.: Computer programs for detecting and correcting spelling errors. Commun. ACM 23(12), 676–687 (1980)
Russell, R.C.: Available at http://patft.uspto.gov/netacgi/nph-Parser?patentnumber=1261167 (1918)
Sahinalp, S.C., Tasan, M., Macker, J., Özsoyoglu, Z.M.: Distance based indexing for string proximity search. In: ICDE, pp. 125–136 (2003)
Sakoe H., Chiba S.: Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. Acoust Speech Signal Process 26, 43–49 (1978)
Salton G.: Introduction to Modern Information Retrieval. McGraw Hill, NY (1987)
Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: SIGMOD Conference, pp. 743–754 (2004)
Schulz K.U., Mihov S.: Fast string correction with levenshtein automata. Intl J Doc Anal Recognit 5(1), 67–85 (2002)
Sussenguth E.H.: Use of tree structures for processing files. Commun. ACM 6, 272–279 (1963)
Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using mapreduce. In: SIGMOD Conference, pp. 495–506 (2010)
Wang J., Li G., Feng J.: Trie-join: Efficient trie-based string similarity joins with edit-distance constraints. PVLDB 3(1), 1219–1230 (2010)
Wang, J., Li, G., Feng, J.: Fast-join: An efficient method for fuzzy token matching based string similarity join. In: ICDE pp. 458–469 (2011)
Wang J., Li G., Yu J.X., Feng J.: Entity matching: how similar is similar. PVLDB 4(10), 622–633 (2011)
Wang, W., Xiao, C., Lin, X., Zhang, C.: Efficient approximate entity extraction with edit distance constraints. In: SIGMOD Conference, pp. 759–770 (2009)
Xiao C., Wang W., Lin X.: Ed-join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB 1(1), 933–944 (2008)
Xiao, C., Wang, W., Lin, X., Shang, H.: Top-k set similarity joins. In: ICDE, pp. 916–927 (2009)
Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: WWW, pp. 131–140 (2008)
Yang, X., Wang, B., Li, C.: Cost-based variable-length-gram selection for string collections to support approximate queries efficiently. In: SIGMOD Conference, pp. 353–364 (2008)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Feng, J., Wang, J. & Li, G. Trie-join: a trie-based method for efficient string similarity joins. The VLDB Journal 21, 437–461 (2012). https://doi.org/10.1007/s00778-011-0252-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-011-0252-8