article

Estimating the selectivity of approximate string queries

Authors:
Arturas Mazeika

Free University of Bozen-Bolzano, Bozen-Bolzano BZ, Italy

Free University of Bozen-Bolzano, Bozen-Bolzano BZ, Italy
View Profile

,
Michael H. Böhlen

Free University of Bozen-Bolzano, Bozen-Bolzano BZ, Italy

Free University of Bozen-Bolzano, Bozen-Bolzano BZ, Italy
View Profile

,
Nick Koudas

University of Toronto, Toronto, Ontario

University of Toronto, Toronto, Ontario
View Profile

,
Divesh Srivastava

AT&T Labs--Research, Florham Park, NJ

AT&T Labs--Research, Florham Park, NJ
View Profile

Authors Info & Claims

ACM Transactions on Database Systems Volume 32 Issue 2pp 12–eshttps://doi.org/10.1145/1242524.1242529

Published:01 June 2007Publication History

ACM Transactions on Database Systems

Abstract

Approximate queries on string data are important due to the prevalence of such data in databases and various conventions and errors in string data. We present the VSol estimator, a novel technique for estimating the selectivity of approximate string queries. The VSol estimator is based on inverse strings and makes the performance of the selectivity estimator independent of the number of strings. To get inverse strings we decompose all database strings into overlapping substrings of length q (q-grams) and then associate each q-gram with its inverse string: the IDs of all strings that contain the q-gram. We use signatures to compress inverse strings, and clustering to group similar signatures.

We study our technique analytically and experimentally. The space complexity of our estimator only depends on the number of neighborhoods in the database and the desired estimation error. The time to estimate the selectivity is independent of the number of database strings and linear with respect to the length of query string. We give a detailed empirical performance evaluation of our solution for synthetic and real-world datasets. We show that VSol is effective for large skewed databases of short strings.

References

Blohsfeld, B., Korus, D., and Seeger, B. 1999. A comparison of selectivity estimators for range queries on metric attributes. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD). 239--250. Google ScholarDigital Library
Broder, A. Z. 1998. On the resemblance and containment of documents. In Proceedings of the (SEQS'91). Google ScholarDigital Library
Broder, A. Z. 2000. Identifying and filtering near-duplicate documents. In Proceedings of the 11th Anual Symposium on Combinatorial Pattern Matching. 1--10. Google ScholarDigital Library
Chaudhuri, S., Ganti, V., and Gravano, L. 2004. Selectivity estimation for string predicates: Overcoming the underestimation problem. In Proceedings of the International Conference on Data Engineering (ICDE). 227--239. Google ScholarDigital Library
Chen, Z., Korn, F., Koudas, N., and Muthukrishnan, S. 2003. Generalized substring selectivity estimation. J. Comput. Syst. Sci. 66, 1, 98--132. Google ScholarDigital Library
Cohen, E. 1994. Estimating the size of the transitive closure in linear time. In Proceedings of the 35th Annual Symposium on Foundations of Computer Science (FOCS) (Nov.). 190--200.Google ScholarDigital Library
Cohen, E., D., M., Fujiwara, S., Gionis, A., Indyk, P., Motwani, R., Ullman, J. D., and Yang, C. 2000. Finding interesting associations without support pruning. In Proceedings of the International Conference on Data Engineering (ICDE). 489--499. Google ScholarDigital Library
Cormode, G. and Muthukrishnan, S. 2005. An improved data stream summary: The count-min sketch and its applications. J. Alg. 55, 1, 58--75. Google ScholarDigital Library
Frakes, B. and Yates, R. 1992. Information Retrieval Data Structures and Algorithms. Prentice-Hall, Upper Saddle River, NJ. Google ScholarDigital Library
Gravano, L., Ipeirotis, P. G., Jagadish, H. V., Koudas, N., Muthukrishnan, S., and Srivastava, D. 2001. Approximate string joins in a database (almost) for free. In Proceedings of the International Conference on Very Large Databases (VLDB). 491--500. Google ScholarDigital Library
Hodge, V. J. and Austin, J. 2003. A comparison of standard spell checking algorithms and a novel binary neural approach. IEEE Trans. Knowl. Data Eng. 15, 5, 1073--1081. Google ScholarDigital Library
Jagadish, H. V., Kapitskaia, O., Ng, R. T., and Srivastava, D. 2000. One-Dimensional and multi-dimensional substring selectivity estimation. VLDB J. 9, 3, 214--230. Google ScholarDigital Library
Jagadish, H. V., Koudas, N., Muthukrishnan, S., Poosala, V., Sevcik, K. C., and Suel, T. 1998. Optimal histograms with quality guarantees. In Proceedings of the International Conference on Very Large Databases (VLDB). 275--286. Google ScholarDigital Library
Jain, A. and Dubes, R. 1988. Algorithms for Clustering Data. Prentice-Hall, Upper Saddle River, NJ. Google ScholarDigital Library
Jain, A. K., Murty, M. N., and Flynn, P. J. 1999. Data clustering: A review. ACM Comput. Surv. 31, 3, 264--323. Google ScholarDigital Library
Jin, L., Koudas, N., Li, C., and Tung, A. K. H. 2005. Indexing mixed types for approximate retrieval. In Proceedings of the International Conference on Very Large Databases (VLDB). 793--804. Google ScholarDigital Library
Jin, L. and Li, C. 2005. Selectivity estimation for fuzzy string predicates in large data sets. In Proceedings of the International Conference on Very Large Databases (VLDB). 397--408. Google ScholarDigital Library
Jin, L., Li, C., and Mehrotra, S. 2003. Efficient record linkage in large data sets. In Proceedings of the International Conference on Database Systems for Advanced Applications (DASFAA). 137. Google ScholarDigital Library
Krishnan, P., Vitter, J. S., and Iyer, B. 1996. Estimating alphanumeric selectivity in the presence of wildcards. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD). 282--293. Google ScholarDigital Library
Kukich, K. 1992. Technique for automatically correcting words in text. ACM Comput. Surv. 24, 4, 377--439. Google ScholarDigital Library
Matias, Y., Vitter, J. S., and Wang, M. 1998. Wavelet-based histograms for selectivity estimation. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD). 448--459. Google ScholarDigital Library
Matias, Y., Vitter, J. S., and Wang, M. 2000. Dynamic maintenance of wavelet-based histograms. VLDB J. 101--110. Google ScholarDigital Library
Mazeika, A. and Böhlen, M. H. 2006. Cleansing databases of misspelled proper nouns. In Proceedings of the CleanDB Workshop (in conjunction with) the International Conference on Very Large Databases (VLDB).Google Scholar
Muralikrishna, M. and DeWitt, D. J. 1988. Equi-Depth histograms for estimating selectivity factors for multi-dimensional queries. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD). 28--36. Google ScholarDigital Library
Navarro, G. 2001. A guided tour to approximate string matching. ACM Comput. Surv. 33, 1, 31--88. Google ScholarDigital Library
Poosala, V., Haas, P. J., Ioannidis, Y. E., and Shekita, E. J. 1996. Improved histograms for selectivity estimation of range predicates. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD). 294--305. Google ScholarDigital Library
Sahinalp, C., Tasan, M., Macker, J., and Ozsoyoglu, Z. M. 2003. Distance based indexing for string proximity search. In Proceedings of the International Conference on Data Engineering (ICDE). 125--137.Google Scholar
Salton, G. and McGill, M. J. 1986. Introduction to Modern Information Retrieval. McGraw-Hill, New York. Google ScholarDigital Library
Sheu, S., Cheng, C.-Y., and Chang, A. 2005. Fast pattern detection in stream data. In Proceedings of the Conference on Advanced Information Networking and Applications (AINA). 125--130. Google ScholarDigital Library
Ukkonen, E. 1983. On approximate string matching. In Proceedings of the Conference on Foundations of Computation Theory (FCT). Google ScholarDigital Library
Vernica, R. and Li, C. 2007. Flamingo project. http://www.ics.uci.edu/~flamingo/.Google Scholar

Index Terms

Estimating the selectivity of approximate string queries

Recommendations

Locating maximal approximate runs in a string

An exact run in a string T is a non-empty substring of T that is a repetition of a smaller substring possibly followed by a prefix of it. Finding maximal exact runs in strings is an important problem and therefore a well-studied one in the area of ...
Read More
Approximate string matching with ordered q-grams

Approximate string matching with k differences is considered. Filtration of the text is a widely adopted technique to reduce the text area processed by dynamic programming. We present sublinear filtration algorithms based on the locations of q-grams in ...
Read More
The Max-Shift Algorithm for Approximate String Matching
WAE '01: Proceedings of the 5th International Workshop on Algorithm Engineering

The approximate string matching problem is to find all locations which a pattern of length m matches a substring of a text of length n with at most k differences. The program agrep is a simple and practical bit-vector algorithm for this problem. In this ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Database Systems Volume 32, Issue 2
June 2007
267 pages
ISSN:0362-5915
EISSN:1557-4644
DOI:10.1145/1242524
Issue’s Table of Contents

Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 June 2007
Published in tods Volume 32, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Inverse strings
min-wise hash signatures
q-grams
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 24
  Total Citations
  View Citations
- 961
  Total Downloads
- Downloads (Last 12 months)10
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Estimating the selectivity of approximate string queries

ACM Transactions on Database Systems

Abstract

References

Cited By

Index Terms

Recommendations

Locating maximal approximate runs in a string

Approximate string matching with ordered q-grams

The Max-Shift Algorithm for Approximate String Matching

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Estimating the selectivity of approximate string queries

ACM Transactions on Database Systems

Abstract

References

Cited By

Index Terms

Recommendations

Locating maximal approximate runs in a string

Approximate string matching with ordered q-grams

The Max-Shift Algorithm for Approximate String Matching

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media