Article

Efficient and tumble similar set retrieval

Authors:
Aristides Gionis

Stanford University

Stanford University
View Profile

,
Dimitrios Gunopulos

University of California, Riverside

University of California, Riverside
View Profile

,
Nick Koudas

AT&T Laboratories

AT&T Laboratories
View Profile

SIGMOD '01: Proceedings of the 2001 ACM SIGMOD international conference on Management of dataMay 2001Pages 247–258https://doi.org/10.1145/375663.375689

Published:01 May 2001Publication History

SIGMOD '01: Proceedings of the 2001 ACM SIGMOD international conference on Management of data

Pages 247–258

ABSTRACT

Set value attributes are a concise and natural way to model complex data sets. Modern Object Relational systems support set value attributes and allow various query capabilities on them. In this paper we initiate a formal study of indexing techniques for set value attributes based on similarity, for suitably defined notions of similarity between sets. Such techniques are necessary in modern applications such as recommendations through collaborative filtering and automated advertising. Our techniques are probabilistic and approximate in nature. As a design principle we create structures that make use of well known and widely used data structuring techniques, as a means to ease integration with existing infrastructure.

We show how the problem of indexing a collection of sets based on similarity can be reduced to the problem of indexing suitably encoded (in a way that preserves similarity) binary vectors in Hamming space thus, reducing the problem to one of similarity query processing in Hamming space. Then, we introduce and analyze two data structure primitives that we use in cooperation to perform similarity query processing in a Hamming space. We show how the resulting indexing technique can be optimized for properties of interest by formulating constraint optimization problems based on the space one is willing to devote for indexing. Finally we present experimental results from a prototype implementation of our techniques using real life datasets exploring the accuracy and efficiency of our overall approach as well as the quality of our solutions to problems related to the optimization of the indexing scheme.

References

AFS93.R. Agrawal, C. Faloutsos, and A. Swami. Efficient Similarity Search in Sequence Databases. Proc. of the 4th Int'l Conference on Foundations of Data Organization and Algorithms, pages 69- 84, October 1993.]] Google ScholarDigital Library
AGM99.P. Indyk A. Gionis and Ft. Motwani. Similarity Search In High Dimensions Via Hastfing. Proceedings of VLDB, 1999.]] Google ScholarDigital Library
ALSS95.Ft. Agrawal, K. Lin, H. S. Sawtmey, mad K. Shim. Fast Similarity Sem-ch in the Presence of Noise, Scaring mad Translation in Time-Series Databases. Proceedings of VLDB, pages 490- 501, September 1995.]] Google ScholarDigital Library
BCFM98.A. Broder, M. Chmikar, A. Frieze, and M. Mitzenmacher. Minwise Independent Permutations. Proceedings of STOC, pages 327- 336, 1998.]] Google ScholarDigital Library
BGMZ97.A. Broder, S. Glassmma, M. Mmaasse, mad G. Zweig. Syntactic Clustering on the Web. Proceedings of WWW6, 1997.]] Google ScholarDigital Library
CDF+00.E. Cohen, M. Datm-, S. Fujiwm-a, A. Gionis, Ft. Motwani, J. Ulhnan, mad C. Ymag. Finding Interesting Associations Without Support Prmming. International Conference on Data Engineering, 2000.]] Google ScholarDigital Library
CJK+01.Z. Chen, H. V. Jagadish, F. Korn, N. Koudas, S. Muthukristmma, R. Ng, mad D. Srivastava. Counting twig matches in a tree. International Conference on Data Engineering, to appear, April 2001.]] Google ScholarDigital Library
CKKM00.Z. Chen, F. Korn, N. Koudas, mad S. Muthukristnan. Selectivity Estimation for Boolema Queries. PODS, May 200.]] Google ScholarDigital Library
Coh97.E. Cohen. Size-Estimation Framework With Applications To Transitive Closure And Reachability. Journal Of Comput. Syst. Sciences, 55, pages 441-453, 1997.]] Google ScholarDigital Library
Fal85.C. Faloutsos. Fast Access Methods For Text. ACM Computing Surveys 17(1), pages 49-74, 1985.]] Google ScholarDigital Library
FRM94.Christos Faloutsos, M. Ranganathan, and I. Manolopoulos. Fast Subsequence Matching in Time Series Databases. Proceedings of ACM SIGMOD, pages 419-429, May 1994.]] Google ScholarDigital Library
GG98.V. Gaede mad O. Gmather. Multidimensional Access Methods. ACM Computing Surveys, No 30, Vol 2, pages 170-231, March 1998.]] Google ScholarDigital Library
HNP95.J.M. Hellerstein, J. F. Naughton, and A. Pfeffer. Generalized Search Trees for Database Systems. Proceedings of VLDB, pages 562-573, August 1995.]] Google ScholarDigital Library
IM98.P. Indyk and Ft. Motwani. Approximate Nereast Neighbors: Towards Removing the Curse Of Dimensionality. 30th Symposium on Theory of Computing, 1998.]] Google ScholarDigital Library
Ind00.P. Indyk. Dimensionality Reduction Techniques For Proximity problems. SODA, pages 371-378, 2000.]] Google ScholarDigital Library
MS93.F. J MacWilliams and A. Sloane. The Theory 0f Error Correcting Codes. North Holland, 1993.]]Google Scholar
SM96.M. Stonebraker mad D. Moore. Object Relational Databases: The Next Wave. Morgan Kauffman, June 1996.]] Google ScholarDigital Library
Udi94.Manber Udi. Finding Similar Files in a Large File System. Winter USENIX Technical Conference, October 1994.]] Google ScholarDigital Library
vR79.C.J. van Rijsbergen. Information Retrieval. Butterworths, 1979.]] Google ScholarDigital Library
YIO93.M. Kitagawa Y. Istfikawa mad N. Obho. Evaluation of Signature Files as Set Access Facility in OODBs. Proceedings of ACM SIGMOD, pages 247-256, June 1993.]] Google ScholarDigital Library

Index Terms

Efficient and tumble similar set retrieval
1. Information systems
  1. Data management systems
    1. Database design and models
  2. Information retrieval
    1. Retrieval models and ranking

Recommendations

Efficient and tumble similar set retrieval

Set value attributes are a concise and natural way to model complex data sets. Modern Object Relational systems support set value attributes and allow various query capabilities on them. In this paper we initiate a formal study of indexing techniques ...
Read More
Efficient Non-Learning Similar Subtrajectory Search

Similar subtrajectory search is a finer-grained operator that can better capture the similarities between one query trajectory and a portion of a data trajectory than the traditional similar trajectory search, which requires that the two checking ...
Read More
Efficient retrieval of similar business process models based on structure
OTM'11: Proceedings of the 2011th Confederated international conference on On the move to meaningful internet systems - Volume Part I

With the business process management technology being more widely used, there are more and more business process models, which are typically graphical. How to query such a large number of models efficiently is challenging. In this paper, we solve the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '01: Proceedings of the 2001 ACM SIGMOD international conference on Management of data
May 2001
630 pages
ISBN:1581133324
DOI:10.1145/375663
Editors:
Timos Sellis,
Sharad Mehrotra
ACM SIGMOD Record Volume 30, Issue 2
June 2001
625 pages
ISSN:0163-5808
DOI:10.1145/376284
Editors:
Timos Sellis
National Technical Univ. of Athens
,
Sharad Mehrotra
Univ. of California at Irvine
Issue’s Table of Contents
Copyright © 2001 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 May 2001
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
SIGMOD '01 Paper Acceptance Rate44of293submissions,15%Overall Acceptance Rate785of4,003submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 49
  Total Citations
  View Citations
- 418
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Efficient and tumble similar set retrieval

SIGMOD '01: Proceedings of the 2001 ACM SIGMOD international conference on Management of data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Efficient and tumble similar set retrieval

Efficient Non-Learning Similar Subtrajectory Search

Efficient retrieval of similar business process models based on structure