skip to main content
10.1145/375663.375689acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article

Efficient and tumble similar set retrieval

Published:01 May 2001Publication History

ABSTRACT

Set value attributes are a concise and natural way to model complex data sets. Modern Object Relational systems support set value attributes and allow various query capabilities on them. In this paper we initiate a formal study of indexing techniques for set value attributes based on similarity, for suitably defined notions of similarity between sets. Such techniques are necessary in modern applications such as recommendations through collaborative filtering and automated advertising. Our techniques are probabilistic and approximate in nature. As a design principle we create structures that make use of well known and widely used data structuring techniques, as a means to ease integration with existing infrastructure.

We show how the problem of indexing a collection of sets based on similarity can be reduced to the problem of indexing suitably encoded (in a way that preserves similarity) binary vectors in Hamming space thus, reducing the problem to one of similarity query processing in Hamming space. Then, we introduce and analyze two data structure primitives that we use in cooperation to perform similarity query processing in a Hamming space. We show how the resulting indexing technique can be optimized for properties of interest by formulating constraint optimization problems based on the space one is willing to devote for indexing. Finally we present experimental results from a prototype implementation of our techniques using real life datasets exploring the accuracy and efficiency of our overall approach as well as the quality of our solutions to problems related to the optimization of the indexing scheme.

References

  1. AFS93.R. Agrawal, C. Faloutsos, and A. Swami. Efficient Similarity Search in Sequence Databases. Proc. of the 4th Int'l Conference on Foundations of Data Organization and Algorithms, pages 69- 84, October 1993.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. AGM99.P. Indyk A. Gionis and Ft. Motwani. Similarity Search In High Dimensions Via Hastfing. Proceedings of VLDB, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. ALSS95.Ft. Agrawal, K. Lin, H. S. Sawtmey, mad K. Shim. Fast Similarity Sem-ch in the Presence of Noise, Scaring mad Translation in Time-Series Databases. Proceedings of VLDB, pages 490- 501, September 1995.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. BCFM98.A. Broder, M. Chmikar, A. Frieze, and M. Mitzenmacher. Minwise Independent Permutations. Proceedings of STOC, pages 327- 336, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. BGMZ97.A. Broder, S. Glassmma, M. Mmaasse, mad G. Zweig. Syntactic Clustering on the Web. Proceedings of WWW6, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. CDF+00.E. Cohen, M. Datm-, S. Fujiwm-a, A. Gionis, Ft. Motwani, J. Ulhnan, mad C. Ymag. Finding Interesting Associations Without Support Prmming. International Conference on Data Engineering, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. CJK+01.Z. Chen, H. V. Jagadish, F. Korn, N. Koudas, S. Muthukristmma, R. Ng, mad D. Srivastava. Counting twig matches in a tree. International Conference on Data Engineering, to appear, April 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. CKKM00.Z. Chen, F. Korn, N. Koudas, mad S. Muthukristnan. Selectivity Estimation for Boolema Queries. PODS, May 200.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Coh97.E. Cohen. Size-Estimation Framework With Applications To Transitive Closure And Reachability. Journal Of Comput. Syst. Sciences, 55, pages 441-453, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Fal85.C. Faloutsos. Fast Access Methods For Text. ACM Computing Surveys 17(1), pages 49-74, 1985.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. FRM94.Christos Faloutsos, M. Ranganathan, and I. Manolopoulos. Fast Subsequence Matching in Time Series Databases. Proceedings of ACM SIGMOD, pages 419-429, May 1994.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. GG98.V. Gaede mad O. Gmather. Multidimensional Access Methods. ACM Computing Surveys, No 30, Vol 2, pages 170-231, March 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. HNP95.J.M. Hellerstein, J. F. Naughton, and A. Pfeffer. Generalized Search Trees for Database Systems. Proceedings of VLDB, pages 562-573, August 1995.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. IM98.P. Indyk and Ft. Motwani. Approximate Nereast Neighbors: Towards Removing the Curse Of Dimensionality. 30th Symposium on Theory of Computing, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Ind00.P. Indyk. Dimensionality Reduction Techniques For Proximity problems. SODA, pages 371-378, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. MS93.F. J MacWilliams and A. Sloane. The Theory 0f Error Correcting Codes. North Holland, 1993.]]Google ScholarGoogle Scholar
  17. SM96.M. Stonebraker mad D. Moore. Object Relational Databases: The Next Wave. Morgan Kauffman, June 1996.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Udi94.Manber Udi. Finding Similar Files in a Large File System. Winter USENIX Technical Conference, October 1994.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. vR79.C.J. van Rijsbergen. Information Retrieval. Butterworths, 1979.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. YIO93.M. Kitagawa Y. Istfikawa mad N. Obho. Evaluation of Signature Files as Set Access Facility in OODBs. Proceedings of ACM SIGMOD, pages 247-256, June 1993.]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Efficient and tumble similar set retrieval

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SIGMOD '01: Proceedings of the 2001 ACM SIGMOD international conference on Management of data
        May 2001
        630 pages
        ISBN:1581133324
        DOI:10.1145/375663

        Copyright © 2001 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 1 May 2001

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • Article

        Acceptance Rates

        SIGMOD '01 Paper Acceptance Rate44of293submissions,15%Overall Acceptance Rate785of4,003submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader