ABSTRACT
Set value attributes are a concise and natural way to model complex data sets. Modern Object Relational systems support set value attributes and allow various query capabilities on them. In this paper we initiate a formal study of indexing techniques for set value attributes based on similarity, for suitably defined notions of similarity between sets. Such techniques are necessary in modern applications such as recommendations through collaborative filtering and automated advertising. Our techniques are probabilistic and approximate in nature. As a design principle we create structures that make use of well known and widely used data structuring techniques, as a means to ease integration with existing infrastructure.
We show how the problem of indexing a collection of sets based on similarity can be reduced to the problem of indexing suitably encoded (in a way that preserves similarity) binary vectors in Hamming space thus, reducing the problem to one of similarity query processing in Hamming space. Then, we introduce and analyze two data structure primitives that we use in cooperation to perform similarity query processing in a Hamming space. We show how the resulting indexing technique can be optimized for properties of interest by formulating constraint optimization problems based on the space one is willing to devote for indexing. Finally we present experimental results from a prototype implementation of our techniques using real life datasets exploring the accuracy and efficiency of our overall approach as well as the quality of our solutions to problems related to the optimization of the indexing scheme.
- AFS93.R. Agrawal, C. Faloutsos, and A. Swami. Efficient Similarity Search in Sequence Databases. Proc. of the 4th Int'l Conference on Foundations of Data Organization and Algorithms, pages 69- 84, October 1993.]] Google ScholarDigital Library
- AGM99.P. Indyk A. Gionis and Ft. Motwani. Similarity Search In High Dimensions Via Hastfing. Proceedings of VLDB, 1999.]] Google ScholarDigital Library
- ALSS95.Ft. Agrawal, K. Lin, H. S. Sawtmey, mad K. Shim. Fast Similarity Sem-ch in the Presence of Noise, Scaring mad Translation in Time-Series Databases. Proceedings of VLDB, pages 490- 501, September 1995.]] Google ScholarDigital Library
- BCFM98.A. Broder, M. Chmikar, A. Frieze, and M. Mitzenmacher. Minwise Independent Permutations. Proceedings of STOC, pages 327- 336, 1998.]] Google ScholarDigital Library
- BGMZ97.A. Broder, S. Glassmma, M. Mmaasse, mad G. Zweig. Syntactic Clustering on the Web. Proceedings of WWW6, 1997.]] Google ScholarDigital Library
- CDF+00.E. Cohen, M. Datm-, S. Fujiwm-a, A. Gionis, Ft. Motwani, J. Ulhnan, mad C. Ymag. Finding Interesting Associations Without Support Prmming. International Conference on Data Engineering, 2000.]] Google ScholarDigital Library
- CJK+01.Z. Chen, H. V. Jagadish, F. Korn, N. Koudas, S. Muthukristmma, R. Ng, mad D. Srivastava. Counting twig matches in a tree. International Conference on Data Engineering, to appear, April 2001.]] Google ScholarDigital Library
- CKKM00.Z. Chen, F. Korn, N. Koudas, mad S. Muthukristnan. Selectivity Estimation for Boolema Queries. PODS, May 200.]] Google ScholarDigital Library
- Coh97.E. Cohen. Size-Estimation Framework With Applications To Transitive Closure And Reachability. Journal Of Comput. Syst. Sciences, 55, pages 441-453, 1997.]] Google ScholarDigital Library
- Fal85.C. Faloutsos. Fast Access Methods For Text. ACM Computing Surveys 17(1), pages 49-74, 1985.]] Google ScholarDigital Library
- FRM94.Christos Faloutsos, M. Ranganathan, and I. Manolopoulos. Fast Subsequence Matching in Time Series Databases. Proceedings of ACM SIGMOD, pages 419-429, May 1994.]] Google ScholarDigital Library
- GG98.V. Gaede mad O. Gmather. Multidimensional Access Methods. ACM Computing Surveys, No 30, Vol 2, pages 170-231, March 1998.]] Google ScholarDigital Library
- HNP95.J.M. Hellerstein, J. F. Naughton, and A. Pfeffer. Generalized Search Trees for Database Systems. Proceedings of VLDB, pages 562-573, August 1995.]] Google ScholarDigital Library
- IM98.P. Indyk and Ft. Motwani. Approximate Nereast Neighbors: Towards Removing the Curse Of Dimensionality. 30th Symposium on Theory of Computing, 1998.]] Google ScholarDigital Library
- Ind00.P. Indyk. Dimensionality Reduction Techniques For Proximity problems. SODA, pages 371-378, 2000.]] Google ScholarDigital Library
- MS93.F. J MacWilliams and A. Sloane. The Theory 0f Error Correcting Codes. North Holland, 1993.]]Google Scholar
- SM96.M. Stonebraker mad D. Moore. Object Relational Databases: The Next Wave. Morgan Kauffman, June 1996.]] Google ScholarDigital Library
- Udi94.Manber Udi. Finding Similar Files in a Large File System. Winter USENIX Technical Conference, October 1994.]] Google ScholarDigital Library
- vR79.C.J. van Rijsbergen. Information Retrieval. Butterworths, 1979.]] Google ScholarDigital Library
- YIO93.M. Kitagawa Y. Istfikawa mad N. Obho. Evaluation of Signature Files as Set Access Facility in OODBs. Proceedings of ACM SIGMOD, pages 247-256, June 1993.]] Google ScholarDigital Library
Index Terms
- Efficient and tumble similar set retrieval
Recommendations
Efficient and tumble similar set retrieval
Set value attributes are a concise and natural way to model complex data sets. Modern Object Relational systems support set value attributes and allow various query capabilities on them. In this paper we initiate a formal study of indexing techniques ...
Efficient Non-Learning Similar Subtrajectory Search
Similar subtrajectory search is a finer-grained operator that can better capture the similarities between one query trajectory and a portion of a data trajectory than the traditional similar trajectory search, which requires that the two checking ...
Efficient retrieval of similar business process models based on structure
OTM'11: Proceedings of the 2011th Confederated international conference on On the move to meaningful internet systems - Volume Part IWith the business process management technology being more widely used, there are more and more business process models, which are typically graphical. How to query such a large number of models efficiently is challenging. In this paper, we solve the ...
Comments