skip to main content
10.1145/3318464.3380601acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Open Access

Continuously Adaptive Similarity Search

Authors Info & Claims
Published:31 May 2020Publication History

ABSTRACT

Similarity search is the basis for many data analytics techniques, including k-nearest neighbor classification and outlier detection. Similarity search over large data sets relies on i) a distance metric learned from input examples and ii) an index to speed up search based on the learned distance metric. In interactive systems, input to guide the learning of the distance metric may be provided over time. As this new input changes the learned distance metric, a naive approach would adopt the costly process of re-indexing all items after each metric change. In this paper, we propose the first solution, called OASIS, to instantaneously adapt the index to conform to a changing distance metric without this prohibitive re-indexing process. To achieve this, we prove that locality-sensitive hashing (LSH) provides an invariance property, meaning that an LSH index built on the original distance metric is equally effective at supporting similarity search using an updated distance metric as long as the transform matrix learned for the new distance metric satisfies certain properties. This observation allows OASIS to avoid recomputing the index from scratch in most cases. Further, for the rare cases when an adaption of the LSH index is shown to be necessary, we design an efficient incremental LSH update strategy that re-hashes only a small subset of the items in the index. In addition, we develop an efficient distance metric learning strategy that incrementally learns the new metric as inputs are received. Our experimental study using real world public datasets confirms the effectiveness of OASIS at improving the accuracy of various similarity search-based data analytics tasks by instantaneously adapting the distance metric and its associated index in tandem, while achieving an up to 3 orders of magnitude speedup over the state-of-art techniques.

Skip Supplemental Material Section

Supplemental Material

3318464.3380601.mp4

mp4

134.8 MB

References

  1. C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A framework for clustering evolving data streams. VLDB, 2003.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Andoni, P. Indyk, T. Laarhoven, I. Razenshteyn, and L. Schmidt. Practical and optimal lsh for angular distance. In NIPS, pages 1225--1233, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Anonymous. Continuously adaptive similarity search. https://drive.google.com/file/d/1hFsqVD6LlQPRm7SBydk1TRlzGuD1yvf2/view?usp=sharing, 2019.Google ScholarGoogle Scholar
  4. S. D. Bay and M. Schwabacher. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In KDD, pages 29--38, 2003.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Bellet, A. Habrard, and M. Sebban. A survey on metric learning for feature vectors and structured data. CoRR, abs/1306.6709, 2013.Google ScholarGoogle Scholar
  6. M. M. Breunig, H. Kriegel, R. T. Ng, and J. Sander. LOF: identifying density-based local outliers. In SIGMOD, pages 93--104, 2000.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. L. Cao, J. Wang, and E. A. Rundensteiner. Sharing-aware outlier analytics over high-volume data streams. In SIGMOD, pages 527--540, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. L. Cao, D. Yang, Q. Wang, Y. Yu, J. Wang, and E. A. Rundensteiner. Scalable distance-based outlier detection over high-volume data streams. In ICDE, pages 76--87, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  9. G. Chechik, U. Shalit, V. Sharma, and S. Bengio. An online algorithm for large scale image similarity learning. In NIPS, pages 306--314, 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Cichocki, S. Cruces, and S. ichi Amari. Log-determinant divergences revisited: Alpha-beta and gamma log-det divergences. Entropy, 17:2988--3034, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  11. T. Cover and P. Hart. Nearest neighbor pattern classification. IEEE Trans. Inf. Theor., 13(1):21--27.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In SCG, pages 253--262. ACM, 2004.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon. Information-theoretic metric learning. ICML, pages 209--216, New York, NY, USA, 2007. ACM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. Dosovitskiy, J. T. Springenberg, M. A. Riedmiller, and T. Brox. Discriminative unsupervised feature learning with convolutional neural networks. In NIPS, pages 766--774, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. C. C. A. S. Edition. Outlier Analysis. Springer, 2017.Google ScholarGoogle Scholar
  16. M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. KDD'96, pages 226--231, 1996.Google ScholarGoogle Scholar
  17. J. Gan, J. Feng, Q. Fang, and W. Ng. Locality-sensitive hashing scheme based on dynamic collision counting. SIGMOD, pages 541--552, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. N. Hansen. The cma evolution strategy: A tutorial. arXiv preprint arXiv:1604.00772, 2016.Google ScholarGoogle Scholar
  19. T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  20. Q. Huang, G. Ma, J. Feng, Q. Fang, and A. K. Tung. Accurate and fast asymmetric locality-sensitive hashing scheme for maximum inner product search. In KDD, pages 1561--1570. ACM, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. S. Idreos, O. Papaemmanouil, and S. Chaudhuri. Overview of data exploration techniques. In SIGMOD, pages 277--281, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. S. B. Imandoust and M. Bolandraftar. Application of k-nearest neighbor ( knn ) approach for predicting economic events : Theoretical background. 2013.Google ScholarGoogle Scholar
  23. P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. STOC, pages 604--613. ACM, 1998.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. P. Jain, B. Kulis, I. S. Dhillon, and K. Grauman. Online metric learning and fast similarity search. NIPS, pages 761--768, USA, 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. E. M. Knorr and R. T. Ng. Algorithms for mining distance-based outliers in large datasets. In VLDB, pages 392--403, 1998.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.Google ScholarGoogle Scholar
  27. K. Kumaran, D. Papageorgiou, Y. Chang, M. Li, and M. Takác. Active metric learning for supervised classification. CoRR, abs/1803.10647, 2018.Google ScholarGoogle Scholar
  28. H. O. Lancaster and E. Seneta. Chi-square distribution. Encyclopedia of biostatistics, 2, 2005.Google ScholarGoogle Scholar
  29. W. Li, Y. Zhang, Y. Sun, W. Wang, M. Li, W. Zhang, and X. Lin. Approximate nearest neighbor search on high dimensional data - experiments, analyses, and improvement. TKDE, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  30. U. Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395--416, Dec. 2007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li. Multi-probe lsh: Efficient indexing for high-dimensional similarity search. PVLDB, pages 950--961, 2007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li. Intelligent probing for locality sensitive hashing: Multi-probe lsh and beyond. Proc. VLDB Endow., 10(12):2021--2024, Aug. 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. D. Malerba, F. Esposito, and G. Semeraro. A further comparison of simplification methods for decision-tree induction. In Learning from data, pages 365--374. Springer, 1996.Google ScholarGoogle ScholarCross RefCross Ref
  34. J. McDonald. Chi-square test of goodness-of-fit: Handbook of Biological Statistics. Sparky House Publishing, 2009.Google ScholarGoogle Scholar
  35. K. Millman and M. Aivazis. Python for scientists and engineers. Computing in Science & Engineering, 13(02):9--12, mar 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. B. Neyshabur and N. Srebro. On symmetric and asymmetric lshs for inner product search. ICML, pages 1926--1934. JMLR.org, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. S. Papadimitriou, H. Kitagawa, P. B. Gibbons, and C. Faloutsos. LOCI: fast outlier detection using the local corr. integral. In ICDE, pages 315--326, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  38. W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes 3rd Edition: The Art of Scientific Computing. Cambridge University Press, New York, NY, USA, 3 edition, 2007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. X. Qin, L. Cao, E. A. Rundensteiner, and S. Madden. Scalable kernel density estimation-based local outlier detection over large data streams. 2019.Google ScholarGoogle Scholar
  40. A. Rajaraman and J. D. Ullman. Mining of Massive Datasets. Cambridge University Press, New York, NY, USA, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  41. S. Shalev-Shwartz, Y. Singer, and A. Y. Ng. Online and batch learning of pseudo-metrics. In ICML, page 94, 2004.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. A. Shrivastava and P. Li. Improved asymmetric locality sensitive hashing (ALSH) for maximum inner product search (MIPS). pages 812--821, 2015.Google ScholarGoogle Scholar
  43. A. Singhal. Modern information retrieval: A brief overview. IEEE Data Eng. Bull., 24(4):35--43, 2001.Google ScholarGoogle Scholar
  44. R. Spring and A. Shrivastava. Scalable and sustainable deep learning via randomized hashing. In SIGKDD, pages 445--454, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. J. M. Steele. The Cauchy-Schwarz master class: an introduction to the art of mathematical inequalities. Cambridge University Press, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  46. F. Wang and J. Sun. Survey on distance metric learning and dimensionality reduction in data mining. Data Min. Knowl. Discov., 29(2):534--564, Mar. 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. K. Q. Weinberger, J. Blitzer, and L. K. Saul. Distance metric learning for large margin nearest neighbor classification. In NIPS, pages 1473--1480, 2006.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. K. Q. Weinberger and L. K. Saul. Distance metric learning for large margin nearest neighbor classification. J. Mach. Learn. Res., 10:207--244, June 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell. Distance metric learning, with application to clustering with side-information. In NIPS, NIPS'02, pages 521--528, Cambridge, MA, USA, 2002. MIT Press.Google ScholarGoogle Scholar
  50. L. Yang, R. Jin, and R. Sukthankar. Bayesian active distance metric learning. In UAI, pages 442--449, 2007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. G. Zheng, S. L. Brantley, T. Lauvaux, and Z. Li. Contextual spatial outlier detection with metric learning. In SIGKDD, pages 2161--2170, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Continuously Adaptive Similarity Search

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data
      June 2020
      2925 pages
      ISBN:9781450367356
      DOI:10.1145/3318464

      Copyright © 2020 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 31 May 2020

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate785of4,003submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader