ABSTRACT
Similarity search is the basis for many data analytics techniques, including k-nearest neighbor classification and outlier detection. Similarity search over large data sets relies on i) a distance metric learned from input examples and ii) an index to speed up search based on the learned distance metric. In interactive systems, input to guide the learning of the distance metric may be provided over time. As this new input changes the learned distance metric, a naive approach would adopt the costly process of re-indexing all items after each metric change. In this paper, we propose the first solution, called OASIS, to instantaneously adapt the index to conform to a changing distance metric without this prohibitive re-indexing process. To achieve this, we prove that locality-sensitive hashing (LSH) provides an invariance property, meaning that an LSH index built on the original distance metric is equally effective at supporting similarity search using an updated distance metric as long as the transform matrix learned for the new distance metric satisfies certain properties. This observation allows OASIS to avoid recomputing the index from scratch in most cases. Further, for the rare cases when an adaption of the LSH index is shown to be necessary, we design an efficient incremental LSH update strategy that re-hashes only a small subset of the items in the index. In addition, we develop an efficient distance metric learning strategy that incrementally learns the new metric as inputs are received. Our experimental study using real world public datasets confirms the effectiveness of OASIS at improving the accuracy of various similarity search-based data analytics tasks by instantaneously adapting the distance metric and its associated index in tandem, while achieving an up to 3 orders of magnitude speedup over the state-of-art techniques.
Supplemental Material
- C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A framework for clustering evolving data streams. VLDB, 2003.Google ScholarDigital Library
- A. Andoni, P. Indyk, T. Laarhoven, I. Razenshteyn, and L. Schmidt. Practical and optimal lsh for angular distance. In NIPS, pages 1225--1233, 2015.Google ScholarDigital Library
- Anonymous. Continuously adaptive similarity search. https://drive.google.com/file/d/1hFsqVD6LlQPRm7SBydk1TRlzGuD1yvf2/view?usp=sharing, 2019.Google Scholar
- S. D. Bay and M. Schwabacher. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In KDD, pages 29--38, 2003.Google ScholarDigital Library
- A. Bellet, A. Habrard, and M. Sebban. A survey on metric learning for feature vectors and structured data. CoRR, abs/1306.6709, 2013.Google Scholar
- M. M. Breunig, H. Kriegel, R. T. Ng, and J. Sander. LOF: identifying density-based local outliers. In SIGMOD, pages 93--104, 2000.Google ScholarDigital Library
- L. Cao, J. Wang, and E. A. Rundensteiner. Sharing-aware outlier analytics over high-volume data streams. In SIGMOD, pages 527--540, 2016.Google ScholarDigital Library
- L. Cao, D. Yang, Q. Wang, Y. Yu, J. Wang, and E. A. Rundensteiner. Scalable distance-based outlier detection over high-volume data streams. In ICDE, pages 76--87, 2014.Google ScholarCross Ref
- G. Chechik, U. Shalit, V. Sharma, and S. Bengio. An online algorithm for large scale image similarity learning. In NIPS, pages 306--314, 2009.Google ScholarDigital Library
- A. Cichocki, S. Cruces, and S. ichi Amari. Log-determinant divergences revisited: Alpha-beta and gamma log-det divergences. Entropy, 17:2988--3034, 2015.Google ScholarCross Ref
- T. Cover and P. Hart. Nearest neighbor pattern classification. IEEE Trans. Inf. Theor., 13(1):21--27.Google ScholarDigital Library
- M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In SCG, pages 253--262. ACM, 2004.Google ScholarDigital Library
- J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon. Information-theoretic metric learning. ICML, pages 209--216, New York, NY, USA, 2007. ACM.Google ScholarDigital Library
- A. Dosovitskiy, J. T. Springenberg, M. A. Riedmiller, and T. Brox. Discriminative unsupervised feature learning with convolutional neural networks. In NIPS, pages 766--774, 2014.Google ScholarDigital Library
- C. C. A. S. Edition. Outlier Analysis. Springer, 2017.Google Scholar
- M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. KDD'96, pages 226--231, 1996.Google Scholar
- J. Gan, J. Feng, Q. Fang, and W. Ng. Locality-sensitive hashing scheme based on dynamic collision counting. SIGMOD, pages 541--552, 2012.Google ScholarDigital Library
- N. Hansen. The cma evolution strategy: A tutorial. arXiv preprint arXiv:1604.00772, 2016.Google Scholar
- T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2009.Google ScholarCross Ref
- Q. Huang, G. Ma, J. Feng, Q. Fang, and A. K. Tung. Accurate and fast asymmetric locality-sensitive hashing scheme for maximum inner product search. In KDD, pages 1561--1570. ACM, 2018.Google ScholarDigital Library
- S. Idreos, O. Papaemmanouil, and S. Chaudhuri. Overview of data exploration techniques. In SIGMOD, pages 277--281, 2015.Google ScholarDigital Library
- S. B. Imandoust and M. Bolandraftar. Application of k-nearest neighbor ( knn ) approach for predicting economic events : Theoretical background. 2013.Google Scholar
- P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. STOC, pages 604--613. ACM, 1998.Google ScholarDigital Library
- P. Jain, B. Kulis, I. S. Dhillon, and K. Grauman. Online metric learning and fast similarity search. NIPS, pages 761--768, USA, 2008.Google ScholarDigital Library
- E. M. Knorr and R. T. Ng. Algorithms for mining distance-based outliers in large datasets. In VLDB, pages 392--403, 1998.Google ScholarDigital Library
- A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.Google Scholar
- K. Kumaran, D. Papageorgiou, Y. Chang, M. Li, and M. Takác. Active metric learning for supervised classification. CoRR, abs/1803.10647, 2018.Google Scholar
- H. O. Lancaster and E. Seneta. Chi-square distribution. Encyclopedia of biostatistics, 2, 2005.Google Scholar
- W. Li, Y. Zhang, Y. Sun, W. Wang, M. Li, W. Zhang, and X. Lin. Approximate nearest neighbor search on high dimensional data - experiments, analyses, and improvement. TKDE, 2019.Google ScholarCross Ref
- U. Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395--416, Dec. 2007.Google ScholarDigital Library
- Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li. Multi-probe lsh: Efficient indexing for high-dimensional similarity search. PVLDB, pages 950--961, 2007.Google ScholarDigital Library
- Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li. Intelligent probing for locality sensitive hashing: Multi-probe lsh and beyond. Proc. VLDB Endow., 10(12):2021--2024, Aug. 2017.Google ScholarDigital Library
- D. Malerba, F. Esposito, and G. Semeraro. A further comparison of simplification methods for decision-tree induction. In Learning from data, pages 365--374. Springer, 1996.Google ScholarCross Ref
- J. McDonald. Chi-square test of goodness-of-fit: Handbook of Biological Statistics. Sparky House Publishing, 2009.Google Scholar
- K. Millman and M. Aivazis. Python for scientists and engineers. Computing in Science & Engineering, 13(02):9--12, mar 2011.Google ScholarDigital Library
- B. Neyshabur and N. Srebro. On symmetric and asymmetric lshs for inner product search. ICML, pages 1926--1934. JMLR.org, 2015.Google ScholarDigital Library
- S. Papadimitriou, H. Kitagawa, P. B. Gibbons, and C. Faloutsos. LOCI: fast outlier detection using the local corr. integral. In ICDE, pages 315--326, 2003.Google ScholarCross Ref
- W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes 3rd Edition: The Art of Scientific Computing. Cambridge University Press, New York, NY, USA, 3 edition, 2007.Google ScholarDigital Library
- X. Qin, L. Cao, E. A. Rundensteiner, and S. Madden. Scalable kernel density estimation-based local outlier detection over large data streams. 2019.Google Scholar
- A. Rajaraman and J. D. Ullman. Mining of Massive Datasets. Cambridge University Press, New York, NY, USA, 2011.Google ScholarCross Ref
- S. Shalev-Shwartz, Y. Singer, and A. Y. Ng. Online and batch learning of pseudo-metrics. In ICML, page 94, 2004.Google ScholarDigital Library
- A. Shrivastava and P. Li. Improved asymmetric locality sensitive hashing (ALSH) for maximum inner product search (MIPS). pages 812--821, 2015.Google Scholar
- A. Singhal. Modern information retrieval: A brief overview. IEEE Data Eng. Bull., 24(4):35--43, 2001.Google Scholar
- R. Spring and A. Shrivastava. Scalable and sustainable deep learning via randomized hashing. In SIGKDD, pages 445--454, 2017.Google ScholarDigital Library
- J. M. Steele. The Cauchy-Schwarz master class: an introduction to the art of mathematical inequalities. Cambridge University Press, 2004.Google ScholarCross Ref
- F. Wang and J. Sun. Survey on distance metric learning and dimensionality reduction in data mining. Data Min. Knowl. Discov., 29(2):534--564, Mar. 2015.Google ScholarDigital Library
- K. Q. Weinberger, J. Blitzer, and L. K. Saul. Distance metric learning for large margin nearest neighbor classification. In NIPS, pages 1473--1480, 2006.Google ScholarDigital Library
- K. Q. Weinberger and L. K. Saul. Distance metric learning for large margin nearest neighbor classification. J. Mach. Learn. Res., 10:207--244, June 2009.Google ScholarDigital Library
- E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell. Distance metric learning, with application to clustering with side-information. In NIPS, NIPS'02, pages 521--528, Cambridge, MA, USA, 2002. MIT Press.Google Scholar
- L. Yang, R. Jin, and R. Sukthankar. Bayesian active distance metric learning. In UAI, pages 442--449, 2007.Google ScholarDigital Library
- G. Zheng, S. L. Brantley, T. Lauvaux, and Z. Li. Contextual spatial outlier detection with metric learning. In SIGKDD, pages 2161--2170, 2017.Google ScholarDigital Library
Index Terms
- Continuously Adaptive Similarity Search
Recommendations
Distance-Based Index Structures for Fast Similarity Search
This review considers the class of index structures for fast similarity search. In constructing and applying such structures, only information on values or ranks of some distances/similarities between objects is used. The search by metric distances (...
Confirmation Sampling for Exact Nearest Neighbor Search
Similarity Search and ApplicationsAbstractLocality-sensitive hashing (LSH), introduced by Indyk and Motwani in STOC ’98, has been an extremely influential framework for nearest neighbor search in high-dimensional data sets. While theoretical work has focused on the approximate nearest ...
Quality and efficiency in high dimensional nearest neighbor search
SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of dataNearest neighbor (NN) search in high dimensional space is an important problem in many applications. Ideally, a practical solution (i) should be implementable in a relational database, and (ii) its query cost should grow sub-linearly with the dataset ...
Comments