research-article

Open Access

Continuously Adaptive Similarity Search

Authors:
Huayi Zhang

Worcester Polytechnic Institute, Worcester, MA, USA

Worcester Polytechnic Institute, Worcester, MA, USA
View Profile

,
Lei Cao

Massachusetts Institute of Technology, Cambridge, MA, USA

Massachusetts Institute of Technology, Cambridge, MA, USA
View Profile

,
Yizhou Yan

Worcester Polytechnic Institute, Worcester, MA, USA

Worcester Polytechnic Institute, Worcester, MA, USA
View Profile

,
Samuel Madden

Massachusetts Institute of Technology, Cambridge, MA, USA

Massachusetts Institute of Technology, Cambridge, MA, USA
View Profile

,
Elke A. Rundensteiner

Worcester Polytechnic Institute, Worcester, MA, USA

Worcester Polytechnic Institute, Worcester, MA, USA
View Profile

SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of DataJune 2020Pages 2601–2616https://doi.org/10.1145/3318464.3380601

Published:31 May 2020Publication History

SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

Pages 2601–2616

ABSTRACT

Similarity search is the basis for many data analytics techniques, including k-nearest neighbor classification and outlier detection. Similarity search over large data sets relies on i) a distance metric learned from input examples and ii) an index to speed up search based on the learned distance metric. In interactive systems, input to guide the learning of the distance metric may be provided over time. As this new input changes the learned distance metric, a naive approach would adopt the costly process of re-indexing all items after each metric change. In this paper, we propose the first solution, called OASIS, to instantaneously adapt the index to conform to a changing distance metric without this prohibitive re-indexing process. To achieve this, we prove that locality-sensitive hashing (LSH) provides an invariance property, meaning that an LSH index built on the original distance metric is equally effective at supporting similarity search using an updated distance metric as long as the transform matrix learned for the new distance metric satisfies certain properties. This observation allows OASIS to avoid recomputing the index from scratch in most cases. Further, for the rare cases when an adaption of the LSH index is shown to be necessary, we design an efficient incremental LSH update strategy that re-hashes only a small subset of the items in the index. In addition, we develop an efficient distance metric learning strategy that incrementally learns the new metric as inputs are received. Our experimental study using real world public datasets confirms the effectiveness of OASIS at improving the accuracy of various similarity search-based data analytics tasks by instantaneously adapting the distance metric and its associated index in tandem, while achieving an up to 3 orders of magnitude speedup over the state-of-art techniques.

Supplemental Material

3318464.3380601.mp4

mp4

134.8 MB

Download

References

C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A framework for clustering evolving data streams. VLDB, 2003.Google ScholarDigital Library
A. Andoni, P. Indyk, T. Laarhoven, I. Razenshteyn, and L. Schmidt. Practical and optimal lsh for angular distance. In NIPS, pages 1225--1233, 2015.Google ScholarDigital Library
Anonymous. Continuously adaptive similarity search. https://drive.google.com/file/d/1hFsqVD6LlQPRm7SBydk1TRlzGuD1yvf2/view?usp=sharing, 2019.Google Scholar
S. D. Bay and M. Schwabacher. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In KDD, pages 29--38, 2003.Google ScholarDigital Library
A. Bellet, A. Habrard, and M. Sebban. A survey on metric learning for feature vectors and structured data. CoRR, abs/1306.6709, 2013.Google Scholar
M. M. Breunig, H. Kriegel, R. T. Ng, and J. Sander. LOF: identifying density-based local outliers. In SIGMOD, pages 93--104, 2000.Google ScholarDigital Library
L. Cao, J. Wang, and E. A. Rundensteiner. Sharing-aware outlier analytics over high-volume data streams. In SIGMOD, pages 527--540, 2016.Google ScholarDigital Library
L. Cao, D. Yang, Q. Wang, Y. Yu, J. Wang, and E. A. Rundensteiner. Scalable distance-based outlier detection over high-volume data streams. In ICDE, pages 76--87, 2014.Google ScholarCross Ref
G. Chechik, U. Shalit, V. Sharma, and S. Bengio. An online algorithm for large scale image similarity learning. In NIPS, pages 306--314, 2009.Google ScholarDigital Library
A. Cichocki, S. Cruces, and S. ichi Amari. Log-determinant divergences revisited: Alpha-beta and gamma log-det divergences. Entropy, 17:2988--3034, 2015.Google ScholarCross Ref
T. Cover and P. Hart. Nearest neighbor pattern classification. IEEE Trans. Inf. Theor., 13(1):21--27.Google ScholarDigital Library
M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In SCG, pages 253--262. ACM, 2004.Google ScholarDigital Library
J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon. Information-theoretic metric learning. ICML, pages 209--216, New York, NY, USA, 2007. ACM.Google ScholarDigital Library
A. Dosovitskiy, J. T. Springenberg, M. A. Riedmiller, and T. Brox. Discriminative unsupervised feature learning with convolutional neural networks. In NIPS, pages 766--774, 2014.Google ScholarDigital Library
C. C. A. S. Edition. Outlier Analysis. Springer, 2017.Google Scholar
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. KDD'96, pages 226--231, 1996.Google Scholar
J. Gan, J. Feng, Q. Fang, and W. Ng. Locality-sensitive hashing scheme based on dynamic collision counting. SIGMOD, pages 541--552, 2012.Google ScholarDigital Library
N. Hansen. The cma evolution strategy: A tutorial. arXiv preprint arXiv:1604.00772, 2016.Google Scholar
T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2009.Google ScholarCross Ref
Q. Huang, G. Ma, J. Feng, Q. Fang, and A. K. Tung. Accurate and fast asymmetric locality-sensitive hashing scheme for maximum inner product search. In KDD, pages 1561--1570. ACM, 2018.Google ScholarDigital Library
S. Idreos, O. Papaemmanouil, and S. Chaudhuri. Overview of data exploration techniques. In SIGMOD, pages 277--281, 2015.Google ScholarDigital Library
S. B. Imandoust and M. Bolandraftar. Application of k-nearest neighbor ( knn ) approach for predicting economic events : Theoretical background. 2013.Google Scholar
P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. STOC, pages 604--613. ACM, 1998.Google ScholarDigital Library
P. Jain, B. Kulis, I. S. Dhillon, and K. Grauman. Online metric learning and fast similarity search. NIPS, pages 761--768, USA, 2008.Google ScholarDigital Library
E. M. Knorr and R. T. Ng. Algorithms for mining distance-based outliers in large datasets. In VLDB, pages 392--403, 1998.Google ScholarDigital Library
A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.Google Scholar
K. Kumaran, D. Papageorgiou, Y. Chang, M. Li, and M. Takác. Active metric learning for supervised classification. CoRR, abs/1803.10647, 2018.Google Scholar
H. O. Lancaster and E. Seneta. Chi-square distribution. Encyclopedia of biostatistics, 2, 2005.Google Scholar
W. Li, Y. Zhang, Y. Sun, W. Wang, M. Li, W. Zhang, and X. Lin. Approximate nearest neighbor search on high dimensional data - experiments, analyses, and improvement. TKDE, 2019.Google ScholarCross Ref
U. Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395--416, Dec. 2007.Google ScholarDigital Library
Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li. Multi-probe lsh: Efficient indexing for high-dimensional similarity search. PVLDB, pages 950--961, 2007.Google ScholarDigital Library
Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li. Intelligent probing for locality sensitive hashing: Multi-probe lsh and beyond. Proc. VLDB Endow., 10(12):2021--2024, Aug. 2017.Google ScholarDigital Library
D. Malerba, F. Esposito, and G. Semeraro. A further comparison of simplification methods for decision-tree induction. In Learning from data, pages 365--374. Springer, 1996.Google ScholarCross Ref
J. McDonald. Chi-square test of goodness-of-fit: Handbook of Biological Statistics. Sparky House Publishing, 2009.Google Scholar
K. Millman and M. Aivazis. Python for scientists and engineers. Computing in Science & Engineering, 13(02):9--12, mar 2011.Google ScholarDigital Library
B. Neyshabur and N. Srebro. On symmetric and asymmetric lshs for inner product search. ICML, pages 1926--1934. JMLR.org, 2015.Google ScholarDigital Library
S. Papadimitriou, H. Kitagawa, P. B. Gibbons, and C. Faloutsos. LOCI: fast outlier detection using the local corr. integral. In ICDE, pages 315--326, 2003.Google ScholarCross Ref
W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes 3rd Edition: The Art of Scientific Computing. Cambridge University Press, New York, NY, USA, 3 edition, 2007.Google ScholarDigital Library
X. Qin, L. Cao, E. A. Rundensteiner, and S. Madden. Scalable kernel density estimation-based local outlier detection over large data streams. 2019.Google Scholar
A. Rajaraman and J. D. Ullman. Mining of Massive Datasets. Cambridge University Press, New York, NY, USA, 2011.Google ScholarCross Ref
S. Shalev-Shwartz, Y. Singer, and A. Y. Ng. Online and batch learning of pseudo-metrics. In ICML, page 94, 2004.Google ScholarDigital Library
A. Shrivastava and P. Li. Improved asymmetric locality sensitive hashing (ALSH) for maximum inner product search (MIPS). pages 812--821, 2015.Google Scholar
A. Singhal. Modern information retrieval: A brief overview. IEEE Data Eng. Bull., 24(4):35--43, 2001.Google Scholar
R. Spring and A. Shrivastava. Scalable and sustainable deep learning via randomized hashing. In SIGKDD, pages 445--454, 2017.Google ScholarDigital Library
J. M. Steele. The Cauchy-Schwarz master class: an introduction to the art of mathematical inequalities. Cambridge University Press, 2004.Google ScholarCross Ref
F. Wang and J. Sun. Survey on distance metric learning and dimensionality reduction in data mining. Data Min. Knowl. Discov., 29(2):534--564, Mar. 2015.Google ScholarDigital Library
K. Q. Weinberger, J. Blitzer, and L. K. Saul. Distance metric learning for large margin nearest neighbor classification. In NIPS, pages 1473--1480, 2006.Google ScholarDigital Library
K. Q. Weinberger and L. K. Saul. Distance metric learning for large margin nearest neighbor classification. J. Mach. Learn. Res., 10:207--244, June 2009.Google ScholarDigital Library
E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell. Distance metric learning, with application to clustering with side-information. In NIPS, NIPS'02, pages 521--528, Cambridge, MA, USA, 2002. MIT Press.Google Scholar
L. Yang, R. Jin, and R. Sukthankar. Bayesian active distance metric learning. In UAI, pages 442--449, 2007.Google ScholarDigital Library
G. Zheng, S. L. Brantley, T. Lauvaux, and Z. Li. Contextual spatial outlier detection with metric learning. In SIGKDD, pages 2161--2170, 2017.Google ScholarDigital Library

Index Terms

Continuously Adaptive Similarity Search
1. Information systems
  1. Data management systems

Recommendations

Distance-Based Index Structures for Fast Similarity Search

This review considers the class of index structures for fast similarity search. In constructing and applying such structures, only information on values or ranks of some distances/similarities between objects is used. The search by metric distances (...
Read More
Confirmation Sampling for Exact Nearest Neighbor Search
Similarity Search and Applications
Abstract
Locality-sensitive hashing (LSH), introduced by Indyk and Motwani in STOC ’98, has been an extremely influential framework for nearest neighbor search in high-dimensional data sets. While theoretical work has focused on the approximate nearest ...
Read More
Quality and efficiency in high dimensional nearest neighbor search
SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data

Nearest neighbor (NN) search in high dimensional space is an important problem in many applications. Ideally, a practical solution (i) should be implementable in a relational database, and (ii) its query cost should grow sub-linearly with the dataset ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data
June 2020
2925 pages
ISBN:9781450367356
DOI:10.1145/3318464
General Chairs:
David Maier
Portland State University, USA
,
Rachel Pottinger
University of British Columbia, Canada
,
Program Chairs:
AnHai Doan
University of Wisconsin, USA
,
Wang-Chiew Tan
Megagon Labs, USA
,
Publications Chairs:
Abdussalam Alawini
University of Illinois at Urbana-Champaign, USA
,
Hung Q. Ngo
RelationalAI, USA
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 31 May 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
LSH
distance metric learning
nearest neighbor search
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate785of4,003submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 1,112
  Total Downloads
- Downloads (Last 12 months)136
- Downloads (Last 6 weeks)22
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Continuously Adaptive Similarity Search

SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Distance-Based Index Structures for Fast Similarity Search

Confirmation Sampling for Exact Nearest Neighbor Search

Quality and efficiency in high dimensional nearest neighbor search

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Continuously Adaptive Similarity Search

SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Distance-Based Index Structures for Fast Similarity Search

Confirmation Sampling for Exact Nearest Neighbor Search

Quality and efficiency in high dimensional nearest neighbor search

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media