Exact and approximate flexible aggregate similarity search

Li, Feifei; Yi, Ke; Tao, Yufei; Yao, Bin; Li, Yang; Xie, Dong; Wang, Min

doi:10.1007/s00778-015-0418-x

Exact and approximate flexible aggregate similarity search

Regular Paper
Published: 05 January 2016

Volume 25, pages 317–338, (2016)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Feifei Li¹,
Ke Yi²,
Yufei Tao³,
Bin Yao⁴,
Yang Li⁴,
Dong Xie⁴ &
…
Min Wang⁵

2011 Accesses
10 Citations
Explore all metrics

Abstract

Aggregate similarity search, also known as aggregate nearest-neighbor (Ann) query, finds many useful applications in spatial and multimedia databases. Given a group Q of M query objects, it retrieves from a database the objects most similar to Q, where the similarity is an aggregation (e.g., \({{\mathrm{sum}}}\), \(\max \)) of the distances between each retrieved object p and all the objects in Q. In this paper, we propose an added flexibility to the query definition, where the similarity is an aggregation over the distances between p and any subset of \(\phi M\) objects in Q for some support \(0< \phi \le 1\). We call this new definition flexible aggregate similarity search and accordingly refer to a query as a flexible aggregate nearest-neighbor ( Fann ) query. We present algorithms for answering Fann queries exactly and approximately. Our approximation algorithms are especially appealing, which are simple, highly efficient, and work well in both low and high dimensions. They also return near-optimal answers with guaranteed constant-factor approximations in any dimensions. Extensive experiments on large real and synthetic datasets from 2 to 74 dimensions have demonstrated their superior efficiency and high quality.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Flexible Aggregate Similarity Search in High-Dimensional Data Sets

Aggregate k Nearest Neighbor Queries in Metric Spaces

Dense Nearest Neighborhood Query

References

Arya, S., Mount, D.M., Netanyahu, N.S., Silverman, R., Wu, A.Y.: An optimal algorithm for approximate nearest neighbor searching in fixed dimensions. J. ACM 45(6), 891–923 (1998)
Article MathSciNet MATH Google Scholar
Berchtold, S., Böhm, C., Keim, D.A., Kriegel, H.-P.: A cost model for nearest neighbor search in high-dimensional data space. In: Proceedings of the Sixteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Tucson. PODS ’97, pp. 78–86. ACM, New York (1997)
Berg, M., Kreveld, M., Overmars, M., Schwarzkopf, O.: Computational Geometry: Algorithms and Applications. Springer, New York (1997)
Böhm, C.: A cost model for query processing in high dimensional data spaces. ACM Trans. Database Syst. 25(2), 129–178 (2000)
Chakrabarti, K., Porkaew, K., Mehrotra, S.: The Color Data Set (2006). http://kdd.ics.uci.edu/databases/CorelFeatures/CorelFeatures.data.html
Ciaccia, P., Patella, M., Zezula, P.: M-tree: an efficient access method for similarity search in metric spaces. In: Proceedings of the 23rd International Conference on Very Large Data Bases. VLDB ’97, pp. 426–435. Morgan Kaufmann Publishers Inc., San Francisco (1997)
Fagin, R., Kumar, R., Sivakumar, D.: Efficient similarity search and classification via rank aggregation. In: SIGMOD (2003)
Fagin, R., Lotem, A., Naor, M.: Optimal aggregation algorithms for middleware. In: PODS (2001)
Ferhatosmanoglu, H., Stanoi, I., Agrawal, D., El Abbadi, A.: Constrained nearest neighbor queries. In: SSTD, pp. 257–278 (2001)
Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: VLDB (1999)
Hjaltason, G.R., Samet, H.: Distance browsing in spatial databases. ACM Trans. Database Syst. 24(2), 265–318. doi:10.1145/320248.320255
Jagadish, H.V., Ooi, B.C., Tan, K.L., Yu, C., Zhang, R.: iDistance: an adaptive B\(^+\)-tree based indexing method for nearest neighbor search. ACM Trans. Database Syst. 30(2), 364–397 (2005)
Article Google Scholar
Kumar, P., Mitchell, J.S.B., Yildirim, E.A.: Approximate minimum enclosing balls in high dimensions using core-sets. ACM J. Exp. Algorithmics 8, Art ID 1.1. doi:10.1145/996546.996548 (2003)
LeCun, Y., Cortes, C.: The MNIST Data Set (1998). http://yann.lecun.com/exdb/mnist
Li, F., Yao, B., Kumar, P.: Group enclosing queries. IEEE Trans Knowl Data Eng 23(10), 1526–1540 (2010)
Li, H., Lu, H., Huang, B., Huang, Z.: Two ellipse-based pruning methods for group nearest neighbor queries. In: Proceedings of the 13th Annual ACM International Workshop on Geographic Information Systems, Bremen. GIS ’05, pp. 192–199. ACM, New York (2005)
Li, Y., Li, F., Yi, K., Yao, B., Wang, M.: Flexible aggregate similarity search. In: SIGMOD, pp. 1009–1020 (2011)
Papadias, D., Shen, Q., Tao, Y., Mouratidis, K.: Group nearest neighbor queries. In: ICDE (2004)
Papadias, D., Tao, Y., Mouratidis, K., Hui, C.K.: Aggregate nearest neighbor queries in spatial databases. ACM TODS 30(2), 529–576 (2005)
Article Google Scholar
Razente, H.L., Barioni, M.C.N., Traina, A.J.M., Faloutsos, C., Traina Jr., C.: A novel optimization approach to efficiently process aggregate similarity queries in metric access methods. In: CIKM (2008)
Rose, K., Manjunath, B.S.: The CORTINA Data Set (2004). http://www.scl.ece.ucsb.edu/datasets/index.htm
Roussopoulos, N., Kelley, S., Vincent, F.: Nearest neighbor queries. In: Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, San Jose. SIGMOD ’95, pp. 71–79. ACM, New York (1995)
Stanoi, I., Agrawal, D., El Abbadi, A.: Reverse nearest neighbor queries for dynamic databases. In: ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, pp. 44–53 (2000)
Tao, Y., Yi, K., Sheng, C., Kalnis, P.: Quality and efficiency in high dimensional nearest neighbor search. In: SIGMOD (2009)
Yiu, M.L., Mamoulis, N., Papadias, D.: Aggregate nearest neighbor queries in road networks. IEEE TKDE 17(6), 820–833 (2005)
Google Scholar

Download references

Acknowledgments

Feifei Li was supported in part by NSF Grants 1053979 and 1251019, and a Google Faculty Award. Ke Yi was supported by HKRGC Grants GRF-621413, GRF-16211614, and GRF-16200415, and by a Microsoft Grant MRA14EG05. Yufei Tao was supported in part by GRF Grants 142072/14 and 142012/15 from HKRGC. Bin Yao was supported by the National Basic Research Program (973 Program, No. 2015CB352403), the NSFC (No. 61202025), and the EU FP7 CLIMBER Project (No. PIRSES-GA-2012-318939). Feifei Li and Bin Yao were also supported by NSFC Grant 61428204.

Author information

Authors and Affiliations

University of Utah, Salt Lake City, UT, USA
Feifei Li
Hong Kong University of Science and Technology, Hong Kong, China
Ke Yi
Chinese University of Hong Kong, Hong Kong, China
Yufei Tao
Shanghai Key Laboratory of Scalable Computing and Systems, Shanghai Jiao Tong University, Shanghai, China
Bin Yao, Yang Li & Dong Xie
Visa Research, Visa Inc., Foster City, CA, USA
Min Wang

Authors

Feifei Li
View author publications
You can also search for this author in PubMed Google Scholar
Ke Yi
View author publications
You can also search for this author in PubMed Google Scholar
Yufei Tao
View author publications
You can also search for this author in PubMed Google Scholar
Bin Yao
View author publications
You can also search for this author in PubMed Google Scholar
Yang Li
View author publications
You can also search for this author in PubMed Google Scholar
Dong Xie
View author publications
You can also search for this author in PubMed Google Scholar
Min Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bin Yao.

Appendix: Tightness of Amax

Here we show that the \((1+2\sqrt{2})\) approximation ratio of Amax is tight, by giving a concrete example. Consider the case in Fig. 25 where \(\epsilon \) is an arbitrarily small positive.

In this case, \(M=8\), \(\phi =0.5\), hence \(\phi M=4\) and \(\phi M-1=3\). Consider \(q_1\), its 3-nearest neighbors in Q are \(\{q_2, q_3, q_4\}\), hence \(Q_{\phi }^{q_1}=\{q_1, q_2, q_3, q_4\}\). Note that \({{\mathrm{MEB}}}(\{q_1, q_2, q_3, q_4\})\) \(=\mathcal {B}(c_1, \sqrt{2} r^*)\), and \({{\mathrm{nn}}}(c_1, P)=p_2\). Now, \(p_2\)’s 4-nearest neighbors in Q are \(\{q_4, q_3, q_2, q_1\}\). Hence, \(Q_{\phi }^{p_2}=\{q_4, q_3, q_2, q_1\}\), \(r_{p_2}=\max (p_2, \{q_4, q_3, q_2, q_1\})=(1+2\sqrt{2})r^*-\epsilon \).

It’s easy to verify that the results from \(q_2\), \(q_3\) and \(q_4\) are the same as \(q_1\), since \(Q_{\phi }^{q_2}\), \(Q_{\phi }^{q_3}\) and \(Q_{\phi }^{q_4}\) are the same as \(Q_{\phi }^{q_1}=\{q_1, q_2, q_3, q_4\}\). Furthermore, \(q_5\), \(q_6\), \(q_7\) and \(q_8\) are symmetric to \(q_1\), \(q_2\), \(q_3\) and \(q_4\), and \(p_3\) is symmetric to \(p_2\). Thus, they yield \((p_3, Q_\phi ^{p_3})\) as the answer, and \(r_{p_3}=\max (p_3, \{q_5, q_6, q_7, q_8\})=(1+2\sqrt{2})r^*-\epsilon \).

As a result, Amax will return either \((p_2, Q_\phi ^{p_2})\) or \((p_3, Q_\phi ^{p_3})\) as the answer, with \(r_2=r_3=(1+2\sqrt{2})r^*-\epsilon \). But in this case \(p^*=p_1\), \(Q_\phi ^*=\{q_1, q_2, q_3, q_4\}\), and \(\max (p^*, Q_\phi ^*)=r^*\).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, F., Yi, K., Tao, Y. et al. Exact and approximate flexible aggregate similarity search. The VLDB Journal 25, 317–338 (2016). https://doi.org/10.1007/s00778-015-0418-x

Download citation

Received: 01 May 2015
Revised: 15 October 2015
Accepted: 11 December 2015
Published: 05 January 2016
Issue Date: June 2016
DOI: https://doi.org/10.1007/s00778-015-0418-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exact and approximate flexible aggregate similarity search

Abstract

Access this article

Similar content being viewed by others

Flexible Aggregate Similarity Search in High-Dimensional Data Sets

Aggregate k Nearest Neighbor Queries in Metric Spaces

Dense Nearest Neighborhood Query

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix: Tightness of Amax

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Exact and approximate flexible aggregate similarity search

Abstract

Access this article

Similar content being viewed by others

Flexible Aggregate Similarity Search in High-Dimensional Data Sets

Aggregate k Nearest Neighbor Queries in Metric Spaces

Dense Nearest Neighborhood Query

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix: Tightness of Amax

Appendix: Tightness of Amax

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation