ABSTRACT
Cloud computing needs to process and analyze massive high-dimensional data in a real-time manner. Approximate queries in cloud computing systems can provide timely queried results with acceptable accuracy, thus alleviating the consumption of a large amount of resources. Locality Sensitive Hashing (LSH) is able to maintain the data locality and support approximate queries. However, due to randomly choosing hash functions, LSH has to use too many functions to guarantee the query accuracy. The extra computation and storage overheads exacerbate the real performance of LSH. In order to reduce the overheads and deliver high performance, we propose a distribution-aware scheme, called DLSH, to offer cost-effective approximate nearest neighbor query service for cloud computing. The idea of DLSH is to leverage the principal components of the data distribution as the projection vectors of hash functions in LSH, further quantify the weight of each hash function and adjust the interval value in each hash table. We then refine the queried result set based on the hit frequency to significantly decrease the time overhead of distance computation. Extensive experiments in a large-scale cloud computing testbed demonstrate significant improvements in terms of multiple system performance metrics. We have released the source code of DLSH for public use.
- December 2015. How many photos are uploaded to Flickr every day, month, year? https://www.flickr.com/photos/franckmichel/6855169886/in/photostream/ (December 2015).Google Scholar
- Updated July 2016. The Top 20 Valuable Facebook Statistics. https://zephoria.com/top-15-valuable-facebook-statistics/ (Updated July 2016).Google Scholar
- Hervé Abdi and Lynne J Williams. 2010. Principal component analysis. Wiley Interdisciplinary Reviews: Computational Statistics 2, 4 (2010), 433--459. Google ScholarDigital Library
- Alexandr Andoni and Piotr Indyk. 2005. E2LSH 0.1 user manual. (2005).Google Scholar
- Alexandr Andoni and Piotr Indyk. 2006. Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions. In Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Sciemce (FOCS'06). IEEE, 459--468. Google ScholarDigital Library
- Alexandr Andoni, Piotr Indyk, Huy L. Nguyen, and Ilya Razenshteyn. 2014. Beyond Locality-Sensitive Hashing. In Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM, 1018--1028. Google ScholarCross Ref
- Alexandr Andoni and Ilya Razenshteyn. 2015. Optimal Data-Dependent Hashing for Approximate Near Neighbors. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing. ACM, 793--801. Google ScholarDigital Library
- Ning Cao, Cong Wang, Ming Li, Kui Ren, and Wenjing Lou. 2014. Privacy-Preserving Multi-Keyword Ranked Search over Encrypted Cloud Data. IEEE Transactions on parallel and distributed systems 25, 1 (2014), 222--233. Google ScholarDigital Library
- Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S Mirrokni. 2004. Locality-Sensitive Hashing Scheme Based on p-Stable Distributions. In Proceedings of the twentieth annual symposium on Computational geometry. ACM, 253--262. Google ScholarDigital Library
- Junhao Gan, Jianlin Feng, Qiong Fang, and Wilfred Ng. 2012. Locality-Sensitive Hashing Scheme Based on Dynamic Collision Counting. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. ACM, 541--552. Google ScholarDigital Library
- Jinyang Gao, Hosagrahar Visvesvaraya Jagadish, Wei Lu, and Beng Chin Ooi. 2014. DSH: Data Sensitive Hashing for High-Dimensional k-NN Search. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. ACM, 1127--1138. Google ScholarDigital Library
- Yu Hua, Bin Xiao, Dan Feng, and Bo Yu. 2008. Bounded LSH for Similarity Search in Peer-to-Peer File Systems. In Proceedings of the 37th International Conference on Parallel Processing. IEEE, 644--651. Google ScholarDigital Library
- Yu Hua, Bin Xiao, and Xue Liu. 2013. NEST: Locality-aware Approximate Query Service for Cloud Computing. In Proceedings IEEE INFOCOM. IEEE, 1303--1311. Google ScholarCross Ref
- Qiang Huang, Jianlin Feng, Yikai Zhang, Qiong Fang, and Wilfred Ng. 2015. Query-Aware Locality-Sensitive Hashing for Sp-proximate Nearest Neighbor Search. Proceedings of the VLDB Endowment 9, 1, 1--12. Google ScholarDigital Library
- Piotr Indyk and Rajeev Motwani. 1998. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing. ACM, 604--613. Google ScholarDigital Library
- Weihao Kong and Wu-Jun Li. 2012. Isotropic Hashing. In Advances in Neural Information Processing Systems. 1646--1654.Google Scholar
- Simon Korman and Shai Avidan. 2016. Coherency Sensitive Hashing. IEEE transactions on pattern analysis and machine intelligence 38, 6 (2016), 1099--1112. Google ScholarCross Ref
- Gautam Kumar, Ganesh Ananthanarayanan, Sylvia Ratnasamy, and Ion Stoica. 2016. Hold'em or Fold'em?: Aggregation Queries under Performance Variations. In Proceedomgs pf tje Eleventh European Conference on Computer Systems (EuroSys'16). ACM.Google Scholar
- Ming Li, Shucheng Yu, Ning Cao, and Wenjing Lou. 2011. Authorized Private Keyword Search over Encrypted Data in Cloud Computing. In Proceedings of the 31st International Conference on Distributed Computing Systems (ICDCS'11). IEEE, 383--392. Google ScholarDigital Library
- Guosheng Lin, Chunhua Shen, Qinfeng Shi, Anton van den Hengel, and David Suter. 2014. Fast Supervised Hashing with Decision Trees for High-Dimensional Data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1963--1970. Google ScholarDigital Library
- Jia Liu, Bin Xiao, Kai Bu, and Lijun Chen. 2014. Efficient Distributed Query Processing in Large RFID-enabled Supply Chains. In Proceedings of the IEEE Conference on Computer Communications (INFOCOM'14). IEEE, 163--171. Google ScholarCross Ref
- Qin Liu, Chiu C Tan, Jie Wu, and Guojun Wang. 2012. Efficient Information Retrieval for Ranked Queries in Cost-Effective Cloud Environments. In Proceedings of the 31st Annual IEEE International Conference on Computer Communications. IEEE, 2581--2585. Google ScholarCross Ref
- David G Lowe. 2004. Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision 60, 2 (2004), 91--110. Google ScholarDigital Library
- Qin Lv, William Josephson, Zhe Wang, Moses Charikar, and Kai Li. 2007. Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search. In Proceedings of the 33rd International Conference on Very Large Data Bases. VLDB Endowment, 950--961.Google ScholarDigital Library
- Yusuke Matsushita and Toshikazu Wada. 2009. Principal Component Hashing: An Accelerated Approximate Nearest Neighbor Search. Pacific-Rim Symposium on Image and Video Technology (2009), 374--385.Google Scholar
- Krystian Mikolajczyk, Tinne Tuytelaars, Cordelia Schmid, Andrew Zisserman, Jiri Matas, Frederik Schaffalitzky, Timor Kadir, and Luc Van Gool. 2005. A Comparison of Affine Region Detectors. International Journal of Computer Vision 65, 1--2 (2005), 43--72.Google ScholarDigital Library
- Nhan Nguyen and Philippas Tsigas. 2014. Lock-Free Cuckoo Hashing. In Proceedings of the 34th International Conference on Distributed Computing Systems. IEEE, 627--636. Google ScholarDigital Library
- David Nister and Henrik Stewenius. 2006. Scalable Recognition with a Vocabulary Tree. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Version and Pattern Recognition (CVPR'06), Vol. 2. IEEE, 2161--2168. Google ScholarDigital Library
- Rina Panigrahy. 2006. Entropy based Nearest Neighbor Search in High Dimensions. In Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithm. Society for Industrial and Applied Mathematics, 1186--1195. Google ScholarCross Ref
- Yongjoo Park, Michael Cafarella, and Barzan Mozafari. 2015. Neighbor-Sensitive Hashing. Proceedings of the VLDB Endowment 9, 3 (2015), 144--155. Google ScholarDigital Library
- Sébastien Poullot, Olivier Buisson, and Michel Crucianu. 2007. Z-grid-based Probabilistic Retrieval for Scaling Up Content-Based Copy Detection. In Proceedings of the 6th ACM International Conference on Image and Video Retrieval. ACM, 348--355. Google ScholarDigital Library
- Maxim Raginsky and Svetlana Lazebnik. 2009. Locality-Sensitive Binary Codes from Shift-Invariant Kernels. In Advances in Neural Information Processing Systems. 1509--1517.Google Scholar
- Fumin Shen, Chunhua Shen, Wei Liu, and Heng Tao Shen. 2015. Supervised Discrete Hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 37--45. Google ScholarCross Ref
- Evan R Sparks, Ameet Talwalkar, Daniel Haas, Michael J Franklin, Michael I Jordan, and Tim Kraska. 2015. Automating Model Search for Large Scale Machine Learning. In Proceedings of the Sixth ACM Symposium on Cloud Computing. ACM, 368--380.Google ScholarDigital Library
- Wenhai Sun, Xuefeng Liu, Wenjing Lou, Y Thomas Hou, and Hui Li. 2015. Catch You If You Lie to Me: Efficient Verifiable Conjunctive Keyword Search over Large Dynamic Encrypted Cloud Data. In Proceedings of the IEEE Conference on Computer Communications (INFOCOM'15). IEEE, 2110--2118. Google ScholarCross Ref
- Yuzhe Tang and Ling Liu. 2015. Privacy-Preserving Multi-Keyword Search in Information Networks. IEEE Transactions on Knowledge and Data Engineering 27, 9 (2015), 2424--2437. Google ScholarDigital Library
- Yufei Tao, Ke Yi, Cheng Sheng, and Panos Kalnis. 2009. Quality and Efficiency in High Dimensional Nearest Neighbor Search. In Proceedings of the 2009 ACM SIGMOD Conference on Management of data. ACM, 563--576. Google ScholarDigital Library
- Yufei Tao, Ke Yi, Cheng Sheng, and Panos Kalnis. 2010. Efficient and Accurate Nearest Neighbor and Closest Pair Search in High-Dimensional Space. ACM Transactions on Database Systems (TODS) 35, 3 (2010), 20.Google ScholarDigital Library
- Shixin Tian, Ying Cai, and Zhenbi Hu. 2016. A Parity-Based Data Outsourcing Model for Query Authentication and Correction. In Proceedings of the 36th International Conference on Distributed Computing Systems (ICDCS'16). IEEE, 395--404. Google ScholarCross Ref
- Vernon Turner, John F Gantz, David Reinsel, and Stephen Minton. 2014. The digital universe of opportunities: rich data and the increasing value of the internet of things. International Data Corporation, White Paper, IDC_1672 (2014).Google Scholar
- Raajay Viswanathan, Ganesh Ananthanarayanan, and Aditya Akella. 2016. CLARINET: WAN-Aware Optimization for Analytics Queries. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI'16). USENIX Association, 435--450.Google Scholar
- Dongsheng Wang, Xiaohua Jia, Cong Wang, Kan Yang, Shaojing Fu, and Ming Xu. 2015. Generalized Pattern Matching String Search on Encrypted Data in Cloud Systems. In Proceedings of the IEEE Conference on Computer Communications (INFOCOM'15). IEEE, 2101--2109.Google ScholarCross Ref
- Yair Weiss, Antonio Torralba, and Rob Fergus. 2009. Spectral Hashing. In Advances in Neural Information Processing Systems. 1753--1760.Google Scholar
- Huiqi Xu, Shumin Guo, and Keke Chen. 2014. Building Confidential and Efficient Query Services in the Cloud with RASP Data Perturbation. IEEE Transactions on Knowledge and Data Engineering 26, 2 (2014), 322--335. Google ScholarDigital Library
- Hao Xu, Jingdong Wang, Zhu Li, Gang Zeng, Shipeng Li, and Nenghai Yu. 2011. Complementary Hashing for Approximate Nearest Neighbor Search. In Proceedings of the 2011 IEEE International Conference on Computer Vision. IEEE, 1631--1638. Google ScholarDigital Library
- Lei Xu, Hong Jiang, Lei Tian, and Ziling Huang. 2014. Propeller: A Scalable Real-Time File-Search Service in Distributed Systems. In Proceedings of the 34th International Conference on Distributed Computing Systems (ICDCS'14). IEEE, 378--388. Google ScholarDigital Library
- Myung Keun Yoon, JinWoo Son, and Seon-Ho Shin. 2014. Bloom Tree: A Search Tree Based on Bloom Filters for Multiple-Set Membership Testing. In Proceedings of the IEEE Conference on Computer Communications (INFOCOM'14). IEEE, 1429--1437.Google ScholarCross Ref
- Felix X Yu, Sanjiv Kumar, Yunchao Gong, and Shih-Fu Chang. 2014. Circulant Binary Embedding. In Proceedings of the International Conference on Machine Learning, Vol. 6. 7.Google Scholar
- Deli Zhang and Damian Dechev. 2016. An Efficient Lock-Free Logarithmic Search Data Structure Based on Multi-dimensional List. In Proceedings of the 36th International Conference on Distributed Computing Systems (ICDCS'16). IEEE, 281--292. Google ScholarCross Ref
- Lan Zhang, Taeho Jung, Cihang Liu, Xuan Ding, Xiang-Yang Li, and Yunhao Liu. 2015. POP: Privacy-Preserving Outsourced Photo Sharing and Searching for Mobile Devices. In Proceedings of the 35th International Conference on Distributed Computing Systems (ICDCS'15). IEEE, 308--317. Google ScholarCross Ref
- Wei Zhang, Ke Gao, Yong-dong Zhang, and Jin-tao Li. 2010. Data-Oriented Locality Sensitive Hashing. In Proceedings of the 18th ACM international conference on Multimedia. ACM, 1131--1134. Google ScholarDigital Library
Index Terms
- DLSH: a distribution-aware LSH scheme for approximate nearest neighbor query in cloud computing
Recommendations
Dynamic Multi-probe LSH: An I/O Efficient Index Structure for Approximate Nearest Neighbor Search
DEXA 2013: Proceedings of the 24th International Conference on Database and Expert Systems Applications - Volume 8055Locality-Sensitive Hashing LSH is widely used to solve approximate nearest neighbor search problems in high-dimensional spaces. The basic idea is to map the "nearby" objects into a same hash bucket with high probability. A significant drawback is that ...
A posteriori multi-probe locality sensitive hashing
MM '08: Proceedings of the 16th ACM international conference on MultimediaEfficient high-dimensional similarity search structures are essential for building scalable content-based search systems on feature-rich multimedia data. In the last decade, Locality Sensitive Hashing (LSH) has been proposed as indexing technique for ...
Quality and efficiency in high dimensional nearest neighbor search
SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of dataNearest neighbor (NN) search in high dimensional space is an important problem in many applications. Ideally, a practical solution (i) should be implementable in a relational database, and (ii) its query cost should grow sub-linearly with the dataset ...
Comments