research-article

On the Hardness and Approximation of Euclidean DBSCAN

Authors:
Junhao Gan

University of Queensland, St Lucia, Brisbane, Australia

University of Queensland, St Lucia, Brisbane, Australia
View Profile

,
Yufei Tao

Chinese University of Hong Kong, Sha Tin, New Territories, Hong Kong

Chinese University of Hong Kong, Sha Tin, New Territories, Hong Kong
View Profile

Authors Info & Claims

ACM Transactions on Database Systems Volume 42 Issue 3Article No.: 14pp 1–45https://doi.org/10.1145/3083897

Published:31 July 2017Publication History

ACM Transactions on Database Systems

Abstract

DBSCAN is a method proposed in 1996 for clustering multi-dimensional points, and has received extensive applications. Its computational hardness is still unsolved to this date. The original KDD‚96 paper claimed an algorithm of O(n log n) ”average runtime complexity„ (where n is the number of data points) without a rigorous proof. In 2013, a genuine O(n log n)-time algorithm was found in 2D space under Euclidean distance. The hardness of dimensionality d ≥3 has remained open ever since.

This article considers the problem of computing DBSCAN clusters from scratch (assuming no existing indexes) under Euclidean distance. We prove that, for d ≥3, the problem requires ω(n ^4/3) time to solve, unless very significant breakthroughs—ones widely believed to be impossible—could be made in theoretical computer science. Motivated by this, we propose a relaxed version of the problem called ρ-approximate DBSCAN, which returns the same clusters as DBSCAN, unless the clusters are ”unstable„ (i.e., they change once the input parameters are slightly perturbed). The ρ-approximate problem can be settled in O(n) expected time regardless of the constant dimensionality d.

The article also enhances the previous result on the exact DBSCAN problem in 2D space. We show that, if the n data points have been pre-sorted on each dimension (i.e., one sorted list per dimension), the problem can be settled in O(n) worst-case time. As a corollary, when all the coordinates are integers, the 2D DBSCAN problem can be solved in O(n log log n) time deterministically, improving the existing O(n log n) bound.

References

Pankaj K. Agarwal, Herbert Edelsbrunner, and Otfried Schwarzkopf. 1991. Euclidean minimum spanning trees and bichromatic closest pairs. Discrete 8 Computational Geometry 6 (1991), 407--422. Google ScholarDigital Library
Arne Andersson, Torben Hagerup, Stefan Nilsson, and Rajeev Raman. 1998. Sorting in linear time?Journal of Computer and System Sciences (JCSS) 57, 1 (1998), 74--93. Google ScholarDigital Library
Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, and Jörg Sander. 1999. OPTICS: Ordering points to identify the clustering structure. In Proceedings of ACM Management of Data (SIGMOD). 49--60. Google ScholarDigital Library
Sunil Arya and David M. Mount. 2000. Approximate range searching. Computational Geometry 17, 3--4 (2000), 135--152.Google Scholar
Sunil Arya and David M. Mount. 2016. A fast and simple algorithm for computing approximate Euclidean minimum spanning trees. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). 1220--1233. Google ScholarCross Ref
K. Bache and M. Lichman. 2013. UCI Machine Learning Repository. Retrieved from http://archive.ics.uci.edu/ml.Google Scholar
Christian Böhm, Karin Kailing, Peer Kröger, and Arthur Zimek. 2004. Computing clusters of correlation connected objects. In Proceedings of ACM Management of Data (SIGMOD). 455--466. Google ScholarDigital Library
B. Borah and D. K. Bhattacharyya. 2004. An improved sampling-based DBSCAN for large spatial databases. In Proceedings of Intelligent Sensing and Information Processing. 92--96. Google ScholarCross Ref
Prosenjit Bose, Anil Maheshwari, Pat Morin, Jason Morrison, Michiel H. M. Smid, and Jan Vahrenhold. 2007. Space-efficient geometric divide-and-conquer algorithms. Computational Geometry 37, 3 (2007), 209--227. Google ScholarDigital Library
Vineet Chaoji, Mohammad Al Hasan, Saeed Salem, and Mohammed J. Zaki. 2008. SPARCL: Efficient and effective shape-based clustering. In Proceedings of International Conference on Management of Data (ICDM). 93--102. Google ScholarDigital Library
Mark de Berg, Otfried Cheong, Marc van Kreveld, and Mark Overmars. 2008. Computational Geometry: Algorithms and Applications (3rd ed.). Springer-Verlag. Google ScholarDigital Library
Mark de Berg, Constantinos Tsirogiannis, and B. T. Wilkinson. 2015. Fast computation of categorical richness on raster data sets and related problems. 18:1--18:10 .Google Scholar
Jeff Erickson. 1995. On the relative complexities of some geometric problems. In Proceedings of the Canadian Conference on Computational Geometry (CCCG). 85--90.Google Scholar
Jeff Erickson. 1996. New lower bounds for Hopcroft&lsquor;s problem. Discrete 8 Computational Geometry 16, 4 (1996), 389--418. Google ScholarDigital Library
Martin Ester. 2013. Density-based clustering. In Data Clustering: Algorithms and Applications. 111--126.Google Scholar
Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of ACM Knowledge Discovery and Data Mining (SIGKDD). 226--231.Google Scholar
Junhao Gan and Yufei Tao. 2015. DBSCAN revisited: Mis-claim, un-fixability, and approximation. In Proceedings of ACM Management of Data (SIGMOD). 519--530. Google ScholarDigital Library
Ade Gunawan. 2013. A Faster Algorithm for DBSCAN. Master&lsquor;s thesis. Technische University Eindhoven.Google Scholar
Jiawei Han, Micheline Kamber, and Jian Pei. 2012. Data Mining: Concepts and Techniques. Morgan Kaufmann. Google ScholarCross Ref
Yijie Han and Mikkel Thorup. 2002. Integer sorting in 0(n sqrt (log log n)) expected time and linear space. In Proceedings of Annual IEEE Symposium on Foundations of Computer Science (FOCS). 135--144.Google Scholar
G. R. Hjaltason and H. Samet. 1999. Distance browsing in spatial databases. ACM Transactions on Database Systems (TODS) 24, 2 (1999), 265--318. Google ScholarDigital Library
David G. Kirkpatrick and Stefan Reisch. 1984. Upper bounds for sorting integers on random access machines. Theoretical Computer Science 28 (1984), 263--276. Google ScholarCross Ref
Matthias Klusch, Stefano Lodi, and Gianluca Moro. 2003. Distributed clustering based on sampling local density estimates. In Proceedings of the International Joint Conference of Artificial Intelligence (IJCAI). 485--490.Google Scholar
Zhenhui Li, Bolin Ding, Jiawei Han, and Roland Kays. 2010. Swarm: Mining relaxed temporal moving object clusters. Proceedings of the VLDB Endowment (PVLDB) 3, 1 (2010), 723--734. Google ScholarDigital Library
Bing Liu. 2006. A fast density-based clustering algorithm for large databases. In Proceedings of International Conference on Machine Learning and Cybernetics. 996--1000. Google ScholarCross Ref
Eric Hsueh-Chan Lu, Vincent S. Tseng, and Philip S. Yu. 2011. Mining cluster-based temporal mobile sequential patterns in location-based service environments. IEEE Transactions on Knowledge and Data Engineering (TKDE) 23, 6 (2011), 914--927. Google ScholarDigital Library
Shaaban Mahran and Khaled Mahar. 2008. Using grid for accelerating density-based clustering. In Proceedings of IEEE International Conference on Computer and Information Technology (CIT). 35--40. Google ScholarCross Ref
Jirí Matousek. 1993. Range searching with efficient hiearchical cutting. Discrete 8 Computational Geometry 10 (1993), 157--182. Google ScholarDigital Library
Boriana L. Milenova and Marcos M. Campos. 2002. O-Cluster: Scalable clustering of large high dimensional data sets. In Proceedings of International Conference on Management of Data (ICDM). 290--297. Google ScholarCross Ref
Davoud Moulavi, Pablo A. Jaskowiak, Ricardo J. G. B. Campello, Arthur Zimek, and Jörg Sander. 2014. Density-based clustering validation. In International Conference on Data Mining. 839--847. Google ScholarCross Ref
Md. Mostofa Ali Patwary, Diana Palsetia, Ankit Agrawal, Wei-keng Liao, Fredrik Manne, and Alok N. Choudhary. 2012. A new scalable parallel DBSCAN algorithm using the disjoint-set data structure. In Conference on High Performance Computing Networking, Storage and Analysis. 62 .Google Scholar
Tao Pei, A-Xing Zhu, Chenghu Zhou, Baolin Li, and Chengzhi Qin. 2006. A new approach to the nearest-neighbour method to discover cluster features in overlaid spatial point processes. International Journal of Geographical Information Science 20, 2 (2006), 153--168. Google ScholarCross Ref
Attila Reiss and Didier Stricker. 2012. Introducing a new benchmarked dataset for activity monitoring. In International Symposium on Wearable Computers. 108--109. Google ScholarDigital Library
S. Roy and D. K. Bhattacharyya. 2005. An approach to find embedded clusters using density based techniques. In Proceedings of Distributed Computing and Internet Technology. 523--535. Google ScholarDigital Library
Gholamhosein Sheikholeslami, Surojit Chatterjee, and Aidong Zhang. 2000. WaveCluster: A wavelet based clustering approach for spatial data in very large databases. The VLDB Journal 8, 3--4 (2000), 289--304.Google ScholarDigital Library
Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. 2006. Introduction to Data Mining. Pearson.Google Scholar
Robert Endre Tarjan. 1979. A class of algorithms which require nonlinear time to maintain disjoint sets. Journal of Computer and System Sciences (JCSS) 18, 2 (1979), 110--127. Google ScholarCross Ref
Cheng-Fa Tsai and Chien-Tsung Wu. 2009. GF-DBSCAN: A new efficient and effective data clustering technique for large databases. In Proceedings of International Conference on Multimedia Systems and Signal Processing. 231--236.Google Scholar
Manik Varma and Andrew Zisserman. 2003. Texture classification: Are filter banks necessary?. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 691--698. Google ScholarCross Ref
Wei Wang, Jiong Yang, and Richard R. Muntz. 1997. STING: A statistical information grid approach to spatial data mining. In Proceedings of Very Large Data Bases (VLDB). 186--195.Google Scholar
Ji-Rong Wen, Jian-Yun Nie, and HongJiang Zhang. 2002. Query clustering using user logs. ACM Transactions on Information Systems (TOIS) 20, 1 (2002), 59--81. Google ScholarDigital Library

Recommendations

AA-DBSCAN: an approximate adaptive DBSCAN for finding clusters with varying densities

Clustering is a typical data mining technique that partitions a dataset into multiple subsets of similar objects according to similarity metrics. In particular, density-based algorithms can find clusters of different shapes and sizes while remaining ...
Read More
A new hybrid method based on partitioning-based DBSCAN and ant clustering

Clustering problem is an unsupervised learning problem. It is a procedure that partition data objects into matching clusters. The data objects in the same cluster are quite similar to each other and dissimilar in the other clusters. Density-based ...
Read More
Exact, Fast and Scalable Parallel DBSCAN for Commodity Platforms
ICDCN '17: Proceedings of the 18th International Conference on Distributed Computing and Networking

DBSCAN is one of the most popular density-based clustering algorithm capable of identifying arbitrary shaped clusters and noise. It is computationally expensive for large data sets. In this paper, we present a grid-based DBSCAN algorithm, GridDBSCAN, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Database Systems Volume 42, Issue 3
Invited Paper from SIGMOD 2015, Invited Paper from PODS 2015, Regular Papers and Technical Correspondence
September 2017
220 pages
ISSN:0362-5915
EISSN:1557-4644
DOI:10.1145/3129336
Editor:
Christian S. Jensen
Aalborg University, Denmark
Issue’s Table of Contents
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 31 July 2017
- Revised: 1 April 2017
- Accepted: 1 April 2017
- Received: 1 April 2016
Published in tods Volume 42, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
DBSCAN
algorithms
computational geometry
density-based clustering
hopcroft hard
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 30
  Total Citations
  View Citations
- 567
  Total Downloads
- Downloads (Last 12 months)44
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

On the Hardness and Approximation of Euclidean DBSCAN

ACM Transactions on Database Systems

Abstract

References

Cited By

Recommendations

AA-DBSCAN: an approximate adaptive DBSCAN for finding clusters with varying densities

A new hybrid method based on partitioning-based DBSCAN and ant clustering

Exact, Fast and Scalable Parallel DBSCAN for Commodity Platforms

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

On the Hardness and Approximation of Euclidean DBSCAN

ACM Transactions on Database Systems

Abstract

References

Cited By

Recommendations

AA-DBSCAN: an approximate adaptive DBSCAN for finding clusters with varying densities

A new hybrid method based on partitioning-based DBSCAN and ant clustering

Exact, Fast and Scalable Parallel DBSCAN for Commodity Platforms

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media