Abstract
DBSCAN is a method proposed in 1996 for clustering multi-dimensional points, and has received extensive applications. Its computational hardness is still unsolved to this date. The original KDD‚96 paper claimed an algorithm of O(n log n) ”average runtime complexity„ (where n is the number of data points) without a rigorous proof. In 2013, a genuine O(n log n)-time algorithm was found in 2D space under Euclidean distance. The hardness of dimensionality d ≥3 has remained open ever since.
This article considers the problem of computing DBSCAN clusters from scratch (assuming no existing indexes) under Euclidean distance. We prove that, for d ≥3, the problem requires ω(n 4/3) time to solve, unless very significant breakthroughs—ones widely believed to be impossible—could be made in theoretical computer science. Motivated by this, we propose a relaxed version of the problem called ρ-approximate DBSCAN, which returns the same clusters as DBSCAN, unless the clusters are ”unstable„ (i.e., they change once the input parameters are slightly perturbed). The ρ-approximate problem can be settled in O(n) expected time regardless of the constant dimensionality d.
The article also enhances the previous result on the exact DBSCAN problem in 2D space. We show that, if the n data points have been pre-sorted on each dimension (i.e., one sorted list per dimension), the problem can be settled in O(n) worst-case time. As a corollary, when all the coordinates are integers, the 2D DBSCAN problem can be solved in O(n log log n) time deterministically, improving the existing O(n log n) bound.
- Pankaj K. Agarwal, Herbert Edelsbrunner, and Otfried Schwarzkopf. 1991. Euclidean minimum spanning trees and bichromatic closest pairs. Discrete 8 Computational Geometry 6 (1991), 407--422. Google ScholarDigital Library
- Arne Andersson, Torben Hagerup, Stefan Nilsson, and Rajeev Raman. 1998. Sorting in linear time?Journal of Computer and System Sciences (JCSS) 57, 1 (1998), 74--93. Google ScholarDigital Library
- Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, and Jörg Sander. 1999. OPTICS: Ordering points to identify the clustering structure. In Proceedings of ACM Management of Data (SIGMOD). 49--60. Google ScholarDigital Library
- Sunil Arya and David M. Mount. 2000. Approximate range searching. Computational Geometry 17, 3--4 (2000), 135--152.Google Scholar
- Sunil Arya and David M. Mount. 2016. A fast and simple algorithm for computing approximate Euclidean minimum spanning trees. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). 1220--1233. Google ScholarCross Ref
- K. Bache and M. Lichman. 2013. UCI Machine Learning Repository. Retrieved from http://archive.ics.uci.edu/ml.Google Scholar
- Christian Böhm, Karin Kailing, Peer Kröger, and Arthur Zimek. 2004. Computing clusters of correlation connected objects. In Proceedings of ACM Management of Data (SIGMOD). 455--466. Google ScholarDigital Library
- B. Borah and D. K. Bhattacharyya. 2004. An improved sampling-based DBSCAN for large spatial databases. In Proceedings of Intelligent Sensing and Information Processing. 92--96. Google ScholarCross Ref
- Prosenjit Bose, Anil Maheshwari, Pat Morin, Jason Morrison, Michiel H. M. Smid, and Jan Vahrenhold. 2007. Space-efficient geometric divide-and-conquer algorithms. Computational Geometry 37, 3 (2007), 209--227. Google ScholarDigital Library
- Vineet Chaoji, Mohammad Al Hasan, Saeed Salem, and Mohammed J. Zaki. 2008. SPARCL: Efficient and effective shape-based clustering. In Proceedings of International Conference on Management of Data (ICDM). 93--102. Google ScholarDigital Library
- Mark de Berg, Otfried Cheong, Marc van Kreveld, and Mark Overmars. 2008. Computational Geometry: Algorithms and Applications (3rd ed.). Springer-Verlag. Google ScholarDigital Library
- Mark de Berg, Constantinos Tsirogiannis, and B. T. Wilkinson. 2015. Fast computation of categorical richness on raster data sets and related problems. 18:1--18:10 .Google Scholar
- Jeff Erickson. 1995. On the relative complexities of some geometric problems. In Proceedings of the Canadian Conference on Computational Geometry (CCCG). 85--90.Google Scholar
- Jeff Erickson. 1996. New lower bounds for Hopcroft‚s problem. Discrete 8 Computational Geometry 16, 4 (1996), 389--418. Google ScholarDigital Library
- Martin Ester. 2013. Density-based clustering. In Data Clustering: Algorithms and Applications. 111--126.Google Scholar
- Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of ACM Knowledge Discovery and Data Mining (SIGKDD). 226--231.Google Scholar
- Junhao Gan and Yufei Tao. 2015. DBSCAN revisited: Mis-claim, un-fixability, and approximation. In Proceedings of ACM Management of Data (SIGMOD). 519--530. Google ScholarDigital Library
- Ade Gunawan. 2013. A Faster Algorithm for DBSCAN. Master‚s thesis. Technische University Eindhoven.Google Scholar
- Jiawei Han, Micheline Kamber, and Jian Pei. 2012. Data Mining: Concepts and Techniques. Morgan Kaufmann. Google ScholarCross Ref
- Yijie Han and Mikkel Thorup. 2002. Integer sorting in 0(n sqrt (log log n)) expected time and linear space. In Proceedings of Annual IEEE Symposium on Foundations of Computer Science (FOCS). 135--144.Google Scholar
- G. R. Hjaltason and H. Samet. 1999. Distance browsing in spatial databases. ACM Transactions on Database Systems (TODS) 24, 2 (1999), 265--318. Google ScholarDigital Library
- David G. Kirkpatrick and Stefan Reisch. 1984. Upper bounds for sorting integers on random access machines. Theoretical Computer Science 28 (1984), 263--276. Google ScholarCross Ref
- Matthias Klusch, Stefano Lodi, and Gianluca Moro. 2003. Distributed clustering based on sampling local density estimates. In Proceedings of the International Joint Conference of Artificial Intelligence (IJCAI). 485--490.Google Scholar
- Zhenhui Li, Bolin Ding, Jiawei Han, and Roland Kays. 2010. Swarm: Mining relaxed temporal moving object clusters. Proceedings of the VLDB Endowment (PVLDB) 3, 1 (2010), 723--734. Google ScholarDigital Library
- Bing Liu. 2006. A fast density-based clustering algorithm for large databases. In Proceedings of International Conference on Machine Learning and Cybernetics. 996--1000. Google ScholarCross Ref
- Eric Hsueh-Chan Lu, Vincent S. Tseng, and Philip S. Yu. 2011. Mining cluster-based temporal mobile sequential patterns in location-based service environments. IEEE Transactions on Knowledge and Data Engineering (TKDE) 23, 6 (2011), 914--927. Google ScholarDigital Library
- Shaaban Mahran and Khaled Mahar. 2008. Using grid for accelerating density-based clustering. In Proceedings of IEEE International Conference on Computer and Information Technology (CIT). 35--40. Google ScholarCross Ref
- Jirí Matousek. 1993. Range searching with efficient hiearchical cutting. Discrete 8 Computational Geometry 10 (1993), 157--182. Google ScholarDigital Library
- Boriana L. Milenova and Marcos M. Campos. 2002. O-Cluster: Scalable clustering of large high dimensional data sets. In Proceedings of International Conference on Management of Data (ICDM). 290--297. Google ScholarCross Ref
- Davoud Moulavi, Pablo A. Jaskowiak, Ricardo J. G. B. Campello, Arthur Zimek, and Jörg Sander. 2014. Density-based clustering validation. In International Conference on Data Mining. 839--847. Google ScholarCross Ref
- Md. Mostofa Ali Patwary, Diana Palsetia, Ankit Agrawal, Wei-keng Liao, Fredrik Manne, and Alok N. Choudhary. 2012. A new scalable parallel DBSCAN algorithm using the disjoint-set data structure. In Conference on High Performance Computing Networking, Storage and Analysis. 62 .Google Scholar
- Tao Pei, A-Xing Zhu, Chenghu Zhou, Baolin Li, and Chengzhi Qin. 2006. A new approach to the nearest-neighbour method to discover cluster features in overlaid spatial point processes. International Journal of Geographical Information Science 20, 2 (2006), 153--168. Google ScholarCross Ref
- Attila Reiss and Didier Stricker. 2012. Introducing a new benchmarked dataset for activity monitoring. In International Symposium on Wearable Computers. 108--109. Google ScholarDigital Library
- S. Roy and D. K. Bhattacharyya. 2005. An approach to find embedded clusters using density based techniques. In Proceedings of Distributed Computing and Internet Technology. 523--535. Google ScholarDigital Library
- Gholamhosein Sheikholeslami, Surojit Chatterjee, and Aidong Zhang. 2000. WaveCluster: A wavelet based clustering approach for spatial data in very large databases. The VLDB Journal 8, 3--4 (2000), 289--304.Google ScholarDigital Library
- Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. 2006. Introduction to Data Mining. Pearson.Google Scholar
- Robert Endre Tarjan. 1979. A class of algorithms which require nonlinear time to maintain disjoint sets. Journal of Computer and System Sciences (JCSS) 18, 2 (1979), 110--127. Google ScholarCross Ref
- Cheng-Fa Tsai and Chien-Tsung Wu. 2009. GF-DBSCAN: A new efficient and effective data clustering technique for large databases. In Proceedings of International Conference on Multimedia Systems and Signal Processing. 231--236.Google Scholar
- Manik Varma and Andrew Zisserman. 2003. Texture classification: Are filter banks necessary?. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 691--698. Google ScholarCross Ref
- Wei Wang, Jiong Yang, and Richard R. Muntz. 1997. STING: A statistical information grid approach to spatial data mining. In Proceedings of Very Large Data Bases (VLDB). 186--195.Google Scholar
- Ji-Rong Wen, Jian-Yun Nie, and HongJiang Zhang. 2002. Query clustering using user logs. ACM Transactions on Information Systems (TOIS) 20, 1 (2002), 59--81. Google ScholarDigital Library
Recommendations
AA-DBSCAN: an approximate adaptive DBSCAN for finding clusters with varying densities
Clustering is a typical data mining technique that partitions a dataset into multiple subsets of similar objects according to similarity metrics. In particular, density-based algorithms can find clusters of different shapes and sizes while remaining ...
A new hybrid method based on partitioning-based DBSCAN and ant clustering
Clustering problem is an unsupervised learning problem. It is a procedure that partition data objects into matching clusters. The data objects in the same cluster are quite similar to each other and dissimilar in the other clusters. Density-based ...
Exact, Fast and Scalable Parallel DBSCAN for Commodity Platforms
ICDCN '17: Proceedings of the 18th International Conference on Distributed Computing and NetworkingDBSCAN is one of the most popular density-based clustering algorithm capable of identifying arbitrary shaped clusters and noise. It is computationally expensive for large data sets. In this paper, we present a grid-based DBSCAN algorithm, GridDBSCAN, ...
Comments