ABSTRACT
The DBSCAN method for spatial clustering has received significant attention due to its applicability in a variety of data analysis tasks. There are fast sequential algorithms for DBSCAN in Euclidean space that take O(nłog n) work for two dimensions, sub-quadratic work for three or more dimensions, and can be computed approximately in linear work for any constant number of dimensions. However, existing parallel DBSCAN algorithms require quadratic work in the worst case. This paper bridges the gap between theory and practice of parallel DBSCAN by presenting new parallel algorithms for Euclidean exact DBSCAN and approximate DBSCAN that match the work bounds of their sequential counterparts, and are highly parallel (polylogarithmic depth). We present implementations of our algorithms along with optimizations that improve their practical performance. We perform a comprehensive experimental evaluation of our algorithms on a variety of datasets and parameter settings. Our experiments on a 36-core machine with two-way hyper-threading show that our implementations outperform existing parallel implementations by up to several orders of magnitude, and achieve speedups of up to 33x over the best sequential algorithms.
Supplemental Material
- Guilherme Andrade, Gabriel Ramos, Daniel Madeira, Rafael Sachetto, Renato Ferreira, and Leonardo Rocha. 2013. G-DBSCAN: A GPU Accelerated Algorithm for Density-based Clustering. Procedia Computer Science, Vol. 18 (2013), 369 -- 378.Google ScholarCross Ref
- Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, and Jörg Sander. 1999. OPTICS: Ordering Points to Identify the Clustering Structure. In ACM International Conference on Management of Data (SIGMOD). 49--60.Google ScholarDigital Library
- Antonio Cavalcante Araujo Neto, Ticiana Linhares Coelho da Silva, Victor Aguiar Evangelista de Farias, José Antonio F. Macêdo, and Javam de Castro Machado. 2015. G2P: A Partitioning Approach for Processing DBSCAN with MapReduce. In Web and Wireless Geographical Information Systems. 191--202.Google Scholar
- Domenica Arlia and Massimo Coppola. 2001. Experiments in Parallel Clustering with DBSCAN. In European Conference on Parallel Processing (Euro-Par). 326--331.Google Scholar
- Sunil Arya and David M. Mount. 2000. Approximate range searching. Computational Geometry, Vol. 17, 3 (2000), 135 -- 152.Google ScholarDigital Library
- Jon Louis Bentley. 1975. Multidimensional Binary Search Trees Used for Associative Searching. Commun. ACM, Vol. 18, 9 (Sept. 1975), 509--517.Google ScholarDigital Library
- Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, and Julian Shun. 2012. Internally Deterministic Parallel Algorithms Can Be Fast. In ACM SIGPLAN Symposium on Proceedings of Principles and Practice of Parallel Programming (PPoPP). 181--192.Google Scholar
- Guy E. Blelloch, Phillip B. Gibbons, and Harsha Vardhan Simhadri. 2010. Low-Depth Cache Oblivious Algorithms. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA). 189--199.Google Scholar
- Robert D. Blumofe and Charles E. Leiserson. 1999. Scheduling Multithreaded Computations by Work Stealing. J. ACM, Vol. 46, 5 (Sept. 1999), 720--748.Google ScholarDigital Library
- Christian Böhm, Robert Noll, Claudia Plant, and Bianca Wackersreuther. 2009. Density-based Clustering Using Graphics Processors. In ACM Conference on Information and Knowledge Management. 661--670.Google Scholar
- B. Borah and D. K. Bhattacharyya. 2004. An improved sampling-based DBSCAN for large spatial databases. In International Conference on Intelligent Sensing and Information Processing. 92--96.Google Scholar
- Prosenjit Bose, Anil Maheshwari, Pat Morin, Jason Morrison, Michiel Smid, and Jan Vahrenhold. 2007. Space-efficient geometric divide-and-conquer algorithms. Computational Geometry, Vol. 37, 3 (2007), 209 -- 227.Google ScholarDigital Library
- S. Brecheisen, H. Kriegel, and M. Pfeifle. 2004. Efficient density-based clustering of complex objects. In IEEE International Conference on Data Mining (ICDM). 43--50.Google Scholar
- Stefan Brecheisen, Hans-Peter Kriegel, and Martin Pfeifle. 2006. Parallel Density-Based Clustering of Complex Objects. In Advances in Knowledge Discovery and Data Mining (PAKDD). 179--188.Google Scholar
- Richard P. Brent. 1974. The Parallel Evaluation of General Arithmetic Expressions. J. ACM, Vol. 21, 2 (April 1974), 201--206.Google ScholarDigital Library
- Ricardo Campello, Davoud Moulavi, Arthur Zimek, and Jörg Sander. 2015. Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection. ACM Trans. Knowl. Discov. Data, Vol. 10, 1, Article 5 (July 2015), 5:1--5:51 pages.Google ScholarDigital Library
- Chun-Chieh Chen and Ming-Syan Chen. 2015. HiClus: Highly Scalable Density-based Clustering with Heterogeneous Cloud. Procedia Computer Science, Vol. 53 (2015), 149 -- 157.Google ScholarCross Ref
- Danny Z. Chen, Michiel Smid, and Bin Xu. 2005 a. Geometric Algorithms for Density-Based Data Clustering. International Journal of Computational Geometry & Applications, Vol. 15, 03 (2005), 239--260.Google ScholarCross Ref
- Danny Z Chen, Michiel Smid, and Bin Xu. 2005 b. Geometric algorithms for density-based data clustering. International Journal of Computational Geometry & Applications, Vol. 15, 03 (2005), 239--260.Google ScholarCross Ref
- Xiaoming Chen, Wanquan Liu, Huining Qiu, and Jianhuang Lai. 2011. APSCAN: A parameter free algorithm for clustering. Pattern Recognition Letters, Vol. 32, 7 (2011), 973 -- 986.Google ScholarDigital Library
- Richard Cole. 1988. Parallel Merge Sort. SIAM J. Comput., Vol. 17, 4 (Aug. 1988), 770--785.Google ScholarDigital Library
- Richard Cole, Philip N. Klein, and Robert E. Tarjan. 1996. Finding Minimum Spanning Forests in Logarithmic Time and Linear Work Using Random Sampling. In ACM Symposium on Parallel Algorithms and Architectures (SPAA). 243--250.Google Scholar
- Massimo Coppola and Marco Vanneschi. 2002. High-performance Data Mining with Skeleton-based Structured Parallel Programming. Parallel Comput., Vol. 28, 5 (May 2002), 793--813.Google ScholarDigital Library
- I. Cordova and T. Moh. 2015. DBSCAN on Resilient Distributed Datasets. In International Conference on High Performance Computing Simulation (HPCS). 531--540.Google Scholar
- Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2009. Introduction to Algorithms (3. ed.) .MIT Press.Google ScholarDigital Library
- CriteoLabs. 2013. Terabyte Click Logs. http://labs.criteo.com/downloads/download-terabyte-click-logs/Google Scholar
- B. Dai and I. Lin. 2012. Efficient Map/Reduce-Based DBSCAN Algorithm with Optimized Data Partition. In IEEE International Conference on Cloud Computing. 59--66.Google Scholar
- Mark de Berg, Otfried Cheong, Marc van Kreveld, and Mark Overmars. 2008. Computational Geometry: Algorithms and Applications .Springer-Verlag.Google ScholarCross Ref
- Mark de Berg, Ade Gunawan, and Marcel Roeloffzen. 2017. Faster DB-scan and HDB-scan in Low-Dimensional Euclidean Spaces. In International Symposium on Algorithms and Computation (ISAAC). 25:1--25:13.Google Scholar
- Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. http://archive.ics.uci.edu/mlGoogle Scholar
- Y. El-Sonbaty, M. A. Ismail, and M. Farouk. 2004. An efficient density based clustering algorithm for large databases. In IEEE International Conference on Tools with Artificial Intelligence. 673--677.Google Scholar
- Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A Density-based Algorithm for Discovering Clusters a Density-based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In International Conference on Knowledge Discovery and Data Mining (KDD). 226--231.Google Scholar
- Xiufen Fu, Yaguang Wang, Yanna Ge, Peiwen Chen, and Shaohua Teng. 2014. Research and Application of DBSCAN Algorithm Based on Hadoop Platform. In Pervasive Computing and the Networked World. 73--87.Google Scholar
- Junhao Gan and Yufei Tao. 2017. On the Hardness and Approximation of Euclidean DBSCAN. ACM Trans. Database Syst., Vol. 42, 3 (2017), 14:1--14:45.Google ScholarDigital Library
- Hillel Gazit. 1991. An Optimal Randomized Parallel Algorithm for Finding Connected Components in a Graph. SIAM J. Comput., Vol. 20, 6 (Dec. 1991), 1046--1067.Google ScholarDigital Library
- J. Gil, Y. Matias, and U. Vishkin. 1991. Towards a theory of nearly constant time parallel algorithms. In IEEE Symposium on Foundations of Computer Science (FOCS). 698--710.Google Scholar
- Markus Götz, Christian Bodenstein, and Morris Riedel. 2015. HPDBSCAN: Highly Parallel DBSCAN. In Workshop on Machine Learning in High-Performance Computing Environments. Article 2, 2:1--2:10 pages.Google Scholar
- Yan Gu, Julian Shun, Yihan Sun, and Guy E. Blelloch. 2015. A Top-Down Parallel Semisort. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA). 24--34.Google Scholar
- Ade Gunawan. 2013. A faster algorithm for DBSCAN. Master's thesis, Eindhoven University of Technology.Google Scholar
- M. Haklay and P. Weber. 2008. OpenStreetMap: User-Generated Street Maps. IEEE Pervasive Computing, Vol. 7, 4 (Oct 2008), 12--18.Google ScholarDigital Library
- Shay Halperin and Uri Zwick. 1994. An Optimal Randomized Logarithmic Time Connectivity Algorithm for the EREW PRAM (Extended Abstract). In ACM Symposium on Parallel Algorithms and Architectures (SPAA). 1--10.Google ScholarDigital Library
- Shay Halperin and Uri Zwick. 2001. Optimal Randomized EREW PRAM Algorithms for Finding Spanning Forests. Journal of Algorithms, Vol. 39, 1 (2001), 1 -- 46.Google ScholarDigital Library
- D. Han, A. Agrawal, W. Liao, and A. Choudhary. 2016. A Novel Scalable DBSCAN Algorithm with Spark. In IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 1393--1402.Google Scholar
- Qing He, Hai Xia Gu, Qin Wei, and Xu Wang. 2017. A Novel DBSCAN Based on Binary Local Sensitive Hashing and Binary-KNN Representation. Adv. in MM, Vol. 2017 (2017), 3695323:1--3695323:9.Google Scholar
- Yaobin He, Haoyu Tan, Wuman Luo, Shengzhong Feng, and Jianping Fan. 2014. MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data. Frontiers of Computer Science, Vol. 8, 1 (01 Feb 2014), 83--99.Google Scholar
- Xu Hu, Jun Huang, and Minghui Qiu. 2017. A Communication Efficient Parallel DBSCAN Algorithm Based on Parameter Server. In ACM on Conference on Information and Knowledge Management (CIKM). 2107--2110.Google ScholarDigital Library
- Xiaojuan Hu, Lei Liu, Ningjia Qiu, Di Yang, and Meng Li. 2018. A MapReduce-based improvement algorithm for DBSCAN. Journal of Algorithms & Computational Technology, Vol. 12, 1 (2018), 53--61.Google ScholarCross Ref
- Fang Huang, Qiang Zhu, Ji Zhou, Jian Tao, Xiaocheng Zhou, Du Jin, Xicheng Tan, and Lizhe Wang. 2017. Research on the Parallelization of the DBSCAN Clustering Algorithm for Spatial Data Mining Based on the Spark Platform. Remote Sensing, Vol. 9, 12 (2017).Google Scholar
- M. Huang and F. Bian. 2009. A Grid and Density Based Fast Spatial Clustering Algorithm. In International Conference on Artificial Intelligence and Computational Intelligence, Vol. 4. 260--263.Google Scholar
- J. Jaja. 1992. Introduction to Parallel Algorithms .Addison-Wesley Professional.Google ScholarDigital Library
- Jennifer Jang and Heinrich Jiang. 2019. DBSCAN+: Towards fast and scalable density clustering. In International Conference on Machine Learning (ICML), Vol. 97. 3019--3029.Google Scholar
- Eshref Januzaj, Hans-Peter Kriegel, and Martin Pfeifle. 2004 a. DBDC: Density Based Distributed Clustering. In International Conference on Extending Database Technology (EDBT). 88--105.Google Scholar
- Eshref Januzaj, Hans-Peter Kriegel, and Martin Pfeifle. 2004 b. Scalable Density-based Distributed Clustering. In European Conference on Principles and Practice of Knowledge Discovery in Databases. 231--244.Google Scholar
- Hua Jiang, Jing Li, Shenghe Yi, Xiangyang Wang, and Xin Hu. 2011. A new hybrid method based on partitioning-based DBSCAN and ant clustering. Expert Systems with Applications, Vol. 38, 8 (2011), 9373 -- 9381.Google ScholarDigital Library
- Karin Kailing, Hans-Peter Kriegel, and Peer Krö ger. 2004. Density-Connected Subspace Clustering for High-Dimensional Data. In SIAM International Conference on Data Mining. 246--256.Google Scholar
- Jeong-Hun Kim, Jong-Hyeok Choi, Kwan-Hee Yoo, and Aziz Nasridinov. 2019. AA-DBSCAN: an approximate adaptive DBSCAN for finding clusters with varying densities. The Journal of Supercomputing, Vol. 75, 1 (01 Jan 2019), 142--169.Google ScholarDigital Library
- Younghoon Kim, Kyuseok Shim, Min-Soeng Kim, and June Sup Lee. 2014. DBCURE-MR: An efficient density-based clustering algorithm for large data using MapReduce. Information Systems, Vol. 42 (2014), 15 -- 35.Google ScholarDigital Library
- Marzena Kryszkiewicz and Piotr Lasek. 2010. TI-DBSCAN: Clustering with DBSCAN by Means of the Triangle Inequality. In Rough Sets and Current Trends in Computing. 60--69.Google Scholar
- YongChul Kwon, Dylan Nunley, Jeffrey P. Gardner, Magdalena Balazinska, Bill Howe, and Sarah Loebman. 2010. Scalable Clustering Algorithm for N-Body Simulations in a Shared-Nothing Cluster. In Scientific and Statistical Database Management. 132--150.Google Scholar
- Charles E. Leiserson. 2010. The Cilk+ concurrency platform. J. Supercomputing, Vol. 51, 3 (2010).Google ScholarDigital Library
- B. Liu. 2006. A Fast Density-Based Clustering Algorithm for Large Databases. In International Conference on Machine Learning and Cybernetics. 996--1000.Google ScholarCross Ref
- Alessandro Lulli, Matteo Dell'Amico, Pietro Michiardi, and Laura Ricci. 2016. NG-DBSCAN: Scalable Density-based Clustering for Arbitrary Data. Proc. VLDB Endow., Vol. 10, 3 (Nov. 2016), 157--168.Google ScholarDigital Library
- G. Luo, X. Luo, T. F. Gooch, L. Tian, and K. Qin. 2016. A Parallel DBSCAN Algorithm Based on Spark. In IEEE International Conferences on Big Data and Cloud Computing. 548--553.Google Scholar
- K. Mahesh Kumar and A. Rama Mohan Reddy. 2016. A Fast DBSCAN Clustering Algorithm by Accelerating Neighbor Searching Using Groups Method. Pattern Recogn., Vol. 58, C (Oct. 2016), 39--48.Google Scholar
- S. Mahran and K. Mahar. 2008. Using grid for accelerating density-based clustering. In IEEE International Conference on Computer and Information Technology. 35--40.Google Scholar
- Md. Mostofa Ali Patwary, Suren Byna, Nadathur Rajagopalan Satish, Narayanan Sundaram, Zarija Lukić, Vadim Roytershteyn, Michael J. Anderson, Yushu Yao, Prabhat, and Pradeep Dubey. 2015. BD-CATS: Big Data Clustering at Trillion Particle Scale. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC). Article 6, 6:1--6:12 pages.Google Scholar
- M. M. A. Patwary, D. Palsetia, A. Agrawal, W. k. Liao, F. Manne, and A. Choudhary. 2012. A new scalable parallel DBSCAN algorithm using the disjoint-set data structure. In ACM/IEEE International Conference on High Performance Computing, Networking, Storage and Analysis (SC). 62:1--62:11.Google Scholar
- M. M. A. Patwary, D. Palsetia, A. Agrawal, W. K. Liao, F. Manne, and A. Choudhary. 2013. Scalable parallel OPTICS data clustering using graph algorithmic techniques. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC). 49:1--49:12.Google Scholar
- Md. Mostofa Ali Patwary, Nadathur Satish, Narayanan Sundaram, Fredrik Manne, Salman Habib, and Pradeep Dubey. 2014. PARDICLE: Parallel Approximate Density-based Clustering. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC). 560--571.Google Scholar
- Seth Pettie and Vijaya Ramachandran. 2002. A Randomized Time-Work Optimal Parallel Algorithm for Finding a Minimum Spanning Forest. SIAM J. Comput., Vol. 31, 6 (2002), 1879--1895.Google ScholarDigital Library
- John H. Reif and Sandeep Sen. 1992. Optimal randomized parallel algorithms for computational geometry. Algorithmica, Vol. 7, 1 (01 Jun 1992), 91--117.Google Scholar
- Jörg Sander, Martin Ester, Hans-Peter Kriegel, and Xiaowei Xu. 1998. Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications. Data Mining and Knowledge Discovery, Vol. 2, 2 (01 Jun 1998), 169--194.Google Scholar
- A. Sarma, P. Goyal, S. Kumari, A. Wani, J. S. Challa, S. Islam, and N. Goyal. 2019. μDBSCAN: An Exact Scalable DBSCAN Algorithm for Big Data Exploiting Spatial Locality. In IEEE International Conference on Cluster Computing (CLUSTER). 1--11.Google Scholar
- Erich Schubert, Jörg Sander, Martin Ester, Hans Peter Kriegel, and Xiaowei Xu. 2017. DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN. ACM Trans. Database Syst., Vol. 42, 3, Article 19 (July 2017), 19:1--19:21 pages.Google ScholarDigital Library
- J. Shun and G. E. Blelloch. 2014. Phase-Concurrent Hash Tables for Determinism. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA). 96--107.Google Scholar
- Julian Shun, Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, Aapo Kyrola, Harsha Vardhan Simhadri, and Kanat Tangwongsan. 2012. Brief announcement: the Problem Based Benchmark Suite. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA). 68--70.Google ScholarDigital Library
- Hwanjun Song and Jae-Gil Lee. 2018. RP-DBSCAN: A Superfast Parallel DBSCAN Algorithm Based on Random Partitioning. In ACM International Conference on Management of Data (SIGMOD). 1173--1187.Google ScholarDigital Library
- Cheng-Fa Tsai and Chien-Tsung Wu. 2009. GF-DBSCAN: A New Efficient and Effective Data Clustering Technique for Large Databases. In WSEAS International Conference on Multimedia Systems & Signal Processing. 231--236.Google Scholar
- O. Uncu, W. A. Gruver, D. B. Kotak, D. Sabaz, Z. Alibhai, and C. Ng. 2006. GRIDBSCAN: GRId Density-Based Spatial Clustering of Applications with Noise. In IEEE International Conference on Systems, Man and Cybernetics, Vol. 4. 2976--2981.Google Scholar
- Uzi Vishkin. 2010. Thinking in Parallel: Some Basic Data-Parallel Algorithms and Techniques.Google Scholar
- P. Viswanath and V. Suresh Babu. 2009. Rough-DBSCAN: A fast hybrid density based clustering method for large data sets. Pattern Recognition Letters, Vol. 30, 16 (2009), 1477 -- 1488.Google ScholarDigital Library
- P. Viswanath and R. Pinkesh. 2006. l-DBSCAN : A Fast Hybrid Density Based Clustering Method. In International Conference on Pattern Recognition (ICPR), Vol. 1. 912--915.Google Scholar
- Yiqiu Wang, Yan Gu, and Julian Shun. 2019. Theoretically-Efficient and Practical Parallel DBSCAN. arxiv: cs.DS/1912.06255Google Scholar
- Benjamin Welton, Evan Samanas, and Barton P. Miller. 2013. Mr. Scan: Extreme Scale Density-based Clustering Using a Tree-based Network of GPGPU Nodes. In ACM/IEEE International Conference on High Performance Computing, Networking, Storage and Analysis (SC). Article 84, 84:1--84:11 pages.Google Scholar
- Yi-Pu Wu, Jin-Jiang Guo, and Xue-Jie Zhang. 2007. A Linear DBSCAN Algorithm Based on LSH. In International Conference on Machine Learning and Cybernetics, Vol. 5. 2608--2614.Google ScholarCross Ref
- Yan Xiang Fu, Wei Zhong Zhao, and Huifang Ma. 2011. Research on parallel DBSCAN algorithm design based on MapReduce. Advanced Materials Research, Vol. 301--303 (07 2011), 1133--1138.Google Scholar
- Xiaowei Xu, Jochen Jager, and Hans-Peter Kriegel. 1999. A Fast Parallel Clustering Algorithm for Large Spatial Databases. Data Mining and Knowledge Discovery, Vol. 3, 3 (01 Sep 1999), 263--290.Google Scholar
- Yanwei Yu, Jindong Zhao, Xiaodong Wang, Qin Wang, and Yonggang Zhang. 2015. Cludoop: An Efficient Distributed Density-based Clustering for Big Data Using Hadoop. Int. J. Distrib. Sen. Netw., Vol. 2015, Article 2 (Jan. 2015), 2:2--2:2 pages.Google Scholar
- Yu Zheng, Like Liu, Longhao Wang, and Xing Xie. 2008. Learning Transportation Mode from Raw Gps Data for Geographic Applications on the Web. In International Conference on World Wide Web. 247--256.Google ScholarDigital Library
Index Terms
- Theoretically-Efficient and Practical Parallel DBSCAN
Recommendations
Exact, Fast and Scalable Parallel DBSCAN for Commodity Platforms
ICDCN '17: Proceedings of the 18th International Conference on Distributed Computing and NetworkingDBSCAN is one of the most popular density-based clustering algorithm capable of identifying arbitrary shaped clusters and noise. It is computationally expensive for large data sets. In this paper, we present a grid-based DBSCAN algorithm, GridDBSCAN, ...
AA-DBSCAN: an approximate adaptive DBSCAN for finding clusters with varying densities
Clustering is a typical data mining technique that partitions a dataset into multiple subsets of similar objects according to similarity metrics. In particular, density-based algorithms can find clusters of different shapes and sizes while remaining ...
HY-DBSCAN: A hybrid parallel DBSCAN clustering algorithm scalable on distributed-memory computers
Highlights- A parallel scalable DBSCAN algorithm which outperforms other implementations.
- ...
AbstractDbscan is a density-based clustering algorithm which is well known for its ability to discover clusters of arbitrary shape as well as to distinguish noise. As it is computationally expensive for large datasets, research ...
Comments