Research on Parallel DBSCAN Algorithm Design Based on MapReduce

Article Preview

Abstract:

Data clustering has been received considerable attention in many applications, such as data mining, document retrieval, image segmentation and pattern classification. The enlarging volumes of information emerging by the progress of technology, makes clustering of very large scale of data a challenging task. In order to deal with the problem, more researchers try to design efficient parallel clustering algorithms. In this paper, we propose a parallel DBSCAN clustering algorithm based on Hadoop, which is a simple yet powerful parallel programming platform. The experimental results demonstrate that the proposed algorithm can scale well and efficiently process large datasets on commodity hardware.

You might also be interested in these eBooks

Info:

Periodical:

Advanced Materials Research (Volumes 301-303)

Pages:

1133-1138

Citation:

Online since:

July 2011

Export:

Price:

[1] J. MacQueen: Some Methods for Classification and Analysis of Multivariate Observations. In: Proc. 5th Berkeley Symp. Math. Statist, Prob., 1: 281-297(1967).

Google Scholar

[2] R. Ng, J. Han: Efficient and Effective Clustering Method for Spatial Data Mining. In: Proc. 1994 Int. Conf. Very large Data Bases, pp.144-155. Santiago, Chile (1994).

Google Scholar

[3] S. Guha, R. Rastogi, K. Shim: CURE: An Efficient Clustering Algorithm for Large Databases. In: Proc. 1998 ACM-SIGMOD Int. Conf. Management of Data, pp.73-84. Seattle, WA (1998).

DOI: 10.1145/276305.276312

Google Scholar

[4] G. Karypis, E. H. Han, V. Kumar: CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling. Computer, 32: 68-75 (1999).

DOI: 10.1109/2.781637

Google Scholar

[5] M. Ester, H. Kriege, J. Sander, X. Xu: A Density-based Algorithm for Discovering Clusters in Large Spatial Databases. In: Proc. 1996 Int. Conf. Knowledge Discovery and Data Mining, pp.226-231. Portland, OR (1996).

Google Scholar

[6] W. Wang, J. Yang, R. Muntz: STING: A Statistical Information Grid Approach to Spatial Data Mining. In: Proc. 1997 Int. Conf. Very Large Data Bases, pp.186-195. Athens, Greece (1997).

Google Scholar

[7] R. Agrawal, J. Gehrke, D. Gunopulos, P. Raghavan: Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications. In: Proc. 1998 ACM SIGMOD Int. Conf. Management of Data, pp.94-105. Seattle, WA (1998).

DOI: 10.1145/276305.276314

Google Scholar

[8] E. Rasmussen, P. Willett, P.: Efficiency of Hierarchical Agglomerative Clustering Using the ICL Distributed Array Processor. Journal of Documentation, 45(1): 1-24. (1989).

DOI: 10.1108/eb026836

Google Scholar

[9] X. Li, Z. Fang: Parallel Clustering Algorithms. Parallel Computing, 11: 275-290. (1989).

DOI: 10.1016/0167-8191(89)90036-7

Google Scholar

[10] C. Olson: Parallel Algorithms for Hierarchical Clustering. Parallel Computing, 21(8): 1313-1325. (1995).

DOI: 10.1016/0167-8191(95)00017-i

Google Scholar

[11] J. Dean, S. Ghemawat: MapReduce: Simplified Data Processing on Large Clusters. In: Proc. of Operating Systems Design and Implementation, pp.137-150. San Francisco, CA (2004).

Google Scholar

[12] Hadoop: Open source implementation of MapReduce. http: /lucene. apache. org/hadoop.

Google Scholar

[13] S. Ghemawat, H. Gobioff, S. Leung: The Google File System. Symposium on Operating Systems Principles, pp.29-43 (2003).

DOI: 10.1145/945445.945450

Google Scholar

[14] D. Borthakur. The Hadoop Distributed File System: Architecture and Design. (2007).

Google Scholar

[15] C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, C. Kozyrakis: Evaluating MapReduce for Multi-core and Multiprocessor Systems. In: Proc. of 13th Int. Symposium on High-Performance Computer Architecture (HPCA). Phoenix, AZ (2007).

DOI: 10.1109/hpca.2007.346181

Google Scholar

[16] R. Lammel.: Google's MapReduce Programming Model - Revisited. Science of Computer Programming, 70: 1-30 (2008).

Google Scholar

[17] X. Xu, J. Jager, H. Kriegel: A Fast Parallel Clustering Algorithm for Large Spatial Databases. Data Mining and Knowledge Discovery, 3: 263-290 (1999).

DOI: 10.1007/0-306-47011-x_3

Google Scholar