Parallel Semi-Supervised Big Data Clustering Based on Mapreduce Technology
Amsaveni M1, Duraisamy S2
1Amsaveni M, Department of Computer Science, AVP College of Arts and Science, Tirupur, India.
2Duraisamy S, Department of Computer Science, Chikkanna Government Arts College, Tirupur, India.

Manuscript received on November 15, 2019. | Revised Manuscript received on November 23, 2019. | Manuscript published on November 30, 2019. | PP: 1657-1664 | Volume-8 Issue-4, November 2019. | Retrieval Number: C5206098319/2019©BEIESP | DOI: 10.35940/ijrte.C5206.118419

Open Access | Ethics and Policies | Cite  | Mendeley | Indexing and Abstracting
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC-BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)

Abstract: In the area of information technology, a speedy sensational technology is big data. Big data brings tremendous challenges to extract valuable hidden knowledge. Data mining techniques can be used over big data to extract valuable knowledge for decision making. Big data results in high heterogeneity because it consists of various inter-related kinds of objects such as audios, texts, and images. In addition to this, the inter-related kinds of objects carry different information. So, in this paper clustering techniques are introduced to separate objects into several clusters. It also reduces the computational complexity of classifiers. A Possibilistic c-Means (PCM) algorithm was introduced to group the objects in big data. PCM replicated the characteristic of each object to different clusters effectively and it had capability to avoid the corruption of noise in the clustering process. However, PCM is not more efficient for big data and it cannot confine the complex correlation over multiple modalities of the heterogeneous data objects. So, a Parallel Semi-supervised Multi-Ant Colonies Clustering (PSMACC) is introduced for big data clustering. Initially, the PSMACC splits the data into number of partitions and each partition is processed in mappers. Each mapper generates a diverse collection of three clustering components using the semi-supervised ant colony clustering algorithm with various moving speeds. Then, a hyper graph model was used to combine three clustering components. Finally, two constraints such as Must-Link (ML) and Cannot-Link (CL) are included to form a consensus clustering. Finally, the intermediate results of each mapper are combined in the reducer. However, the overhead of iteration in PSMACC is overwhelming which affects the performance of PSMACC. So, a Parallel Semi-supervised Multi-Imperialist Competitive Algorithm (PSMICA) is proposed to cluster the big data. In PSMICA, each mapper processes the ICA where initial population is called countries. Some of the best countries in the population chosen as the imperialists and the remaining countries form the colonies of these imperialists. The colonies move towards the imperialists based on the distance between them. The intermediate results of each mapper are combined in reducer to get the final clustering result.
Keywords: Big Data Clustering, Parallel Semi-Supervised Multi-Imperialist Competitive Algorithm, Parallel Semi-Supervised Multi-Ant Colonies Clustering, Possibilistic C Means.
Scope of the Article: Big Data Analytics.