Abstract
Cancer classification is one of the main steps during patient healing process. This fact enforces modern clinical researchers to use advanced bioinformatics methods for cancer classification. Cancer classification is usually performed using gene expression data gained in microarray experiment and advanced machine learning methods. Microarray experiment generates huge amount of data, and its processing via machine learning methods represents a big challenge. In this study, two-step classification paradigm which merges genetic algorithm feature selection and machine learning classifiers is utilized. Genetic algorithm is built in MapReduce programming spirit which makes this algorithm highly scalable for Hadoop cluster. In order to improve the performance of the proposed algorithm, it is extended into a parallel algorithm which process on microarray data in distributed manner using the Hadoop MapReduce framework. In this paper, the algorithm was tested on eleven GEMS data sets (9 tumors, 11 tumors, 14 tumors, brain tumor 1, lung cancer, brain tumor 2, leukemia 1, DLBCL, leukemia 2, SRBCT, and prostate tumor) and its accuracy reached 100% for less than 25 selected features. The proposed cloud computing-based MapReduce parallel genetic algorithm performed well on gene expression data. In addition, the scalability of the suggested algorithm is unlimited because of underlying Hadoop MapReduce platform. The presented results indicate that the proposed method can be effectively implemented for real-world microarray data in the cloud environment. In addition, the Hadoop MapReduce framework demonstrates substantial decrease in the computation time.
Similar content being viewed by others
References
Welcome to Apache™ Hadoop®!. http://hadoop.apache.org. Accessed 13 Dec 2015
Mohammed EA, Far BH, Naugler C (2014) Applications of the MapReduce programming framework to clinical big data analysis: current landscape and future trends. BioData Min 7:22
Coulouris GF, Dollimore J, Kindberg T (2005) Distributed systems: concepts and design. Pearson Education, Upper Saddle River
Apache Spark™—Lightning-Fast Cluster Computing. http://spark.apache.org/. Accessed 24 Apr 2016
Quackenbush J, John Q (2001) Computational genetics: computational analysis of microarray data. Nat Rev Genet 2(6):418–427
Wong T-T, Tzu-Tsung W, Ching-Han H (2008) Two-stage classification methods for microarray data. Expert Syst Appl 34(1):375–383
Lee C-P, Chien-Pang L, Wen-Shin L, Yuh-Min C, Bo-Jein K (2011) Gene selection and sample classification on microarray data based on adaptive genetic algorithm/k-nearest neighbor method. Expert Syst Appl 38(5):4661–4667
Roffo G, Giorgio R, Simone M, Marco C (2015) Infinite Feature Selection, in 2015 IEEE International Conference on Computer Vision (ICCV)
Phuong TM, Lin Z, Altman RB (2005) Choosing SNPs using feature selection. In: Proceedings of IEEE Computational Systems Bioinformatics Conference, pp 301–309
Hong J-H, Jin-Hyuk H, Sung-Bae C (2006) Efficient huge-scale feature selection with speciated genetic algorithm. Pattern Recognit Lett 27(2):143–150
Mohamad MS, Safaai D, Illias RMD (2005) A hybrid of genetic algorithm and support vector machine for features selection and classification of gene expression microarray. Int J Comput Intell Appl 05(01):91–107
Hung C-L, Chen W-P, Hua G-J, Zheng H, Tsai S-JJ, Lin Y-L (2015) Cloud computing-based TagSNP selection algorithm for human genome data. Int J Mol Sci 16(1):1096–1110
Taylor RC (2010) An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. BMC Bioinform 11(Suppl 12):S1
Schatz MC (2009) CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics 25(11):1363–1369
Hung C-L, Lin Y-L (2013) Implementation of a parallel protein structure alignment service on cloud. Int J Genom Proteom 2013:439681
Hung C-L, Hua G-J (2013) Cloud computing for protein-ligand binding site comparison. Biomed Res Int 2013:170356
Gunarathne T (2015) Hadoop MapReduce v2 Cookbook, 2nd edn. Packt Publishing Ltd, Birmingham
Keco D, Subasi A (2012) Parallelization of genetic algorithms using Hadoop Map/Reduce. SouthEast Eur J Soft Comput 1(2):56–59
Statnikov A, Tsamardinos I, Dosbayev Y, Aliferis CF (2005) GEMS: a system for automated cancer diagnosis and biomarker discovery from microarray gene expression data. Int J Med Inform 74(7–8):491–503
Negnevitsky M (2005) Artificial intelligence: a guide to intelligent systems. Pearson Education, Upper Saddle River
Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Lee W-P, Hsiao Y-T, Hwang W-C (2014) Designing a parallel evolutionary algorithm for inferring gene networks on the cloud computing environment. BMC Syst Biol 8:5
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Rights and permissions
About this article
Cite this article
Kečo, D., Subasi, A. & Kevric, J. Cloud computing-based parallel genetic algorithm for gene selection in cancer classification. Neural Comput & Applic 30, 1601–1610 (2018). https://doi.org/10.1007/s00521-016-2780-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-016-2780-z