Skip to main content

Advertisement

Log in

Machine Learning Based Big Data Processing Framework for Cancer Diagnosis Using Hidden Markov Model and GM Clustering

  • Published:
Wireless Personal Communications Aims and scope Submit manuscript

Abstract

The change in the DNA is a form of genetic variation in the human genome. In addition, the DNA copy number change is also linked with the progression of many emerging diseases. Array-based Comparative Genomic Hybridization (CGH) is considered as a major task when measuring the DNA copy number change across the genome. Moreover, DNA copy number change is an essential measure to diagnose the cancer disease. Next generation sequencing is an important method for studying the spread of infectious disease qualitatively and quantitatively. CGH is widely used in continuous monitoring of copy number of thousands of genes throughout the genome. In recent years, the size of the DNA sequence data is very large. Hence, there is a need to use a scalable machine learning approach to overcome the various issues in DNA copy number change detection. In this paper, we use a Bayesian hidden Markov model (HMM) with Gaussian Mixture (GM) Clustering approach to model the DNA copy number change across the genome. The proposed Bayesian HMM with GM Clustering approach is compared with various existing approaches such as Pruned Exact Linear Time method, binary segmentation method and segment neighborhood method. Experimental results demonstrate the effectiveness of our proposed change detection algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

References

  1. Attiyeh, E. F., Diskin, S. J., Attiyeh, M. A., Mossé, Y. P., Hou, C., Jackson, E. M., et al. (2009). Genomic copy number determination in cancer cells from single nucleotide polymorphism microarrays based on quantitative genotyping corrected for aneuploidy. Genome Research, 19(2), 276–283.

    Article  Google Scholar 

  2. Zhao, X., Li, C., Paez, J. G., Chin, K., Jänne, P. A., Chen, T. H., et al. (2004). An integrated view of copy number and allelic alterations in the cancer genome using single nucleotide polymorphism arrays. Cancer Research, 64(9), 3060–3071.

    Article  Google Scholar 

  3. Lopez, D., Gunasekaran, M., Murugan, B. S., Kaur, H., & Abbas, K. M. (2014). Spatial big data analytics of influenza epidemic in Vellore, India. In 2014 IEEE international conference on big data (Big Data) (pp. 19–24).

  4. Varatharajan, R., Manogaran, G., Priyan, M. K., & Sundarasekar, R. (2017). Wearable sensor devices for early detection of Alzheimer disease using dynamic time warping algorithm. Cluster Computing, 1–10.

  5. Varatharajan, R., Manogaran, G., Priyan, M. K., Balaş, V. E., & Barna, C. (2017). Visual analysis of geospatial habitat suitability model based on inverse distance weighting with paired comparison analysis. Multimedia Tools and Applications, 1–21.

  6. Thota, C., Sundarasekar, R., Manogaran, G., Varatharajan, R., & Priyan, M. K. (2018). Centralized fog computing security platform for IoT and cloud in healthcare system. In Exploring the convergence of big data and the internet of things (pp. 141–154). IGI Global.

  7. Varatharajan, R., Vasanth, K., Gunasekaran, M., Priyan, M., & Gao, X. Z. (2017). An adaptive decision based kriging interpolation algorithm for the removal of high density salt and pepper noise in images. Computers & Electrical Engineering.

  8. Manogaran, G., Lopez, D., Thota, C., Abbas, K. M., Pyne, S., & Sundarasekar, R. (2017). Big data analytics in healthcare internet of things. In G. S. Tomar (Ed.), Innovative healthcare systems for the 21st century (pp. 263–284). Berlin: Springer.

    Chapter  Google Scholar 

  9. Manogaran, G., & Lopez, D. (2017). Spatial cumulative sum algorithm with big data analytics for climate change detection. Computers & Electrical Engineering.

  10. Manogaran, G., & Lopez, D. (2017). A Gaussian process based big data processing framework in cluster computing environment. Cluster Computing, 1–16.

  11. Campbell, P. J., Yachida, S., Mudie, L. J., Stephens, P. J., Pleasance, E. D., Stebbings, L. A., et al. (2010). The patterns and dynamics of genomic instability in metastatic pancreatic cancer. Nature, 467(7319), 1109–1113.

    Article  Google Scholar 

  12. Vayena, E., Salathé, M., Madoff, L. C., & Brownstein, J. S. (2015). Ethical challenges of big data in public health. PLoS Computational Biology, 11(2), e1003904.

    Article  Google Scholar 

  13. Lopez, D., & Gunasekaran, M. (2015). Assessment of vaccination strategies using fuzzy multi-criteria decision making. In Proceedings of the Fifth International Conference on Fuzzy and Neuro Computing (FANCCO-2015) (pp. 195–208). Berlin: Springer.

  14. Lopez, D., & Sekaran, G. (2016). Climate change and disease dynamics-a big data perspective. International Journal of Infectious Diseases, 45, 23–24.

    Article  Google Scholar 

  15. Lopez, D., & Manogaran, G. (2016). Big data architecture for climate change and disease dynamics. In G. S. Tomar et al. (Eds.) The human element of big data: issues, analytics, and performance (pp. 301–331). Boca Raton: CRC Press.

    Google Scholar 

  16. Manogaran, G., Thota, C., & Kumar, M. V. (2016). MetaCloud data storage architecture for big data security in cloud computing. Procedia Computer Science, 87, 128–133.

    Article  Google Scholar 

  17. Manogaran, G., & Lopez, D. (2016). Health data analytics using scalable logistic regression with stochastic gradient descent. International Journal of Advanced Intelligence Paradigms, 9, 1–15.

    Google Scholar 

  18. Manogaran, G., & Lopez, D. (2017). Disease surveillance system for big climate data processing and dengue transmission. International Journal of Ambient Computing and Intelligence, 8(2), 88–105.

    Article  Google Scholar 

  19. Thota, C., Manogaran, G., Lopez, D., & Vijayakumar, V. (2017). Big data security framework for distributed cloud data centers. In Cybersecurity breaches and issues surrounding online threat protection (pp. 288–310). IGI Global.

  20. Manogaran, G., Thota, C., Lopez, D., Vijayakumar, V., Abbas, K. M., & Sundarsekar, R. (2017). Big data knowledge system in healthcare. In C. Bhatt, N. Dey & A. Ashour (Eds.), Internet of things and big data technologies for next generation healthcare (pp. 133–157). Berlin: Springer.

    Chapter  Google Scholar 

  21. Gijzen, H. (2013). Development: big data for a sustainable future. Nature, 502(7469), 38.

    Article  Google Scholar 

  22. Wang, X., & Sun, Z. (2013). The design of water resources and hydropower cloud GIS platform based on big data. In Y. Xie, X. Cui & F. Bian (Eds.), Geo-informatics in resource management and sustainable ecosystem (pp. 313–322). Berlin: Springer.

    Chapter  Google Scholar 

  23. Howe, D., Costanzo, M., Fey, P., Gojobori, T., Hannick, L., Hide, W., et al. (2008). Big data: The future of biocuration. Nature, 455(7209), 47–50.

    Article  Google Scholar 

  24. Hampton, S. E., Strasser, C. A., Tewksbury, J. J., Gram, W. K., Budden, A. E., Batcheller, A. L., et al. (2013). Big data and the future of ecology. Frontiers in Ecology and the Environment, 11(3), 156–162.

    Article  Google Scholar 

  25. Jang, S. M., & Hart, P. S. (2015). Polarized frames on—climate change‖ and—global warming‖ across countries and states: evidence from twitter big data. Global Environmental Change, 32, 11–17.

    Article  Google Scholar 

  26. Zhao, W., Ma, H., & He, Q. (2009). Parallel k-means clustering based on mapreduce. In M. G. Jaatun, G. Zhao & C. Rong (Eds.), Cloud computing (pp. 674–679). Berlin: Springer.

    Chapter  Google Scholar 

  27. Nguyen, C. D., Nguyen, D. T., & Pham, V. H. (2013). Parallel two-phase K-means. In B. Murgante, S. Misra & M. Carlini (Eds.), Computational Science and Its Applications–ICCSA 2013 (pp. 224–231). Berlin: Springer.

    Chapter  Google Scholar 

  28. Sun, Z., & Fox, G. (2012). Study on parallel SVM based on MapReduce. In Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA) (p. 1). The Steering Committee of The World Congress in Computer Science, Computer Engineering and Applied Computing (WorldComp).

  29. Ester, M., Kriegel, H. P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In Kdd (Vol. 96, No. (34), pp. 226–231).

  30. Li, L., & Xi, Y. (2011).Research on clustering algorithm and its parallelization strategy. In IEEE international conference on computational and information sciences (ICCIS) (pp. 325–328).

  31. He, Y., Tan, H., Luo, W., Mao, H., Ma, D., Feng, S., & Fan, J. (2011). Mr-dbscan: An efficient parallel density-based clustering algorithm using mapreduce. In IEEE 17th international conference on parallel and distributed systems (ICPADS) (pp. 473–480).

  32. Fries, S., Wels, S., & Seidl, T. (2014).Projected clustering for huge data sets in MapReduce. In EDBT (pp. 49–60).

  33. Moise, G., Sander, J., & Ester, M. (2006). P3C: A robust projected clustering algorithm. In IEEE sixth international conference on data mining, 2006. ICDM’06 (pp. 414–425).

  34. Gao, Z., Bu, W., Zheng, Y., & Wu, X. (2017). Automated layer segmentation of macular OCT images via graph-based SLIC superpixels and manifold ranking approach. Computerized Medical Imaging and Graphics, 55, 42–53.

    Article  Google Scholar 

  35. Baran, U., Zhu, W., Choi, W. J., Omori, M., Zhang, W., Alkayed, N. J., et al. (2016). Automated segmentation and enhancement of optical coherence tomography-acquired images of rodent brain. Journal of Neuroscience Methods, 270, 132–137.

    Article  Google Scholar 

  36. Li, D., Taniguchi, E. V., Cai, S., Paschalis, E. I., Wang, H., Miller, J. B., & Shen, L. Q. (2016). Comparison of swept-source and enhanced depth imaging spectral-domain optical coherence tomography in quantitative characterisation of the optic nerve head. British Journal of Ophthalmology, bjophthalmol-2016.

  37. Tang, J., Liu, X., & Sun, Q. (2009). A direct image contrast enhancement algorithm in the wavelet domain for screening mammograms. IEEE Journal of Selected Topics in Signal Processing, 3(1), 74–80.

    Article  Google Scholar 

  38. Li, C., Wang, X., Eberl, S., Fulham, M., & Feng, D. (2013). A new energy framework with distribution descriptors for image segmentation. IEEE Transactions on Image Processing, 22(9), 3578–3590.

    Article  Google Scholar 

  39. Vermeer, K. A., van der Schoot, J., Lemij, H. G., & de Boer, J. F. (2012). RPE-normalized RNFL attenuation coefficient maps derived from volumetric OCT imaging for glaucoma assessment RNFL attenuation coefficient maps for Glaucoma. Investigative Ophthalmology & Visual Science, 53(10), 6102–6108.

    Article  Google Scholar 

  40. Ma, Z., Xue, J. H., Leijon, A., Tan, Z. H., Yang, Z., & Guo, J. (2016). Decorrelation of neutral vector variables: Theory and applications. IEEE transactions on neural networks and learning systems.

  41. Ma, Z., Teschendorff, A. E., Leijon, A., Qiao, Y., Zhang, H., & Guo, J. (2015). Variational bayesian matrix factorization for bounded support data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(4), 876–889.

    Article  Google Scholar 

  42. Ng, P. A. A. Y. (2005). Learning first-order Markov models for control. In Advances in neural information processing systems 17: Proceedings of the 2004 conference (Vol. 17, p. 1). MIT Press.

  43. Ma, Z., Rana, P. K., Taghia, J., Flierl, M., & Leijon, A. (2014). Bayesian estimation of Dirichlet mixture model with variational inference. Pattern Recognition, 47(9), 3143–3157.

    Article  MATH  Google Scholar 

  44. Ma, Z., Xie, J., Li, H., Sun, Q., Si, Z., Zhang, J., & Guo, J. (2017). The role of data analysis in the development of intelligent energy networks. arXiv preprint arXiv:1705.11132.

  45. Ghahramani, Z. (2001). An introduction to hidden Markov models and Bayesian networks. International Journal of Pattern Recognition and Artificial Intelligence, 15(01), 9–42.

    Article  Google Scholar 

  46. Stanke, M., & Waack, S. (2003). Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics, 19(suppl 2), ii215–ii225.

    Article  Google Scholar 

  47. Henderson, J., Salzberg, S., & Fasman, K. H. (1997). Finding genes in DNA with a hidden Markov model. Journal of Computational Biology, 4(2), 127–141.

    Article  Google Scholar 

  48. Wang, K., Li, M., Hadley, D., Liu, R., Glessner, J., Grant, S. F., et al. (2007). PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Research, 17(11), 1665–1674.

    Article  Google Scholar 

  49. Boys, R. J., Henderson, D. A., & Wilkinson, D. J. (2000). Detecting homogeneous segments in DNA sequences by using hidden Markov models. Applied Statistics, 49, 269–285.

    MathSciNet  MATH  Google Scholar 

  50. Leroux, B. G. (1992). Maximum-likelihood estimation for hidden Markov models. Stochastic processes and their applications, 40(1), 127–143.

    Article  MathSciNet  MATH  Google Scholar 

  51. Hidden Markov model. (2017). En.wikipedia.org. Retrieved October 9, 2017, from http://en.wikipedia.org/wiki/Hidden_Markov_model#/media/File:HiddenMarkovModel.svg.

  52. Siepel, A., & Haussler, D. (2004). Combining phylogenetic and hidden Markov models in biosequence analysis. Journal of Computational Biology, 11(2–3), 413–428.

    Article  Google Scholar 

  53. Krogh, A., Brown, M., Mian, I. S., Sjölander, K., & Haussler, D. (1994). Hidden Markov models in computational biology: Applications to protein modeling. Journal of Molecular Biology, 235(5), 1501–1531.

    Article  Google Scholar 

  54. Churchill, G. A. (1989). Stochastic models for heterogeneous DNA sequences. Bulletin of Mathematical Biology, 51(1), 79–94.

    Article  MathSciNet  MATH  Google Scholar 

  55. Stanke, M., Schöffmann, O., Morgenstern, B., & Waack, S. (2006). Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinformatics, 7(1), 62.

    Article  Google Scholar 

  56. Yada, T., Totoki, Y., Ishikawa, M., Asai, K., & Nakai, K. (1998). Automatic extraction of motifs represented in the hidden Markov model from a number of DNA sequences. Bioinformatics, 14(4), 317–325.

    Article  Google Scholar 

  57. Jablonowski, K. (2017). Hidden Markov models for protein domain homology identification and analysis. SH2 Domains: Methods and Protocols, 1555, 47–58.

    Article  Google Scholar 

  58. Lehmann, T., & Schlattmann, P. (2017). Treatment of nonignorable missing data when modeling unobserved heterogeneity with finite mixture models. Biometrical Journal, 59(1), 159–171.

    Article  MathSciNet  MATH  Google Scholar 

  59. Prakash, R. M., & Kumari, R. S. S. (2017). Spatial fuzzy C means and expectation maximization algorithms with bias correction for segmentation of MR brain images. Journal of Medical Systems, 41(1), 15.

    Article  Google Scholar 

  60. Mihlin, A., & Levin, C. S. (2017). An expectation maximization method for joint estimation of emission activity distribution and photon attenuation map in PET. IEEE Transactions on Medical Imaging, 36(1), 214–224.

    Article  Google Scholar 

  61. Bhadra, A. (2017). An expectation–maximization scheme for measurement error models. Statistics & Probability Letters, 120, 61–68.

    Article  MathSciNet  MATH  Google Scholar 

  62. Kounades-Bastian, D., Girin, L., Alameda-Pineda, X., Gannot, S., & Horaud, R. (2017). An EM algorithm for joint source separation and diarisation of multichannel convolutive speech mixtures. In IEEE International Conference on Acoustics, Speech and Signal Processing.

  63. Borges, P. (2017). EM algorithm-based likelihood estimation for a generalized Gompertz regression model in presence of survival data with long-term survivors: an application to uterine cervical cancer data. Journal of Statistical Computation and Simulation, 87, 1–11.

    Article  MathSciNet  Google Scholar 

  64. Chen, F., Agüero, J. C., Gilson, M., Garnier, H., & Liu, T. (2017). EM-based identification of continuous-time ARMA Models from irregularly sampled data. Automatica, 77, 293–301.

    Article  MathSciNet  MATH  Google Scholar 

  65. Shinmura, K., Kato, H., Kawanishi, Y., Yoshimura, K., Igarashi, H., Goto, M., et al. (2017). Reduced expression of the DNA glycosylase gene MUTYH is associated with an increased number of somatic mutations via a reduction in the DNA repair capacity in prostate adenocarcinoma. Molecular Carcinogenesis, 56(2), 781–788.

    Article  Google Scholar 

  66. Papastamoulis, P., & Rattray, M. (2017). A Bayesian model selection approach for identifying differentially expressed transcripts from RNA sequencing data. Journal of the Royal Statistical Society: Series C (Applied Statistics).

  67. Killick, R., Eckley, I. A., Jonathan, P., & Chester, U. K. (2011). Efficient detection of multiple changepoints within an oceano-graphic time series. In Proceedings of the 58th world science congress of ISI.

  68. Scott, A. J., & Knott, M. (1974). A cluster analysis method for grouping means in the analysis of variance. Biometrics, 30, 507–512.

    Article  MATH  Google Scholar 

  69. Auger, I. E., & Lawrence, C. E. (1989). Algorithms for the optimal identification of segment neighborhoods. Bulletin of Mathematical Biology, 51(1), 39–54.

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gunasekaran Manogaran.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Manogaran, G., Vijayakumar, V., Varatharajan, R. et al. Machine Learning Based Big Data Processing Framework for Cancer Diagnosis Using Hidden Markov Model and GM Clustering. Wireless Pers Commun 102, 2099–2116 (2018). https://doi.org/10.1007/s11277-017-5044-z

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11277-017-5044-z

Keywords

Navigation