Skip to main content

Applying Data Mining for the Analysis of Breast Cancer Data

  • Protocol
  • First Online:
Data Mining in Clinical Medicine

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1246))

Abstract

Data mining, also known as Knowledge-Discovery in Databases (KDD), is the process of automatically searching large volumes of data for patterns. For instance, a clinical pattern might indicate a female who have diabetes or hypertension are easier suffered from stroke for 5 years in a future. Then, a physician can learn valuable knowledge from the data mining processes. Here, we present a study focused on the investigation of the application of artificial intelligence and data mining techniques to the prediction models of breast cancer. The artificial neural network, decision tree, logistic regression, and genetic algorithm were used for the comparative studies and the accuracy and positive predictive value of each algorithm were used as the evaluation indicators. 699 records acquired from the breast cancer patients at the University of Wisconsin, nine predictor variables, and one outcome variable were incorporated for the data analysis followed by the tenfold cross-validation. The results revealed that the accuracies of logistic regression model were 0.9434 (sensitivity 0.9716 and specificity 0.9482), the decision tree model 0.9434 (sensitivity 0.9615, specificity 0.9105), the neural network model 0.9502 (sensitivity 0.9628, specificity 0.9273), and the genetic algorithm model 0.9878 (sensitivity 1, specificity 0.9802). The accuracy of the genetic algorithm was significantly higher than the average predicted accuracy of 0.9612. The predicted outcome of the logistic regression model was higher than that of the neural network model but no significant difference was observed. The average predicted accuracy of the decision tree model was 0.9435 which was the lowest of all four predictive models. The standard deviation of the tenfold cross-validation was rather unreliable. This study indicated that the genetic algorithm model yielded better results than other data mining models for the analysis of the data of breast cancer patients in terms of the overall accuracy of the patient classification, the expression and complexity of the classification rule. The results showed that the genetic algorithm described in the present study was able to produce accurate results in the classification of breast cancer data and the classification rule identified was more acceptable and comprehensible.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Wingo PA, Tong T, Bolden S (1995) Cancer statistics, 1995. CA Cancer J Clin 45(1):8–30

    Article  CAS  PubMed  Google Scholar 

  2. Calle J (2004) Breast cancer facts and figures 2003–2004. Am Cancer Soc 2004:1–27

    Google Scholar 

  3. Jerez-Aragones JM et al (2003) A combined neural network and decision trees model for prognosis of breast cancer relapse. Artif Intell Med 27(1):45–63

    Article  PubMed  Google Scholar 

  4. Edwards BK et al (2002) Annual report to the nation on the status of cancer, 1973–1999, featuring implications of age and aging on U.S. cancer burden. Cancer 94(10):2766–2792

    Article  PubMed  Google Scholar 

  5. Pendharkar P, Rodger J, Yaverbaum G (1999) Association, statistical, mathematical and neural approaches for mining breast cancer patterns. Exp Syst Appl 17:223–232

    Article  Google Scholar 

  6. Elmore JG et al (1994) Variability in radiologists’ interpretations of mammograms. N Engl J Med 331(22):1493–1499

    Article  CAS  PubMed  Google Scholar 

  7. Fentiman IS (1998) Detection and treatment of breast cancer. Martin Duntiz, London

    Google Scholar 

  8. Anderson TW (1984) An introduction to multivariate statistical analysis. Willey, New York, NY

    Google Scholar 

  9. Johnson RA, Wichern DW (2002) Applied multivariate statistical analysis. Prentice-Hall, Upper Saddle River, NJ

    Google Scholar 

  10. Kovalerchuk B et al (1997) Fuzzy logic in computer-aided breast cancer diagnosis: analysis of lobulation. Artif Intell Med 11(1):75–85

    Article  CAS  PubMed  Google Scholar 

  11. Barr EAF (1982) The handbook of artificial intelligence, vol 1–3. William Kaufmann, Los Altos, CA

    Google Scholar 

  12. Laurikkala J, Juhola M (1998) A genetic-based machine learning system to discover the diagnostic rules for female urinary incontinence. Comput Methods Programs Biomed 55(3):217–228

    Article  CAS  PubMed  Google Scholar 

  13. Myoung-Jong K, Ingoo H (2003) The discovery of experts’ decision rules from qualitative bankruptcy data using genetic algorithms. Exp Syst Appl 25:637–646

    Article  Google Scholar 

  14. Chen TC, Hsu TC (2006) A GAs based approach for mining breast cancer pattern. Exp Syst Appl 30:674–681

    Article  Google Scholar 

  15. Goldberg DE (1989) Genetic algorithm in search, optimization, and machine learning. Addison-Wesley, Reading, MA

    Google Scholar 

  16. Holland JH (1975) Adaption in natural and artificial systems. The University of Michigan Press, Ann Arbor, MI

    Google Scholar 

  17. Goldberg DE (1994) Genetic and evolutionary algorithms come of age. Comm ACM 37:113–119

    Article  Google Scholar 

  18. Mitchell M (1996) An introduction to genetic algorithms. MIT Press, Cambridge, MA

    Google Scholar 

  19. Forrest S (1993) Genetic algorithms: principles of natural selection applied to computation. Science 261(5123):872–878

    Article  CAS  PubMed  Google Scholar 

  20. Congdon CB (1995) A comparison of genetic algorithms and other machine learning systems on a complex classification task from common disease research. Department of Computer Science and Engineering, University of Michigan

    Google Scholar 

  21. Bali RK et al (2005) Introduction to the special issue on advances in clinical and health-care knowledge management. IEEE Trans Inf Technol Biomed 9(2):157–161

    Article  PubMed  Google Scholar 

  22. Gurbaxani BM et al (2006) Linear data mining the Wichita clinical matrix suggests sleep and allostatic load involvement in chronic fatigue syndrome. Pharmacogenomics 7(3):455–465

    Article  PubMed  Google Scholar 

  23. Berger AM, Berger CR (2004) Data mining as a tool for research and knowledge development in nursing. Comput Inform Nurs 22(3):123–131

    Article  PubMed  Google Scholar 

  24. Hobbs GR (2001) Data mining and healthcare informatics. Am J Health Behav 25(3):285–289

    Article  CAS  PubMed  Google Scholar 

  25. Obenshain MK (2004) Application of data mining techniques to healthcare data. Infect Control Hosp Epidemiol 25(8):690–695

    Article  PubMed  Google Scholar 

  26. Koh HC, Tan G (2005) Data mining applications in healthcare. J Healthc Inf Manag 19(2):64–72

    PubMed  Google Scholar 

  27. Bauer RJ (1994) Genetic algorithm and investment strategies. Willey, New York, NY

    Google Scholar 

  28. Kim YS et al (2003) Screening test data analysis for liver disease prediction model using growth curve. Biomed Pharmacother 57(10):482–488

    Article  PubMed  Google Scholar 

  29. Shin KS, LEE YJ (2002) A genetic algorithm application in bankruptcy prediction model. Exp Syst Appl 23(3):321–328

    Article  Google Scholar 

  30. Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. The fourteenth International Joint Conference on Artificial Intelligence 1995. San Francisco, CA.

    Google Scholar 

  31. Breiman L, Friedman JH, Qlshen RA (1984) Classification and regression trees. Wadsworth & Brooks/Cole Advanced Books, Pacific Grove, CA

    Google Scholar 

  32. Kim EK et al (1993) Comparison of neural network and k-NN classification methods in medical image and voice recognitions. Med J Osaka Univ 41–42(1–4):11–16

    PubMed  Google Scholar 

  33. Richardson CJ, Barlow DJ (1996) Neural network computer simulation of medical aerosols. J Pharm Pharmacol 48(6):581–591

    Article  CAS  PubMed  Google Scholar 

  34. Eghbaldar A et al (1996) Identification of structural features from mass spectrometry using a neural network approach: application to trimethylsilyl derivatives used for medical diagnosis. J Chem Inf Comput Sci 36(4):637–643

    Article  CAS  PubMed  Google Scholar 

  35. Rumelhart DE, Hinton GE, Williams RJ (1986) Learning internal representations by error propagation. MIT Press, Cambridge, MA

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Der-Ming Liou or Wei-Pin Chang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer Science+Business Media, New York

About this protocol

Cite this protocol

Liou, DM., Chang, WP. (2015). Applying Data Mining for the Analysis of Breast Cancer Data. In: Fernández-Llatas, C., García-Gómez, J. (eds) Data Mining in Clinical Medicine. Methods in Molecular Biology, vol 1246. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-1985-7_12

Download citation

  • DOI: https://doi.org/10.1007/978-1-4939-1985-7_12

  • Published:

  • Publisher Name: Humana Press, New York, NY

  • Print ISBN: 978-1-4939-1984-0

  • Online ISBN: 978-1-4939-1985-7

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics