Applying Data Mining for the Analysis of Breast Cancer Data

Liou, Der-Ming; Chang, Wei-Pin

doi:10.1007/978-1-4939-1985-7_12

Der-Ming Liou⁴ &
Wei-Pin Chang⁴

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1246))

2790 Accesses
12 Citations

Abstract

Data mining, also known as Knowledge-Discovery in Databases (KDD), is the process of automatically searching large volumes of data for patterns. For instance, a clinical pattern might indicate a female who have diabetes or hypertension are easier suffered from stroke for 5 years in a future. Then, a physician can learn valuable knowledge from the data mining processes. Here, we present a study focused on the investigation of the application of artificial intelligence and data mining techniques to the prediction models of breast cancer. The artificial neural network, decision tree, logistic regression, and genetic algorithm were used for the comparative studies and the accuracy and positive predictive value of each algorithm were used as the evaluation indicators. 699 records acquired from the breast cancer patients at the University of Wisconsin, nine predictor variables, and one outcome variable were incorporated for the data analysis followed by the tenfold cross-validation. The results revealed that the accuracies of logistic regression model were 0.9434 (sensitivity 0.9716 and specificity 0.9482), the decision tree model 0.9434 (sensitivity 0.9615, specificity 0.9105), the neural network model 0.9502 (sensitivity 0.9628, specificity 0.9273), and the genetic algorithm model 0.9878 (sensitivity 1, specificity 0.9802). The accuracy of the genetic algorithm was significantly higher than the average predicted accuracy of 0.9612. The predicted outcome of the logistic regression model was higher than that of the neural network model but no significant difference was observed. The average predicted accuracy of the decision tree model was 0.9435 which was the lowest of all four predictive models. The standard deviation of the tenfold cross-validation was rather unreliable. This study indicated that the genetic algorithm model yielded better results than other data mining models for the analysis of the data of breast cancer patients in terms of the overall accuracy of the patient classification, the expression and complexity of the classification rule. The results showed that the genetic algorithm described in the present study was able to produce accurate results in the classification of breast cancer data and the classification rule identified was more acceptable and comprehensible.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Wingo PA, Tong T, Bolden S (1995) Cancer statistics, 1995. CA Cancer J Clin 45(1):8–30
Article CAS PubMed Google Scholar
Calle J (2004) Breast cancer facts and figures 2003–2004. Am Cancer Soc 2004:1–27
Google Scholar
Jerez-Aragones JM et al (2003) A combined neural network and decision trees model for prognosis of breast cancer relapse. Artif Intell Med 27(1):45–63
Article PubMed Google Scholar
Edwards BK et al (2002) Annual report to the nation on the status of cancer, 1973–1999, featuring implications of age and aging on U.S. cancer burden. Cancer 94(10):2766–2792
Article PubMed Google Scholar
Pendharkar P, Rodger J, Yaverbaum G (1999) Association, statistical, mathematical and neural approaches for mining breast cancer patterns. Exp Syst Appl 17:223–232
Article Google Scholar
Elmore JG et al (1994) Variability in radiologists’ interpretations of mammograms. N Engl J Med 331(22):1493–1499
Article CAS PubMed Google Scholar
Fentiman IS (1998) Detection and treatment of breast cancer. Martin Duntiz, London
Google Scholar
Anderson TW (1984) An introduction to multivariate statistical analysis. Willey, New York, NY
Google Scholar
Johnson RA, Wichern DW (2002) Applied multivariate statistical analysis. Prentice-Hall, Upper Saddle River, NJ
Google Scholar
Kovalerchuk B et al (1997) Fuzzy logic in computer-aided breast cancer diagnosis: analysis of lobulation. Artif Intell Med 11(1):75–85
Article CAS PubMed Google Scholar
Barr EAF (1982) The handbook of artificial intelligence, vol 1–3. William Kaufmann, Los Altos, CA
Google Scholar
Laurikkala J, Juhola M (1998) A genetic-based machine learning system to discover the diagnostic rules for female urinary incontinence. Comput Methods Programs Biomed 55(3):217–228
Article CAS PubMed Google Scholar
Myoung-Jong K, Ingoo H (2003) The discovery of experts’ decision rules from qualitative bankruptcy data using genetic algorithms. Exp Syst Appl 25:637–646
Article Google Scholar
Chen TC, Hsu TC (2006) A GAs based approach for mining breast cancer pattern. Exp Syst Appl 30:674–681
Article Google Scholar
Goldberg DE (1989) Genetic algorithm in search, optimization, and machine learning. Addison-Wesley, Reading, MA
Google Scholar
Holland JH (1975) Adaption in natural and artificial systems. The University of Michigan Press, Ann Arbor, MI
Google Scholar
Goldberg DE (1994) Genetic and evolutionary algorithms come of age. Comm ACM 37:113–119
Article Google Scholar
Mitchell M (1996) An introduction to genetic algorithms. MIT Press, Cambridge, MA
Google Scholar
Forrest S (1993) Genetic algorithms: principles of natural selection applied to computation. Science 261(5123):872–878
Article CAS PubMed Google Scholar
Congdon CB (1995) A comparison of genetic algorithms and other machine learning systems on a complex classification task from common disease research. Department of Computer Science and Engineering, University of Michigan
Google Scholar
Bali RK et al (2005) Introduction to the special issue on advances in clinical and health-care knowledge management. IEEE Trans Inf Technol Biomed 9(2):157–161
Article PubMed Google Scholar
Gurbaxani BM et al (2006) Linear data mining the Wichita clinical matrix suggests sleep and allostatic load involvement in chronic fatigue syndrome. Pharmacogenomics 7(3):455–465
Article PubMed Google Scholar
Berger AM, Berger CR (2004) Data mining as a tool for research and knowledge development in nursing. Comput Inform Nurs 22(3):123–131
Article PubMed Google Scholar
Hobbs GR (2001) Data mining and healthcare informatics. Am J Health Behav 25(3):285–289
Article CAS PubMed Google Scholar
Obenshain MK (2004) Application of data mining techniques to healthcare data. Infect Control Hosp Epidemiol 25(8):690–695
Article PubMed Google Scholar
Koh HC, Tan G (2005) Data mining applications in healthcare. J Healthc Inf Manag 19(2):64–72
PubMed Google Scholar
Bauer RJ (1994) Genetic algorithm and investment strategies. Willey, New York, NY
Google Scholar
Kim YS et al (2003) Screening test data analysis for liver disease prediction model using growth curve. Biomed Pharmacother 57(10):482–488
Article PubMed Google Scholar
Shin KS, LEE YJ (2002) A genetic algorithm application in bankruptcy prediction model. Exp Syst Appl 23(3):321–328
Article Google Scholar
Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. The fourteenth International Joint Conference on Artificial Intelligence 1995. San Francisco, CA.
Google Scholar
Breiman L, Friedman JH, Qlshen RA (1984) Classification and regression trees. Wadsworth & Brooks/Cole Advanced Books, Pacific Grove, CA
Google Scholar
Kim EK et al (1993) Comparison of neural network and k-NN classification methods in medical image and voice recognitions. Med J Osaka Univ 41–42(1–4):11–16
PubMed Google Scholar
Richardson CJ, Barlow DJ (1996) Neural network computer simulation of medical aerosols. J Pharm Pharmacol 48(6):581–591
Article CAS PubMed Google Scholar
Eghbaldar A et al (1996) Identification of structural features from mass spectrometry using a neural network approach: application to trimethylsilyl derivatives used for medical diagnosis. J Chem Inf Comput Sci 36(4):637–643
Article CAS PubMed Google Scholar
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning internal representations by error propagation. MIT Press, Cambridge, MA
Google Scholar

Download references

Author information

Authors and Affiliations

Yang Ming University, No 155, Sec. 2, Li-Nong St., Taipei, 112, Taiwan R.O.C.
Der-Ming Liou & Wei-Pin Chang

Authors

Der-Ming Liou
View author publications
You can also search for this author in PubMed Google Scholar
Wei-Pin Chang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Der-Ming Liou or Wei-Pin Chang .

Editor information

Editors and Affiliations

Instituto Itaca, Universitat Politècnica de València, Valencia, Spain
Carlos Fernández-Llatas
Instituto Itaca, Universitat Politècnica de València, Valencia, Spain
Juan Miguel García-Gómez

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Liou, DM., Chang, WP. (2015). Applying Data Mining for the Analysis of Breast Cancer Data. In: Fernández-Llatas, C., García-Gómez, J. (eds) Data Mining in Clinical Medicine. Methods in Molecular Biology, vol 1246. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-1985-7_12

Download citation

DOI: https://doi.org/10.1007/978-1-4939-1985-7_12
Published: 05 November 2014
Publisher Name: Humana Press, New York, NY
Print ISBN: 978-1-4939-1984-0
Online ISBN: 978-1-4939-1985-7
eBook Packages: Springer Protocols

Publish with us

Policies and ethics