Machine Learning Methodology in Bioinformatics

Campbell, Colin

doi:10.1007/978-3-642-30574-0_12

Colin Campbell²

Part of the book series: Springer Handbooks ((SHB))

7364 Accesses
7 Citations

Abstract

Machine learning plays a central role in the interpretation of many datasets generated within the biomedical sciences. In this chapter we focus on two core topics within machine learning, supervised and unsupervised learning, and illustrate their application to interpreting these datasets. For supervised learning, we focus on support vector machines (SVMs), which is a subtopic of kernel-based learning. Kernels can be used to encode many different types of data, from continuous and discrete data through to graph and sequence data. Given the different types of data encountered within bioinformatics, they are therefore a method of choice within this context. With unsupervised learning we are interested in the discovery of structure within data. We start by considering hierarchical cluster analysis (HCA), given its common usage in this context. We then point out the advantages of Bayesian approaches to unsupervised learning, such as a principled approach to model selection (how many clusters are present in the data) through to confidence measures for assignment of datapoints to clusters. We outline five case studies illustrating these methods. For supervised learning we consider prediction of disease progression in cancer and protein fold prediction. For unsupervised learning we apply HCA to a small colon cancer dataset and then illustrate the use of Bayesian unsupervised learning applied to breast and lung cancer datasets. Finally we consider network inference, which can be approached as an unsupervised or supervised learning task depending on the data available.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 269.00; Price excludes VAT (USA)

Hardcover Book: USD 349.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Abbreviations

DAG:: directed acyclic graph
DNA:: deoxyribonucleic acid
EGF:: epidermal growth factor
EGFR:: epidermal growth factor receptor
EM:: expectation-maximization
ERK:: extracellular signal-regulated kinase
GA:: genetic algorithm
HCA:: hierarchical cluster analysis
KL:: Kullback–Leibler
LIBSVM:: library for support vector machines
LOO:: leave-one-out
LPD:: latent process decomposition
MAP:: maximum a posteriori
MCMC:: Markov chain Monte Carlo
MKL:: multiple kernel learning
ML:: maximum likelihood
MRI:: magnetic resonance imaging
ODE:: ordinary differential equation
PSD:: positive semidefinite
QP:: quadratic programming
RNA:: ribonucleic acid
SDP:: semidefinite programming
SVM:: support vector machine
TG:: triacylglyceride
TSA:: test set accuracy
cDNA:: complementary DNA
log:: logistic regression

References

L. Bottou, O. Chapelle, D. DeCoste, J. Weston: Large-Scale Kernel Machines, Neural Information Processing Series (MIT Press, Cambridge 2007)
Google Scholar
J. Platt, N. Cristianini, J. Shawe-Taylor: Large margin DAGS for multiclass classification, Adv. Neural Inform. Proces. Syst. 12, 547–553 (2000)
Google Scholar
Y. Lee, Y. Lin, G. Wahba: Multicategory support vector machines, Technical Report 1043 (Univ. Madison, Wisconsin 2001)
Google Scholar
T. Hastie, R. Tibshirani: Classification by pairwise coupling, Ann. Stat. 26, 451–471 (1998)
Article MathSciNet MATH Google Scholar
T.G. Dietterich, G. Bakiri: Solving multiclass learning problems via error-correcting output codes, J. Artif. Intell. 2, 263–286 (1995)
MATH Google Scholar
E.L. Allwein, R.E. Schapire, Y. Singer: Reducing multiclass to binary: A unifying approach for margin classifiers, J. Mach. Learn. Res. 1, 133–141 (2000)
MathSciNet MATH Google Scholar
K.-B. Duan, S.S. Keerthi: Which is the best multiclass SVM Method? An empirical study, Proc. 6th Int. Workshop Multiple Classifier Syst. (2005), Vol. 3541 (Springer, Berlin, Heidelberg 2006) pp. 278–285
Google Scholar
C. Cortes, V. Vapnik: Support vector networks, Mach. Learn. 20, 273–297 (1995)
MATH Google Scholar
K. Veropoulos, C. Campbell, N. Cristianini: Controlling the sensitivity of support vector machines, Proc. Int. Joint Conf. Artif. Intell. (IJCAI) (1999)
Google Scholar
J. Platt: Probabilistic outputs for support vector machines and comparison to regularised likelihood methods, Adv. Large Margin Classifiers (MIT Press, Cambridge 1999) pp. 61–74
Google Scholar
A.E. Hoerl, R. Kennard: Ridge regression: Biased estimation for nonorthogonal problems, Technometrics 12, 55–67 (1970)
Article MATH Google Scholar
C. Saunders, A. Gammermann, V. Vovk: Ridge regression learning algorithm in dual variables, Proc. Fifteenth Int. Conf. Mach. Learn. (ICML), ed. by J. Shavlik (Morgan Kaufmann, 1998)
Google Scholar
V. Vapnik: The Nature of Statistical Learning Theory (Springer, New York 1995)
Book MATH Google Scholar
V. Vapnik: Statistical Learning Theory (Wiley, New York 1998)
MATH Google Scholar
B. Schölkopf, A.J. Smola: Learning with Kernels (MIT Press, Cambridge 2002)
MATH Google Scholar
J. Weston, A. Gammerman, M. Stitson, V. Vapnik, V. Vovk, C. Watkins: Support vector density estimation, Advances in Kernel Methods: Support Vector Machines (MIT Press, Cambridge 1998) pp. 293–306
Google Scholar
A.J. Smola, B. Schölkopf: A tutorial on support vector regression, Stat. Comput. 14, 199–222 (2004)
Article MathSciNet Google Scholar
R.D. Williams, S.N. Hing, B.T. Greer, C.C. Whiteford, J.S. Wei, R. Natrajan, A. Kelsey, S. Rogers, C. Campbell, K. Pritchard-Jones, J. Khan: Prognostic classification of relapsing favourable histology Wilms tumour using cDNA microarray expression profiling and support vector machines, Genes Chromosom. Cancer 41, 65–79 (2004)
Article Google Scholar
I. Guyon, A. Elisseeff: An Introduction to Variable and Feature Selection, J. Mach. Learn. Res. 3, 1157–1182 (2003)
MATH Google Scholar
T. Graepel, R. Herbrich, P. Bollmann-Sdorra, K. Obermayer: Classification on pairwise proximity data, Adv. Neural Inform. Proces. Syst. 11, 438–444 (1998)
Google Scholar
E. Pekalska, P. Paclik, R.P.W. Duin: A generalized kernel approach to dissimilarity based classification, J. Mach. Learn. Res. 2, 175–211 (2002)
MathSciNet MATH Google Scholar
V. Roth, J. Laub, M. Kawanabe, J.M. Buhmann: Optimal cluster preserving embedding of nonmetric proximity data, IEEE Trans. Pattern Analys. Mach. Intell. 25, 1540–1551 (2003)
Article Google Scholar
R. Luss, A. dʼAspremont: Support vector machine classification with indefinite kernels, Adv. Neural Inform. Proces. Syst. 20, 953–960 (2008)
Google Scholar
Y. Ying, C. Campbell, M. Girolami: Analysis of SVM with Indefinite Kernels, Adv. Neural Informat. Proces. Syst. 22, 2205–2213 (2009)
Google Scholar
N. Cristianini, C. Campbell, J. Shawe-Taylor: Dynamically adapting kernels in support vector machines, Adv. Neural Inform. Proces. Syst. 11, 204–210 (1999)
Google Scholar
T. Joachims: Estimating the generalization performance of an SVM efficiently, Proc. 17th Int. Conf. Mach. Learn. (Morgan Kaufmann, 2000) pp. 431–438
Google Scholar
O. Chapelle, V. Vapnik: Model selection for support vector machines, Adv. Neural Inform. Proces. Syst. 12, 673–680 (2000)
Google Scholar
V. Vapnik, O. Chapelle: Bounds on error expectation for support vector machines, Neural Comput. 12, 2013–2036 (2000)
Article Google Scholar
P. Sollich: Bayesian methods for support vector machines: Evidence and predictive class probabilities, Mach. Learn. 46, 21–52 (2002)
Article MATH Google Scholar
J. Shawe-Taylor, N. Cristianini: Kernel Methods for Pattern Analysis (Cambridge Univ. Press, Cambridge 2004)
Book MATH Google Scholar
H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, C. Watkins: Text classification using string kernels, J. Mach. Learn. Res. 2, 419–444 (2002)
MATH Google Scholar
C. Leslie, R. Kuang: Fast kernels for inexact string matching, 16th Ann. Conf. Learning Theory 7th Kernel Workshop, Vol. 2777 (Springer, Berlin, Heidelberg 2003) pp. 114–128
Chapter Google Scholar
S. Vishwanathan, A. Smola: Fast Kernels for String and Tree Matching, Adv. Neural Inform. Proces. Syst. 15, 569–576 (2003)
Google Scholar
I.R. Kondor, J.D. Lafferty: Diffusion kernels on graphs and other discrete structures, Proc. Int. Conf. Mach. Learn. (Morgan Kaufmann, San Francisco, 2002) pp. 315–322
Google Scholar
A.J. Smola, I.R. Kondor: Kernels and regularization on graphs, Conf. Learning Theory (COLT), Vol. 2777 (Springer, Berlin, Heidelberg 2003) pp. 144–158
Google Scholar
T. Gartner, P. Flach, S. Wrobel: On graph kernels: Hardness results and efficient alternatives, Proc. Annu. Conf. Computational Learning Theory (COLT) (Springer, Berlin, Heidelberg 2003) pp. 129–143
Google Scholar
S.V.N. Vishwanathan, K.M. Borgwardt, I.R. Kondor, N.N. Schraudolph: Graph Kernels, J. Mach. Learn. Res. 9, 1–41 (2008)
MATH MathSciNet Google Scholar
G.R.G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, M.I. Jordan: Learning the kernel matrix with semidefinite programming, J. Mach. Learn. Res. 5, 27–72 (2004)
MATH MathSciNet Google Scholar
F. Bach, G.R.G. Lanckriet, M.I. Jordan: Multiple kernel learning, conic duality and the SMO algorithm, Proc. 21st Int. Conf. Machine Learning (ICML) (Morgan Kaufmann, New York 1998)
Google Scholar
S. Sonnenburg, G. Rätsch, C. Schäfer, B. Schölkopf: Large scale multiple kernel learning, J. Mach. Learn. Res. 7, 1531–1565 (2006)
MathSciNet MATH Google Scholar
A. Rakotomamonjy, F. Bach, S. Canu, Y. Grandvalet: SimpleMKL, J. Mach. Learn. Res. 9, 2491–2521 (2008)
MathSciNet MATH Google Scholar
Z. Xu, R. Jin, I. King, M.R. Lyu: An extended level method for multiple kernel learning, Adv. Neural Inform. Proces. Syst. 22, 1825–1832 (2008)
Google Scholar
Y. Ying, K. Huang, C. Campbell: Enhanced protein fold recognition through a novel data integration approach, BMC Bioinf. 10, 267–285 (2009)
Article Google Scholar
T. Damoulas, M. Girolami: Probabilistic multi-class multi-kernel learning: On protein fold recognition and remote homology detection, Bioinformatics 24, 1264–1270 (2008)
Article Google Scholar
G.R.G. Lanckriet, T. De Bie, N. Cristianini, M.I. Jordan, W.S. Noble: A statistical framework for genomic data fusion, Bioinformatics 20, 2626–2635 (2004)
Article Google Scholar
M. Kloft, U. Brefeld, S. Sonnenburg, P. Laskov, K.-R. Müller, A. Zien: Efficient and accurate lp-norm multiple kernel learning, Adv. Neural Inform. Proces. Syst. 22, 997–1005 (2009)
MATH Google Scholar
U. Alon, N. Barkai, D.A. Notterman, K. Gish, S. Ybarra, D. Mack, A.J. Levine: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays, Proc. Natl. Acad. Sci. USA 96(12), 6745–6750 (1999)
Article Google Scholar
B. Everitt: Cluster Analysis (Arnold, New York 1993)
MATH Google Scholar
L. Kaufman, P.J. Rousseeuw: Finding Groups in Data (Wiley, New York 2005)
Google Scholar
R.O. Duda, P.E. Hart, D.G. Stork: Pattern classification (Wiley, New York 2001)
MATH Google Scholar
Y.W. Teh, D. Newman, M. Welling: A collapsed variational Bayesian inference algorithm for latent dirichlet allocation, Adv. Neural Inform. Proces. Syst. 19, 1353–1360 (2006)
Google Scholar
Y. Ying, P. Li, C. Campbell: A marginalized variational Bayesian approach to the analysis of array data, BMC Proc. 2(4), S7 (2008)
Article Google Scholar
P. Li, Y. Ying, C. Campbell: A variational approach to semi-supervised clustering, Proc. ESANN2009 (2009) pp. 11–16
Google Scholar
D.M. Blei, M.I. Jordan: Modeling annotated data, Proc. 26th Annu. Int. ACM SIGIR Conf. Res. Dev. Inf. Retr. (ACM Press, New York 2003) pp. 127–134
Chapter Google Scholar
P. Agius, Y. Ying, C. Campbell: Bayesian Unsupervised Learning with Multiple Data Types, Stat. Appl. Genet. Molec. Biol. 8, 27 (2009)
MathSciNet MATH Google Scholar
S. Rogers, M. Girolami, C. Campbell, R. Breitling: The latent process decomposition of cdna microarray datasets, IEEE/ACM Trans. Comput. Biol. Bioinforma. 2, 143–156 (2005)
Article Google Scholar
C. Blenkiron, L.D. Goldstein, N.P. Thorne, I. Spiteri, S.F. Chin, M.J. Dunning, N.L. Barbosa-Morais, A.E. Teschendorff, A.R. Green, I.O. Ellis, S. Tavaré, C. Caldas, E.A. Miska: MicroRNA expression profiling of human breast cancer identifies new markers of tumour subtype, Genome Biol. 8(10), R214–1–R214–16 (2007)
Article Google Scholar
L. Carrivick, S. Rogers, J. Clark, C. Campbell, M. Girolami, C. Cooper: Identification of prognostic signatures in breast cancer microarray data using Bayesian techniques, J. R. Soc. Interf. 3, 367–381 (2006)
Article Google Scholar
E. Garber, O.G. Troyanskaya, K. Schluens, S. Petersen, Z. Thaesler, M. Pacyna-Gengelbach, M. van de Rijn, G.D. Rosen, C.M. Perou, R.I. Whyte, R.B. Altman, P.O. Brown, D. Botstein, I. Petersen: Diversity of gene expression in adenocarcinoma of the lung, Proc. Natl. Acad. Sci. USA 98, 13784–13789 (2001)
Article Google Scholar
C. Andrieu, N. De Freitas, A. Doucet, M.I. Jordan: An introduction to MCMC for machine learning, Mach. Learn. 50, 5–43 (2003)
Article MATH Google Scholar
W.R. Gilks, S. Richardson, D.J. Spiegelhalter: Markov Chain Monte Carlo in Practice (Chapman Hall/CRC, New York 1996)
MATH Google Scholar
C.P. Robert, G. Casella: Monte Carlo Statistical Methods (Springer, Berlin, Heidelberg 2004)
Book MATH Google Scholar
S. Chib, E. Greenberg: Understanding the Metropolis Hastings Algorithm, Am. Stat. 49(4), 327–335 (1995)
Google Scholar
B.A. Berg: Markov Chain Monte Carlo Simulations and Their Statistical Analysis (World Scientific, Singapore 2004)
Book MATH Google Scholar
W.M. Bolstad: Understanding Computational Bayesian Statistics (Wiley, New York 2010)
MATH Google Scholar
K. Bleakley, G. Biau, J.-P. Vert: Supervised reconstruction of biological networks with local models, Bioinformatics 23, i57–i65 (2007)
Article Google Scholar
B. Calderhead, M. Girolami: Estimating Bayes factors via thermodynamic integration and population MCMC, Comput. Stat. Data Anal. 53, 4028–4045 (2009)
Article MathSciNet MATH Google Scholar
T.R. Xu, V. Vyshemirsky, A. Gormand, A. von Kriegsheim, M. Girolami, G.S. Baillie, D. Ketley, A.J. Dunlop, G. Milligan, M.D. Houslay, W. Kolch: Inferring signaling pathway topologies from multiple perturbation measurements of specific biochemical species, Sci. Signal. 3(113), ra20:1–ra20:10 (2010)
Article Google Scholar
Cancer Genome Atlas: Available at http://cancergenome.nih.gov

Download references

Author information

Authors and Affiliations

Department of Engineering Mathematics, University of Bristol, BS8 1UB, Bristol, UK
Colin Campbell

Authors

Colin Campbell
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Colin Campbell .

Editor information

Editors and Affiliations

KEDRI – Knowledge Engineering and Discovery Research Institute, Auckland University of Technology, 120 Mayoral Drive, 1010, Auckland, New Zealand
Nikola Kasabov

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Campbell, C. (2014). Machine Learning Methodology in Bioinformatics. In: Kasabov, N. (eds) Springer Handbook of Bio-/Neuroinformatics. Springer Handbooks. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-30574-0_12

Download citation

DOI: https://doi.org/10.1007/978-3-642-30574-0_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-30573-3
Online ISBN: 978-3-642-30574-0
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics