Abstract
The advent of computational drug discovery holds the promise of significantly reducing the effort of experimentalists, along with monetary cost. More generally, predicting the binding of small organic molecules to biological macromolecules has far-reaching implications for a range of problems, including metabolomics. However, problems such as predicting the bound structure of a protein–ligand complex along with its affinity have proven to be an enormous challenge. In recent years, machine learning-based methods have proven to be more accurate than older methods, many based on simple linear regression. Nonetheless, there remains room for improvement, as these methods are often trained on a small set of features, with a single functional form for any given physical effect, and often with little mention of the rationale behind choosing one functional form over another. Moreover, it is not entirely clear why one machine learning method is favored over another. In this work, we endeavor to undertake a comprehensive effort towards developing high-accuracy, machine-learned scoring functions, systematically investigating the effects of machine learning method and choice of features, and, when possible, providing insights into the relevant physics using methods that assess feature importance. Here, we show synergism among disparate features, yielding adjusted R2 with experimental binding affinities of up to 0.871 on an independent test set and enrichment for native bound structures of up to 0.913. When purely physical terms that model enthalpic and entropic effects are used in the training, we use feature importance assessments to probe the relevant physics and hopefully guide future investigators working on this and other computational chemistry problems.
Similar content being viewed by others
Data availability
Data will be readily provided to the reader upon contact with the author via the indicated email.
Code availability
An online repository is being established.
References
Wang RX et al (2004) The PDBbind database: collection of binding affinities for protein-ligand complexes with known three-dimensional structures. J Med Chem 47(12):2977–2980
Wang RX et al (2005) The PDBbind database: methodologies and updates. J Med Chem 48(12):4111–4119
Cheng TJ et al (2009) Comparative assessment of scoring functions on a diverse test set. J Chem Inf Model 49(4):1079–1093
Li Y et al (2014) Comparative assessment of scoring functions on an updated benchmark: 1. Compilation of the test set. J Chem Inf Model 54(6):1700–1716
Liu ZH et al (2015) PDB-wide collection of binding data: current status of the PDBbind database. Bioinformatics 31(3):405–412
Liu ZH et al (2017) Forging the basis for developing protein-ligand Interaction scoring functions. Acc Chem Res 50(2):302–309
Huang N, Shoichet BK, Irwin JJ (2006) Benchmarking sets for molecular docking. J Med Chem 49(23):6789–6801
Mysinger MM et al (2012) Directory of useful decoys, enhanced (DUD-E): better ligands and decoys for better benchmarking. J Med Chem 55(14):6582–6594
Kollman P (1993) Free-energy calculations - applications to chemical and biochemical phenomena. Chem Rev 93(7):2395–2417
Jorgensen WL (1989) Free-energy calculations - a breakthrough for modeling organic-chemistry in solution. Acc Chem Res 22(5):184–189
Massova I, Kollman PA (2000) Combined molecular mechanical and continuum solvent approach (MM-PBSA/GBSA) to predict ligand binding. Perspect Drug Discov Des 18:113–135
Liu J, Wang RX (2015) Classification of current scoring functions. J Chem Inf Model 55(3):475–482
Meng EC, Shoichet BK, Kuntz ID (1992) Automated docking with grid-based energy evaluation. J Comput Chem 13(4):505–524
Ortiz AR et al (1995) Prediction of drug-binding affinities by comparative binding-energy analysis. J Med Chem 38(14):2681–2691
Goodsell DS, Morris GM, Olson AJ (1996) Automated docking of flexible ligands: applications of autoDock. J Mol Recognit 9(1):1–5
Gilson MK, Given JA, Head MS (1997) A new class of models for computing receptor-ligand binding affinities. Chem Biol 4(2):87–92
Makino S, Kuntz ID (1997) Automated flexible ligand docking method and its application for database search. J Comput Chem 18(14):1812–1825
Zou XQ, Sun YX, Kuntz ID (1999) Inclusion of solvation in ligand binding free energy calculations using the generalized-born model. J Am Chem Soc 121(35):8033–8043
Yin S et al (2008) MedusaScore: an accurate force field-based scoring function for virtual drug screening. J Chem Inf Model 48(8):1656–1662
DeWitte RS, Shakhnovich EI (1996) SMoG: de Novo design method based on simple, fast, and accurate free energy estimates: 1 Methodology and supporting evidence. J Am Chem Soc 118(47):11733–11744
Gohlke H, Hendlich M, Klebe G (2000) Knowledge-based scoring function to predict protein-ligand interactions. J Mol Biol 295(2):337–356
Muegge I (2000) A knowledge-based scoring function for protein-ligand interactions: probing the reference state. Perspect Drug Discov Des 20(1):99–114
Grzybowski BA et al (2002) From knowledge-based potentials to combinatorial lead design in silico. Acc Chem Res 35(5):261–269
Velec HFG, Gohlke H, Klebe G (2005) DrugScore(CSD)-knowledge-based scoring function derived from small molecule crystal data with superior recognition rate of near-native ligand poses and better affinity prediction. J Med Chem 48(20):6296–6303
Huang SY, Zou XQ (2006) An iterative knowledge-based scoring function to predict protein-ligand interactions: II. Validation of the scoring function. J Comput Chem 27(15):1876–1882
Huang SY, Zou XQ (2006) An iterative knowledge-based scoring function to predict protein-ligand interactions: I. Derivation of interaction potentials. J Comput Chem 27(15):1866–1875
Huang SY, Zou XQ (2010) Inclusion of solvation and entropy in the knowledge-based scoring function for protein-ligand interactions. J Chem Inf Model 50(2):262–273
Neudert G, Klebe G (2011) DSX: A knowledge-based scoring function for the assessment of protein-ligand complexes. J Chem Inf Model 51(10):2731–2745
Zheng Z, Merz KM (2013) Development of the knowledge-based and empirical combined scoring algorithm (KECSA) to score protein-ligand interactions. J Chem Inf Model 53(5):1073–1083
Kadukova M, Grudinin S (2017) Convex-PL: a novel knowledge-based potential for protein-ligand interactions deduced from structural databases using convex optimization. J Comput Aided Mol Des 31(10):943–958
Bohm HJ (1994) The development of a simple empirical scoring function to estimate the binding constant for a protein ligand complex of known 3-dimensional structure. J Comput Aided Mol Des 8(3):243–256
Verkhivker G et al (1995) Empirical free-energy calculations of ligand-protein crystallographic complexes: 1. Knowledge-based ligand-protein interaction potentials applied to the prediction of human-immunodeficiency-virus-1 protease binding-affinity. Protein Eng 8(7):677–691
Eldridge MD et al (1997) Empirical scoring functions: 1. The development of a fast empirical scoring function to estimate the binding affinity of ligands in receptor complexes. J Comput Aid Mol Des 11(5):425–445
Murray CW, Auton TR, Eldridge MD (1998) Empirical scoring functions: II The testing of an empirical scoring function for the prediction of ligand-receptor binding affinities and the use of Bayesian regression to improve the quality of the model. J Comput Aid Mol Des 12(5):503–519
Wang RX, Lai LH, Wang SM (2002) Further development and validation of empirical scoring functions for structure-based binding affinity prediction. J Comput Aided Mol Des 16(1):11–26
Verdonk ML et al (2003) Improved protein-ligand docking using GOLD. Proteins-Struct Funct Genet 52(4):609–623
Friesner RA et al (2004) Glide: a new approach for rapid, accurate docking and scoring: 1. Method and assessment of docking accuracy. J Med Chem 47(7):1739–1749
Friesner RA et al (2006) Extra precision glide: Docking and scoring incorporating a model of hydrophobic enclosure for protein-ligand complexes. J Med Chem 49(21):6177–6196
Sotriffer CA et al (2008) SFCscore: scoring functions for affinity prediction of protein-ligand complexes. Proteins-Struct Funct Bioinf 73(2):395–419
Ballester PJ, Mitchell JBO (2010) A machine learning approach to predicting protein-ligand binding affinity with applications to molecular docking. Bioinformatics 26(9):1169–1175
Das S, Krein MP, Breneman CM (2010) Binding affinity prediction with property-encoded shape distribution signatures. J Chem Inf Model 50(2):298–308
Durrant JD, McCammon JA (2010) NNScore: A neural-network-based scoring function for the characterization of protein-ligand complexes. J Chem Inf Model 50(10):1865–1871
Kinnings SL et al (2011) A machine learning-based method to improve docking scoring functions and its application to drug repurposing. J Chem Inf Model 51(2):408–419
Li L, Wang B, Meroueh SO (2011) Support vector regression scoring of receptor-ligand complexes for rank-ordering and virtual screening of chemical libraries. J Chem Inf Model 51(9):2132–2138
Brylinski M (2013) Nonlinear scoring functions for similarity-based ligand docking and binding affinity prediction. J Chem Inf Model 53(11):3097–3112
Ding B et al (2013) Characterization of small molecule binding. I. Accurate identification of strong inhibitors in virtual screening. J Chem Inf Model 53(1):114–122
Li GB et al (2013) ID-Score: A new empirical scoring function based on a comprehensive set of descriptors related to protein-ligand interactions. J Chem Inf Model 53(3):592–600
Liu Q, Kwoh CK, Li JY (2013) Binding affinity prediction for protein-ligand complexes based on beta contacts and b factor. J Chem Inf Model 53(11):3076–3085
Wang W et al (2013) Optimization of molecular docking scores with support vector rank regression. Proteins Struct Funct Bioinf 81(8):1386–1398
Zilian D, Sotriffer CA (2013) SFCscore(RF): a random forest-based scoring function for improved affinity prediction of protein-ligand complexes. J Chem Inf Model 53(8):1923–1933
Ballester PJ, Schreyer A, Blundell TL (2014) Does a more precise chemical description of protein-ligand complexes lead to more accurate prediction of binding affinity? J Chem Inf Model 54(3):944–955
Li HJ et al (2014) Substituting random forest for multiple linear regression improves binding affinity prediction of scoring functions: cyscore as a case study. Bmc Bioinf 15:9
Ashtawy HM, Mahapatra NR (2015) A comparative assessment of predictive accuracies of conventional and machine learning scoring functions for protein-ligand binding affinity prediction. IEEE-ACM Trans Comput Biol Bioinf 12(2):335–347
Li HJ et al (2015) Low-quality structural and interaction data improves binding affinity prediction via random forest. Molecules 20(6):10947–10962
Li HJ et al (2015) Improving AutoDock vina using random forest: the growing accuracy of binding affinity prediction by the effective exploitation of larger data sets. Mol Inf 34(2–3):115–126
Pereira JC, Caffarena ER, dos Santos CN (2016) Boosting docking-based virtual screening with deep learning. J Chem Inf Model 56(12):2495–2506
Ashtawy HM, Mahapatra NR (2018) Task-specific scoring functions for predicting ligand binding poses and affinity and for screening enrichment. J Chem Inf Model 58(1):119–133
Ragoza M et al (2017) Protein-ligand scoring with convolutional neural networks. J Chem Inf Model 57(4):942–957
Wang C, Zhang YK (2017) Improving scoring-docking-screening powers of protein-ligand scoring functions using random forest. J Comput Chem 38(3):169–177
Wojcikowski M, Ballester PJ, Siedlecki P (2017) Performance of machine-learning scoring functions in structure-based virtual screening. Sci Rep 7:10
Fleishman SJ et al (2011) Community-wide assessment of protein-interface modeling suggests improvements to design methodology. J Mol Biol 414(2):289–302
Demerdash ONA, Mitchell JC (2013) Using physical potentials and learned models to distinguish native binding interfaces from de novo designed interfaces that do not bind. Proteins Struct Funct Bioinf 81(11):1919–1930
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
Vapnik V (1998) Statistical Learning Theory. Wiley Press, New York
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Freund Y, Schapire R (1995) A decision-theoretic generalization of on-line learning and an application to boosting. Computational Learning Theory. Springer, pp 23–37
Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29(5):1189–1232
Li Y et al (2014) Comparative assessment of scoring functions on an updated benchmark: 2. Evaluation methods and general results. J Chem Inf Model 54(6):1717–1736
Trott O, Olson AJ (2010) Software news and update AutoDock vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J Comput Chem 31(2):455–461
Baek M et al (2017) GalaxyDock BP2 score: a hybrid scoring function for accurate protein-ligand docking. J Comput Aid Mol Des 31(7):653–666
Cao Y, Li L (2014) Improved protein-ligand binding affinity prediction by using a curvature-dependent surface-area model. Bioinformatics 30(12):1674–1680
Demerdash ONA, Buyan A, Mitchell JC (2010) ReplicOpter: a replicate optimizer for flexible docking. Proteins Struct Funct Bioinf 78(15):3156–3165
Mehler EL, Solmajer T (1991) Electrostatic effects in proteins - comparison of dielectric and charge models. Protein Eng 4(8):903–910
Brooks BR et al (1983) Charmm - a program for macromolecular energy, minimization, and dynamics calculations. J Comput Chem 4(2):187–217
Warshel A, Russell ST (1984) Calculations of electrostatic interactions in biological-systems and in solutions. Q Rev Biophys 17(3):283–422
Warshel A, Russell ST, Churg AK (1984) Macroscopic models for studies of electrostatic interactions in proteins - limitations and applicability. Proc Natl Acad Sci USA 81(15):4785–4789
Gabb HA, Jackson RM, Sternberg MJE (1997) Modelling protein docking using shape complementarity, electrostatics and biochemical information. J Mol Biol 272(1):106–120
Ramstein J, Lavery R (1988) Energetic coupling between DNA bending and base pair opening. Proc Natl Acad Sci USA 85(19):7231–7235
Hingerty BE et al (1985) Dielectric effects in bio-polymers - the theory of ionic saturation revisited. Biopolymers 24(3):427–439
Goodford PJ (1985) A computational-procedure for determining energetically favorable binding-sites on biologically important macromolecules. J Med Chem 28(7):849–857
Mayo SL, Olafson BD, Goddard WA (1990) Dreiding - a Generic Force-Field for Molecular Simulations. J Phys Chem 94(26):8897–8909
Dahiyat BI, Gordon DB, Mayo SL (1997) Automated design of the surface positions of protein helices. Protein Sci 6(6):1333–1337
Cho KI et al (2006) Specificity of molecular interactions in transient protein-protein interaction interfaces. Proteins Struct Funct Bioinf 65(3):593–606
MacKerell AD et al (1998) All-atom empirical potential for molecular modeling and dynamics studies of proteins. J Phys Chem B 102(18):3586–3616
Wang RX, Gao Y, Lai LH (2000) Calculating partition coefficient by atom-additive method. Perspect Drug Discovery Des 19(1):47–66
Clark M, Cramer RD, Vanopdenbosch N (1989) Validation of the general-purpose tripos 52 force-field. J Comput Chem 10(8):982–1012
Sanner MF, Olson AJ, Spehner JC (1996) Reduced surface: an efficient way to compute molecular surfaces. Biopolymers 38(3):305–320
Tolman RC (1949) The effect of droplet size on surface tension. J Chem Phys 17(3):333–337
Mitchell JC, Kerr R, Ten Eyck LF (2001) Rapid atomic density methods for molecular shape characterization. J Mol Graph Model 19(3–4):325
Kuhn LA et al (1992) The interdependence of protein surface-topography and bound water-molecules revealed by surface accessibility and fractal density measures. J Mol Biol 228(1):13–22
Yuki H et al (2007) Implementation of pi-pi interactions in molecular dynamics simulation. J Comput Chem 28(6):1091–1099
Minoux H, Chipot C (1999) Cation-pi interactions in proteins: Can simple models provide an accurate description? J Am Chem Soc 121(44):10366–10372
Neudert G, Klebe G (2011) fconv: format conversion, manipulation and feature computation of molecular data. Bioinformatics 27(7):1021–1022
Allen FH (2002) The Cambridge structural database: a quarter of a million crystal structures and rising. Acta Crystallogr Sect B 58:380–388
Pedregosa F et al (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830
Chen JH, Brooks CL (2007) Critical importance of length-scale dependence in implicit modeling of hydrophobic interactions. J Am Chem Soc 129(9):2444
Lin MS, Fawzi NL, Head-Gordon T (2007) Hydrophobic potential of mean force as a solvation function for protein structure prediction. Structure 15(6):727–740
Chandler D (2005) Interfaces and the driving force of hydrophobic assembly. Nature 437(7059):640–647
Acknowledgements
The author would like to thank Julie C. Mitchell for guidance. This work was funded through the Laboratory Directed Research and Development Program at Oak Ridge National Laboratory (LOIS ID: 9207). This manuscript has been authored by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The US government retains and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).
Funding
UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE).
Author information
Authors and Affiliations
Contributions
Demerdash conceived and planned the research, carried out all calculations, and wrote the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
To his knowledge, the author has no conflicts of interest to disclose.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Demerdash, O.N.A. Using diverse potentials and scoring functions for the development of improved machine-learned models for protein–ligand affinity and docking pose prediction. J Comput Aided Mol Des 35, 1095–1123 (2021). https://doi.org/10.1007/s10822-021-00423-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10822-021-00423-4