Skip to main content

Advertisement

Log in

Using diverse potentials and scoring functions for the development of improved machine-learned models for protein–ligand affinity and docking pose prediction

  • Published:
Journal of Computer-Aided Molecular Design Aims and scope Submit manuscript

Abstract

The advent of computational drug discovery holds the promise of significantly reducing the effort of experimentalists, along with monetary cost. More generally, predicting the binding of small organic molecules to biological macromolecules has far-reaching implications for a range of problems, including metabolomics. However, problems such as predicting the bound structure of a protein–ligand complex along with its affinity have proven to be an enormous challenge. In recent years, machine learning-based methods have proven to be more accurate than older methods, many based on simple linear regression. Nonetheless, there remains room for improvement, as these methods are often trained on a small set of features, with a single functional form for any given physical effect, and often with little mention of the rationale behind choosing one functional form over another. Moreover, it is not entirely clear why one machine learning method is favored over another. In this work, we endeavor to undertake a comprehensive effort towards developing high-accuracy, machine-learned scoring functions, systematically investigating the effects of machine learning method and choice of features, and, when possible, providing insights into the relevant physics using methods that assess feature importance. Here, we show synergism among disparate features, yielding adjusted R2 with experimental binding affinities of up to 0.871 on an independent test set and enrichment for native bound structures of up to 0.913. When purely physical terms that model enthalpic and entropic effects are used in the training, we use feature importance assessments to probe the relevant physics and hopefully guide future investigators working on this and other computational chemistry problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Data availability

Data will be readily provided to the reader upon contact with the author via the indicated email.

Code availability

An online repository is being established.

References

  1. Wang RX et al (2004) The PDBbind database: collection of binding affinities for protein-ligand complexes with known three-dimensional structures. J Med Chem 47(12):2977–2980

    Article  CAS  PubMed  Google Scholar 

  2. Wang RX et al (2005) The PDBbind database: methodologies and updates. J Med Chem 48(12):4111–4119

    Article  CAS  PubMed  Google Scholar 

  3. Cheng TJ et al (2009) Comparative assessment of scoring functions on a diverse test set. J Chem Inf Model 49(4):1079–1093

    Article  CAS  PubMed  Google Scholar 

  4. Li Y et al (2014) Comparative assessment of scoring functions on an updated benchmark: 1. Compilation of the test set. J Chem Inf Model 54(6):1700–1716

    Article  CAS  PubMed  Google Scholar 

  5. Liu ZH et al (2015) PDB-wide collection of binding data: current status of the PDBbind database. Bioinformatics 31(3):405–412

    Article  CAS  PubMed  Google Scholar 

  6. Liu ZH et al (2017) Forging the basis for developing protein-ligand Interaction scoring functions. Acc Chem Res 50(2):302–309

    Article  CAS  PubMed  Google Scholar 

  7. Huang N, Shoichet BK, Irwin JJ (2006) Benchmarking sets for molecular docking. J Med Chem 49(23):6789–6801

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Mysinger MM et al (2012) Directory of useful decoys, enhanced (DUD-E): better ligands and decoys for better benchmarking. J Med Chem 55(14):6582–6594

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Kollman P (1993) Free-energy calculations - applications to chemical and biochemical phenomena. Chem Rev 93(7):2395–2417

    Article  CAS  Google Scholar 

  10. Jorgensen WL (1989) Free-energy calculations - a breakthrough for modeling organic-chemistry in solution. Acc Chem Res 22(5):184–189

    Article  CAS  Google Scholar 

  11. Massova I, Kollman PA (2000) Combined molecular mechanical and continuum solvent approach (MM-PBSA/GBSA) to predict ligand binding. Perspect Drug Discov Des 18:113–135

    Article  CAS  Google Scholar 

  12. Liu J, Wang RX (2015) Classification of current scoring functions. J Chem Inf Model 55(3):475–482

    Article  CAS  PubMed  Google Scholar 

  13. Meng EC, Shoichet BK, Kuntz ID (1992) Automated docking with grid-based energy evaluation. J Comput Chem 13(4):505–524

    Article  CAS  Google Scholar 

  14. Ortiz AR et al (1995) Prediction of drug-binding affinities by comparative binding-energy analysis. J Med Chem 38(14):2681–2691

    Article  CAS  PubMed  Google Scholar 

  15. Goodsell DS, Morris GM, Olson AJ (1996) Automated docking of flexible ligands: applications of autoDock. J Mol Recognit 9(1):1–5

    Article  CAS  PubMed  Google Scholar 

  16. Gilson MK, Given JA, Head MS (1997) A new class of models for computing receptor-ligand binding affinities. Chem Biol 4(2):87–92

    Article  CAS  PubMed  Google Scholar 

  17. Makino S, Kuntz ID (1997) Automated flexible ligand docking method and its application for database search. J Comput Chem 18(14):1812–1825

    Article  CAS  Google Scholar 

  18. Zou XQ, Sun YX, Kuntz ID (1999) Inclusion of solvation in ligand binding free energy calculations using the generalized-born model. J Am Chem Soc 121(35):8033–8043

    Article  CAS  Google Scholar 

  19. Yin S et al (2008) MedusaScore: an accurate force field-based scoring function for virtual drug screening. J Chem Inf Model 48(8):1656–1662

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. DeWitte RS, Shakhnovich EI (1996) SMoG: de Novo design method based on simple, fast, and accurate free energy estimates: 1 Methodology and supporting evidence. J Am Chem Soc 118(47):11733–11744

    Article  CAS  Google Scholar 

  21. Gohlke H, Hendlich M, Klebe G (2000) Knowledge-based scoring function to predict protein-ligand interactions. J Mol Biol 295(2):337–356

    Article  CAS  PubMed  Google Scholar 

  22. Muegge I (2000) A knowledge-based scoring function for protein-ligand interactions: probing the reference state. Perspect Drug Discov Des 20(1):99–114

    Article  CAS  Google Scholar 

  23. Grzybowski BA et al (2002) From knowledge-based potentials to combinatorial lead design in silico. Acc Chem Res 35(5):261–269

    Article  CAS  PubMed  Google Scholar 

  24. Velec HFG, Gohlke H, Klebe G (2005) DrugScore(CSD)-knowledge-based scoring function derived from small molecule crystal data with superior recognition rate of near-native ligand poses and better affinity prediction. J Med Chem 48(20):6296–6303

    Article  CAS  PubMed  Google Scholar 

  25. Huang SY, Zou XQ (2006) An iterative knowledge-based scoring function to predict protein-ligand interactions: II. Validation of the scoring function. J Comput Chem 27(15):1876–1882

    Article  CAS  PubMed  Google Scholar 

  26. Huang SY, Zou XQ (2006) An iterative knowledge-based scoring function to predict protein-ligand interactions: I. Derivation of interaction potentials. J Comput Chem 27(15):1866–1875

    Article  CAS  PubMed  Google Scholar 

  27. Huang SY, Zou XQ (2010) Inclusion of solvation and entropy in the knowledge-based scoring function for protein-ligand interactions. J Chem Inf Model 50(2):262–273

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Neudert G, Klebe G (2011) DSX: A knowledge-based scoring function for the assessment of protein-ligand complexes. J Chem Inf Model 51(10):2731–2745

    Article  CAS  PubMed  Google Scholar 

  29. Zheng Z, Merz KM (2013) Development of the knowledge-based and empirical combined scoring algorithm (KECSA) to score protein-ligand interactions. J Chem Inf Model 53(5):1073–1083

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Kadukova M, Grudinin S (2017) Convex-PL: a novel knowledge-based potential for protein-ligand interactions deduced from structural databases using convex optimization. J Comput Aided Mol Des 31(10):943–958

    Article  CAS  PubMed  Google Scholar 

  31. Bohm HJ (1994) The development of a simple empirical scoring function to estimate the binding constant for a protein ligand complex of known 3-dimensional structure. J Comput Aided Mol Des 8(3):243–256

    Article  CAS  PubMed  Google Scholar 

  32. Verkhivker G et al (1995) Empirical free-energy calculations of ligand-protein crystallographic complexes: 1. Knowledge-based ligand-protein interaction potentials applied to the prediction of human-immunodeficiency-virus-1 protease binding-affinity. Protein Eng 8(7):677–691

    Article  CAS  PubMed  Google Scholar 

  33. Eldridge MD et al (1997) Empirical scoring functions: 1. The development of a fast empirical scoring function to estimate the binding affinity of ligands in receptor complexes. J Comput Aid Mol Des 11(5):425–445

    Article  CAS  Google Scholar 

  34. Murray CW, Auton TR, Eldridge MD (1998) Empirical scoring functions: II The testing of an empirical scoring function for the prediction of ligand-receptor binding affinities and the use of Bayesian regression to improve the quality of the model. J Comput Aid Mol Des 12(5):503–519

    Article  CAS  Google Scholar 

  35. Wang RX, Lai LH, Wang SM (2002) Further development and validation of empirical scoring functions for structure-based binding affinity prediction. J Comput Aided Mol Des 16(1):11–26

    Article  CAS  PubMed  Google Scholar 

  36. Verdonk ML et al (2003) Improved protein-ligand docking using GOLD. Proteins-Struct Funct Genet 52(4):609–623

    Article  CAS  PubMed  Google Scholar 

  37. Friesner RA et al (2004) Glide: a new approach for rapid, accurate docking and scoring: 1. Method and assessment of docking accuracy. J Med Chem 47(7):1739–1749

    Article  CAS  PubMed  Google Scholar 

  38. Friesner RA et al (2006) Extra precision glide: Docking and scoring incorporating a model of hydrophobic enclosure for protein-ligand complexes. J Med Chem 49(21):6177–6196

    Article  CAS  Google Scholar 

  39. Sotriffer CA et al (2008) SFCscore: scoring functions for affinity prediction of protein-ligand complexes. Proteins-Struct Funct Bioinf 73(2):395–419

    Article  CAS  Google Scholar 

  40. Ballester PJ, Mitchell JBO (2010) A machine learning approach to predicting protein-ligand binding affinity with applications to molecular docking. Bioinformatics 26(9):1169–1175

    Article  CAS  PubMed  Google Scholar 

  41. Das S, Krein MP, Breneman CM (2010) Binding affinity prediction with property-encoded shape distribution signatures. J Chem Inf Model 50(2):298–308

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Durrant JD, McCammon JA (2010) NNScore: A neural-network-based scoring function for the characterization of protein-ligand complexes. J Chem Inf Model 50(10):1865–1871

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Kinnings SL et al (2011) A machine learning-based method to improve docking scoring functions and its application to drug repurposing. J Chem Inf Model 51(2):408–419

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Li L, Wang B, Meroueh SO (2011) Support vector regression scoring of receptor-ligand complexes for rank-ordering and virtual screening of chemical libraries. J Chem Inf Model 51(9):2132–2138

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Brylinski M (2013) Nonlinear scoring functions for similarity-based ligand docking and binding affinity prediction. J Chem Inf Model 53(11):3097–3112

    Article  CAS  PubMed  Google Scholar 

  46. Ding B et al (2013) Characterization of small molecule binding. I. Accurate identification of strong inhibitors in virtual screening. J Chem Inf Model 53(1):114–122

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. Li GB et al (2013) ID-Score: A new empirical scoring function based on a comprehensive set of descriptors related to protein-ligand interactions. J Chem Inf Model 53(3):592–600

    Article  CAS  PubMed  Google Scholar 

  48. Liu Q, Kwoh CK, Li JY (2013) Binding affinity prediction for protein-ligand complexes based on beta contacts and b factor. J Chem Inf Model 53(11):3076–3085

    Article  CAS  PubMed  Google Scholar 

  49. Wang W et al (2013) Optimization of molecular docking scores with support vector rank regression. Proteins Struct Funct Bioinf 81(8):1386–1398

    Article  CAS  Google Scholar 

  50. Zilian D, Sotriffer CA (2013) SFCscore(RF): a random forest-based scoring function for improved affinity prediction of protein-ligand complexes. J Chem Inf Model 53(8):1923–1933

    Article  CAS  PubMed  Google Scholar 

  51. Ballester PJ, Schreyer A, Blundell TL (2014) Does a more precise chemical description of protein-ligand complexes lead to more accurate prediction of binding affinity? J Chem Inf Model 54(3):944–955

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  52. Li HJ et al (2014) Substituting random forest for multiple linear regression improves binding affinity prediction of scoring functions: cyscore as a case study. Bmc Bioinf 15:9

    Article  Google Scholar 

  53. Ashtawy HM, Mahapatra NR (2015) A comparative assessment of predictive accuracies of conventional and machine learning scoring functions for protein-ligand binding affinity prediction. IEEE-ACM Trans Comput Biol Bioinf 12(2):335–347

    Article  CAS  Google Scholar 

  54. Li HJ et al (2015) Low-quality structural and interaction data improves binding affinity prediction via random forest. Molecules 20(6):10947–10962

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  55. Li HJ et al (2015) Improving AutoDock vina using random forest: the growing accuracy of binding affinity prediction by the effective exploitation of larger data sets. Mol Inf 34(2–3):115–126

    Article  Google Scholar 

  56. Pereira JC, Caffarena ER, dos Santos CN (2016) Boosting docking-based virtual screening with deep learning. J Chem Inf Model 56(12):2495–2506

    Article  CAS  PubMed  Google Scholar 

  57. Ashtawy HM, Mahapatra NR (2018) Task-specific scoring functions for predicting ligand binding poses and affinity and for screening enrichment. J Chem Inf Model 58(1):119–133

    Article  CAS  PubMed  Google Scholar 

  58. Ragoza M et al (2017) Protein-ligand scoring with convolutional neural networks. J Chem Inf Model 57(4):942–957

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  59. Wang C, Zhang YK (2017) Improving scoring-docking-screening powers of protein-ligand scoring functions using random forest. J Comput Chem 38(3):169–177

    Article  PubMed  Google Scholar 

  60. Wojcikowski M, Ballester PJ, Siedlecki P (2017) Performance of machine-learning scoring functions in structure-based virtual screening. Sci Rep 7:10

    Article  Google Scholar 

  61. Fleishman SJ et al (2011) Community-wide assessment of protein-interface modeling suggests improvements to design methodology. J Mol Biol 414(2):289–302

    Article  CAS  PubMed  Google Scholar 

  62. Demerdash ONA, Mitchell JC (2013) Using physical potentials and learned models to distinguish native binding interfaces from de novo designed interfaces that do not bind. Proteins Struct Funct Bioinf 81(11):1919–1930

    Article  CAS  Google Scholar 

  63. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297

    Article  Google Scholar 

  64. Vapnik V (1998) Statistical Learning Theory. Wiley Press, New York

    Google Scholar 

  65. Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  Google Scholar 

  66. Freund Y, Schapire R (1995) A decision-theoretic generalization of on-line learning and an application to boosting. Computational Learning Theory. Springer, pp 23–37

    Chapter  Google Scholar 

  67. Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29(5):1189–1232

    Article  Google Scholar 

  68. Li Y et al (2014) Comparative assessment of scoring functions on an updated benchmark: 2. Evaluation methods and general results. J Chem Inf Model 54(6):1717–1736

    Article  CAS  PubMed  Google Scholar 

  69. Trott O, Olson AJ (2010) Software news and update AutoDock vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J Comput Chem 31(2):455–461

    CAS  PubMed  PubMed Central  Google Scholar 

  70. Baek M et al (2017) GalaxyDock BP2 score: a hybrid scoring function for accurate protein-ligand docking. J Comput Aid Mol Des 31(7):653–666

    Article  CAS  Google Scholar 

  71. Cao Y, Li L (2014) Improved protein-ligand binding affinity prediction by using a curvature-dependent surface-area model. Bioinformatics 30(12):1674–1680

    Article  CAS  PubMed  Google Scholar 

  72. Demerdash ONA, Buyan A, Mitchell JC (2010) ReplicOpter: a replicate optimizer for flexible docking. Proteins Struct Funct Bioinf 78(15):3156–3165

    Article  CAS  Google Scholar 

  73. Mehler EL, Solmajer T (1991) Electrostatic effects in proteins - comparison of dielectric and charge models. Protein Eng 4(8):903–910

    Article  CAS  PubMed  Google Scholar 

  74. Brooks BR et al (1983) Charmm - a program for macromolecular energy, minimization, and dynamics calculations. J Comput Chem 4(2):187–217

    Article  CAS  Google Scholar 

  75. Warshel A, Russell ST (1984) Calculations of electrostatic interactions in biological-systems and in solutions. Q Rev Biophys 17(3):283–422

    Article  CAS  PubMed  Google Scholar 

  76. Warshel A, Russell ST, Churg AK (1984) Macroscopic models for studies of electrostatic interactions in proteins - limitations and applicability. Proc Natl Acad Sci USA 81(15):4785–4789

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  77. Gabb HA, Jackson RM, Sternberg MJE (1997) Modelling protein docking using shape complementarity, electrostatics and biochemical information. J Mol Biol 272(1):106–120

    Article  CAS  PubMed  Google Scholar 

  78. Ramstein J, Lavery R (1988) Energetic coupling between DNA bending and base pair opening. Proc Natl Acad Sci USA 85(19):7231–7235

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  79. Hingerty BE et al (1985) Dielectric effects in bio-polymers - the theory of ionic saturation revisited. Biopolymers 24(3):427–439

    Article  CAS  Google Scholar 

  80. Goodford PJ (1985) A computational-procedure for determining energetically favorable binding-sites on biologically important macromolecules. J Med Chem 28(7):849–857

    Article  CAS  PubMed  Google Scholar 

  81. Mayo SL, Olafson BD, Goddard WA (1990) Dreiding - a Generic Force-Field for Molecular Simulations. J Phys Chem 94(26):8897–8909

    Article  CAS  Google Scholar 

  82. Dahiyat BI, Gordon DB, Mayo SL (1997) Automated design of the surface positions of protein helices. Protein Sci 6(6):1333–1337

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  83. Cho KI et al (2006) Specificity of molecular interactions in transient protein-protein interaction interfaces. Proteins Struct Funct Bioinf 65(3):593–606

    Article  CAS  Google Scholar 

  84. MacKerell AD et al (1998) All-atom empirical potential for molecular modeling and dynamics studies of proteins. J Phys Chem B 102(18):3586–3616

    Article  CAS  PubMed  Google Scholar 

  85. Wang RX, Gao Y, Lai LH (2000) Calculating partition coefficient by atom-additive method. Perspect Drug Discovery Des 19(1):47–66

    Article  CAS  Google Scholar 

  86. Clark M, Cramer RD, Vanopdenbosch N (1989) Validation of the general-purpose tripos 52 force-field. J Comput Chem 10(8):982–1012

    Article  CAS  Google Scholar 

  87. Sanner MF, Olson AJ, Spehner JC (1996) Reduced surface: an efficient way to compute molecular surfaces. Biopolymers 38(3):305–320

    Article  CAS  PubMed  Google Scholar 

  88. Tolman RC (1949) The effect of droplet size on surface tension. J Chem Phys 17(3):333–337

    Article  CAS  Google Scholar 

  89. Mitchell JC, Kerr R, Ten Eyck LF (2001) Rapid atomic density methods for molecular shape characterization. J Mol Graph Model 19(3–4):325

    Article  CAS  PubMed  Google Scholar 

  90. Kuhn LA et al (1992) The interdependence of protein surface-topography and bound water-molecules revealed by surface accessibility and fractal density measures. J Mol Biol 228(1):13–22

    Article  CAS  PubMed  Google Scholar 

  91. Yuki H et al (2007) Implementation of pi-pi interactions in molecular dynamics simulation. J Comput Chem 28(6):1091–1099

    Article  CAS  PubMed  Google Scholar 

  92. Minoux H, Chipot C (1999) Cation-pi interactions in proteins: Can simple models provide an accurate description? J Am Chem Soc 121(44):10366–10372

    Article  CAS  Google Scholar 

  93. Neudert G, Klebe G (2011) fconv: format conversion, manipulation and feature computation of molecular data. Bioinformatics 27(7):1021–1022

    Article  CAS  PubMed  Google Scholar 

  94. Allen FH (2002) The Cambridge structural database: a quarter of a million crystal structures and rising. Acta Crystallogr Sect B 58:380–388

    Article  Google Scholar 

  95. Pedregosa F et al (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830

    Google Scholar 

  96. Chen JH, Brooks CL (2007) Critical importance of length-scale dependence in implicit modeling of hydrophobic interactions. J Am Chem Soc 129(9):2444

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  97. Lin MS, Fawzi NL, Head-Gordon T (2007) Hydrophobic potential of mean force as a solvation function for protein structure prediction. Structure 15(6):727–740

    Article  CAS  PubMed  Google Scholar 

  98. Chandler D (2005) Interfaces and the driving force of hydrophobic assembly. Nature 437(7059):640–647

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

The author would like to thank Julie C. Mitchell for guidance. This work was funded through the Laboratory Directed Research and Development Program at Oak Ridge National Laboratory (LOIS ID: 9207). This manuscript has been authored by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The US government retains and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).

Funding

UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE).

Author information

Authors and Affiliations

Authors

Contributions

Demerdash conceived and planned the research, carried out all calculations, and wrote the manuscript.

Corresponding author

Correspondence to Omar N. A. Demerdash.

Ethics declarations

Conflict of interest

To his knowledge, the author has no conflicts of interest to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 35 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Demerdash, O.N.A. Using diverse potentials and scoring functions for the development of improved machine-learned models for protein–ligand affinity and docking pose prediction. J Comput Aided Mol Des 35, 1095–1123 (2021). https://doi.org/10.1007/s10822-021-00423-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10822-021-00423-4

Keywords

Navigation