Abstract
We present some new ideas for characterizing and comparing largechemical databases. The comparison of the contents of large databases is nottrivial since it implies pairwise comparison of hundreds of thousands ofcompounds. We have developed methods for categorizing compounds into groupsor series based on their ring-system content, using precalculatedstructure-based hashcodes. Two large databases can then be compared bysimply comparing their hashcode tables. Furthermore, the number of distinctring-system combinations can be used as an indicator of database diversity.We also present an indepen- dent technique for diversity assessment calledthe ’saturation diversity‘ approach. This method is based on picking as manymutually dissimilar compounds as possible from a database or a subsetthereof. We show that both methods yield similar results. Since the twomethods measure very different properties, this probably says more about theproperties of the databases studied than about the methods.
Similar content being viewed by others
References
Barnard, J.M., J. Chem. Inf. Comput. Sci., 33 (1993) 532.
Downs, G.M. and Willett, P., In Lipkowitz, K.B. and Boyd, D.B. (Eds.) Reviews in Computational Chemistry, Vol. 7, VCH, New York, NY, U.S.A., 1996, pp. 1–66.
Carhart, R.E., Smith, D.H. and Venkataraghavan, R., J. Chem. Inf. Comput. Sci., 25 (1985) 64.
Nilakantan, R., Bauman, N., Dixon, J.S. and Venkataraghavan, R., J. Chem. Inf. Comput. Sci., 27 (1987) 82.
Sheridan, R.P., Rusinko III, A., Nilakantan, R. and Venkataraghavan, R., Proc. Natl. Acad. Sci. USA, 86 (1989) 8165.
Sheridan, R.P., Nilakantan, R., Rusinko III, A., Bauman, N., Haraki, K.S. and Venkataraghavan, R., J. Chem. Inf. Comput. Sci., 29 (1989) 255.
Shemetulskis, N.E., Dunbar, J.B., Dunbar, B.W., Moreland, D.W. and Humblet, C., J. Comput.-Aided Mol. Design, 9 (1995) 407.
Shemetulskis, N.E., Weininger, D., Blankley, C.J., Yang, J.J. and Humblet, C., J. Chem. Inf. Comput. Sci., 36 (1996) 862.
Boyd, S.M., Beverley, M., Norskov, L. and Hubbard, R.E., J. Comput.-Aided Mol. Design, 9 (1995) 417.
Martin, E.J., Blaney, J.M., Siani, M.A., Spellmeyer, D.C., Wong, A.K. and Moos, W.H., J. Med. Chem., 38 (1995) 1431.
Jarvis, R.A. and Patrick, E.A., IEEE Trans. Comput., C22 (1973) 1025.
Cummins, D.J., Andrews, C.W., Bentley, J.A. and Cory, M., J. Chem. Inf. Comput. Sci., 36 (1996) 750.
Pickett, S.D., Mason, J.S. and McLay, I.M., J. Chem. Inf. Comput. Sci., 36 (1996) 1214.
Nilakantan, R., Bauman, N., Haraki, K.S. and Venkataraghavan, R., J. Chem. Inf. Comput. Sci., 30 (1990) 65.
Bemis, G.W. and Murcko, M.A., J. Med. Chem., 39 (1996) 2887.
World Drug Index (WDI), developed and published by Derwent Publications, London, U.K.
ORACLE, a database management system distributed by Oracle Corporation.
Available Chemicals Directory (ACD), a database of commercially available compounds distributed by MDL Information Systems, San Leandro, CA, U.S.A.
NCI3D, the public-domain portion of the National Cancer Institute’s database distributed by MDL Information Systems, San Leandro, CA, U.S.A.
MACCS, an acronym for Molecular Access System, a chemical database management system supplied by MDL Information Systems, San Leandro, CA, U.S.A.
Allen, F.H., Bellard, S., Brice, M.D., Cartwright, B.A., Doubleday, A., Higgs, H., Hummelink, T., Hummelink-Peters, B.G., Kennard, O., Motherwell, W.D.S., Rodgers, J.R. and Watson, D.G., Acta Crystallogr., B35 (1979) 2331.
Durrett, R., In Probability, Theory and Examples, Wadsworth, Belmont, CA, U.S.A., 1991, pp. 45–46.
Holliday, J.H., Ranade, S.H. and Willett, P., Quant. Struct.–Act. Relatsh., 14 (1995) 501.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Nilakantan, R., Bauman, N. & Haraki, K.S. Database diversity assessment: New ideas, concepts, and tools. J Comput Aided Mol Des 11, 447–452 (1997). https://doi.org/10.1023/A:1007937308615
Issue Date:
DOI: https://doi.org/10.1023/A:1007937308615