skip to main content
research-article

Author name disambiguation in MEDLINE

Published:28 July 2009Publication History
Skip Abstract Section

Abstract

Background: We recently described “Author-ity,” a model for estimating the probability that two articles in MEDLINE, sharing the same author name, were written by the same individual. Features include shared title words, journal name, coauthors, medical subject headings, language, affiliations, and author name features (middle initial, suffix, and prevalence in MEDLINE). Here we test the hypothesis that the Author-ity model will suffice to disambiguate author names for the vast majority of articles in MEDLINE. Methods: Enhancements include: (a) incorporating first names and their variants, email addresses, and correlations between specific last names and affiliation words; (b) new methods of generating large unbiased training sets; (c) new methods for estimating the prior probability; (d) a weighted least squares algorithm for correcting transitivity violations; and (e) a maximum likelihood based agglomerative algorithm for computing clusters of articles that represent inferred author-individuals. Results: Pairwise comparisons were computed for all author names on all 15.3 million articles in MEDLINE (2006 baseline), that share last name and first initial, to create Author-ity 2006, a database that has each name on each article assigned to one of 6.7 million inferred author-individual clusters. Recall is estimated at ∼98.8%. Lumping (putting two different individuals into the same cluster) affects ∼0.5% of clusters, whereas splitting (assigning articles written by the same individual to >1 cluster) affects ∼2% of articles. Impact: The Author-ity model can be applied generally to other bibliographic databases. Author name disambiguation allows information retrieval and data integration to become person-centered, not just document-centered, setting the stage for new data mining and social network tools that will facilitate the analysis of scholarly publishing and collaboration behavior. Availability: The Author-ity 2006 database is available for nonprofit academic research, and can be freely queried via http://arrowsmith.psych.uic.edu.

References

  1. Bhattacharya, I. and Getoor, L. 2006. A latent Dirichlet model for unsupervised entity resolution. In Proceedings of the 6th SIAM Conference on Data Mining, J. Ghosh, D. Lambert, D. B. Skillicorn, and J. Srivastava Eds. SIAM, 47--58.Google ScholarGoogle Scholar
  2. Bhattacharya, I. and Getoor, L. 2007. Collective entity resolution in relational data. ACM Trans. Knowl. Discov. Data 1, 1--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Bilenko, M., Kamath, B., and Mooney, R. J. 2006. Adaptive blocking: Learning to scale up record linkage. In Proceedings of the IEEE Computer Society 6th International Conference on Data Mining. 87--96. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Culotta, A. and McCallum, A. 2006. Tractable learning and inference of high-order representations. In Proceedings of the ICML Workshop on Open Problems in Statistical Relational Learning. http://www.cs.umd.edu/projects/srl2006/proceedings.html.Google ScholarGoogle Scholar
  5. Culotta, A., Kanani, P., Hall, R., Wick, M., and McCallum, A. 2007. Author disambiguation using error-driven machine learning with a ranking loss function. In Proceedings of the 6th AAAI International Workshop on Information Integration on the Web.Google ScholarGoogle Scholar
  6. Dominguez, J. and Gonzalez-Lima, M. D. 2006. A primal-dual interior-point algorithm for quadratic programming. Numer. Algor. 42, 1--30.Google ScholarGoogle ScholarCross RefCross Ref
  7. Fisher, D. H. 1987. Knowledge acquisition via incremental conceptual clustering. Mach. Learn. 2, 139--172. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. French, J. C., Powell, A., and Schulman, E. 2000. Using clustering strategies for creating authority files. J. Amer. Soc. Inform. Sci. Technol. 51, 774--786. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Galvez, C. and Moya-Anegón, F. 2007. Approximate personal name-matching through finite state graphs. J. Amer. Soc. Inform. Sci. Technol. 58, 1960--1976. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Garfield, E. 1969. British quest for uniqueness versus American egocentrism. Nature 223, 763.Google ScholarGoogle ScholarCross RefCross Ref
  11. Han, H., Zha, H., and Giles, C. L. 2005. Name disambiguation in author citations using a K-way spectral clustering method. In Proceedings of the 5th ACM/IEEE Joint Conference on Digital Libraries. M. Marlino, T. Sumner, and F. M. Shipman III Eds. ACM, 334--343. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Han, H., Giles, C. L., Zha, H., Li, C., and Tsioutsiouliklis, K. 2004. Two supervised learning approaches for name disambiguation in author citations. In Proceedings of the 4th ACM/IEEE Joint Conference on Digital Libraries, H. Chen, H. D. Wactlar, C.-C. Chen, E.-P. Lim, and M. G. Christel Eds. ACM, 296--305. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Herskovic, J. R., Tanaka, L. Y., Hersh, W., and Bernstam, E. V. 2007. A day in the life of PubMed: analysis of a typical day's query log. J. Amer. Med. Inform. Ass. 14, 212--220.Google ScholarGoogle ScholarCross RefCross Ref
  14. Holmes, D. I., Robertson, M., and Paez, R. 2001. Stephen Crane and the New-York Tribune: A case study in traditional and non-traditional authorship attribution. Comput. Human. 35, 315--331.Google ScholarGoogle ScholarCross RefCross Ref
  15. Huang, J., Ertekin, S., and Giles, C. L. 2006. Efficient name disambiguation for large-scale databases. In Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases, J. Fürnkranz, T. Scheffer, and M. Spiliopoulou Eds. Springer-Verlag, 536--544. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Jaro, M. A. 1995. Probabilistic linkage of large public health data files. Statis. Med. 14, 491--498.Google ScholarGoogle ScholarCross RefCross Ref
  17. Kalashnikov, D. V. and Mehrotra, S. 2006. Domain-independent data cleaning via analysis of entity-relationship graph. ACM Trans. Data. Syst. 31, 716--767. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Kanani, P., McCallum, A., and Pal, C. 2007. Improving author coreference by resource-bounded information gathering from the Web. In Proceedings of the 20th International Joint Conference on Artificial Intelligence, M. M. Veloso Ed. 429--434. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Koudas, N., Sarawagi, S., and Srivstava, D. 2006. Record linkage: Similarity measures and algorithms. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 802--803. (Supplementary tutorial slides available from http://queens.db.toronto.edu/koudas/docs/aj.pdf.) Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Madigan, D., Genkin, A., Lewis, D. D., Argamon, S., Fradkin, D., and Ye, L. 2005. Author identification on the large-scale. Annual Meeting of the Classification Society of North America. http://www.stat.rutgers.edu/~madigan/PAPERS/authorid-csna05.pdf.Google ScholarGoogle Scholar
  21. Mann, G. S. and Yarowsky, D. 2003. Unsupervised personal name disambiguation. In Proceedings of the 7th Conference on Natural Language Learning. Association for Computational Linguistics, Morristown, 33--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. On, B. W., Lee, D., Kang, J., and Mitra, P. 2005. Comparative study of name disambiguation problem using a scalable blocking-based framework. In Proceedings of the 5th ACM/IEEE Joint Conference on Digital Libraries 2. M. Marlino, T. Sumner, F. M. Shipman III Eds. ACM, 344--353. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Qiu, J. 2008. Scientific publishing: identity crisis. Nature 451, 766--767.Google ScholarGoogle ScholarCross RefCross Ref
  24. Reuther, P. and Walter, B. 2006. Survey on test collections and techniques for personal name matching. Int. J. Metadata, Seman. Ontol. 1, 89--99. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Scoville, C. L., Johnson, E. D., and McConnell, A. L. 2003. When A. rose is not A. Rose: The vagaries of author searching. Med. Refer. Serv. Quart. 22, 1--11.Google ScholarGoogle ScholarCross RefCross Ref
  26. Smalheiser, N. R., Zhou, W., and Torvik, V. I. 2008. Anne O'Tate: A tool to support user-driven summarization, drill-down and browsing of PubMed search results. J. Biomed. Discov. Collab. 3, 2.Google ScholarGoogle ScholarCross RefCross Ref
  27. Smalheiser, N. R. and Torvik, V. I. 2009. Author name disambiguation. In Annual Review of Information Science and Technology 43, B. Cronin Ed. 287--313. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Smalheiser, N. R., Torvik, V. I., and Zhou, W. 2009. Arrowsmith two-node search interface: A tutorial on finding meaningful links between two disparate sets of articles in MEDLINE. Comput. Meth. Prog. Biomed. 94, 190--197. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Soler, J. M. 2007. Separating the articles of authors with the same name. Scientometrics 72, 281--290.Google ScholarGoogle ScholarCross RefCross Ref
  30. Song, Y., Huang, J., Councill, I. G., Li, J., and Giles, C. L. 2007. Efficient topic-based unsupervised name disambiguation. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries. E. M. Rasmussen, R. R. Larson, E. Toms, and S. Sugimoto Eds. ACM, 342--351. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Tan, Y. F., Kan, M. Y., and Lee, D. 2006. Search engine-driven author disambiguation. In Proceedings of the 6th ACM/IEEE Joint Conference on Digital Libraries. G. Marchionini, M. L. Nelson, and C. C. Marshall Eds. ACM, 314--315. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Torvik, V. I., Weeber, M., Swanson, D. R., and Smalheiser, N. R. 2005. A probabilistic similarity metric for Medline records: A model for author name disambiguation. J. Amer. Soc. Inform. Sci. Technol. 56, 140--158. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Torvik, V. I. and Smalheiser, N. R. 2007. A quantitative model for linking two disparate sets of articles in MEDLINE. Bioinformatics 23, 1658--1665. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Wilbur, W. J. and Yang, Y. 1996. An analysis of statistical term-strength and its use in the indexing and retrieval of molecular biology texts. Comput. Bio. Med. 26, 209--222.Google ScholarGoogle ScholarCross RefCross Ref
  35. Winkler, W. E. 1995. Matching and record linkage. In Business Survey Methods, B. G. Cox, D. A. Binder, B. N. Chinnappa, A. Christianson, M. J. Colledge, and P. S. Kott Eds. Wiley, New York, 355--384.Google ScholarGoogle Scholar
  36. Yin, X., Han, J., and Yu, P. S. 2007. Object distinction: Distinguishing objects with identical names by link analysis. In Proceedings of the IEEE 23rd International Conference on Data Engineering. IEEE, 1242--1246.Google ScholarGoogle Scholar

Index Terms

  1. Author name disambiguation in MEDLINE

            Recommendations

            Reviews

            Quinsulon Israel

            Solving the difficult problem of author name disambiguation will help greatly with social networking analysis and determining an individual author's "image." Partitioning articles along multiple dimensions-such as name derivation, email address, coauthor identification, self-citations, and research areas-has shown that articles can be accurately identified as belonging to the correct person, even when the name of the individual on those separate articles is derived or incomplete. As a byproduct of this research area, the links between these partitions (individual authors) and the links between the works within these partitions can be studied for further conjectures about an author's collaborative research and relationships. Torvik and Smalheiser provide readers with a thorough understanding of the problem, the solutions, and the issues in between, including a clear presentation of their methodology. They use many methods that are becoming fundamental, such as similarity analysis of multidimensional vectors. However, although the paper is very well organized, it suffers greatly from conceptual overload, so that the many statistics and discoveries are lost in the dense text. The paper needs more adequate listings, tables, and diagrams to help readers better conceptualize the reported findings. The authors' ingenuity is apparent in their robust and valid dataset. The freely available dataset performs well against other author disambiguation resources. Unfortunately, the authors heavily cite their own work, with few references to other approaches and research. Despite the paper's weaknesses, I still recommend it to those who are interested in author name disambiguation; those who are looking for an excellent dataset; and those who are interested in similarity measures and experimenting with author metadata. Online Computing Reviews Service

            Access critical reviews of Computing literature here

            Become a reviewer for Computing Reviews.

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM Transactions on Knowledge Discovery from Data
              ACM Transactions on Knowledge Discovery from Data  Volume 3, Issue 3
              July 2009
              122 pages
              ISSN:1556-4681
              EISSN:1556-472X
              DOI:10.1145/1552303
              Issue’s Table of Contents

              Copyright © 2009 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 28 July 2009
              • Accepted: 1 March 2009
              • Revised: 1 February 2009
              • Received: 1 July 2007
              Published in tkdd Volume 3, Issue 3

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article
              • Research
              • Refereed

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader