research-article

Author name disambiguation in MEDLINE

Authors:
Vetle I. Torvik

University of Illinois at Chicago, Chicago, IL, USA

University of Illinois at Chicago, Chicago, IL, USA
View Profile

,
Neil R. Smalheiser

University of Illinois at Chicago, Chicago, IL, USA

University of Illinois at Chicago, Chicago, IL, USA
View Profile

ACM Transactions on Knowledge Discovery from Data Volume 3 Issue 3Article No.: 11pp 1–29https://doi.org/10.1145/1552303.1552304

Published:28 July 2009Publication History

ACM Transactions on Knowledge Discovery from Data

Abstract

Background: We recently described “Author-ity,” a model for estimating the probability that two articles in MEDLINE, sharing the same author name, were written by the same individual. Features include shared title words, journal name, coauthors, medical subject headings, language, affiliations, and author name features (middle initial, suffix, and prevalence in MEDLINE). Here we test the hypothesis that the Author-ity model will suffice to disambiguate author names for the vast majority of articles in MEDLINE. Methods: Enhancements include: (a) incorporating first names and their variants, email addresses, and correlations between specific last names and affiliation words; (b) new methods of generating large unbiased training sets; (c) new methods for estimating the prior probability; (d) a weighted least squares algorithm for correcting transitivity violations; and (e) a maximum likelihood based agglomerative algorithm for computing clusters of articles that represent inferred author-individuals. Results: Pairwise comparisons were computed for all author names on all 15.3 million articles in MEDLINE (2006 baseline), that share last name and first initial, to create Author-ity 2006, a database that has each name on each article assigned to one of 6.7 million inferred author-individual clusters. Recall is estimated at ∼98.8%. Lumping (putting two different individuals into the same cluster) affects ∼0.5% of clusters, whereas splitting (assigning articles written by the same individual to >1 cluster) affects ∼2% of articles. Impact: The Author-ity model can be applied generally to other bibliographic databases. Author name disambiguation allows information retrieval and data integration to become person-centered, not just document-centered, setting the stage for new data mining and social network tools that will facilitate the analysis of scholarly publishing and collaboration behavior. Availability: The Author-ity 2006 database is available for nonprofit academic research, and can be freely queried via http://arrowsmith.psych.uic.edu.

References

Bhattacharya, I. and Getoor, L. 2006. A latent Dirichlet model for unsupervised entity resolution. In Proceedings of the 6th SIAM Conference on Data Mining, J. Ghosh, D. Lambert, D. B. Skillicorn, and J. Srivastava Eds. SIAM, 47--58.Google Scholar
Bhattacharya, I. and Getoor, L. 2007. Collective entity resolution in relational data. ACM Trans. Knowl. Discov. Data 1, 1--36. Google ScholarDigital Library
Bilenko, M., Kamath, B., and Mooney, R. J. 2006. Adaptive blocking: Learning to scale up record linkage. In Proceedings of the IEEE Computer Society 6th International Conference on Data Mining. 87--96. Google ScholarDigital Library
Culotta, A. and McCallum, A. 2006. Tractable learning and inference of high-order representations. In Proceedings of the ICML Workshop on Open Problems in Statistical Relational Learning. http://www.cs.umd.edu/projects/srl2006/proceedings.html.Google Scholar
Culotta, A., Kanani, P., Hall, R., Wick, M., and McCallum, A. 2007. Author disambiguation using error-driven machine learning with a ranking loss function. In Proceedings of the 6th AAAI International Workshop on Information Integration on the Web.Google Scholar
Dominguez, J. and Gonzalez-Lima, M. D. 2006. A primal-dual interior-point algorithm for quadratic programming. Numer. Algor. 42, 1--30.Google ScholarCross Ref
Fisher, D. H. 1987. Knowledge acquisition via incremental conceptual clustering. Mach. Learn. 2, 139--172. Google ScholarDigital Library
French, J. C., Powell, A., and Schulman, E. 2000. Using clustering strategies for creating authority files. J. Amer. Soc. Inform. Sci. Technol. 51, 774--786. Google ScholarDigital Library
Galvez, C. and Moya-Aneg&#243;n, F. 2007. Approximate personal name-matching through finite state graphs. J. Amer. Soc. Inform. Sci. Technol. 58, 1960--1976. Google ScholarDigital Library
Garfield, E. 1969. British quest for uniqueness versus American egocentrism. Nature 223, 763.Google ScholarCross Ref
Han, H., Zha, H., and Giles, C. L. 2005. Name disambiguation in author citations using a K-way spectral clustering method. In Proceedings of the 5th ACM/IEEE Joint Conference on Digital Libraries. M. Marlino, T. Sumner, and F. M. Shipman III Eds. ACM, 334--343. Google ScholarDigital Library
Han, H., Giles, C. L., Zha, H., Li, C., and Tsioutsiouliklis, K. 2004. Two supervised learning approaches for name disambiguation in author citations. In Proceedings of the 4th ACM/IEEE Joint Conference on Digital Libraries, H. Chen, H. D. Wactlar, C.-C. Chen, E.-P. Lim, and M. G. Christel Eds. ACM, 296--305. Google ScholarDigital Library
Herskovic, J. R., Tanaka, L. Y., Hersh, W., and Bernstam, E. V. 2007. A day in the life of PubMed: analysis of a typical day's query log. J. Amer. Med. Inform. Ass. 14, 212--220.Google ScholarCross Ref
Holmes, D. I., Robertson, M., and Paez, R. 2001. Stephen Crane and the New-York Tribune: A case study in traditional and non-traditional authorship attribution. Comput. Human. 35, 315--331.Google ScholarCross Ref
Huang, J., Ertekin, S., and Giles, C. L. 2006. Efficient name disambiguation for large-scale databases. In Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases, J. F&#252;rnkranz, T. Scheffer, and M. Spiliopoulou Eds. Springer-Verlag, 536--544. Google ScholarDigital Library
Jaro, M. A. 1995. Probabilistic linkage of large public health data files. Statis. Med. 14, 491--498.Google ScholarCross Ref
Kalashnikov, D. V. and Mehrotra, S. 2006. Domain-independent data cleaning via analysis of entity-relationship graph. ACM Trans. Data. Syst. 31, 716--767. Google ScholarDigital Library
Kanani, P., McCallum, A., and Pal, C. 2007. Improving author coreference by resource-bounded information gathering from the Web. In Proceedings of the 20th International Joint Conference on Artificial Intelligence, M. M. Veloso Ed. 429--434. Google ScholarDigital Library
Koudas, N., Sarawagi, S., and Srivstava, D. 2006. Record linkage: Similarity measures and algorithms. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 802--803. (Supplementary tutorial slides available from http://queens.db.toronto.edu/koudas/docs/aj.pdf.) Google ScholarDigital Library
Madigan, D., Genkin, A., Lewis, D. D., Argamon, S., Fradkin, D., and Ye, L. 2005. Author identification on the large-scale. Annual Meeting of the Classification Society of North America. http://www.stat.rutgers.edu/~madigan/PAPERS/authorid-csna05.pdf.Google Scholar
Mann, G. S. and Yarowsky, D. 2003. Unsupervised personal name disambiguation. In Proceedings of the 7th Conference on Natural Language Learning. Association for Computational Linguistics, Morristown, 33--40. Google ScholarDigital Library
On, B. W., Lee, D., Kang, J., and Mitra, P. 2005. Comparative study of name disambiguation problem using a scalable blocking-based framework. In Proceedings of the 5th ACM/IEEE Joint Conference on Digital Libraries 2. M. Marlino, T. Sumner, F. M. Shipman III Eds. ACM, 344--353. Google ScholarDigital Library
Qiu, J. 2008. Scientific publishing: identity crisis. Nature 451, 766--767.Google ScholarCross Ref
Reuther, P. and Walter, B. 2006. Survey on test collections and techniques for personal name matching. Int. J. Metadata, Seman. Ontol. 1, 89--99. Google ScholarDigital Library
Scoville, C. L., Johnson, E. D., and McConnell, A. L. 2003. When A. rose is not A. Rose: The vagaries of author searching. Med. Refer. Serv. Quart. 22, 1--11.Google ScholarCross Ref
Smalheiser, N. R., Zhou, W., and Torvik, V. I. 2008. Anne O'Tate: A tool to support user-driven summarization, drill-down and browsing of PubMed search results. J. Biomed. Discov. Collab. 3, 2.Google ScholarCross Ref
Smalheiser, N. R. and Torvik, V. I. 2009. Author name disambiguation. In Annual Review of Information Science and Technology 43, B. Cronin Ed. 287--313. Google ScholarDigital Library
Smalheiser, N. R., Torvik, V. I., and Zhou, W. 2009. Arrowsmith two-node search interface: A tutorial on finding meaningful links between two disparate sets of articles in MEDLINE. Comput. Meth. Prog. Biomed. 94, 190--197. Google ScholarDigital Library
Soler, J. M. 2007. Separating the articles of authors with the same name. Scientometrics 72, 281--290.Google ScholarCross Ref
Song, Y., Huang, J., Councill, I. G., Li, J., and Giles, C. L. 2007. Efficient topic-based unsupervised name disambiguation. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries. E. M. Rasmussen, R. R. Larson, E. Toms, and S. Sugimoto Eds. ACM, 342--351. Google ScholarDigital Library
Tan, Y. F., Kan, M. Y., and Lee, D. 2006. Search engine-driven author disambiguation. In Proceedings of the 6th ACM/IEEE Joint Conference on Digital Libraries. G. Marchionini, M. L. Nelson, and C. C. Marshall Eds. ACM, 314--315. Google ScholarDigital Library
Torvik, V. I., Weeber, M., Swanson, D. R., and Smalheiser, N. R. 2005. A probabilistic similarity metric for Medline records: A model for author name disambiguation. J. Amer. Soc. Inform. Sci. Technol. 56, 140--158. Google ScholarDigital Library
Torvik, V. I. and Smalheiser, N. R. 2007. A quantitative model for linking two disparate sets of articles in MEDLINE. Bioinformatics 23, 1658--1665. Google ScholarDigital Library
Wilbur, W. J. and Yang, Y. 1996. An analysis of statistical term-strength and its use in the indexing and retrieval of molecular biology texts. Comput. Bio. Med. 26, 209--222.Google ScholarCross Ref
Winkler, W. E. 1995. Matching and record linkage. In Business Survey Methods, B. G. Cox, D. A. Binder, B. N. Chinnappa, A. Christianson, M. J. Colledge, and P. S. Kott Eds. Wiley, New York, 355--384.Google Scholar
Yin, X., Han, J., and Yu, P. S. 2007. Object distinction: Distinguishing objects with identical names by link analysis. In Proceedings of the IEEE 23rd International Conference on Data Engineering. IEEE, 1242--1246.Google Scholar

Index Terms

Author name disambiguation in MEDLINE

Recommendations

Web personal name disambiguation based on reference entity tables mined from the web
WIDM '09: Proceedings of the eleventh international workshop on Web information and data management

Ambiguous personal names are common on the Web, which pose a challenge for many different tasks. The traditional disambiguation employs the clustering methods. However, without reference entity tables, the clustering method can only identify whether two ...
Read More
Name Disambiguation Using Semantic Association Clustering
ICEBE '09: Proceedings of the 2009 IEEE International Conference on e-Business Engineering

Due to homonyms, abbreviations, etc., name ambiguity is widely available in web and e-document. For example, when integrating heterogeneous literature databases, because there are different name specifications, different authors may be thought of as the ...
Read More
Name disambiguation in author citations using a K-way spectral clustering method
JCDL '05: Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries

An author may have multiple names and multiple authors may share the same name simply due to name abbreviations, identical names, or name misspellings in publications or bibliographies ¹. This can produce name ambiguity which can affect the performance ...
Read More

Reviews

Reviewer: Quinsulon Israel

Solving the difficult problem of author name disambiguation will help greatly with social networking analysis and determining an individual author's "image." Partitioning articles along multiple dimensions-such as name derivation, email address, coauthor identification, self-citations, and research areas-has shown that articles can be accurately identified as belonging to the correct person, even when the name of the individual on those separate articles is derived or incomplete. As a byproduct of this research area, the links between these partitions (individual authors) and the links between the works within these partitions can be studied for further conjectures about an author's collaborative research and relationships. Torvik and Smalheiser provide readers with a thorough understanding of the problem, the solutions, and the issues in between, including a clear presentation of their methodology. They use many methods that are becoming fundamental, such as similarity analysis of multidimensional vectors. However, although the paper is very well organized, it suffers greatly from conceptual overload, so that the many statistics and discoveries are lost in the dense text. The paper needs more adequate listings, tables, and diagrams to help readers better conceptualize the reported findings. The authors' ingenuity is apparent in their robust and valid dataset. The freely available dataset performs well against other author disambiguation resources. Unfortunately, the authors heavily cite their own work, with few references to other approaches and research. Despite the paper's weaknesses, I still recommend it to those who are interested in author name disambiguation; those who are looking for an excellent dataset; and those who are interested in similarity measures and experimenting with author metadata. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Knowledge Discovery from Data Volume 3, Issue 3
July 2009
122 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/1552303
Issue’s Table of Contents

Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 28 July 2009
- Accepted: 1 March 2009
- Revised: 1 February 2009
- Received: 1 July 2007
Published in tkdd Volume 3, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Name disambiguation
bibliographic databases
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 186
  Total Citations
  View Citations
- 1,923
  Total Downloads
- Downloads (Last 12 months)70
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Author name disambiguation in MEDLINE

ACM Transactions on Knowledge Discovery from Data

Abstract

References

Cited By

Index Terms

Recommendations

Web personal name disambiguation based on reference entity tables mined from the web

Name Disambiguation Using Semantic Association Clustering

Name disambiguation in author citations using a K-way spectral clustering method

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Author name disambiguation in MEDLINE

ACM Transactions on Knowledge Discovery from Data

Abstract

References

Cited By

Index Terms

Recommendations

Web personal name disambiguation based on reference entity tables mined from the web

Name Disambiguation Using Semantic Association Clustering

Name disambiguation in author citations using a K-way spectral clustering method

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media