ABSTRACT
Linked Open Data has made available a diversity of scientific collections where scientists have annotated entities in the datasets with controlled vocabulary terms (CV terms) from ontologies. These semantic annotations encode scientific knowledge which is captured in annotation datasets. One can mine these datasets to discover relationships and patterns between entities. Determining the relatedness (or similarity) between entities becomes a building block for graph pattern mining, e.g., identifying drug-drug relationships could depend on the similarity of the diseases (conditions) that are associated with each drug. Diverse similarity metrics have been proposed in the literature, e.g., i) string-similarity metrics; ii) path-similarity metrics; iii) topological-similarity metrics; all measure relatedness in a given taxonomy or ontology. In this paper, we consider a novel annotation similarity metric AnnSim that measures the relatedness between two entities in terms of the similarity of their annotations. We model AnnSim as a 1-to-1 maximal weighted bipartite match, and we exploit properties of existing solvers to provide an efficient solution. We empirically study the effectiveness of AnnSim on real-world datasets of genes and their GO annotations, clinical trials, and a human disease benchmark. Our results suggest that AnnSim can provide a deeper understanding of the relatedness of concepts and can provide an explanation of potential novel patterns.
- Classified transporter families in arabidopsis. http://www.clfs.umd.edu/CBMG/faculty/sze/lab/AtTransporters.html.Google Scholar
- D. Aumueller, H. H. Do, S. Massmann, and E. Rahm. Schema and ontology matching with coma++. In SIGMOD Conference, pages 906--908, 2005. Google ScholarDigital Library
- Z. Bellahsene, A. Bonifati, and E. Rahm, editors. Schema Matching and Mapping. Springer, 2011. Google ScholarDigital Library
- S. Belongie, J. Malik, and J. Puzicha. Shape matching and object recognition using shape contexts. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 24(4):509--522, 2002. Google ScholarDigital Library
- M. A. Bender, M. Farach-Colton, G. Pemmasani, S. Skiena, and P. Sumazin. Lowest common ancestors in trees and directed acyclic graphs. Journal of Algorithms, 57(2):75--94, 2005. Google ScholarDigital Library
- J. Benik, C. Chang, L. Raschid, M. E. Vidal, G. Palma, and A. Thor. Finding cross genome patterns in annotation graphs. In Proceedings of Data Integration in the Life Sciences (DILS), 2012. Google ScholarDigital Library
- S. Bhagwani, S. Satapathy, and H. Karnick. Semantic textual similarity using maximal weighted bipartite graph matching. In Proceedings of the First Joint Conference on Lexical and Computational Semantics-Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation, pages 579--585. Association for Computational Linguistics, 2012. Google ScholarDigital Library
- K. Bleakley and Y. Yamanishi. Supervised prediction of drug-target interactions using bipartite local models. Bioinformatics, 25(18):2397--2403, 2009. Google ScholarDigital Library
- C. Chen, S. Hsieh, Y. Weng, W. Chang, and F. Lai. Semantic similarity measure in biomedical domain leverage web search engine. Proc.IEEE Eng Med Biol Soc, pages 4436--4439, 2010.Google Scholar
- W. Cook and A. Rohe. Blossom iv: Code for minimum weight perfect matchings. http://www2.isye.gatech.edu/~wcook/software.html.Google Scholar
- M. A. Jaro. Probabilistic linkage of large public health data files. Statistics in Medicine, pages 491--498, 1995.Google Scholar
- J. Jiang and D. Conrath. Semantic similarity based on corpus statistics and lexical taxonomy. CoRR, cmp-lg/9709008, 1997.Google Scholar
- J. K. Kalervo Jarvelin. Cumulated gain-based evaluation of ir techniques. JACM Transactions on Information Systems, 20(4):422--446, 2002. Google ScholarDigital Library
- D. Lin. An information-theoretic definition of similarity. In ICML, pages 296--304, 1998. Google ScholarDigital Library
- B. McInnes, T. Pedersen, and S. Pakhomov. Umls-interface and umls-similarity: Open source software for measuring paths and semantic similarity. Proceedings of the AMIA Symposium, pages 431--435, 2009.Google Scholar
- S. Pakhomov, B. McInnes, T. Adam, Y. Liu, T. Pedersen, and G. Melton. Semantic similarity and relatedness between clinical terms: An experimental study. Proceedings of the AMIA Symposium, pages 572--576, 2010.Google Scholar
- T. Pedersen, S. Pakhomov, S. Patwardhan, and C. Chute. Measures of semantic similarity and relatedness in the biomedical domain. Journal of Biomedical Informatics, 40(3):288--299, 2007. Google ScholarDigital Library
- V. Pekar and S. Staab. Taxonomy learning - factoring the structure of a taxonomy into a semantic classification decision. In COLING, 2002. Google ScholarDigital Library
- C. Pesquita, D. Faria, A. Falcão, P. Lord, and F. Couto. Semantic similarity in biomedical ontologies. PLoS Computational Biology, 5(7):e1000443, 2009.Google ScholarCross Ref
- P. Resnik. Using information content to evaluate semantic similarity in a taxonomy. In IJCAI, pages 448--453, 1995. Google ScholarDigital Library
- J. Schwartz, A. Steger, and A. Weißl. Fast algorithms for weighted bipartite matching. In WEA, pages 476--487, 2005. Google ScholarDigital Library
- Y. Shavitt, E. Weinsberg, and U. Weinsberg. Estimating peer similarity using distance of shared files. In International workshop on peer-to-peer systems (IPTPS), volume 104, 2010. Google ScholarDigital Library
- C. Shi, X. Kong, P. S. Yu, S. Xie, and B. Wu. Relevance search in heterogeneous networks. In EDBT, pages 180--191, 2012. Google ScholarDigital Library
- P. Shvaiko and J. Euzenat. Ontology matching: State of the art and future challenges. IEEE Trans. Knowl. Data Eng., 25(1):158--176, 2013. Google ScholarDigital Library
- T. Smith and M. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147(1):195--197, March 1981.Google ScholarCross Ref
- T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, pages 195--197, 1981.Google ScholarCross Ref
- Y. Sun, J. Han, X. Yan, P. S. Yu, and T. Wu. Pathsim: Meta path-based top-k similarity search in heterogeneous information networks. PVLDB, 4(11):992--1003, 2011.Google ScholarDigital Library
- A. Thor, T. Kirsten, and E. Rahm. Instance-based matching of hierarchical ontologies. In BTW, pages 436--448, 2007.Google Scholar
- J. Z. Wang, Z. Du, R. Payattakool, P. S. Yu, and C.-F. Chen. A new method to measure the semantic similarity of go terms. Bioinformatics, 23(10):1274--1281, 2007. Google ScholarDigital Library
Index Terms
- Measuring Relatedness Between Scientific Entities in Annotation Datasets
Recommendations
Computing Semantic Relatedness between Named Entities Using Wikipedia
AICI '10: Proceedings of the 2010 International Conference on Artificial Intelligence and Computational Intelligence - Volume 01In this paper the authors suggest an novel approach that uses Wikipedia to measure the semantic relatedness between Chinese named entities, such as names of persons, books, softwares, etc. The relatedness is measured through articles in Wikipedia that ...
Annotation of chemical named entities
BioNLP '07: Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language ProcessingWe describe the annotation of chemical named entities in scientific text. A set of annotation guidelines defines 5 types of named entities, and provides instructions for the resolution of special cases. A corpus of fulltext chemistry papers was ...
Automatic semantic web annotation of named entities
Canadian AI'11: Proceedings of the 24th Canadian conference on Advances in artificial intelligenceThis paper describes a method to perform automated semantic annotation of named entities contained in large corpora. The semantic annotation is made in the context of the Semantic Web. The method is based on an algorithm that compares the set of words ...
Comments