Ontology refinement for improved information retrieval

doi:10.1016/j.ipm.2009.05.008

Information Processing & Management

Volume 46, Issue 4, July 2010, Pages 426-435

https://doi.org/10.1016/j.ipm.2009.05.008 Get rights and content

Abstract

Ontologies are frequently used in information retrieval being their main applications the expansion of queries, semantic indexing of documents and the organization of search results. Ontologies provide lexical items, allow conceptual normalization and provide different types of relations. However, the optimization of an ontology to perform information retrieval tasks is still unclear. In this paper, we use an ontology query model to analyze the usefulness of ontologies in effectively performing document searches. Moreover, we propose an algorithm to refine ontologies for information retrieval tasks with preliminary positive results.

Introduction

Ontologies and terminological resources have appeared in information retrieval (IR) either to provide query expansion terms, to perform semantic indexing of documents or to produce a better organization of retrieved documents. However, these ontologies are usually not optimized for IR tasks.

In this paper we rely on language modeling (Ponte & Croft, 1998), as it provides a formal probabilistic background and an interesting retrieval performance. In language modeling, documents are ranked by the probability that the query is generated by the language model of the document. In our work, we rank the documents combining their models with a query model based on the topology of the ontology and a selection of concepts from the ontology (i.e. the query). In this paper, we study mechanisms to improve ontologies to make them more effective in IR.

In the next section, we present related work. Section 3 introduces the ontology query model. Section 4 shows results of this query model. Section 5 presents lexicon cleansing proposals to enhance the quality of our ontology. Section 6 introduces our ontology refinement algorithm and shows the results using the algorithm. Finally, Section 7 presents conclusions and future work.

Section snippets

Related work

The contribution of this paper is related to ontology refinement, the heuristics related to IR that might be interesting for ontology refinement and the usage of ontologies in information retrieval. Next sections review the main approaches in these topics.

Ontology query model

The main aim of the ontology query model (OQM) (Jimeno-Yepes, Berlanga-Llavori, & Rebholz-Schuhmann, 2009) is to produce an IR query from a set of concepts $C$ selected by a user browsing the ontology.

We define the following sets and functions. W is the set of words in the lexicon, T is the set of terms in the ontology. $LexW (T)$ returns the set of words in W given the term T. This means that the term breast cancer in T will be represented as the words breast and cancer in LexW. Terms are grouped

Performance of OQM

In this section we describe the experiments carried out to show the effectiveness of queries generated from a domain ontology.

Lexicon cleansing

Several heuristics are presented in this section that we have evaluated extensively in (Jimeno-Yepes et al., 2009), see Table 5. The first heuristic (Corpus) consists in removing terms from the lexicon that are not found in the document collection; i.e. Medline. This heuristic is query independent and allows removing redundant terms and, consequently, reducing space and noise in the lexicon. The second strategy is aimed at finding the specific contexts in which a concept is labeled with a term.

Ontology refinement

In this section, we explain our refinement algorithm. This section is split in three section. In the first one we present our refinement approach. Then we present the information extraction implementation used in our work. Finally, we show the results of the algorithm applied to the data sets.

Conclusions

The ontology query model presents an interesting performance and we plan to investigate on the different parameters and document models to obtain further improvement in retrieval effectiveness.

The results show that our method has identified missing knowledge in the ontology relevant to IR tasks. Thus, a selection of the represented knowledge linked to a concept has to be done to avoid a query drift. This outcome was earlier identified in the literature (Voorhees, 1994). The main contribution of

Acknowledgement

This work was funded by the EC within the BOOTStrep (FP6-028099) Project and by the Spanish National Research Program Project TIN2008-01825/TIN.

References (32)

A. Kiryakov et al.
Semantic annotation indexing and retrieval
Web Semantics: Science Services and Agents on the World Wide Web
(2004)
Bai, J., Song, D., Bruza, P., Nie, J., & Cao, G. (2005). Query expansion using term relationships in language models...
Berger, A., & Lafferty, J. (1999). Information retrieval as statistical translation. In Proceedings of the 22nd annual...
Buckley, C., & Salton, G. (1995). Optimization of relevance feedback weights. In Proceedings of the 18th annual...
Buckley, C., Salton, G., Allan, J., & Singhal, A. (2004). Automatic query expansion using SMART: TREC 3. In Text...
Cao, G., Nie, J., & Bai, J. (2005). Integrating word relationships into language models. In Proceedings of the 28th...
P. Castells et al.
An adaptation of the vector-space model for ontology-based information retrieval
IEEE Transactions on Knowledge and Data Engineering
(2007)
Cimiano, P., Handschuh, S., & Staab, S. (2004). Towards the self-annotating web. In Proceedings of the 13th...
S. Deerwester et al.
Indexing by latent semantic analysis
Journal of the American Society for Information Science
(1990)
A. Divoli et al.
BioIE: Extracting informative sentences from the biomedical literature
Bioinformatics
(2005)

Efthimiadis, E. (1996). Query expansion. In Martha E. Williams (Ed.), Annual review of information systems and...

Faatz, A., & Steinmetz, R. (2002). Ontology enrichment with texts from the www. In Semantic web mining,...

Hahn, U., & Schnattinger, K. (1998). Towards text knowledge engineering. In Proceedings of the fifteenth national/tenth...

Hearst, M. (1992). Automatic acquisition of hyponyms from large text corpora. Technical Report...

Hearst, M. (1998). Automated discovery of wordnet relations. In M. Press (Ed.), WordNet: An electronic lexical database...

Jimeno-Yepes, A., & Berlanga-Llavori, R. (2008). Study of named entity recognition in biomedicine: Towards the...

Cited by (43)

Knowledge based word-concept model estimation and refinement for biomedical text mining
2015, Journal of Biomedical Informatics
Citation Excerpt :
Regardless of the large number of potentially false positive relations extracted by co-occurrences, the model refinement improves the performance of the initial model only based on the KB. The improvement of the resulting model is global, since the refinement is done on the whole of the KB, and not by a single concept as in [25]. In the document ranking results, we showed significant improvement in ranking over other methods.
Text mining of scientific literature has been essential for setting up large public biomedical databases, which are being widely used by the research community. In the biomedical domain, the existence of a large number of terminological resources and knowledge bases (KB) has enabled a myriad of machine learning methods for different text mining related tasks. Unfortunately, KBs have not been devised for text mining tasks but for human interpretation, thus performance of KB-based methods is usually lower when compared to supervised machine learning methods. The disadvantage of supervised methods though is they require labeled training data and therefore not useful for large scale biomedical text mining systems. KB-based methods do not have this limitation.
In this paper, we describe a novel method to generate word-concept probabilities from a KB, which can serve as a basis for several text mining tasks. This method not only takes into account the underlying patterns within the descriptions contained in the KB but also those in texts available from large unlabeled corpora such as MEDLINE. The parameters of the model have been estimated without training data. Patterns from MEDLINE have been built using MetaMap for entity recognition and related using co-occurrences.
The word-concept probabilities were evaluated on the task of word sense disambiguation (WSD). The results showed that our method obtained a higher degree of accuracy than other state-of-the-art approaches when evaluated on the MSH WSD data set. We also evaluated our method on the task of document ranking using MEDLINE citations. These results also showed an increase in performance over existing baseline retrieval approaches.
Hybrid fuzzy-ontology design using FCA based clustering for information retrieval in semantic web
2015, Procedia Computer Science
Ontology is a way to represent the domain knowledge into a human understandable and machine readable format. It is used as one of the major knowledge representation mechanism for semantic web. Introducing the ontology knowledge provides more relevant search results for the users information need. To deal with uncertain information, the mechanism supported by the regular ontology may not be adequate and the requirement for new technique arises. Fuzzy based methods are the proven methods to interpret the uncertain information. The combination of Fuzzy and Ontology based information retrieval provides better results as they mainly deal with the semantics and the uncertainty of information. Keyword matching is one another widely used method which matches the input keywords with the existing information domain to find the best match results. When the input queries are complex the fuzzy ontology based information retrieval which respects the user's keyword and the domain produces more accurate results. This work enlarges the fuzzy ontology knowledge results along with the input queries and keyword matching. The given algorithm is a hybrid technique based on matching extracted instances from the input queries and in information domain. Overall, compared to the existing query models supported by fuzzy ontology or keyword based models the hybrid ontology with keyword matching is sufficient and easy way to retrieve the documents in semantic web. The performance of the hybrid ontology approach is measured using improved precision, recall and f-measure values.
Tailored semantic annotation for semantic search
2015, Journal of Web Semantics
This paper presents a novel method for semantic annotation and search of a target corpus using several knowledge resources (KRs). This method relies on a formal statistical framework in which KR concepts and corpus documents are homogeneously represented using statistical language models. Under this framework, we can perform all the necessary operations for an efficient and effective semantic annotation of the corpus. Firstly, we propose a coarse tailoring of the KRs w.r.t the target corpus with the main goal of reducing the ambiguity of the annotations and their computational overhead. Then, we propose the generation of concept profiles, which allow measuring the semantic overlap of the KRs as well as performing a finer tailoring of them. Finally, we propose how to semantically represent documents and queries in terms of the KRs concepts and the statistical framework to perform semantic search. Experiments have been carried out with a corpus about web resources which includes several Life Sciences catalogs and Wikipedia pages related to web resources in general (e.g., databases, tools, services, etc.). Results demonstrate that the proposed method is more effective and efficient than state-of-the-art methods relying on either context-free annotation or keyword-based search.
Recent developments in the organization goals conformance using ontology
2013, Expert Systems with Applications
Citation Excerpt :
In this case, information needs to be well retrieved. Jimeno-Yepes et al. (2010) studied on ontology refinement to improve information retrieval. The authors studied on ontology and terminological resources have appeared in information retrieval (IR) either to provide query expansion terms, to perform semantic indexing of documents or to produce a better organization of retrieved documents.
Organizational goals serve as the most important achievement target in every organization. Even though some researchers have developed the concept of the organization goals, but structuring the organization goals model is always questionable by the way it is being used. In this paper, we propose ontology to develop a unified model for the organization goals structure. We review the recent literature on the organization modelling and ontology development as an effort to evaluate the organization goals using a metrics for the achievement of the organization goals. We suggest that the metrics is important to identify the relevant organization data in relation to the organization goals conformance. In order to achieve this purpose, we investigate various associated concepts and organize the literature based on the organization goals, organization ontology and metrics model. We observe our proposed models are important for domain experts and entrepreneurs to evaluate the relevant organization data and to assist them in decision making. In summary, the contribution of this survey may serve as a first step in understanding the evaluation of the organization data for the achievement of the organization goals.
Semantic similarity based food entities recognition using WordNet
2022, Journal of Intelligent and Fuzzy Systems
Ontology development life cycle: A review
2019, International Journal of Advanced Science and Technology

View all citing articles on Scopus

View full text

Ontology refinement for improved information retrieval

Abstract

Introduction

Section snippets

Related work

Ontology query model

Performance of OQM

Lexicon cleansing

Ontology refinement

Conclusions

Acknowledgement

Web Semantics: Science Services and Agents on the World Wide Web

An adaptation of the vector-space model for ontology-based information retrieval

IEEE Transactions on Knowledge and Data Engineering

Indexing by latent semantic analysis

Journal of the American Society for Information Science

BioIE: Extracting informative sentences from the biomedical literature

Bioinformatics