Computing inter-document similarity with Context Semantic Analysis

doi:10.1016/j.is.2018.02.009

Information Systems

Volume 80, February 2019, Pages 136-147

https://doi.org/10.1016/j.is.2018.02.009 Get rights and content

Abstract

We propose a novel knowledge-based technique for inter-document similarity computation, called Context Semantic Analysis (CSA). Several specialized approaches built on top of specific knowledge base (e.g. Wikipedia) exist in literature, but CSA differs from them because it is designed to be portable to any RDF knowledge base. In fact, our technique relies on a generic RDF knowledge base (e.g. DBpedia and Wikidata) to extract from it a Semantic Context Vector, a novel model for representing the context of a document, which is exploited by CSA to compute inter-document similarity effectively. Moreover, we show how CSA can be effectively applied in the Information Retrieval domain. Experimental results show that: (i) for the general task of inter-document similarity, CSA outperforms baselines built on top of traditional methods, and achieves a performance similar to the ones built on top of specific knowledge bases; (ii) for Information Retrieval tasks, enriching documents with context (i.e., employing the Semantic Context Vector model) improves the results quality of the state-of-the-art technique that employs such similar semantic enrichment.

Introduction

Recent years have seen a growing number of knowledge bases employed in several domains and applications. Besides DBpedia [1], which is the heart of the Linked Open Data (LOD) cloud [2], other important knowledge bases are: Wikidata [3], a collaborative knowledge base; YAGO [4], a huge semantic knowledge base derived from Wikipedia, WordNet and GeoNames; Snomed CT [5], the best known ontology in the medical domain and AGROVOC [6], a multilingual agricultural thesaurus we recently used for annotating agricultural resources [7].

In the literature, knowledge-based approaches have been employed for improving existing techniques in Natural Language Processing (NLP) [8] and Information Retrieval (IR) domains [9]. Yet, there is much room for improvement in order to effectively exploit these rich models in these fields [10]. For instance, in the context of inter-document similarity, which plays an important role in many NLP and IR applications, classic techniques rely solely on syntactic information and are usually based on Vector Space Models [11], where the documents are represented in a vector space having document words as dimensions. Nevertheless, such techniques fail in detecting relationships among concepts in simple scenarios like the following sentences: “The Rolling Stones with the participation of Roger Daltrey opened the concerts’ season in Trafalgar Square” and “The bands headed by Mick Jagger with the leader of The Who played in London last week”. These two sentences contain highly related concepts (e.g., Roger Daltrey is the leader of The Who) which can be found by exploiting the knowledge network encoded within knowledge bases such as DBpedia.

To overcome the limitation of a purely syntactical approach, in [12] we proposed Context Semantic Analysis (CSA), a novel semantic technique for estimating inter-document similarity, leveraging the information contained in a knowledge base. One of the main novelties of CSA w.r.t. other knowledge-based approaches is its applicability over any RDF knowledge base, so that all datasets belonging to the LOD cloud [2] (more than one thousand) can be used. CSA is based on the notion of contextual graph of a document, i.e. a subgraph of the knowledge base that contains the contextual information of the document; the notion of contextual graph is very similar to the one of semantic graph defined in [10]. The contextual graph is then suitably weighted to capture the degree of associativity between its concepts, i.e., the degree of relevance of a property for the entities it connects. The vertices of such a weighted contextual graph are then ranked by using PageRank methods, so obtaining a Semantic Context Vector, a novel model able to represent the context of the document. Thus, the similarity of two documents is computed by comparing their Semantic Context Vectors with general vector comparison methods, such as the cosine similarity. By evaluating our method on a standard benchmark for document similarity (which consider correlations with human judges), we showed how CSA outperforms almost all other methods and how it can exploit any RDF knowledge base. Moreover we analyzed its scalability in a clustering task with a large corpus of documents, and showed that our approach outperforms the considered baselines.

This paper extends our previous work at the SISAP 2016 Conference. The main novel contribution of the extended paper is to test Context Semantic Analysis (CSA) applicability and effectiveness in a real-world application domain, such as Information Retrieval (IR). To this purpose, we analyzed the semantic based approaches recently proposed in the Information Retrieval research community. We found that, the most effective and general IR framework, adopting semantic enrichment of documents, is KE4IR [13]. We studied its layered architecture and tried to improve its performance, by including CSA, as a new semantic layer. The outcome was really positive as we were able to show that KE4IR + CSA outperforms the original KE4IR framework (see Section 5.2).

The paper is structured as follows. Section 2 describes the related work, while Section 3 is devoted to some preliminaries useful for the rest of the paper. Then, CSA is described in Section 4 and Section 5 shows its evaluation. Finally, Section 6 outlines conclusions and future work.

Section snippets

Related work

Text similarity has been one the main research area of the last years due to wide range of its applications in tasks such as information retrieval, text classification, document clustering, topic detection, etc. [14]. In this field a lot of techniques have been proposed but we can group them in two main categories, content-based and knowledge-enriched approaches, where the main difference is that the first group uses only textual information contained in documents while the second one enriches

Inter-document similarity

The state-of-the-art techniques for estimating inter-document similarity are primarily based on Vector Space Models: a document is represented through a bag-of-words feature vector, which contains information about the presence and absence of words in the document, and the similarity between two documents is calculated as the cosine of the angle between the two respective vectors (i.e., their cosine similarity).

Vector Space Models are generally based on a co-occurrence matrix, a way of

Context semantic analysis

In this section we introduce our novel technique for estimating inter-document similarity, called Context Semantic Analysis (CSA), that is based on leveraging the information contained in a generic RDF knowledge base. Given a corpus $C$ of documents and an RDF knowledge graph $K B$ , CSA is composed of the following three steps:

1.
Contextual Graph Extraction: the Contextual Graph $C G (d)$ containing the contextual information of a document $d$ is extracted from the $K B$ .
2.
Semantic Context Vectors Generation: the

Evaluation

In this section we evaluate CSA: firstly, we assess CSA efficacy by considering the correlation with human judges; secondly, we evaluate how CSA performs in a real-world application, employing it in an Information Retrieval framework; thirdly, we analyze CSA scalability in a clustering task on a large dataset.

All experiments have been performed on a server running Ubuntu 14.04, with 80 GB RAM, and an Intel Xeon E5-2670 v2 @ 2.50 GHz CPU. CSA has been implemented in Python 2.7, and for

Conclusion and future work

In this paper, we proposed Context Semantic Analysis (CSA), a novel knowledge-based technique for estimating inter-document similarity. The technique is based on a Semantic Context Vector, which can be extracted from a knowledge base and stored as metadata of a document and employed to compute inter-document similarity. We showed the consistency of CSA with respect to human judges and how it outperforms standard (i.e., syntactic) inter-document similarity methods. Moreover, we obtained

References (40)

BeneventanoD. et al.
Semantic annotation of the cerealab database by the agrovoc linked dataset
Ecological Informatics
(2015)
FernándezM. et al.
Semantically enhanced information retrieval: an ontology-based approach
Web Semant. Sci. Serv. Agents World Wide Web
(2011)
BergamaschiS. et al.
Combining user and database perspective for solving keyword queries over relational databases
Inf. Syst.
(2016)
AuerS. et al.
DBpedia: A Nucleus for a Web of Open Data
(2007)
BizerC. et al.
Linked data - the story so far
Int. J. Semantic Web Inf. Syst.
(2009)
VrandečićD. et al.
Wikidata: A free collaborative knowledgebase
Commun. ACM
(2014)
SuchanekF.M. et al.
Yago: A core of semantic knowledge
BosL. et al.
SNOMED-CT: The advanced terminology and coding system for eHealth
Stud. Health Technol. Inform.
(2006)
CaraccioloC. et al.
The agrovoc linked dataset
Semantic Web
(2013)
FriedmanC. et al.
Genies: a natural-language processing system for the extraction of molecular pathways from journal articles
Bioinformatics
(2001)

SchuhmacherM. et al.

Knowledge-based graph document modeling

TurneyP.D. et al.

From frequency to meaning: Vector space models of semantics

J. Artif. Int. Res.

(2010)

F. Benedetti, D. Beneventano, S. Bergamaschi, Context semantic analysis: A knowledge-based technique for computing...

F. Corcoglioniti, M. Dragoni, M. Rospocher, A.P. Aprosio, Knowledge extraction for information retrieval, in: The...

GomaaW.H. et al.

A survey of text similarity approaches

Int. J. Comput. Appl.

(2013)

DumaisS.T.

Latent semantic analysis

Annu. Rev. Inf. Sci. Technol.

(2004)

RobertsonS.E. et al.

The probabilistic relevance framework: BM25 and beyond

Found. Trends Inf. Retr.

(2009)

ZhaiC.

Statistical language models for information retrieval. A critical review

Found. Trends Inf. Retr.

(2008)

GabrilovichE. et al.

Computing semantic relatedness using wikipedia-based explicit semantic analysis

YehE. et al.

Wikiwalk: random walks on wikipedia for semantic relatedness

Cited by (53)

Green technology investment selection with carbon price and competition: One-to-many matching structure
2024, Journal of Cleaner Production
With the deterioration of natural resources and the aggravation of environmental pollution, there is a growing imperative to implement industrial green reform through green technology investment (GTI). This paper constructs a nonlinear dual-objective function by drawing on the theory of two-sided matching under the carbon trading regulation, aiming to examine GTI selection decisions made by two oligarchic manufacturers (i.e., oligarchs) and a monopoly supplier (i.e., supplier). The supplier offers multiple green technologies, while oligarchs rack their brains to buy green technologies in an attempt to exploit their social, economic and environmental potentials. Available technologies vary in terms of their environmental efficiency and cost. Then, a two-sided matching mechanism is constructed to explore the availability of stable matchings and examine the impacts of the carbon price and green improvement coefficient on GTI, product pricing decisions and profits within a competitive environment. Finally, a Monte-Carlo experiment is designed to analyze the impact of uncertain market risks on supplier's decisions. The results show that there is a two-sided matching mechanism that enables oligarchs and suppliers to obtain stable matching solution sets. Higher carbon prices lead to higher product prices, lower market demand, and improved environmental performance. Product price and market demand are positively correlated with investment volumes of oligarchs themselves and negatively correlated with those of their competitors. Investment risk has a certain negative impact on suppliers' profitability.
Finding most informative common ancestor in cross-ontological semantic similarity assessment: An intrinsic information content-based approach
2022, Expert Systems with Applications
Citation Excerpt :
The semantic measure is applicable in diversified areas like for estimating inter-document similarity (Benedetti et al., 2016, 2019), ontology-based text clustering (Nasir et al., 2013; Song et al., 2014), text summarization (Gambhir & Gupta, 2017; Rahman & Borah, 2019), entity disambiguation (Zhu & Iglesias, 2018), development of recommender systems (Blanco-Fernández et al., 2008; Passant, 2010), semantic annotation (Sánchez, Isern, & Millan, 2011), ontology merging (Gaeta et al., 2009), ontology segment matching (Xue & Zhang, 2021), information retrieval (Kim et al., 2017), personalized assistance for finding tourist attractions (Abbasi-Moud et al., 2021), solving graph edit similarity search problem (Gouda & Hassaan, 2019), etc.
Semantic Similarity (SS) has become a long-standing research domain in artificial intelligence and cognitive science for measuring the strength of the semantic relationship between entities (e.g., words, documents). Several ontology-based SS measures have been proposed in the recent time due to their ability of mimicking the cognitive process of humans. Among them, intrinsic information content (IC) based approaches have shown a significant correlation with human assessment. The design principle of the existing intrinsic IC-based SS measures constrain themselves to be applicable in a single ontology. However, such SS measures can be leveraged within two ontologies with the help of identifying the most informative common ancestors (MICA) across the ontologies. Existing IC-based MICA identification algorithms follow string matching of the labels of the concepts. In this paper, we propose a novel intrinsic IC-based MICA finding algorithm that exploits two domain-ontologies for finding SS without using string matching of the labels. The proposed approach has been evaluated using a widely used benchmark dataset of medical terms. The experimental results show that the proposed IC-based approach can be a stepping stone to a new direction in the process of finding MICA over two ontologies.
Knowledge-based framework for estimating the relevance of scientific articles
2020, Expert Systems with Applications
Citation Excerpt :
In contrast, other techniques concentrate their efforts on measuring the opposite (i.e., the anti-relevance) with the purpose of finding problems in recommendation systems (Sánchez & Bellogín, 2018). Delving into the filtering process of documents by relevance (Benedetti, Beneventano, Bergamaschi, & Simonini, 2019), it is observed that most of the techniques employed proceed by intensifying their endeavors on tags or keywords (Lu & Tanne, 2017). In the specific case of the medical documents, there exist several approaches (Lagopoulos, Anagnostou, Minas, & Tsoumakas, 2018) whose majority commonly harness PubMed (Goeckenjan et al., 2011) to conduct the document relevance based searches by using their keywords (Fiorini et al., 2018).
The volume of published papers provided by the scientific community has increased over the last years in a drastic way. This fact has led to having a considerable growth of the topics covered by different publications. Despite topics under discussion on these publications were usually regarded as cutting edge subjects when released in conferences and journals, the restless evolution of science may have faded their relative importance away over the years. This issue undoubtedly poses big challenges to those researchers interested in gathering information to enrich their own background. Consequently, the development of a system able to automatically organize and provide relevance to scientific papers should play a crucial role to address the aforementioned problem. In this paper, the Webelance framework is presented. It makes use of a lexicon and Machine Learning techniques to accomplish these tasks. It has been built by using specific metrics for the scientific domain to measure the relative importance of papers. Several experiments using more than $50, 000$ articles focused on the medicine domain have been addressed to illustrate the viability of the proposal. The obtained results both confirm the usability of the system and its good performance.
Enhancing User Experience on Q&A Platforms: Measuring Text Similarity Based on Hybrid CNN-LSTM Model for Efficient Duplicate Question Detection
2024, IEEE Access
Comprehensive Review on Semantic Information Retrieval and Ontology Engineering
2023, arXiv
Identifying Landscape Relevant Natural Language using Actively Crowdsourced Landscape Descriptions and Sentence-Transformers
2023, KI - Kunstliche Intelligenz

View all citing articles on Scopus

View full text

Computing inter-document similarity with Context Semantic Analysis

Abstract

Introduction

Section snippets

Related work

Inter-document similarity

Context semantic analysis

Evaluation

Conclusion and future work

Ecological Informatics

Web Semant. Sci. Serv. Agents World Wide Web

Inf. Syst.

DBpedia: A Nucleus for a Web of Open Data

Linked data - the story so far

Int. J. Semantic Web Inf. Syst.

Wikidata: A free collaborative knowledgebase

Commun. ACM

Yago: A core of semantic knowledge

SNOMED-CT: The advanced terminology and coding system for eHealth

Stud. Health Technol. Inform.

The agrovoc linked dataset

Semantic Web

Genies: a natural-language processing system for the extraction of molecular pathways from journal articles

Bioinformatics

Knowledge-based graph document modeling

From frequency to meaning: Vector space models of semantics

J. Artif. Int. Res.

A survey of text similarity approaches

Int. J. Comput. Appl.

Latent semantic analysis

Annu. Rev. Inf. Sci. Technol.

The probabilistic relevance framework: BM25 and beyond

Found. Trends Inf. Retr.

Statistical language models for information retrieval. A critical review

Found. Trends Inf. Retr.

Computing semantic relatedness using wikipedia-based explicit semantic analysis

Wikiwalk: random walks on wikipedia for semantic relatedness