Elsevier

Information Systems

Volume 80, February 2019, Pages 136-147
Information Systems

Computing inter-document similarity with Context Semantic Analysis

https://doi.org/10.1016/j.is.2018.02.009Get rights and content

Abstract

We propose a novel knowledge-based technique for inter-document similarity computation, called Context Semantic Analysis (CSA). Several specialized approaches built on top of specific knowledge base (e.g. Wikipedia) exist in literature, but CSA differs from them because it is designed to be portable to any RDF knowledge base. In fact, our technique relies on a generic RDF knowledge base (e.g. DBpedia and Wikidata) to extract from it a Semantic Context Vector, a novel model for representing the context of a document, which is exploited by CSA to compute inter-document similarity effectively. Moreover, we show how CSA can be effectively applied in the Information Retrieval domain. Experimental results show that: (i) for the general task of inter-document similarity, CSA outperforms baselines built on top of traditional methods, and achieves a performance similar to the ones built on top of specific knowledge bases; (ii) for Information Retrieval tasks, enriching documents with context (i.e., employing the Semantic Context Vector model) improves the results quality of the state-of-the-art technique that employs such similar semantic enrichment.

Introduction

Recent years have seen a growing number of knowledge bases employed in several domains and applications. Besides DBpedia [1], which is the heart of the Linked Open Data (LOD) cloud [2], other important knowledge bases are: Wikidata [3], a collaborative knowledge base; YAGO [4], a huge semantic knowledge base derived from Wikipedia, WordNet and GeoNames; Snomed CT [5], the best known ontology in the medical domain and AGROVOC [6], a multilingual agricultural thesaurus we recently used for annotating agricultural resources [7].

In the literature, knowledge-based approaches have been employed for improving existing techniques in Natural Language Processing (NLP) [8] and Information Retrieval (IR) domains [9]. Yet, there is much room for improvement in order to effectively exploit these rich models in these fields [10]. For instance, in the context of inter-document similarity, which plays an important role in many NLP and IR applications, classic techniques rely solely on syntactic information and are usually based on Vector Space Models [11], where the documents are represented in a vector space having document words as dimensions. Nevertheless, such techniques fail in detecting relationships among concepts in simple scenarios like the following sentences: The Rolling Stones with the participation of Roger Daltrey opened the concerts’ season in Trafalgar Square and “The bands headed by Mick Jagger with the leader of The Who played in London last week”. These two sentences contain highly related concepts (e.g., Roger Daltrey is the leader of The Who) which can be found by exploiting the knowledge network encoded within knowledge bases such as DBpedia.

To overcome the limitation of a purely syntactical approach, in [12] we proposed Context Semantic Analysis (CSA), a novel semantic technique for estimating inter-document similarity, leveraging the information contained in a knowledge base. One of the main novelties of CSA w.r.t. other knowledge-based approaches is its applicability over any RDF knowledge base, so that all datasets belonging to the LOD cloud [2] (more than one thousand) can be used. CSA is based on the notion of contextual graph of a document, i.e. a subgraph of the knowledge base that contains the contextual information of the document; the notion of contextual graph is very similar to the one of semantic graph defined in [10]. The contextual graph is then suitably weighted to capture the degree of associativity between its concepts, i.e., the degree of relevance of a property for the entities it connects. The vertices of such a weighted contextual graph are then ranked by using PageRank methods, so obtaining a Semantic Context Vector, a novel model able to represent the context of the document. Thus, the similarity of two documents is computed by comparing their Semantic Context Vectors with general vector comparison methods, such as the cosine similarity. By evaluating our method on a standard benchmark for document similarity (which consider correlations with human judges), we showed how CSA outperforms almost all other methods and how it can exploit any RDF knowledge base. Moreover we analyzed its scalability in a clustering task with a large corpus of documents, and showed that our approach outperforms the considered baselines.

This paper extends our previous work at the SISAP 2016 Conference. The main novel contribution of the extended paper is to test Context Semantic Analysis (CSA) applicability and effectiveness in a real-world application domain, such as Information Retrieval (IR). To this purpose, we analyzed the semantic based approaches recently proposed in the Information Retrieval research community. We found that, the most effective and general IR framework, adopting semantic enrichment of documents, is KE4IR [13]. We studied its layered architecture and tried to improve its performance, by including CSA, as a new semantic layer. The outcome was really positive as we were able to show that KE4IR + CSA outperforms the original KE4IR framework (see Section 5.2).

The paper is structured as follows. Section 2 describes the related work, while Section 3 is devoted to some preliminaries useful for the rest of the paper. Then, CSA is described in Section 4 and Section 5 shows its evaluation. Finally, Section 6 outlines conclusions and future work.

Section snippets

Related work

Text similarity has been one the main research area of the last years due to wide range of its applications in tasks such as information retrieval, text classification, document clustering, topic detection, etc. [14]. In this field a lot of techniques have been proposed but we can group them in two main categories, content-based and knowledge-enriched approaches, where the main difference is that the first group uses only textual information contained in documents while the second one enriches

Inter-document similarity

The state-of-the-art techniques for estimating inter-document similarity are primarily based on Vector Space Models: a document is represented through a bag-of-words feature vector, which contains information about the presence and absence of words in the document, and the similarity between two documents is calculated as the cosine of the angle between the two respective vectors (i.e., their cosine similarity).

Vector Space Models are generally based on a co-occurrence matrix, a way of

Context semantic analysis

In this section we introduce our novel technique for estimating inter-document similarity, called Context Semantic Analysis (CSA), that is based on leveraging the information contained in a generic RDF knowledge base. Given a corpus C of documents and an RDF knowledge graph KB, CSA is composed of the following three steps:

  • 1.

    Contextual Graph Extraction: the Contextual Graph CG(d) containing the contextual information of a document d is extracted from the KB.

  • 2.

    Semantic Context Vectors Generation: the

Evaluation

In this section we evaluate CSA: firstly, we assess CSA efficacy by considering the correlation with human judges; secondly, we evaluate how CSA performs in a real-world application, employing it in an Information Retrieval framework; thirdly, we analyze CSA scalability in a clustering task on a large dataset.

All experiments have been performed on a server running Ubuntu 14.04, with 80 GB RAM, and an Intel Xeon E5-2670 v2 @ 2.50 GHz CPU. CSA has been implemented in Python 2.7, and for

Conclusion and future work

In this paper, we proposed Context Semantic Analysis (CSA), a novel knowledge-based technique for estimating inter-document similarity. The technique is based on a Semantic Context Vector, which can be extracted from a knowledge base and stored as metadata of a document and employed to compute inter-document similarity. We showed the consistency of CSA with respect to human judges and how it outperforms standard (i.e., syntactic) inter-document similarity methods. Moreover, we obtained

References (40)

  • SchuhmacherM. et al.

    Knowledge-based graph document modeling

  • TurneyP.D. et al.

    From frequency to meaning: Vector space models of semantics

    J. Artif. Int. Res.

    (2010)
  • F. Benedetti, D. Beneventano, S. Bergamaschi, Context semantic analysis: A knowledge-based technique for computing...
  • F. Corcoglioniti, M. Dragoni, M. Rospocher, A.P. Aprosio, Knowledge extraction for information retrieval, in: The...
  • GomaaW.H. et al.

    A survey of text similarity approaches

    Int. J. Comput. Appl.

    (2013)
  • DumaisS.T.

    Latent semantic analysis

    Annu. Rev. Inf. Sci. Technol.

    (2004)
  • RobertsonS.E. et al.

    The probabilistic relevance framework: BM25 and beyond

    Found. Trends Inf. Retr.

    (2009)
  • ZhaiC.

    Statistical language models for information retrieval. A critical review

    Found. Trends Inf. Retr.

    (2008)
  • GabrilovichE. et al.

    Computing semantic relatedness using wikipedia-based explicit semantic analysis

  • YehE. et al.

    Wikiwalk: random walks on wikipedia for semantic relatedness

  • Cited by (53)

    • Finding most informative common ancestor in cross-ontological semantic similarity assessment: An intrinsic information content-based approach

      2022, Expert Systems with Applications
      Citation Excerpt :

      The semantic measure is applicable in diversified areas like for estimating inter-document similarity (Benedetti et al., 2016, 2019), ontology-based text clustering (Nasir et al., 2013; Song et al., 2014), text summarization (Gambhir & Gupta, 2017; Rahman & Borah, 2019), entity disambiguation (Zhu & Iglesias, 2018), development of recommender systems (Blanco-Fernández et al., 2008; Passant, 2010), semantic annotation (Sánchez, Isern, & Millan, 2011), ontology merging (Gaeta et al., 2009), ontology segment matching (Xue & Zhang, 2021), information retrieval (Kim et al., 2017), personalized assistance for finding tourist attractions (Abbasi-Moud et al., 2021), solving graph edit similarity search problem (Gouda & Hassaan, 2019), etc.

    • Knowledge-based framework for estimating the relevance of scientific articles

      2020, Expert Systems with Applications
      Citation Excerpt :

      In contrast, other techniques concentrate their efforts on measuring the opposite (i.e., the anti-relevance) with the purpose of finding problems in recommendation systems (Sánchez & Bellogín, 2018). Delving into the filtering process of documents by relevance (Benedetti, Beneventano, Bergamaschi, & Simonini, 2019), it is observed that most of the techniques employed proceed by intensifying their endeavors on tags or keywords (Lu & Tanne, 2017). In the specific case of the medical documents, there exist several approaches (Lagopoulos, Anagnostou, Minas, & Tsoumakas, 2018) whose majority commonly harness PubMed (Goeckenjan et al., 2011) to conduct the document relevance based searches by using their keywords (Fiorini et al., 2018).

    View all citing articles on Scopus
    View full text