Computing inter-document similarity with Context Semantic Analysis
Introduction
Recent years have seen a growing number of knowledge bases employed in several domains and applications. Besides DBpedia [1], which is the heart of the Linked Open Data (LOD) cloud [2], other important knowledge bases are: Wikidata [3], a collaborative knowledge base; YAGO [4], a huge semantic knowledge base derived from Wikipedia, WordNet and GeoNames; Snomed CT [5], the best known ontology in the medical domain and AGROVOC [6], a multilingual agricultural thesaurus we recently used for annotating agricultural resources [7].
In the literature, knowledge-based approaches have been employed for improving existing techniques in Natural Language Processing (NLP) [8] and Information Retrieval (IR) domains [9]. Yet, there is much room for improvement in order to effectively exploit these rich models in these fields [10]. For instance, in the context of inter-document similarity, which plays an important role in many NLP and IR applications, classic techniques rely solely on syntactic information and are usually based on Vector Space Models [11], where the documents are represented in a vector space having document words as dimensions. Nevertheless, such techniques fail in detecting relationships among concepts in simple scenarios like the following sentences: “The Rolling Stones with the participation of Roger Daltrey opened the concerts’ season in Trafalgar Square” and “The bands headed by Mick Jagger with the leader of The Who played in London last week”. These two sentences contain highly related concepts (e.g., Roger Daltrey is the leader of The Who) which can be found by exploiting the knowledge network encoded within knowledge bases such as DBpedia.
To overcome the limitation of a purely syntactical approach, in [12] we proposed Context Semantic Analysis (CSA), a novel semantic technique for estimating inter-document similarity, leveraging the information contained in a knowledge base. One of the main novelties of CSA w.r.t. other knowledge-based approaches is its applicability over any RDF knowledge base, so that all datasets belonging to the LOD cloud [2] (more than one thousand) can be used. CSA is based on the notion of contextual graph of a document, i.e. a subgraph of the knowledge base that contains the contextual information of the document; the notion of contextual graph is very similar to the one of semantic graph defined in [10]. The contextual graph is then suitably weighted to capture the degree of associativity between its concepts, i.e., the degree of relevance of a property for the entities it connects. The vertices of such a weighted contextual graph are then ranked by using PageRank methods, so obtaining a Semantic Context Vector, a novel model able to represent the context of the document. Thus, the similarity of two documents is computed by comparing their Semantic Context Vectors with general vector comparison methods, such as the cosine similarity. By evaluating our method on a standard benchmark for document similarity (which consider correlations with human judges), we showed how CSA outperforms almost all other methods and how it can exploit any RDF knowledge base. Moreover we analyzed its scalability in a clustering task with a large corpus of documents, and showed that our approach outperforms the considered baselines.
This paper extends our previous work at the SISAP 2016 Conference. The main novel contribution of the extended paper is to test Context Semantic Analysis (CSA) applicability and effectiveness in a real-world application domain, such as Information Retrieval (IR). To this purpose, we analyzed the semantic based approaches recently proposed in the Information Retrieval research community. We found that, the most effective and general IR framework, adopting semantic enrichment of documents, is KE4IR [13]. We studied its layered architecture and tried to improve its performance, by including CSA, as a new semantic layer. The outcome was really positive as we were able to show that KE4IR + CSA outperforms the original KE4IR framework (see Section 5.2).
The paper is structured as follows. Section 2 describes the related work, while Section 3 is devoted to some preliminaries useful for the rest of the paper. Then, CSA is described in Section 4 and Section 5 shows its evaluation. Finally, Section 6 outlines conclusions and future work.
Section snippets
Related work
Text similarity has been one the main research area of the last years due to wide range of its applications in tasks such as information retrieval, text classification, document clustering, topic detection, etc. [14]. In this field a lot of techniques have been proposed but we can group them in two main categories, content-based and knowledge-enriched approaches, where the main difference is that the first group uses only textual information contained in documents while the second one enriches
Inter-document similarity
The state-of-the-art techniques for estimating inter-document similarity are primarily based on Vector Space Models: a document is represented through a bag-of-words feature vector, which contains information about the presence and absence of words in the document, and the similarity between two documents is calculated as the cosine of the angle between the two respective vectors (i.e., their cosine similarity).
Vector Space Models are generally based on a co-occurrence matrix, a way of
Context semantic analysis
In this section we introduce our novel technique for estimating inter-document similarity, called Context Semantic Analysis (CSA), that is based on leveraging the information contained in a generic RDF knowledge base. Given a corpus of documents and an RDF knowledge graph , CSA is composed of the following three steps:
- 1.
Contextual Graph Extraction: the Contextual Graph containing the contextual information of a document is extracted from the .
- 2.
Semantic Context Vectors Generation: the
Evaluation
In this section we evaluate CSA: firstly, we assess CSA efficacy by considering the correlation with human judges; secondly, we evaluate how CSA performs in a real-world application, employing it in an Information Retrieval framework; thirdly, we analyze CSA scalability in a clustering task on a large dataset.
All experiments have been performed on a server running Ubuntu 14.04, with 80 GB RAM, and an Intel Xeon E5-2670 v2 @ 2.50 GHz CPU. CSA has been implemented in Python 2.7, and for
Conclusion and future work
In this paper, we proposed Context Semantic Analysis (CSA), a novel knowledge-based technique for estimating inter-document similarity. The technique is based on a Semantic Context Vector, which can be extracted from a knowledge base and stored as metadata of a document and employed to compute inter-document similarity. We showed the consistency of CSA with respect to human judges and how it outperforms standard (i.e., syntactic) inter-document similarity methods. Moreover, we obtained
References (40)
- et al.
Semantic annotation of the cerealab database by the agrovoc linked dataset
Ecological Informatics
(2015) - et al.
Semantically enhanced information retrieval: an ontology-based approach
Web Semant. Sci. Serv. Agents World Wide Web
(2011) - et al.
Combining user and database perspective for solving keyword queries over relational databases
Inf. Syst.
(2016) - et al.
DBpedia: A Nucleus for a Web of Open Data
(2007) - et al.
Linked data - the story so far
Int. J. Semantic Web Inf. Syst.
(2009) - et al.
Wikidata: A free collaborative knowledgebase
Commun. ACM
(2014) - et al.
Yago: A core of semantic knowledge
- et al.
SNOMED-CT: The advanced terminology and coding system for eHealth
Stud. Health Technol. Inform.
(2006) - et al.
The agrovoc linked dataset
Semantic Web
(2013) - et al.
Genies: a natural-language processing system for the extraction of molecular pathways from journal articles
Bioinformatics
(2001)
Knowledge-based graph document modeling
From frequency to meaning: Vector space models of semantics
J. Artif. Int. Res.
A survey of text similarity approaches
Int. J. Comput. Appl.
Latent semantic analysis
Annu. Rev. Inf. Sci. Technol.
The probabilistic relevance framework: BM25 and beyond
Found. Trends Inf. Retr.
Statistical language models for information retrieval. A critical review
Found. Trends Inf. Retr.
Computing semantic relatedness using wikipedia-based explicit semantic analysis
Wikiwalk: random walks on wikipedia for semantic relatedness
Cited by (53)
Green technology investment selection with carbon price and competition: One-to-many matching structure
2024, Journal of Cleaner ProductionFinding most informative common ancestor in cross-ontological semantic similarity assessment: An intrinsic information content-based approach
2022, Expert Systems with ApplicationsCitation Excerpt :The semantic measure is applicable in diversified areas like for estimating inter-document similarity (Benedetti et al., 2016, 2019), ontology-based text clustering (Nasir et al., 2013; Song et al., 2014), text summarization (Gambhir & Gupta, 2017; Rahman & Borah, 2019), entity disambiguation (Zhu & Iglesias, 2018), development of recommender systems (Blanco-Fernández et al., 2008; Passant, 2010), semantic annotation (Sánchez, Isern, & Millan, 2011), ontology merging (Gaeta et al., 2009), ontology segment matching (Xue & Zhang, 2021), information retrieval (Kim et al., 2017), personalized assistance for finding tourist attractions (Abbasi-Moud et al., 2021), solving graph edit similarity search problem (Gouda & Hassaan, 2019), etc.
Knowledge-based framework for estimating the relevance of scientific articles
2020, Expert Systems with ApplicationsCitation Excerpt :In contrast, other techniques concentrate their efforts on measuring the opposite (i.e., the anti-relevance) with the purpose of finding problems in recommendation systems (Sánchez & Bellogín, 2018). Delving into the filtering process of documents by relevance (Benedetti, Beneventano, Bergamaschi, & Simonini, 2019), it is observed that most of the techniques employed proceed by intensifying their endeavors on tags or keywords (Lu & Tanne, 2017). In the specific case of the medical documents, there exist several approaches (Lagopoulos, Anagnostou, Minas, & Tsoumakas, 2018) whose majority commonly harness PubMed (Goeckenjan et al., 2011) to conduct the document relevance based searches by using their keywords (Fiorini et al., 2018).
Identifying Landscape Relevant Natural Language using Actively Crowdsourced Landscape Descriptions and Sentence-Transformers
2023, KI - Kunstliche Intelligenz