Elsevier

Information Sciences

Volume 453, July 2018, Pages 154-167
Information Sciences

Fast and effective cluster-based information retrieval using frequent closed itemsets

https://doi.org/10.1016/j.ins.2018.04.008Get rights and content

Abstract

Document Information retrieval consists of finding the documents in a collection of documents that are the most relevant to a user query. Information retrieval techniques are widely-used by organizations to facilitate the search for information. However, applying traditional information retrieval techniques is time consuming for large document collections. Recently, cluster-based information retrieval approaches have been developed. Although these approaches are often much faster than traditional approaches for processing large document collections, the quality of the documents retrieved by cluster-based approaches is often less than that of traditional approaches. To address this drawback of cluster-based approaches, and improve the performance of information retrieval both in terms of runtime and quality of retrieved documents, this paper proposes a new cluster-based information retrieval approach named ICIR (Intelligent Cluster-based Information Retrieval). The proposed approach combines k-means clustering with frequent closed itemset mining to extract clusters of documents and find frequent terms in each cluster. Patterns discovered in each cluster are then used to select the most relevant document clusters to answer each user query. Four alternative heuristics are proposed to select the most relevant clusters, and two alternative heuristics for choosing documents in the selected clusters. Thus, eight versions of the proposed approach are obtained. To validate the proposed approach, extensive experiments have been carried out on well-known document collections. Results show that the designed approach outperforms traditional and cluster-based information retrieval approaches both in terms of execution time and quality of the returned documents.

Introduction

DIR (Document Information Retrieval) is the task of retrieving the documents from a collection that are the most relevant to a user query [32]. The traditional approach for DIR consists of first scanning all documents in a collection to compute a score for each document that indicates its relevance to the user’s query. A ranking function is then applied to select the most relevant documents (those with the highest scores) and show them to the user [32]. Although tprocess has a polynomial time complexity, applying this approach to answer queries on large collections of documents can result in a long runtime. To improve the performance of document information retrieval, cluster-based approaches have been proposed. The key idea of these approaches is to perform a preprocessing step where documents from a collection of documents are grouped into clusters of similar documents. Then, to answer a query, cluster-based approaches first select the clusters that are the most relevant to the query, and then only search for documents in these clusters. Because cluster-based approaches do not scan the whole collection of documents to answer a query, they can be considerably faster than traditional DIR approaches.

In the last decades, several data mining based approaches have been proposed to improve the performance of cluster-based information retrieval. These approaches extract knowledge from a collection of documents by applying a data mining algorithm. Then, this knowledge is used to answer user queries. Two main approaches have been proposed. On one hand, several studies [6], [24], [25], [30], [40], [41] have applied partitioning algorithms (e.g. k-means [23] and CLUBS+ [25]) to assign documents to k disjoints clusters, where each group contains similar documents. On the other hand, algorithms such as HFTC (Hierarchical Frequent Term-based Clustering) [5], FIHC (Frequent Itemset-based Hierarchical Clustering) [13], TDC (Topic Document Clustering) [37] and LATRE (Lazy Associative Tag REcommender) [27] apply FIM (Frequent Itemset Mining) [12], [38] to discover frequent terms in a document collection. Then, the k most frequent patterns are used to group documents that share similar terms. These approaches can be viewed as a way of decomposing a problem into several sub-problems that can be solved independently.

Although cluster-based approaches can answer queries much faster than traditional approaches on large document collections, cluster-based approaches tend to retrieve documents that are less relevant. To address this drawback of cluster-based approaches, and improve the performance of information retrieval both in terms of runtime and quality of retrieved documents, this paper proposes a new cluster-based information retrieval approach named ICIR (Intelligent Cluster-based Information Retrieval), which combines both clustering and frequent itemset mining. To the best of our knowledge, this is the first study that combines several data mining techniques for solving the well-known document information retrieval problem.

The major contributions of this paper are threefold:

  • The proposed ICIR approach improves upon the preprocessing step of existing cluster-based information retrieval approaches by applying both clustering and closed frequent itemset mining to extract rich knowledge from a collection of documents that can be used to answer queries. The preprocessing step of ICIR is executed once and consists of two steps. ICIR first runs the K-means algorithm to partition documents into several clusters. Then, ICIR applies the DCI_Closed algorithm on each document cluster to extract sets of terms (closed itemsets) that frequently occur in each cluster.

  • The proposed ICIR approach also introduces an improved query answering process. This process utilizes the knowledge extracted by the preprocessing step to answer each user query. Unlike the traditional DIR approach that scans a whole collection of documents to answer a query, the proposed approach relies on the sets of closed frequent terms to find the clusters of documents that are the most relevant to the user query. This is performed in three steps, called matching step, selecting step and returning step. In the matching step, a new measure is calculated to score the relevance of each cluster of documents to the user query by considering the closed frequent terms found in each cluster. In the selection step, one of four alternative heuristics is applied to select the most relevant clusters of documents for the user query. In the returning step, one of two alternative strategies is applied (called full and partial) to extract relevant documents from the selected clusters. In the literature, several information retrieval models have been developed such as the vector model [32], LDA (latent Dirichlet allocation) model [36], and the logic model [33]. The proposed approach applies the vector model to select documents as it is simple and easy to use.

  • To evaluate the performance of the proposed approach, extensive experiments have been carried out on well-known medium, large and big document collections. Results show that the proposed approach outperforms both state-of-the-art data mining-based, cluster-based and other DIR approaches in terms of runtime and quality of retrieved documents.

The rest of the paper is organized as follows. Section 2 reviews the main cluster-based and frequent itemset mining DIR approaches. Section 3 gives an overview of the proposed approach. Section 4 and Section 5 describe the proposed approach in details. Section 6 provides an example of how the approach is applied. Section 7 presents the experimental evaluation. Finally, Section 8 draws a conclusion and discusses opportunities for future work.

Section snippets

Related work

Several approaches have been proposed for the DIR problem [4], [18], [36]. This section first presents an overview of cluster-based information retrieval methods. Then, it surveys document information retrieval approaches that utilize term mining.

ICIR: Intelligent cluster-based information retrieval approach

This section presents the proposed ICIR (Intelligent Cluster-based Information Retrieval) approach, which employs both clustering and frequent itemset mining to improve the quality of documents retrieved using a cluster-based information retrieval approach. The designed approach consists of two main steps. The first one, called preprocessing step, consists of generating clusters of documents and then to extract frequent terms from documents in each cluster. The second step, called selection

Preprocessing step

The preprocessing step consists of two phases, which are called document decomposition and closed frequent term discovery. The following paragraphs explain these two phases.

Selection step

The selection step utilizes the knowledge extracted during the preprocessing step to efficiently retrieve relevant documents to answer a user request. This step consists of three main phases:

  • 1.

    Score matching. The score between the user request Req and the set of closed frequent terms is computed for each cluster of documents. Let F={F1,F2Fk} be the set of patterns found in each cluster, where Fi={Fi1,Fi2Fipi} is the set of closed frequent terms of the ith cluster. Fij represents the jth closed

Illustrative example

This section presents a detailed example of how the proposed approach is applied to answer a query. The three steps (preprocessing, matching, and returning documents) are described. Consider the following collection of 10 documents, where each document is represented as a set of pairs of the form (x, y), where x is a term from the set of terms {heuristic, optimization, search, intelligent, network, wireless, node, graph, process, information, system, model} and y represents its frequency: d1: (

Implementation and performance evaluation

A number of experiments have been carried out to evaluate the performance of the proposed ICIR approach. This section is divided into two parts. The first one, called Implementation, describes the collections of documents used in the experiments, and defines the evaluation measures. The second part, called Performance Evaluation, first explains how the parameters of the proposed approach have been set (the number of clusters, the minimum support, the selection heuristic and the return

Conclusion

This paper has proposed a novel cluster-based information retrieval approach for document information retrieval. The designed approach, named ICIR, combines two knowledge discovery techniques to extract useful knowledge from a given document collection. First, the k-means clustering algorithm is applied to partition a document collection into clusters of similar documents. Second, a modified version of the DCI_Closed closed frequent itemset mining algorithm is run to extract frequent terms in

References (42)

  • F. Beil et al.

    Frequent term-based text clustering

    Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

    (2002)
  • D. Cai et al.

    Document clustering using locality preserving indexing

    IEEE Trans. Knowl. Data Eng.

    (2005)
  • X. Cai et al.

    Ranking through clustering: an integrated approach to multi-document summarization

    IEEE Trans. Audio Speech Lang. Process.

    (2013)
  • D.R. Cutting et al.

    Scatter/gather: a cluster-based approach to browsing large document collections

    Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval

    (1992)
  • Y. Djenouri et al.

    Bees swarm optimisation using multiple strategies for association rule mining

    Int. J. Bio-Inspir. Comput.

    (2014)
  • T. Ebesu et al.

    Neural semantic personalized ranking for item cold-start recommendation

    Inf. Retr. J.

    (2017)
  • P. Fournier-Viger et al.

    A survey of itemset mining

    WIREs Data Min. Knowl. Discov.

    (2017)
  • B.C. Fung et al.

    Hierarchical document clustering using frequent itemsets

    Proceedings of SIAM International Conference on Data Mining

    (2003)
  • M.A. Hearst et al.

    Reexamining the cluster hypothesis: scatter/gather on retrieval results

    Proceedings of the Nineteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval

    (1996)
  • X. Jin et al.

    Hybrid indexing for versioned document search with cluster-based retrieval

    Proceedings of the Twenty Fifth ACM International on Conference on Information and Knowledge Management

    (2016)
  • T. Joachims

    Optimizing search engines using clickthrough data

    Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

    (2002)
  • Cited by (76)

    • A probabilistic topic model based on short distance Co-occurrences

      2022, Expert Systems with Applications
      Citation Excerpt :

      Text clustering is a fundamental task which can be applied with many different objectives and scenarios such as sentiment analysis (AL-Sharuee et al., 2018; Rehioui & Idrissi, 2020; Riaz et al., 2019) where the documents are categorized by their polarity or authorship attribution (Hamadache & Sayoud, 2018; Panicheva et al., 2019; Stamatatos et al., 2016) in which documents are grouped according their authorship style using clustering or classification algorithms. Text clustering is also a key step in many others like information retrieval (Djenouri et al., 2018; Reda et al., 2020) where text clustering techniques are employed to primarily group the documents based on their relevance to a query and then process the relevant groups for more accurate ranking, text summarization (Mallick et al., 2018; Rouane et al., 2019) in which text clustering is used to group similar sentences and recommendation systems (Drushku et al., 2019; Jiang et al., 2019) where text clustering is used for grouping the textual data available on user interests or their objects of interest. In the rest of the paper, existing related models are briefly introduced in section 2.

    View all citing articles on Scopus
    View full text