Elsevier

Information Sciences

Volume 177, Issue 22, 15 November 2007, Pages 4893-4905
Information Sciences

On principal component analysis, cosine and Euclidean measures in information retrieval

https://doi.org/10.1016/j.ins.2007.05.027Get rights and content

Abstract

Clustering groups document objects represented as vectors. An extensive vector space may cause obstacles to applying these methods. Therefore, the vector space was reduced with principal component analysis (PCA). The conventional cosine measure is not the only choice with PCA, which involves the mean-correction of data. Since mean-correction changes the location of the origin, the angles between the document vectors also change. To avoid this, we used a connection between the cosine measure and the Euclidean distance in association with PCA, and grounded searching on the latter. We applied the single and complete linkage and Ward clustering to Finnish documents utilizing their relevance assessment as a new feature. After the normalization of the data PCA was run and relevant documents were clustered.

Introduction

Text document retrieval tasks typically involve large numbers of variables (words or terms). This was obvious also in the Finnish newspaper article collection [15] which we aim to process for information retrieval both with cluster analysis (see, for example [6], [11], [26]) and instance-based learning techniques [23]. Variable selection is the traditional approach to tackle this problem in the area of information retrieval. Since very frequent words are poor discriminators, they can be excluded from the index using stop word lists. Similarly, very rare words are good candidates for elimination. However, simple frequency-based selection is applicable only as long as the eliminated variables are irrelevant to the retrieval task. Extreme pruning of relevant variables will inevitably impair the retrieval performance.

More sophisticated methods are necessary when the removal of the irrelevant variables alone will not reduce the search space enough. The retrieval task and the intended analysis methods greatly restricted our choice of dimensionality reduction methods. Some traditional methods, such as term-discrimination values [29], would have been quite slow because of the size of the collection. Class-based dimensionality reduction techniques [31] could not be used since our research involves clustering of the documents as well as their classification in predefined classes. Therefore, we adopted the variable extraction approach [31], where a new set of variables S is created from the original variables T so that |S|  |T|. For this purpose we applied principal component analysis (PCA), which is a standard statistical technique for dimensionality reduction [12], [27], [32]. PCA combines the m original variables into m new variables of which l (l  m) include most of the information on the data. Thus, by removing the m  l irrelevant variables from the new variables, the search space can be further reduced without greatly affecting the retrieval performance.

Although PCA applied to the present data and factor analysis have significant conceptual differences [27], [32], they are often confused with each other, because both methods are able to reduce data. In practice, factor analysis is more suitable for identifying and explaining latent constructs than PCA, which is better for the straightforward reduction of the data [27]. Both methods have connections with the latent semantic indexing (LSI) model [5], which, along with variable clustering [36], is typically used for variable extraction in information retrieval. PCA and LSI are related through singular value decomposition [3], [4], which can also be used to efficiently facilitate PCA.

There may be a catch in applying the popular cosine similarity to the PCA results: cosine similarities of the original data and PCA results differ, even if none of the new variables have been excluded, because PCA is performed on the mean-corrected data [12], [27], [32]. Since many of the clustering and instance-based learning methods operate on similarities or dissimilarities between objects, the use of the cosine measure in conjunction with PCA may unnecessarily affect the inter-object similarities and the results based on them. The property is easy to see from the geometric interpretation of the cosine similarity as the cosine of the angle between vectors OA and OB from the origin O to the points A and B. When the data are placed in a centred manner in the origin P of the new coordinates, the angle between the vectors changes and the cosine similarities cos((OA,OB)) and cos((PA,PB)) differ.

Knowing the popularity of the cosine measure [24], [28], [35] and infrequent use of PCA in information retrieval, this side-effect is not perhaps widely apprehended in information retrieval research although noted [20], and, therefore, it is interesting. We show the discrepancy in similarities and present a transform that uses the relation between the cosine measure and the Euclidean distance. The transform allows methods invariant to monotonic transforms of distances, such as the nearest neighbor classification and the single and complete linkage clustering studied here, to produce similar results from the PCA results as they would produce from the original data with the cosine measure.

This paper is organized as follows. In Section 2, the collection and its pre-processing are described. Then the conventional vector space model, normalization and principal component analysis are briefly covered. Section 4 focuses on the relation between the Euclidean distance and the cosine measure. In Section 5, the single and complete linkage clustering techniques are described and their invariance to the transform is proven. Section 6 presents clustering results: First, the issue of applying the cosine measure to the PCA results is demonstrated. Second, additional runs are provided to show that PCA is appropriate to reduce vector space. Lastly, clustering is applied to a larger sample of documents. Section 7 concludes the current work.

The novelty of the present research is mainly connected to the relation of the cosine measure and Euclidean distance together with PCA, the monotonicity consideration of the single and complete linkage techniques with the former relation, testing these using hierarchical clustering and testing hierarchical clustering particularly with relevance assessed documents.

Section snippets

Data

As a Finno-Ugric language, Finnish differs substantially from the majority of European languages, which mostly belong to the family of the Indo-European languages and thus are more or less reminiscent of each other. It is typical of Finnish that words are inflected in numerous ways and may take several different suffixes, e.g. “talossammekinko” (“talo + ssa + mme + kin + ko”) corresponding approximately to “even in our house?” [1], [2]. Thus, there may in theory be as many as a few thousand different

Vector space models

The computational basis was grounded on the ordinary vector space model [25], [30], in which documents are mapped as real-valued vectors di = (wi1, wi2, …, wim), where wij corresponds to the jth key (j = 1, 2, …, m) of the ith document (i = 1, 2, …, n). Thus, the collection is represented as an n × m document-term matrix D of n documents and m keys. To weight the keys, we exploited the well-known tf · idf scheme,wij=tfij·log2ndfj,where the inverse of the document frequency dfj (number of documents which include

Evaluation of the similarity or dissimilarity of documents

The present section begins from the cosine similarity and Euclidean distance. The former is converted into a distance whose non-metricity is proven. Then, the cosine and Euclidean distances are shown to be respectively sensitive and invariant to mean-correction. Lastly, we derive a connection between the cosine and Euclidean distance for document vectors that are normalized along with their lengths. Although the non-metricity and the relation [20] between the cosine and Euclidean distance are

On hierarchical clustering of documents

Our research concerned the agglomerative hierarchical clustering techniques that have often been used to facilitate the cluster-based retrieval of documents [7], [26], [29], [36]. These methods cluster document on the basis of proximities computed between document vectors. Hierarchical clustering techniques are computationally demanding and usually possess a time complexity of O(n2) or even O(n3), where n is the number of documents [26].

The agglomerative clustering techniques applied here are

Clustering results

The 500 × 7067 and 5000 × 13,693 document-term matrices were length-normalized and processed further with PCA, which gave 499 and 4999 principal components. No standardization (variance normalization) was necessary, since the variances were reasonably equitable and very small. Since the document vectors were length-normalized into the unit vectors, Euclidean distances could be applied. The parts of the principal components were substantially small from the total variance, because the large number

Conclusions

The contribution of the current research was in the consideration of the Euclidean distance measure instead of the cosine measure commonly used in information retrieval when principal component analysis (PCA) was applied to reduce the search space. The Euclidean distance measure was useful, because it is insensitive to mean-correction that is traditionally assumed by PCA. Although the cosine distance is sensitive in this sense, this need not mean that it produces different results. We could use

Acknowledgements

The first author acknowledges the financial support of the Academy of Finland. The authors are grateful to Prof. Kalervo Järvelin and the staff of the Department of Information Studies, University of Tampere, for data and information.

References (37)

  • B.S. Everitt et al.

    Cluster Analysis

    (2001)
  • A. El-Hamdouchi et al.

    Comparison of hierarchic agglomerative clustering methods for documents retrieval

    The Computer Journal

    (1989)
  • D.A. Grossman et al.

    Information Retrieval, Algorithms and Heuristics

    (2004)
  • D.J. Hand et al.

    Principles of Data Mining

    (2001)
  • R.A. Horn et al.

    Matrix Analysis

    (1990)
  • A.K. Jain et al.

    Algorithms for Clustering Data

    (1988)
  • I.T. Jolliffe

    Principal Components Analysis

    (1986)
  • L. Kaufman et al.

    Finding Groups in Data

    (1990)
  • Cited by (132)

    View all citing articles on Scopus
    View full text