On principal component analysis, cosine and Euclidean measures in information retrieval
Introduction
Text document retrieval tasks typically involve large numbers of variables (words or terms). This was obvious also in the Finnish newspaper article collection [15] which we aim to process for information retrieval both with cluster analysis (see, for example [6], [11], [26]) and instance-based learning techniques [23]. Variable selection is the traditional approach to tackle this problem in the area of information retrieval. Since very frequent words are poor discriminators, they can be excluded from the index using stop word lists. Similarly, very rare words are good candidates for elimination. However, simple frequency-based selection is applicable only as long as the eliminated variables are irrelevant to the retrieval task. Extreme pruning of relevant variables will inevitably impair the retrieval performance.
More sophisticated methods are necessary when the removal of the irrelevant variables alone will not reduce the search space enough. The retrieval task and the intended analysis methods greatly restricted our choice of dimensionality reduction methods. Some traditional methods, such as term-discrimination values [29], would have been quite slow because of the size of the collection. Class-based dimensionality reduction techniques [31] could not be used since our research involves clustering of the documents as well as their classification in predefined classes. Therefore, we adopted the variable extraction approach [31], where a new set of variables S is created from the original variables T so that |S| ≪ |T|. For this purpose we applied principal component analysis (PCA), which is a standard statistical technique for dimensionality reduction [12], [27], [32]. PCA combines the m original variables into m new variables of which l (l ≪ m) include most of the information on the data. Thus, by removing the m − l irrelevant variables from the new variables, the search space can be further reduced without greatly affecting the retrieval performance.
Although PCA applied to the present data and factor analysis have significant conceptual differences [27], [32], they are often confused with each other, because both methods are able to reduce data. In practice, factor analysis is more suitable for identifying and explaining latent constructs than PCA, which is better for the straightforward reduction of the data [27]. Both methods have connections with the latent semantic indexing (LSI) model [5], which, along with variable clustering [36], is typically used for variable extraction in information retrieval. PCA and LSI are related through singular value decomposition [3], [4], which can also be used to efficiently facilitate PCA.
There may be a catch in applying the popular cosine similarity to the PCA results: cosine similarities of the original data and PCA results differ, even if none of the new variables have been excluded, because PCA is performed on the mean-corrected data [12], [27], [32]. Since many of the clustering and instance-based learning methods operate on similarities or dissimilarities between objects, the use of the cosine measure in conjunction with PCA may unnecessarily affect the inter-object similarities and the results based on them. The property is easy to see from the geometric interpretation of the cosine similarity as the cosine of the angle between vectors and from the origin O to the points A and B. When the data are placed in a centred manner in the origin P of the new coordinates, the angle between the vectors changes and the cosine similarities and differ.
Knowing the popularity of the cosine measure [24], [28], [35] and infrequent use of PCA in information retrieval, this side-effect is not perhaps widely apprehended in information retrieval research although noted [20], and, therefore, it is interesting. We show the discrepancy in similarities and present a transform that uses the relation between the cosine measure and the Euclidean distance. The transform allows methods invariant to monotonic transforms of distances, such as the nearest neighbor classification and the single and complete linkage clustering studied here, to produce similar results from the PCA results as they would produce from the original data with the cosine measure.
This paper is organized as follows. In Section 2, the collection and its pre-processing are described. Then the conventional vector space model, normalization and principal component analysis are briefly covered. Section 4 focuses on the relation between the Euclidean distance and the cosine measure. In Section 5, the single and complete linkage clustering techniques are described and their invariance to the transform is proven. Section 6 presents clustering results: First, the issue of applying the cosine measure to the PCA results is demonstrated. Second, additional runs are provided to show that PCA is appropriate to reduce vector space. Lastly, clustering is applied to a larger sample of documents. Section 7 concludes the current work.
The novelty of the present research is mainly connected to the relation of the cosine measure and Euclidean distance together with PCA, the monotonicity consideration of the single and complete linkage techniques with the former relation, testing these using hierarchical clustering and testing hierarchical clustering particularly with relevance assessed documents.
Section snippets
Data
As a Finno-Ugric language, Finnish differs substantially from the majority of European languages, which mostly belong to the family of the Indo-European languages and thus are more or less reminiscent of each other. It is typical of Finnish that words are inflected in numerous ways and may take several different suffixes, e.g. “talossammekinko” (“talo + ssa + mme + kin + ko”) corresponding approximately to “even in our house?” [1], [2]. Thus, there may in theory be as many as a few thousand different
Vector space models
The computational basis was grounded on the ordinary vector space model [25], [30], in which documents are mapped as real-valued vectors di = (wi1, wi2, …, wim), where wij corresponds to the jth key (j = 1, 2, …, m) of the ith document (i = 1, 2, …, n). Thus, the collection is represented as an n × m document-term matrix D of n documents and m keys. To weight the keys, we exploited the well-known tf · idf scheme,where the inverse of the document frequency dfj (number of documents which include
Evaluation of the similarity or dissimilarity of documents
The present section begins from the cosine similarity and Euclidean distance. The former is converted into a distance whose non-metricity is proven. Then, the cosine and Euclidean distances are shown to be respectively sensitive and invariant to mean-correction. Lastly, we derive a connection between the cosine and Euclidean distance for document vectors that are normalized along with their lengths. Although the non-metricity and the relation [20] between the cosine and Euclidean distance are
On hierarchical clustering of documents
Our research concerned the agglomerative hierarchical clustering techniques that have often been used to facilitate the cluster-based retrieval of documents [7], [26], [29], [36]. These methods cluster document on the basis of proximities computed between document vectors. Hierarchical clustering techniques are computationally demanding and usually possess a time complexity of O(n2) or even O(n3), where n is the number of documents [26].
The agglomerative clustering techniques applied here are
Clustering results
The 500 × 7067 and 5000 × 13,693 document-term matrices were length-normalized and processed further with PCA, which gave 499 and 4999 principal components. No standardization (variance normalization) was necessary, since the variances were reasonably equitable and very small. Since the document vectors were length-normalized into the unit vectors, Euclidean distances could be applied. The parts of the principal components were substantially small from the total variance, because the large number
Conclusions
The contribution of the current research was in the consideration of the Euclidean distance measure instead of the cosine measure commonly used in information retrieval when principal component analysis (PCA) was applied to reduce the search space. The Euclidean distance measure was useful, because it is insensitive to mean-correction that is traditionally assumed by PCA. Although the cosine distance is sensitive in this sense, this need not mean that it produces different results. We could use
Acknowledgements
The first author acknowledges the financial support of the Academy of Finland. The authors are grateful to Prof. Kalervo Järvelin and the staff of the Department of Information Studies, University of Tampere, for data and information.
References (37)
- et al.
Exploiting concept clusters for content-based information retrieval
Information Sciences
(2005) - et al.
Class normalization in centroid-based text categorization
Information Sciences
(2006) - et al.
Modelling highly inflected languages
Information Sciences
(2004) - et al.
Measuring the incremental information value of documents
Information Sciences
(2006) - et al.
Computing with words for text processing: an approach to the text categorization
Information Sciences
(2006) - R. Alkula, From strings to Finnish words (in Finnish), PhD Thesis, Department of Information Studies, University of...
From plain character strings to meaningful words: producing better full text databases for inflectional and compounding languages with morphological analysis software
Information Retrieval
(2001)Finding out About, A Cognitive Perspective on Search Engine Technology and the WWW
(2000)- et al.
Using algebra for intelligent information retrieval
SIAM Review
(1995) - et al.
Indexing by latent semantic analysis
Journal of the American Society for Information Science
(1990)
Cluster Analysis
Comparison of hierarchic agglomerative clustering methods for documents retrieval
The Computer Journal
Information Retrieval, Algorithms and Heuristics
Principles of Data Mining
Matrix Analysis
Algorithms for Clustering Data
Principal Components Analysis
Finding Groups in Data
Cited by (132)
Automatic construction of classification dimensions by clustering texts based on common words
2024, Expert Systems with ApplicationsMolecular Dynamics and Machine Learning reveal distinguishing mechanisms of Competitive Ligands to perturb α,β-Tubulin
2024, Computational Biology and ChemistryDo all roads lead to Rome? Studying distance measures in the context of machine learning
2023, Pattern RecognitionMultiple mechanisms regulate statistical learning of orthographic regularities in school-age children: Neurophysiological evidence
2023, Developmental Cognitive NeuroscienceBuilding collaborative trust in public safety networks
2022, Safety Science