Abstract
Several graph theoretic cluster techniques aimed at the automatic generation of thesauri for information retrieval systems are explored. Experimental cluster analysis is performed on a sample corpus of 2267 documents. A term-term similarity matrix is constructed for the 3950 unique terms used to index the documents. Various threshold values, T, are applied to the similarity matrix to provide a series of binary threshold matrices. The corresponding graph of each binary threshold matrix is used to obtain the term clusters.
Three definitions of a cluster are analyzed: (1) the connected components of the threshold matrix; (2) the maximal complete subgraphs of the connected components of the threshold matrix; (3) clusters of the maximal complete subgraphs of the threshold matrix, as described by Gotlieb and Kumar.
Algorithms are described and analyzed for obtaining each cluster type. The algorithms are designed to be useful for large document and index collections. Two algorithms have been tested that find maximal complete subgraphs. An algorithm developed by Bierstone offers a significant time improvement over one suggested by Bonner.
For threshold levels T ≥ 0.6, basically the same clusters are developed regardless of the cluster definition used. In such situations one need only find the connected components of the graph to develop the clusters.
- 1 AUGUSTSON, J. G., .&ND MINKER, J. An analysis of some graph theoretical cluster techniques. Computer Science Center Tech. Rep. No. TR70-106, U. of Maryland, College Park, Md., Jan. 1970.Google Scholar
- 2 --. Experiments with graph theoretical clustering techniques. Thesis submitted to the Faculty of the Graduate School of the University of Maryland in partial fulfillment of requirements for MS degree, U. of Maryland, College Park, Md., 1969.Google Scholar
- 3 BAKER, F.B. Information retrieval based upon latent class analysis. J. ACM 9, 4 (Oct. 1962), 512-521. Google Scholar
- 4 BALL, G.H. Data analysis in the social sciences: What about the details? Proc. AFIPS 1965 Fall Joint Comput. Conf., Vol. 27, Pt. 1, pp. 533-559.Google Scholar
- 5 BERGE, C., AND GHOUILx-HouRI, A. Programming, Games, and Networks. Wiley, New York, 1965.Google Scholar
- 6 BIERSTONE, E. Cliques and generalized cliques in a finite linear graph. Unpublished rep.Google Scholar
- 7 BONNER, R .E . On some clustering techniques. IBM J. Res. Develop. 8, 1 (Jan. 1964), 22-32.Google Scholar
- 8 BoRxo, H. The construction of an empirically based mathematically derived classification system. Rep. No. SP-585, System Development Corp., Santa Monica, Calif., Oct. 26, 1961.Google Scholar
- 9 --. Research in document classification and file organization. Rep. No. SP-1423, System Development Corp., Santa Monica, Calif., 1963.Google Scholar
- 10 -- AND BERNICK, M. D. Automatic document classification. Part II--Additional experiments. Tech. Memo TM-771/001/00, System Development Corp., Santa Monica, Calif., Oct. 18, 1963.Google Scholar
- 11 COSATI subject category list (DoD--modified). AD-624 000, Defense Documentation Center, Defense Supply Agency, Oct. 1965.Google Scholar
- 12 DALE, A. G ., AND DALE, N. Some clumping experiments for information retrieval. LRC 64-WPI1, Linguistic Research Center, U. of Texas, Austin, Texas, Feb. 1964.Google Scholar
- 13 D&TTOL&, R.T. A fast algorithm for automatic classification. In Information Storage and Retrieval series, Scientific Rep. No. ISR-14, Cornell U., Ithaca, N. Y.; Oct. 1968, Ch. V.Google Scholar
- 14 DOYL, L.B. Breaking the cost barrier in automatic classification. Rep. No. SP-2516, System Development Corp., Santa Monica, Calif., July 1966.Google Scholar
- 15 EURATOM-thesaurus: Keywords used within EURATOM's nuclear energy documentation project. Directorate "Dissemination of Information," Center for Information and Documentation, 1964.Google Scholar
- 16 GOTLIEB, C. C., AND KUMAR, S. Semantic clustering of index terms. J. ACM 15, 4 (Oct. 1968), 493-513. Google Scholar
- 17 GIULIANO, V. E ., AND JONES, P.E. Linear associative information retrieval. In Howerton, P. W., and Weeks, D. C. (Eds.), Vistas in Information Handling, Vol. 1, Spartan Books, Washington, D. C., 1963, Ch. 2, pp. 30-46.Google Scholar
- 18 IvlE, E. L. Search procedures based on measures of relatedness between documents. Ph.D. thesis, MIT, Cambridge, Mass., May 1966.Google Scholar
- 19 KNUTH, D. E. The Art of Computer Programming: Vol. 1, Fundamental Algorithms. Addison-Wesley, Reading, Mass., 1968. Google Scholar
- 20 KOCHEN, M., AND WONG, E. Concerning the possibility of a cooperative information exchange. IBM J. Res. Develop. 6, 2 (April 1962), 270-271.21. KUHNS, J .L . The continuum of coefficients of association. In Stevens, M. E., Giuliano,Google Scholar
- 21 V. E., and Heilprin, L. B. (Eds.), Proc. Symposium in Statistical Association Methods for Mechanized Documentation, US Dep. of Commerce, Washington, D. C., Dec. 1965.Google Scholar
- 22 --. Mathematical analysis of correlation clusters. In Word correlation and automatic indexing, Progress Rep. No. 2, C82-OU1, Ramo-Wooldridge, Canoga Park, Calif., Dec. 1959, Appendix D.Google Scholar
- 23 LESK, M. E. Word-word association in document retrieval systems. In Information storage and retrieval, Scientific Rep. No. ISR-13, Cornell U., Ithaca, New York, Jan. 1968, Section IX.Google Scholar
- 24 MEETHAM, A.R. Graph separability and word grouping. Proc. 21st Nat. Conf. ACM, 1966, ACM Pub P-66, Thompson Book Co., Washington, D. C., pp. 513-514. Google Scholar
- 25 NEEDHAM, R.M. A method for using computers in information classification. In Popplewell, C. M. CEd0, Information Processing 1962, Proc. IFIP Congr. 62, North-Holland, Amsterdam, 1963, pp. 284--287.Google Scholar
- 26 --. The termination of certain iterative processes. Memorandum RM-5188-PR, Rand Corp., Santa Monica, Calif., Nov. 1966.Google Scholar
- 27 The theory of clumps II. Rep. No. ML 139, Cambridge Language Research Unit, Cambridge, England, March 1961.Google Scholar
- 28 OGILVrE, J.C. The distribution of number and size of connected components in random graphs of medium size. In Morrell, A. J. H. (Ed.), Information Processing 68, Proc. IFIP Congress 68, Vol. 2---Hardware, Applications, North-Holland, Amsterdam, 1969, pp. 1527-1530.Google Scholar
- 29 ORE, O. Graphs and Their Use. Random House, New York, 1963.Google Scholar
- 30 PARKER-RHODES, A. F. Contributions to the theory of clumps: The usefulness and feasibility of the theory. Rep. No. ML 138, Cambridge Language Research Unit, Cambridge, England, March 1961.Google Scholar
- 31 AND NEEDHAM, R.M. The theory of clumps. Rep. No. ML 126, Cambridge Language Research Unit, Cambridge, England, Feb. 1960.Google Scholar
- 32 PRICE, N., XND SCHIINOVICH, S. A clustering experiment: First step towards a computer-generated classification scheme. Inform. Storage and Retrieval $, 3 (Aug. 1968), 271-280.Google Scholar
- 33 ROCCHIO, J. J., JR. Document retrieval systems--optimization and evaluation. Scientific Rep. No. ISR-10, Computation Laboratory, Harvard U., Cambridge, Mass., 1966.Google Scholar
- 34 ROGERS, D., AND TANIOTO, T. A computer program for classifying plants. Science 132 (Oct. 1960), 1115-1118.Google Scholar
- 35 SALTON, G. Automatic Information Organization and Retrieval. McGraw-Hill, New York, 1968. Google Scholar
- 36 SHEPXRD, M. J., AND WILLMOTT, A .J . Cluster analysis on the Atlas computer. Computer J. 11, 1 (May 1968), 57-62.Google Scholar
- 37 SPXRCK-JONES, K. Automatic term classification and information retrieval. In Morrell, A. J. H. (Ed.), Information Processing 68, Proc. IFIP Congress 68, Vol. 2---Hardware, Applications, North-Holland, Amsterdam, 1969, pp. 1290-1295.Google Scholar
- 38 Mechanized semantic classification. 1961 International Conf. on Machine Translation of Languages and Applied Language Analysis, National Physics Laboratory Symposium No. 13, Volume II, 1962, Paper 25, pp. 417--435.Google Scholar
- 39 -- AND JACKSON, D. Current approaches to classification and clump-finding at Cambridge Language Research Unit. Computer J. 10, 1 (May 1967), 29-37.Google Scholar
- 40 STEVENS, M.E. Automatic indexing: A state of the art report. NBS Monograph 91, US Dep. of Commerce, Washington, D. C., March 1965.Google Scholar
- 41 STILES, H. E. The association factor in information retrieval. J. ACM 8, 2 (April 1961), 271-279. Google Scholar
- 42 -- AND SALISBURY, B. A. The use of the B-coefficient in information retrieval. Unpublished rep., Sept. 1967.Google Scholar
- 43 TANIOTO, T. An elementary mathematical theory of classification and prediction. Rep., IBM Corp., 1958.Google Scholar
Index Terms
- An Analysis of Some Graph Theoretical Cluster Techniques
Recommendations
Cluster graph modification problems
In a clustering problem one has to partition a set of elements into homogeneous and well-separated subsets. From a graph theoretic point of view, a cluster graph is a vertex-disjoint union of cliques. The clustering problem is the task of making the ...
Cluster Graph Modification Problems
WG '02: Revised Papers from the 28th International Workshop on Graph-Theoretic Concepts in Computer ScienceIn a clustering problem one has to partition a set of elements into homogeneous and well-separated subsets. From a graph theoretic point of view, a cluster graph is a vertex-disjoint union of cliques. The clustering problem is the task of making fewest ...
Comments