article

Free Access

An Analysis of Some Graph Theoretical Cluster Techniques

Authors:
J. Gary Augustson

University of Maryland, Computer Science Center, College Park, Maryland

University of Maryland, Computer Science Center, College Park, Maryland
View Profile

,
Jack Minker

University of Maryland, Computer Science Center, College Park, Maryland

University of Maryland, Computer Science Center, College Park, Maryland
View Profile

Authors Info & Claims

Journal of the ACM Volume 17 Issue 4pp 571–588https://doi.org/10.1145/321607.321608

Published:01 October 1970Publication History

Journal of the ACM

Abstract

Several graph theoretic cluster techniques aimed at the automatic generation of thesauri for information retrieval systems are explored. Experimental cluster analysis is performed on a sample corpus of 2267 documents. A term-term similarity matrix is constructed for the 3950 unique terms used to index the documents. Various threshold values, T, are applied to the similarity matrix to provide a series of binary threshold matrices. The corresponding graph of each binary threshold matrix is used to obtain the term clusters.

Three definitions of a cluster are analyzed: (1) the connected components of the threshold matrix; (2) the maximal complete subgraphs of the connected components of the threshold matrix; (3) clusters of the maximal complete subgraphs of the threshold matrix, as described by Gotlieb and Kumar.

Algorithms are described and analyzed for obtaining each cluster type. The algorithms are designed to be useful for large document and index collections. Two algorithms have been tested that find maximal complete subgraphs. An algorithm developed by Bierstone offers a significant time improvement over one suggested by Bonner.

For threshold levels T ≥ 0.6, basically the same clusters are developed regardless of the cluster definition used. In such situations one need only find the connected components of the graph to develop the clusters.

References

1 AUGUSTSON, J. G., .&ND MINKER, J. An analysis of some graph theoretical cluster techniques. Computer Science Center Tech. Rep. No. TR70-106, U. of Maryland, College Park, Md., Jan. 1970.Google Scholar
2 --. Experiments with graph theoretical clustering techniques. Thesis submitted to the Faculty of the Graduate School of the University of Maryland in partial fulfillment of requirements for MS degree, U. of Maryland, College Park, Md., 1969.Google Scholar
3 BAKER, F.B. Information retrieval based upon latent class analysis. J. ACM 9, 4 (Oct. 1962), 512-521. Google Scholar
4 BALL, G.H. Data analysis in the social sciences: What about the details? Proc. AFIPS 1965 Fall Joint Comput. Conf., Vol. 27, Pt. 1, pp. 533-559.Google Scholar
5 BERGE, C., AND GHOUILx-HouRI, A. Programming, Games, and Networks. Wiley, New York, 1965.Google Scholar
6 BIERSTONE, E. Cliques and generalized cliques in a finite linear graph. Unpublished rep.Google Scholar
7 BONNER, R .E . On some clustering techniques. IBM J. Res. Develop. 8, 1 (Jan. 1964), 22-32.Google Scholar
8 BoRxo, H. The construction of an empirically based mathematically derived classification system. Rep. No. SP-585, System Development Corp., Santa Monica, Calif., Oct. 26, 1961.Google Scholar
9 --. Research in document classification and file organization. Rep. No. SP-1423, System Development Corp., Santa Monica, Calif., 1963.Google Scholar
10 -- AND BERNICK, M. D. Automatic document classification. Part II--Additional experiments. Tech. Memo TM-771/001/00, System Development Corp., Santa Monica, Calif., Oct. 18, 1963.Google Scholar
11 COSATI subject category list (DoD--modified). AD-624 000, Defense Documentation Center, Defense Supply Agency, Oct. 1965.Google Scholar
12 DALE, A. G ., AND DALE, N. Some clumping experiments for information retrieval. LRC 64-WPI1, Linguistic Research Center, U. of Texas, Austin, Texas, Feb. 1964.Google Scholar
13 D&TTOL&, R.T. A fast algorithm for automatic classification. In Information Storage and Retrieval series, Scientific Rep. No. ISR-14, Cornell U., Ithaca, N. Y.; Oct. 1968, Ch. V.Google Scholar
14 DOYL, L.B. Breaking the cost barrier in automatic classification. Rep. No. SP-2516, System Development Corp., Santa Monica, Calif., July 1966.Google Scholar
15 EURATOM-thesaurus: Keywords used within EURATOM's nuclear energy documentation project. Directorate "Dissemination of Information," Center for Information and Documentation, 1964.Google Scholar
16 GOTLIEB, C. C., AND KUMAR, S. Semantic clustering of index terms. J. ACM 15, 4 (Oct. 1968), 493-513. Google Scholar
17 GIULIANO, V. E ., AND JONES, P.E. Linear associative information retrieval. In Howerton, P. W., and Weeks, D. C. (Eds.), Vistas in Information Handling, Vol. 1, Spartan Books, Washington, D. C., 1963, Ch. 2, pp. 30-46.Google Scholar
18 IvlE, E. L. Search procedures based on measures of relatedness between documents. Ph.D. thesis, MIT, Cambridge, Mass., May 1966.Google Scholar
19 KNUTH, D. E. The Art of Computer Programming: Vol. 1, Fundamental Algorithms. Addison-Wesley, Reading, Mass., 1968. Google Scholar
20 KOCHEN, M., AND WONG, E. Concerning the possibility of a cooperative information exchange. IBM J. Res. Develop. 6, 2 (April 1962), 270-271.21. KUHNS, J .L . The continuum of coefficients of association. In Stevens, M. E., Giuliano,Google Scholar
21 V. E., and Heilprin, L. B. (Eds.), Proc. Symposium in Statistical Association Methods for Mechanized Documentation, US Dep. of Commerce, Washington, D. C., Dec. 1965.Google Scholar
22 --. Mathematical analysis of correlation clusters. In Word correlation and automatic indexing, Progress Rep. No. 2, C82-OU1, Ramo-Wooldridge, Canoga Park, Calif., Dec. 1959, Appendix D.Google Scholar
23 LESK, M. E. Word-word association in document retrieval systems. In Information storage and retrieval, Scientific Rep. No. ISR-13, Cornell U., Ithaca, New York, Jan. 1968, Section IX.Google Scholar
24 MEETHAM, A.R. Graph separability and word grouping. Proc. 21st Nat. Conf. ACM, 1966, ACM Pub P-66, Thompson Book Co., Washington, D. C., pp. 513-514. Google Scholar
25 NEEDHAM, R.M. A method for using computers in information classification. In Popplewell, C. M. CEd0, Information Processing 1962, Proc. IFIP Congr. 62, North-Holland, Amsterdam, 1963, pp. 284--287.Google Scholar
26 --. The termination of certain iterative processes. Memorandum RM-5188-PR, Rand Corp., Santa Monica, Calif., Nov. 1966.Google Scholar
27 The theory of clumps II. Rep. No. ML 139, Cambridge Language Research Unit, Cambridge, England, March 1961.Google Scholar
28 OGILVrE, J.C. The distribution of number and size of connected components in random graphs of medium size. In Morrell, A. J. H. (Ed.), Information Processing 68, Proc. IFIP Congress 68, Vol. 2---Hardware, Applications, North-Holland, Amsterdam, 1969, pp. 1527-1530.Google Scholar
29 ORE, O. Graphs and Their Use. Random House, New York, 1963.Google Scholar
30 PARKER-RHODES, A. F. Contributions to the theory of clumps: The usefulness and feasibility of the theory. Rep. No. ML 138, Cambridge Language Research Unit, Cambridge, England, March 1961.Google Scholar
31 AND NEEDHAM, R.M. The theory of clumps. Rep. No. ML 126, Cambridge Language Research Unit, Cambridge, England, Feb. 1960.Google Scholar
32 PRICE, N., XND SCHIINOVICH, S. A clustering experiment: First step towards a computer-generated classification scheme. Inform. Storage and Retrieval $, 3 (Aug. 1968), 271-280.Google Scholar
33 ROCCHIO, J. J., JR. Document retrieval systems--optimization and evaluation. Scientific Rep. No. ISR-10, Computation Laboratory, Harvard U., Cambridge, Mass., 1966.Google Scholar
34 ROGERS, D., AND TANIOTO, T. A computer program for classifying plants. Science 132 (Oct. 1960), 1115-1118.Google Scholar
35 SALTON, G. Automatic Information Organization and Retrieval. McGraw-Hill, New York, 1968. Google Scholar
36 SHEPXRD, M. J., AND WILLMOTT, A .J . Cluster analysis on the Atlas computer. Computer J. 11, 1 (May 1968), 57-62.Google Scholar
37 SPXRCK-JONES, K. Automatic term classification and information retrieval. In Morrell, A. J. H. (Ed.), Information Processing 68, Proc. IFIP Congress 68, Vol. 2---Hardware, Applications, North-Holland, Amsterdam, 1969, pp. 1290-1295.Google Scholar
38 Mechanized semantic classification. 1961 International Conf. on Machine Translation of Languages and Applied Language Analysis, National Physics Laboratory Symposium No. 13, Volume II, 1962, Paper 25, pp. 417--435.Google Scholar
39 -- AND JACKSON, D. Current approaches to classification and clump-finding at Cambridge Language Research Unit. Computer J. 10, 1 (May 1967), 29-37.Google Scholar
40 STEVENS, M.E. Automatic indexing: A state of the art report. NBS Monograph 91, US Dep. of Commerce, Washington, D. C., March 1965.Google Scholar
41 STILES, H. E. The association factor in information retrieval. J. ACM 8, 2 (April 1961), 271-279. Google Scholar
42 -- AND SALISBURY, B. A. The use of the B-coefficient in information retrieval. Unpublished rep., Sept. 1967.Google Scholar
43 TANIOTO, T. An elementary mathematical theory of classification and prediction. Rep., IBM Corp., 1958.Google Scholar

Index Terms

An Analysis of Some Graph Theoretical Cluster Techniques
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering
2. Mathematics of computing
  1. Discrete mathematics

Recommendations

Cluster graph modification problems

In a clustering problem one has to partition a set of elements into homogeneous and well-separated subsets. From a graph theoretic point of view, a cluster graph is a vertex-disjoint union of cliques. The clustering problem is the task of making the ...
Read More
Cluster Graph Modification Problems
WG '02: Revised Papers from the 28th International Workshop on Graph-Theoretic Concepts in Computer Science

In a clustering problem one has to partition a set of elements into homogeneous and well-separated subsets. From a graph theoretic point of view, a cluster graph is a vertex-disjoint union of cliques. The clustering problem is the task of making fewest ...
Read More
Graph-theoretical methods in object recognition and related problems in extremal graph theory
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Journal of the ACM Volume 17, Issue 4
Oct. 1970
169 pages
ISSN:0004-5411
EISSN:1557-735X
DOI:10.1145/321607
Issue’s Table of Contents

Copyright © 1970 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 October 1970
Published in jacm Volume 17, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 178
  Total Citations
  View Citations
- 2,257
  Total Downloads
- Downloads (Last 12 months)124
- Downloads (Last 6 weeks)18
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

An Analysis of Some Graph Theoretical Cluster Techniques

Journal of the ACM

Abstract

References

Cited By

Index Terms

Recommendations

Cluster graph modification problems

Cluster Graph Modification Problems

Graph-theoretical methods in object recognition and related problems in extremal graph theory

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

An Analysis of Some Graph Theoretical Cluster Techniques

Journal of the ACM

Abstract

References

Cited By

Index Terms

Recommendations

Cluster graph modification problems

Cluster Graph Modification Problems

Graph-theoretical methods in object recognition and related problems in extremal graph theory

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media