skip to main content
article
Free Access

An Analysis of Some Graph Theoretical Cluster Techniques

Authors Info & Claims
Published:01 October 1970Publication History
Skip Abstract Section

Abstract

Several graph theoretic cluster techniques aimed at the automatic generation of thesauri for information retrieval systems are explored. Experimental cluster analysis is performed on a sample corpus of 2267 documents. A term-term similarity matrix is constructed for the 3950 unique terms used to index the documents. Various threshold values, T, are applied to the similarity matrix to provide a series of binary threshold matrices. The corresponding graph of each binary threshold matrix is used to obtain the term clusters.

Three definitions of a cluster are analyzed: (1) the connected components of the threshold matrix; (2) the maximal complete subgraphs of the connected components of the threshold matrix; (3) clusters of the maximal complete subgraphs of the threshold matrix, as described by Gotlieb and Kumar.

Algorithms are described and analyzed for obtaining each cluster type. The algorithms are designed to be useful for large document and index collections. Two algorithms have been tested that find maximal complete subgraphs. An algorithm developed by Bierstone offers a significant time improvement over one suggested by Bonner.

For threshold levels T ≥ 0.6, basically the same clusters are developed regardless of the cluster definition used. In such situations one need only find the connected components of the graph to develop the clusters.

References

  1. 1 AUGUSTSON, J. G., .&ND MINKER, J. An analysis of some graph theoretical cluster techniques. Computer Science Center Tech. Rep. No. TR70-106, U. of Maryland, College Park, Md., Jan. 1970.Google ScholarGoogle Scholar
  2. 2 --. Experiments with graph theoretical clustering techniques. Thesis submitted to the Faculty of the Graduate School of the University of Maryland in partial fulfillment of requirements for MS degree, U. of Maryland, College Park, Md., 1969.Google ScholarGoogle Scholar
  3. 3 BAKER, F.B. Information retrieval based upon latent class analysis. J. ACM 9, 4 (Oct. 1962), 512-521. Google ScholarGoogle Scholar
  4. 4 BALL, G.H. Data analysis in the social sciences: What about the details? Proc. AFIPS 1965 Fall Joint Comput. Conf., Vol. 27, Pt. 1, pp. 533-559.Google ScholarGoogle Scholar
  5. 5 BERGE, C., AND GHOUILx-HouRI, A. Programming, Games, and Networks. Wiley, New York, 1965.Google ScholarGoogle Scholar
  6. 6 BIERSTONE, E. Cliques and generalized cliques in a finite linear graph. Unpublished rep.Google ScholarGoogle Scholar
  7. 7 BONNER, R .E . On some clustering techniques. IBM J. Res. Develop. 8, 1 (Jan. 1964), 22-32.Google ScholarGoogle Scholar
  8. 8 BoRxo, H. The construction of an empirically based mathematically derived classification system. Rep. No. SP-585, System Development Corp., Santa Monica, Calif., Oct. 26, 1961.Google ScholarGoogle Scholar
  9. 9 --. Research in document classification and file organization. Rep. No. SP-1423, System Development Corp., Santa Monica, Calif., 1963.Google ScholarGoogle Scholar
  10. 10 -- AND BERNICK, M. D. Automatic document classification. Part II--Additional experiments. Tech. Memo TM-771/001/00, System Development Corp., Santa Monica, Calif., Oct. 18, 1963.Google ScholarGoogle Scholar
  11. 11 COSATI subject category list (DoD--modified). AD-624 000, Defense Documentation Center, Defense Supply Agency, Oct. 1965.Google ScholarGoogle Scholar
  12. 12 DALE, A. G ., AND DALE, N. Some clumping experiments for information retrieval. LRC 64-WPI1, Linguistic Research Center, U. of Texas, Austin, Texas, Feb. 1964.Google ScholarGoogle Scholar
  13. 13 D&TTOL&, R.T. A fast algorithm for automatic classification. In Information Storage and Retrieval series, Scientific Rep. No. ISR-14, Cornell U., Ithaca, N. Y.; Oct. 1968, Ch. V.Google ScholarGoogle Scholar
  14. 14 DOYL, L.B. Breaking the cost barrier in automatic classification. Rep. No. SP-2516, System Development Corp., Santa Monica, Calif., July 1966.Google ScholarGoogle Scholar
  15. 15 EURATOM-thesaurus: Keywords used within EURATOM's nuclear energy documentation project. Directorate "Dissemination of Information," Center for Information and Documentation, 1964.Google ScholarGoogle Scholar
  16. 16 GOTLIEB, C. C., AND KUMAR, S. Semantic clustering of index terms. J. ACM 15, 4 (Oct. 1968), 493-513. Google ScholarGoogle Scholar
  17. 17 GIULIANO, V. E ., AND JONES, P.E. Linear associative information retrieval. In Howerton, P. W., and Weeks, D. C. (Eds.), Vistas in Information Handling, Vol. 1, Spartan Books, Washington, D. C., 1963, Ch. 2, pp. 30-46.Google ScholarGoogle Scholar
  18. 18 IvlE, E. L. Search procedures based on measures of relatedness between documents. Ph.D. thesis, MIT, Cambridge, Mass., May 1966.Google ScholarGoogle Scholar
  19. 19 KNUTH, D. E. The Art of Computer Programming: Vol. 1, Fundamental Algorithms. Addison-Wesley, Reading, Mass., 1968. Google ScholarGoogle Scholar
  20. 20 KOCHEN, M., AND WONG, E. Concerning the possibility of a cooperative information exchange. IBM J. Res. Develop. 6, 2 (April 1962), 270-271.21. KUHNS, J .L . The continuum of coefficients of association. In Stevens, M. E., Giuliano,Google ScholarGoogle Scholar
  21. 21 V. E., and Heilprin, L. B. (Eds.), Proc. Symposium in Statistical Association Methods for Mechanized Documentation, US Dep. of Commerce, Washington, D. C., Dec. 1965.Google ScholarGoogle Scholar
  22. 22 --. Mathematical analysis of correlation clusters. In Word correlation and automatic indexing, Progress Rep. No. 2, C82-OU1, Ramo-Wooldridge, Canoga Park, Calif., Dec. 1959, Appendix D.Google ScholarGoogle Scholar
  23. 23 LESK, M. E. Word-word association in document retrieval systems. In Information storage and retrieval, Scientific Rep. No. ISR-13, Cornell U., Ithaca, New York, Jan. 1968, Section IX.Google ScholarGoogle Scholar
  24. 24 MEETHAM, A.R. Graph separability and word grouping. Proc. 21st Nat. Conf. ACM, 1966, ACM Pub P-66, Thompson Book Co., Washington, D. C., pp. 513-514. Google ScholarGoogle Scholar
  25. 25 NEEDHAM, R.M. A method for using computers in information classification. In Popplewell, C. M. CEd0, Information Processing 1962, Proc. IFIP Congr. 62, North-Holland, Amsterdam, 1963, pp. 284--287.Google ScholarGoogle Scholar
  26. 26 --. The termination of certain iterative processes. Memorandum RM-5188-PR, Rand Corp., Santa Monica, Calif., Nov. 1966.Google ScholarGoogle Scholar
  27. 27 The theory of clumps II. Rep. No. ML 139, Cambridge Language Research Unit, Cambridge, England, March 1961.Google ScholarGoogle Scholar
  28. 28 OGILVrE, J.C. The distribution of number and size of connected components in random graphs of medium size. In Morrell, A. J. H. (Ed.), Information Processing 68, Proc. IFIP Congress 68, Vol. 2---Hardware, Applications, North-Holland, Amsterdam, 1969, pp. 1527-1530.Google ScholarGoogle Scholar
  29. 29 ORE, O. Graphs and Their Use. Random House, New York, 1963.Google ScholarGoogle Scholar
  30. 30 PARKER-RHODES, A. F. Contributions to the theory of clumps: The usefulness and feasibility of the theory. Rep. No. ML 138, Cambridge Language Research Unit, Cambridge, England, March 1961.Google ScholarGoogle Scholar
  31. 31 AND NEEDHAM, R.M. The theory of clumps. Rep. No. ML 126, Cambridge Language Research Unit, Cambridge, England, Feb. 1960.Google ScholarGoogle Scholar
  32. 32 PRICE, N., XND SCHIINOVICH, S. A clustering experiment: First step towards a computer-generated classification scheme. Inform. Storage and Retrieval $, 3 (Aug. 1968), 271-280.Google ScholarGoogle Scholar
  33. 33 ROCCHIO, J. J., JR. Document retrieval systems--optimization and evaluation. Scientific Rep. No. ISR-10, Computation Laboratory, Harvard U., Cambridge, Mass., 1966.Google ScholarGoogle Scholar
  34. 34 ROGERS, D., AND TANIOTO, T. A computer program for classifying plants. Science 132 (Oct. 1960), 1115-1118.Google ScholarGoogle Scholar
  35. 35 SALTON, G. Automatic Information Organization and Retrieval. McGraw-Hill, New York, 1968. Google ScholarGoogle Scholar
  36. 36 SHEPXRD, M. J., AND WILLMOTT, A .J . Cluster analysis on the Atlas computer. Computer J. 11, 1 (May 1968), 57-62.Google ScholarGoogle Scholar
  37. 37 SPXRCK-JONES, K. Automatic term classification and information retrieval. In Morrell, A. J. H. (Ed.), Information Processing 68, Proc. IFIP Congress 68, Vol. 2---Hardware, Applications, North-Holland, Amsterdam, 1969, pp. 1290-1295.Google ScholarGoogle Scholar
  38. 38 Mechanized semantic classification. 1961 International Conf. on Machine Translation of Languages and Applied Language Analysis, National Physics Laboratory Symposium No. 13, Volume II, 1962, Paper 25, pp. 417--435.Google ScholarGoogle Scholar
  39. 39 -- AND JACKSON, D. Current approaches to classification and clump-finding at Cambridge Language Research Unit. Computer J. 10, 1 (May 1967), 29-37.Google ScholarGoogle Scholar
  40. 40 STEVENS, M.E. Automatic indexing: A state of the art report. NBS Monograph 91, US Dep. of Commerce, Washington, D. C., March 1965.Google ScholarGoogle Scholar
  41. 41 STILES, H. E. The association factor in information retrieval. J. ACM 8, 2 (April 1961), 271-279. Google ScholarGoogle Scholar
  42. 42 -- AND SALISBURY, B. A. The use of the B-coefficient in information retrieval. Unpublished rep., Sept. 1967.Google ScholarGoogle Scholar
  43. 43 TANIOTO, T. An elementary mathematical theory of classification and prediction. Rep., IBM Corp., 1958.Google ScholarGoogle Scholar

Index Terms

  1. An Analysis of Some Graph Theoretical Cluster Techniques

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image Journal of the ACM
          Journal of the ACM  Volume 17, Issue 4
          Oct. 1970
          169 pages
          ISSN:0004-5411
          EISSN:1557-735X
          DOI:10.1145/321607
          Issue’s Table of Contents

          Copyright © 1970 ACM

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 1 October 1970
          Published in jacm Volume 17, Issue 4

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader