Skip to main content
Log in

BIRCH: A New Data Clustering Algorithm and Its Applications

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Data clustering is an important technique for exploratory data analysis, and has been studied for several years. It has been shown to be useful in many practical domains such as data classification and image processing. Recently, there has been a growing emphasis on exploratory analysis of very large datasets to discover useful patterns and/or correlations among attributes. This is called data mining, and data clustering is regarded as a particular branch. However existing data clustering methods do not adequately address the problem of processing large datasets with a limited amount of resources (e.g., memory and cpu cycles). So as the dataset size increases, they do not scale up well in terms of memory requirement, running time, and result quality.

In this paper, an efficient and scalable data clustering method is proposed, based on a new in-memory data structure called CF-tree, which serves as an in-memory summary of the data distribution. We have implemented it in a system called BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies), and studied its performance extensively in terms of memory requirements, running time, clustering quality, stability and scalability; we also compare it with other available methods. Finally, BIRCH is applied to solve two real-life problems: one is building an iterative and interactive pixel classification tool, and the other is generating the initial codebook for image compression.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Beckmann, Norbert, Kriegel, Hans-Peter, Schneider, Ralf and Seeger Bernhard, The R*-tree: An Efficient and Robust Access Method for Points and Rectangles, Proc. of ACM SIGMOD Int. Conf. on Management of Data, 322–331,1990.

  • Cheeseman, Peter, Kelly, James, Self, Matthew, et al., AutoClass: A Bayesian Classification System, Proc. of the 5th Int. Conf. on Machine Learning, Morgan Kaufman, Jun. 1988.

  • Cheng, Michael, Livny, Miron and Ramakrishnan, Raghu, Visual Analysis of Stream Data, Proc. of IS&T/SPIE Conf. on Visual Data Exploration and Analysis, San Jose, CA, Feb. 1995.

  • Duda, Richard and Hart Peter E., Pattern Classification and Scene Analysis, Wiley, 1973.

  • Dubes, R. and Jain, A.K., Clustering Methodologies in Exploratory Data Analysis, Advances in Computers, Edited by M.C. Yovits, Vol. 19, Academic Press, New York, 1980.

    Google Scholar 

  • Ester, Martin, Kriegel, Hans-Peter and Xu, Xiaowei, A Database Interface for Clustering in Large Spatial Databases, Proc. of 1st Int. Conf. on Knowledge Discovery and Data Mining, 1995a.

  • Ester, Martin, Kriegel, Hans-Peter and Xu, Xiaowei, Knowledge Discovery in Large Spatial Databases: Focusing Techniques for Efficient Class Identification, Proc. of 4th Int. Symposium on Large Spatial Databases, Portland, Maine, U.S.A., 1995b.

  • Feigenbaum, E.A. and Simon, H., EPAM-like models of recognition and learning, Cognitive Science, vol. 8, 1984, 305–336.

    Google Scholar 

  • Fisher, Douglas H., Knowledge Acquisition via Incremental Conceptual Clustering, Machine Learning, 2(2), 1987

  • Fisher, Douglas H., Iterative Optimization and Simplification of Hierarchical Clusterings, Technical Report CS-95-01, Dept. of Computer Science, Vanderbilt University, Nashville, TN 37235, 1995.

    Google Scholar 

  • Gersho, A. and Gray, R., Vector quantization and signal compression, Boston, Ma.: Kluwer Academic Publishers, 1992.

    Google Scholar 

  • Gennari, John H., Langley, Pat and Fisher, Douglas, Models of Incremental Concept Formation, Artificial Intelligence, vol. 40, 1989, 11–61.

    Google Scholar 

  • Guttman, A., R-trees: a dynamic index structure for spatial searching, Proc. ACM SIGMOD Int. Conf. on Management of Data, 47–57, 1984.

  • Huang, C., Bi, Q., Stiles, G. and Harris, R., Fast Full Search Equivalent Encoding Algorithms for Image Compression Using Vector Quantization, IEEE Trans. on Image Processing, vol. 1, no. 3, July, 1992.

  • Hartigan, J.A. and Wong, M.A., A K-Means Clustering Algorithm, Appl. Statist., vol. 28, no. 1, 1979.

  • Kaufman, Leonard and Rousseeuw, Peter J., Finding Groups in Data-An Introduction to Cluster Analysis,Wiley Series in Probability and Mathematical Statistics, 1990.

  • Kucharik, C.J. and Norman, J.M., Measuring Canopy Architecture with a Multiband Vegetation Imager (MVI) Proc. of the 22nd conf. on Agricultural and Forest Meteorology, American Meteorological Society annual meeting, Atlanta, GA, Jan 28-Feb 2, 1996.

  • Kucharik, C.J., Norman, J.M., Murdock, L.M. and Gower, S.T., Characterizing Canopy non-randomness with a Multiband Vegetation Imager (MVI), Submitted to Journal of Geophysical Research, to appear in the Boreal Ecosystem-Atmosphere Study (BOREAS) special issue, 1996.

  • Kou, Weidong, Digital Image Compression Algorithms and Standards, Kluwer Academic Publishers, 1995.

  • Linde, Y., Buzo, A. and Gray, R.M., An Algorithm for Vector Quantization Design, IEEE Trans. on Communications, vol. 28, no. 1, 1980.

  • Lebowitz, Michael, Experiments with Incremental Concept Formation: UNIMEM, Machine Learning, 1987.

  • Lee, R.C.T., Clustering analysis and its applications, Advances in Information Systems Science, Edited by J.T. Toum, Vol. 8, pp. 169-292, Plenum Press, New York, 1981.

    Google Scholar 

  • Murtagh, F., A Survey of Recent Advances in Hierarchical Clustering Algorithms, The Computer Journal, 1983.

  • Ng, Raymond T. and Han, Jiawei, Efficient and Effective Clustering Methods for Spatial Data Mining, Proc. of VLDB, 1994.

  • Olson, Clark F., Parallel Algorithms for Hierarchical Clustering, Technical Report, Computer Science Division, Univ. of California at Berkeley, Dec.,1993.

    Google Scholar 

  • Rabbani, Majid and Jones, Paul W. Digital Image Compression Techniques, SPIE Optical Engineering Press, 1991.

  • Zhang, Tian, Ramakrishnan, Raghu and Livny, Miron, BIRCH: An Efficient Data Clustering Method for Very Large Databases, Technical Report, Computer Sciences Dept., Univ. of Wisconsin-Madison, 1995.

  • Zhang, Tian, Data Clustering for Very Large Datasets Plus Applications, Dissertation, Computer Sciences Dept. at Univ. of Wisconsin-Madison, 1996.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, T., Ramakrishnan, R. & Livny, M. BIRCH: A New Data Clustering Algorithm and Its Applications. Data Mining and Knowledge Discovery 1, 141–182 (1997). https://doi.org/10.1023/A:1009783824328

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1009783824328

Navigation