BIRCH: A New Data Clustering Algorithm and Its Applications

Zhang, Tian; Ramakrishnan, Raghu; Livny, Miron

doi:10.1023/A:1009783824328

BIRCH: A New Data Clustering Algorithm and Its Applications

Published: June 1997

Volume 1, pages 141–182, (1997)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Tian Zhang¹,
Raghu Ramakrishnan¹ &
Miron Livny¹

4694 Accesses
576 Citations
Explore all metrics

Abstract

Data clustering is an important technique for exploratory data analysis, and has been studied for several years. It has been shown to be useful in many practical domains such as data classification and image processing. Recently, there has been a growing emphasis on exploratory analysis of very large datasets to discover useful patterns and/or correlations among attributes. This is called data mining, and data clustering is regarded as a particular branch. However existing data clustering methods do not adequately address the problem of processing large datasets with a limited amount of resources (e.g., memory and cpu cycles). So as the dataset size increases, they do not scale up well in terms of memory requirement, running time, and result quality.

In this paper, an efficient and scalable data clustering method is proposed, based on a new in-memory data structure called CF-tree, which serves as an in-memory summary of the data distribution. We have implemented it in a system called BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies), and studied its performance extensively in terms of memory requirements, running time, clustering quality, stability and scalability; we also compare it with other available methods. Finally, BIRCH is applied to solve two real-life problems: one is building an iterative and interactive pixel classification tool, and the other is generating the initial codebook for image compression.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Beckmann, Norbert, Kriegel, Hans-Peter, Schneider, Ralf and Seeger Bernhard, The R*-tree: An Efficient and Robust Access Method for Points and Rectangles, Proc. of ACM SIGMOD Int. Conf. on Management of Data, 322–331,1990.
Cheeseman, Peter, Kelly, James, Self, Matthew, et al., AutoClass: A Bayesian Classification System, Proc. of the 5th Int. Conf. on Machine Learning, Morgan Kaufman, Jun. 1988.
Cheng, Michael, Livny, Miron and Ramakrishnan, Raghu, Visual Analysis of Stream Data, Proc. of IS&T/SPIE Conf. on Visual Data Exploration and Analysis, San Jose, CA, Feb. 1995.
Duda, Richard and Hart Peter E., Pattern Classification and Scene Analysis, Wiley, 1973.
Dubes, R. and Jain, A.K., Clustering Methodologies in Exploratory Data Analysis, Advances in Computers, Edited by M.C. Yovits, Vol. 19, Academic Press, New York, 1980.
Google Scholar
Ester, Martin, Kriegel, Hans-Peter and Xu, Xiaowei, A Database Interface for Clustering in Large Spatial Databases, Proc. of 1st Int. Conf. on Knowledge Discovery and Data Mining, 1995a.
Ester, Martin, Kriegel, Hans-Peter and Xu, Xiaowei, Knowledge Discovery in Large Spatial Databases: Focusing Techniques for Efficient Class Identification, Proc. of 4th Int. Symposium on Large Spatial Databases, Portland, Maine, U.S.A., 1995b.
Feigenbaum, E.A. and Simon, H., EPAM-like models of recognition and learning, Cognitive Science, vol. 8, 1984, 305–336.
Google Scholar
Fisher, Douglas H., Knowledge Acquisition via Incremental Conceptual Clustering, Machine Learning, 2(2), 1987
Fisher, Douglas H., Iterative Optimization and Simplification of Hierarchical Clusterings, Technical Report CS-95-01, Dept. of Computer Science, Vanderbilt University, Nashville, TN 37235, 1995.
Google Scholar
Gersho, A. and Gray, R., Vector quantization and signal compression, Boston, Ma.: Kluwer Academic Publishers, 1992.
Google Scholar
Gennari, John H., Langley, Pat and Fisher, Douglas, Models of Incremental Concept Formation, Artificial Intelligence, vol. 40, 1989, 11–61.
Google Scholar
Guttman, A., R-trees: a dynamic index structure for spatial searching, Proc. ACM SIGMOD Int. Conf. on Management of Data, 47–57, 1984.
Huang, C., Bi, Q., Stiles, G. and Harris, R., Fast Full Search Equivalent Encoding Algorithms for Image Compression Using Vector Quantization, IEEE Trans. on Image Processing, vol. 1, no. 3, July, 1992.
Hartigan, J.A. and Wong, M.A., A K-Means Clustering Algorithm, Appl. Statist., vol. 28, no. 1, 1979.
Kaufman, Leonard and Rousseeuw, Peter J., Finding Groups in Data-An Introduction to Cluster Analysis,Wiley Series in Probability and Mathematical Statistics, 1990.
Kucharik, C.J. and Norman, J.M., Measuring Canopy Architecture with a Multiband Vegetation Imager (MVI) Proc. of the 22nd conf. on Agricultural and Forest Meteorology, American Meteorological Society annual meeting, Atlanta, GA, Jan 28-Feb 2, 1996.
Kucharik, C.J., Norman, J.M., Murdock, L.M. and Gower, S.T., Characterizing Canopy non-randomness with a Multiband Vegetation Imager (MVI), Submitted to Journal of Geophysical Research, to appear in the Boreal Ecosystem-Atmosphere Study (BOREAS) special issue, 1996.
Kou, Weidong, Digital Image Compression Algorithms and Standards, Kluwer Academic Publishers, 1995.
Linde, Y., Buzo, A. and Gray, R.M., An Algorithm for Vector Quantization Design, IEEE Trans. on Communications, vol. 28, no. 1, 1980.
Lebowitz, Michael, Experiments with Incremental Concept Formation: UNIMEM, Machine Learning, 1987.
Lee, R.C.T., Clustering analysis and its applications, Advances in Information Systems Science, Edited by J.T. Toum, Vol. 8, pp. 169-292, Plenum Press, New York, 1981.
Google Scholar
Murtagh, F., A Survey of Recent Advances in Hierarchical Clustering Algorithms, The Computer Journal, 1983.
Ng, Raymond T. and Han, Jiawei, Efficient and Effective Clustering Methods for Spatial Data Mining, Proc. of VLDB, 1994.
Olson, Clark F., Parallel Algorithms for Hierarchical Clustering, Technical Report, Computer Science Division, Univ. of California at Berkeley, Dec.,1993.
Google Scholar
Rabbani, Majid and Jones, Paul W. Digital Image Compression Techniques, SPIE Optical Engineering Press, 1991.
Zhang, Tian, Ramakrishnan, Raghu and Livny, Miron, BIRCH: An Efficient Data Clustering Method for Very Large Databases, Technical Report, Computer Sciences Dept., Univ. of Wisconsin-Madison, 1995.
Zhang, Tian, Data Clustering for Very Large Datasets Plus Applications, Dissertation, Computer Sciences Dept. at Univ. of Wisconsin-Madison, 1996.
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Sciences Department, University of Wisconsin, Madison, WI, 53706, U.S.A.
Tian Zhang, Raghu Ramakrishnan & Miron Livny

Authors

Tian Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Raghu Ramakrishnan
View author publications
You can also search for this author in PubMed Google Scholar
Miron Livny
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, T., Ramakrishnan, R. & Livny, M. BIRCH: A New Data Clustering Algorithm and Its Applications. Data Mining and Knowledge Discovery 1, 141–182 (1997). https://doi.org/10.1023/A:1009783824328

Download citation

Issue Date: June 1997
DOI: https://doi.org/10.1023/A:1009783824328

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

BIRCH: A New Data Clustering Algorithm and Its Applications

Abstract

Access this article

Similar content being viewed by others

Modified balanced iterative reducing and clustering using hierarchies (m-BIRCH) for visual clustering

A-BIRCH: Automatic Threshold Estimation for the BIRCH Clustering Algorithm

Nearest Neighbor-Based Clustering Algorithm for Large Data Sets

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

BIRCH: A New Data Clustering Algorithm and Its Applications

Abstract

Access this article

Similar content being viewed by others

Modified balanced iterative reducing and clustering using hierarchies (m-BIRCH) for visual clustering

A-BIRCH: Automatic Threshold Estimation for the BIRCH Clustering Algorithm

Nearest Neighbor-Based Clustering Algorithm for Large Data Sets

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation