skip to main content
10.1145/775047.775117acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

SECRET: a scalable linear regression tree algorithm

Published:23 July 2002Publication History

ABSTRACT

Developing regression models for large datasets that are both accurate and easy to interpret is a very important data mining problem. Regression trees with linear models in the leaves satisfy both these requirements, but thus far, no truly scalable regression tree algorithm is known. This paper proposes a novel regression tree construction algorithm (SECRET) that produces trees of high quality and scales to very large datasets. At every node, SECRET uses the EM algorithm for Gaussian mixtures to find two clusters in the data and to locally transform the regression problem into a classification problem based on closeness to these clusters. Goodness of split measures, like the gini gain, can then be used to determine the split variable and the split point much like in classification tree construction. Scalability of the algorithm can be achieved by employing scalable versions of the EM and classification tree construction algorithms. An experimental evaluation on real and artificial data shows that SECRET has accuracy comparable to other linear regression tree algorithms but takes orders of magnitude less computation time for large datasets.

References

  1. W. P. Alexander and S. D. Grimshaw. Treed regression. Journal of Computational and Graphical Statistics, (5):156--175, 1996.Google ScholarGoogle Scholar
  2. P. S. Bradley, U. M. Fayyad, and C. Reina. Scaling clustering algorithms to large databases. In Knowledge Discovery and Data Mining, pages 9--15, 1998.Google ScholarGoogle Scholar
  3. L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth, Belmont, 1984.Google ScholarGoogle Scholar
  4. P. Chaudhuri, M.-C. Huang, W.-Y. Loh, and R. Yao. Piecewise-polynomial regression trees. Statistica Sinica, 4:143--167, 1994.Google ScholarGoogle Scholar
  5. J. H. Friedman. Multivariate adaptive regression splines. The Annals of Statistics, 19:1--141 (with discussion), 1991.Google ScholarGoogle ScholarCross RefCross Ref
  6. K. Fukanaga. Introduction to Statistical Pattern Recognition, Second edition. Academic Press, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest -- a framework for fast decision tree construction of large datasets. In Proceedings of the 24th International Conference on Very Large Databases, pages 416--427. Morgan Kanfmarn, August 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. G. H. Golub and C. F. V. Loan. Matrix Computations. Johns Hopkins, 1996.Google ScholarGoogle Scholar
  9. A. Karalic. Linear regression in regression tree leaves. In International School for Synthesis of Expert Knowledge, Bled, Slovenia, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. K.-C. Li, H.-H. Lue, and C.-H. Chen. Interactive tree-structured regression via principal hessian directions. journal of the American Statistical Association, (95):547--560, 2000.Google ScholarGoogle Scholar
  11. W.-Y. Loh. Regression trees with unbiased variable selection and interaction detection. Statistica Sinica, 2002. in press.Google ScholarGoogle Scholar
  12. W.-Y. Loh and Y.-S. Shih. Split selection methods for classification trees. Statistica Sinica, 7(4), October 1997.Google ScholarGoogle Scholar
  13. S. K. Murthy. Automatic construction of decision trees from data: A multi-disciplinary survey. Data Mining and Knowledge Discovery, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. R. Quinlan. Learning with Continuous Classes. In 5th Australian Joint Conference on Artificial Intelligence, pages 343--348, 1992.Google ScholarGoogle Scholar
  15. J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufman, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. L. Torgo. Functional models for regression tree leaves. In Proc. l4th International Conference on Machine Learning, pages 385--393. Morgan Kaufmann, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. L. Torgo. Kernel regression trees. In European Conference on Machine Learning, 1997. Poster paper.Google ScholarGoogle Scholar
  18. L. Torgo. A comparative study of reliable error estimators for pruning regression trees. Iberoamerican Conf. on Artificial Intelligence. Springer-Verlag, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. SECRET: a scalable linear regression tree algorithm

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in
              • Published in

                cover image ACM Conferences
                KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
                July 2002
                719 pages
                ISBN:158113567X
                DOI:10.1145/775047

                Copyright © 2002 ACM

                Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                • Published: 23 July 2002

                Permissions

                Request permissions about this article.

                Request Permissions

                Check for updates

                Qualifiers

                • Article

                Acceptance Rates

                KDD '02 Paper Acceptance Rate44of307submissions,14%Overall Acceptance Rate1,133of8,635submissions,13%

                Upcoming Conference

                KDD '24

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader