Article

SECRET: a scalable linear regression tree algorithm

Authors:
Alin Dobra

Cornell University, Ithaca, NY

Cornell University, Ithaca, NY
View Profile

,
Johannes Gehrke

Cornell University, Ithaca, NY

Cornell University, Ithaca, NY
View Profile

KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data miningJuly 2002Pages 481–487https://doi.org/10.1145/775047.775117

Published:23 July 2002Publication History

KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 481–487

ABSTRACT

Developing regression models for large datasets that are both accurate and easy to interpret is a very important data mining problem. Regression trees with linear models in the leaves satisfy both these requirements, but thus far, no truly scalable regression tree algorithm is known. This paper proposes a novel regression tree construction algorithm (SECRET) that produces trees of high quality and scales to very large datasets. At every node, SECRET uses the EM algorithm for Gaussian mixtures to find two clusters in the data and to locally transform the regression problem into a classification problem based on closeness to these clusters. Goodness of split measures, like the gini gain, can then be used to determine the split variable and the split point much like in classification tree construction. Scalability of the algorithm can be achieved by employing scalable versions of the EM and classification tree construction algorithms. An experimental evaluation on real and artificial data shows that SECRET has accuracy comparable to other linear regression tree algorithms but takes orders of magnitude less computation time for large datasets.

References

W. P. Alexander and S. D. Grimshaw. Treed regression. Journal of Computational and Graphical Statistics, (5):156--175, 1996.Google Scholar
P. S. Bradley, U. M. Fayyad, and C. Reina. Scaling clustering algorithms to large databases. In Knowledge Discovery and Data Mining, pages 9--15, 1998.Google Scholar
L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth, Belmont, 1984.Google Scholar
P. Chaudhuri, M.-C. Huang, W.-Y. Loh, and R. Yao. Piecewise-polynomial regression trees. Statistica Sinica, 4:143--167, 1994.Google Scholar
J. H. Friedman. Multivariate adaptive regression splines. The Annals of Statistics, 19:1--141 (with discussion), 1991.Google ScholarCross Ref
K. Fukanaga. Introduction to Statistical Pattern Recognition, Second edition. Academic Press, 1990. Google ScholarDigital Library
J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest -- a framework for fast decision tree construction of large datasets. In Proceedings of the 24th International Conference on Very Large Databases, pages 416--427. Morgan Kanfmarn, August 1998. Google ScholarDigital Library
G. H. Golub and C. F. V. Loan. Matrix Computations. Johns Hopkins, 1996.Google Scholar
A. Karalic. Linear regression in regression tree leaves. In International School for Synthesis of Expert Knowledge, Bled, Slovenia, 1992. Google ScholarDigital Library
K.-C. Li, H.-H. Lue, and C.-H. Chen. Interactive tree-structured regression via principal hessian directions. journal of the American Statistical Association, (95):547--560, 2000.Google Scholar
W.-Y. Loh. Regression trees with unbiased variable selection and interaction detection. Statistica Sinica, 2002. in press.Google Scholar
W.-Y. Loh and Y.-S. Shih. Split selection methods for classification trees. Statistica Sinica, 7(4), October 1997.Google Scholar
S. K. Murthy. Automatic construction of decision trees from data: A multi-disciplinary survey. Data Mining and Knowledge Discovery, 1997. Google ScholarDigital Library
J. R. Quinlan. Learning with Continuous Classes. In 5th Australian Joint Conference on Artificial Intelligence, pages 343--348, 1992.Google Scholar
J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufman, 1993. Google ScholarDigital Library
L. Torgo. Functional models for regression tree leaves. In Proc. l4th International Conference on Machine Learning, pages 385--393. Morgan Kaufmann, 1997. Google ScholarDigital Library
L. Torgo. Kernel regression trees. In European Conference on Machine Learning, 1997. Poster paper.Google Scholar
L. Torgo. A comparative study of reliable error estimators for pruning regression trees. Iberoamerican Conf. on Artificial Intelligence. Springer-Verlag, 1998. Google ScholarDigital Library

Index Terms

SECRET: a scalable linear regression tree algorithm

Recommendations

Adobe Flash Professional CS5 Classroom in a Book
Read More
Optimal Information Rate of Secret Sharing Schemes on Trees

The information rate for an access structure is the reciprocal of the load of the optimal secret sharing scheme for this structure. We determine this value for all trees: it is $(2-1/c)^{-1}$, where $c$ is the size of the largest core of the tree. A ...
Read More
Hierarchical Scheme for Secret Separation Based on Computable Access Labels
Abstract
In this paper, we propose a threshold scheme for sharing a secret between tree-like graph nodes. Nodes whose descendants cover the entire set of leaf nodes can recover the secret. Otherwise, the nodes do not receive any information about the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
July 2002
719 pages
ISBN:158113567X
DOI:10.1145/775047
Conference Chair:
Osmar R. Zaïane
University of Alberta, Canada
,
General Chair:
Randy Goebel
University of Alberta, Canada
,
Program Chairs:
David Hand
Imperial College, UK
,
Daniel Keim
AT&T
,
Raymond Ng
University of British Columbia, Canada
Copyright © 2002 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 July 2002
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
KDD '02 Paper Acceptance Rate44of307submissions,14%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 57
  Total Citations
  View Citations
- 924
  Total Downloads
- Downloads (Last 12 months)31
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

SECRET: a scalable linear regression tree algorithm

KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Adobe Flash Professional CS5 Classroom in a Book

Optimal Information Rate of Secret Sharing Schemes on Trees

Hierarchical Scheme for Secret Separation Based on Computable Access Labels