ABSTRACT

Consider a tree to be constructed from a learning sample for the purpose of regression, classification, or class probability estimation. The use of test samples to estimate the risk of tree structured procedures requires that one set of sample data be used to construct the procedure and a disjoint set be used to evaluate it. When the combined set of available data contains a thousand or more cases, this is a reasonable approach. Test samples or cross-validation can be used to select a particular procedure dk=dTk from among the candidates dk, 1 = k = K. Suppose, say, that cross-validation is used. A method based on the bootstrap can also be used to reduce the overoptimism of the resubstitution estimate R(d) of the overall risk R*(d) of a tree structured rule d based on a learning sample.