Abstract
Prequential model selection and delete-one cross-validation are data-driven methodologies for choosing between rival models on the basis of their predictive abilities. For a given set of observations, the predictive ability of a model is measured by the model's accumulated prediction error and by the model's average-out-of-sample prediction error, respectively, for prequential model selection and for cross-validation. In this paper, given i.i.d. observations, we propose nonparametric regression estimators—based on neural networks—that select the number of “hidden units” (or “neurons”) using either prequential model selection or delete-one cross-validation. As our main contributions: (i) we establish rates of convergence for the integrated mean-squared errors in estimating the regression function using “off-line” or “batch” versions of the proposed estimators and (ii) we establish rates of convergence for the time-averaged expected prediction errors in using “on-line” versions of the proposed estimators. We also present computer simulations (i) empirically validating the proposed estimators and (ii) empirically comparing the proposed estimators with certain novel prequential and cross-validated “mixture” regression estimators.
Article PDF
Similar content being viewed by others
References
Barron, A.R. (1991). Complexity regularization. In G. Roussas (Ed.), Proceedings NATO Advanced Study Institute on Nonparametric Functional Estimation. Dordrecht, The Netherlands: Kluwer Academic Publishers.
Barron, A.R. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. Inform. Theory, 39(3), 930-945.
Barron, A.R. (1994). Approximation and estimation bounds for artificial neural networks. Machine Learning, 14, 115-133.
Barron, A.R., Birgé, L., & Massart, P. (1996). Risk bounds for model selection via penalization. Probab. Theory Relat. Fields (to appear).
Baum, E., & Haussler, D. (1989). What size net gives valid generalization? Neural Computation, 1(1), 151-160.
Birgé, L., & Massart, P. (1994a). From model selection to adaptive estimation. Probab. Theory Relat. Fields (to appear).
Birgé, L., & Massart, P. (1994b). Minimum contrast estimators on sieves. Technical Report. Université Paris-Sud.
Breiman, L. (1993). Hinging hyperplanes for regression, classification, and function approximation. IEEE Trans. Inform. Theory, 39(3), 999-1013.
Dawid, A.P. (1984). Statistical theory: The prequential approach. J.R. Statist. Soc. A, 147(2), 278-292.
Dawid, A.P. (1991). Prequential data analysis. In M. Ghosh, & P.K. Pathak (Eds.), Current issues in statistical inference. Hayward, CA: Institute of Mathematical Statistics.
Dawid, A.P. (1992). Prequential analysis, stochastic complexity, and Bayesian inference. In J.M. Bernardo, J.O. Berger, A.P. Dawid, & A.F.M. Smith (Eds.), Bayesian statistics. Oxford University Press.
Haussler, D. (1992). Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, 100, 78-150.
Haussler, D., Kearns, M., & Schapire, R.E. (1994). Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension. Machine Learning, 14(1), 83-113.
Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc., 58, 13-30.
Hornik, K., Stinchcombe, M.B., White, H., & Auer, P. (1994). Degree of approximation results for feedforward networks approximating unknown mappings and their derivatives. Neural Comput., 6, 1262-1275.
Jones, L.K. (1997). L.K. Jones, The computational intractability of training sigmoidal neural networks. IEEE Trans. Inform. Theory, 43(1), 167-173.
Kearns, M. (1997). A bound on the error of cross validation using the approximation and estimation rates, with consequences for the training-test split. Neural Computation, 9, 1143-1161.
Lehtokangas, M., Saarinen, J., Huuhtanen, P., & Kaski, K. (1996). Predictive minimum description length criterion for time series modeling with neural networks. Neural Computation, 8, 583-593.
Li, K.-C. (1987). Asymptotic optimality for C p, C L, cross-validation, and generalized cross-validation: Discrete index set. Ann. Statist., 15, 958-975.
Lugosi, G., & Nobel, A. (1995). Adaptive model selection using empirical complexities. Submitted for publication.
Lugosi, G., & Zeger, K. (1996). Concept learning using complexity regularization. IEEE Trans. Inform. Theory, 42(1), 48-54.
McCaffrey, D.F., & Gallant, A.R. (1994). Convergence rates for single hidden layer feedforward networks. Neural Networks, 7, 147-158.
Modha, D.S., & Masry, E. (1996). Minimum complexity regression estimation with weakly dependent observations. IEEE Trans. Inform. Theory, 42, 2133-2145.
Modha, D.S., & Masry, E. (1998). Memory-universal prediction of stationary random processes. IEEE Trans. Inform. Theory, 44, 117-133.
Mosteller, F., & Tukey, J.W. (1968). Data analysis, including statistics. In G. Lindzey & E. Aronson (Eds.), Handbook of social psychology (Vol. 2). Reading, MA: Addison-Wesley.
Rissanen, J. (1983). A universal prior for integers and estimation by minimum description length. Ann. Statist., 11(2), 416-431.
Rissanen, J. (1986a). A predictive least-squares principle. IMA J. Math. Contr. Inform., 3, 211-222.
Rissanen, J. (1986b). Complexity of strings in the class of Markov sources. IEEE Trans. Inform. Theory, 32(4), 526-532.
Rissanen, J. (1989). Stochastic complexity in statistical inquiry. Teaneck, NJ: World Scientific Publishers.
Rissanen, J. (1994). Information theory and learning. In P. Smolensky, M.C. Mozer, & D.E. Rumelhart (Eds.), Mathematical perspectives on neural networks. Hilldale, NJ: L. Erlbaum Associates.
Rissanen, J., Speed, T., & Yu, B. (1992). Density estimation by stochastic complexity. IEEE Tran. Inform. Theory, 38(2), 315-323.
Sarkar, D. (1995). Methods to speed up error back-propagation learning algorithm. ACM Comput. Surveys, 27(4), 519-542.
Shen, X., & Wong, W.H. (1994). Convergence rates of sieves estimates. Ann. Statist., 22, 580-615.
Stone, C.J. (1984). An asymptotically optimal window selection rule for kernel density estimates. Ann. Statist., 12, 1285-1297.
Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. J.R. Statist. Soc. B, 36, 111–133.
Stone, M. (1977). An asymptotic equivalence of choice of model by cross-validation and Akaike's criterion. J.R. Statist. Soc. B, 39, 44-47.
Vapnik, V.N. (1982). Estimation of dependences based on empirical data. New York: Springer-Verlag.
Vapnik, V.N. (1995). The nature of statistical learning theory. New York: Springer-Verlag.
White, H. (1989). Connectionist nonparametric regression: Multilayer feedforward networks can learn arbitrary mappings. Neural Networks, 3, 535-549.
Yukich, J.E., Stinchcombe, M.B., & White, H. (1995). Sup-norm approximation bounds for networks through probabilistic methods. IEEE Trans. Inform. Theory. 41(4), 1021-1027.
Yu, B., & Speed, T. (1992). Data compression and histograms. Probab. Theory Relat. Fields, 92, 195-229.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Modha, D.S., Masry, E. Prequential and Cross-Validated Regression Estimation. Machine Learning 33, 5–39 (1998). https://doi.org/10.1023/A:1007577530334
Issue Date:
DOI: https://doi.org/10.1023/A:1007577530334