Deep learning: a statistical viewpoint

Peter L. Bartlett; Andrea Montanari; Alexander Rakhlin

doi:10.1017/S0962492921000027

Deep learning: a statistical viewpoint

Published online by Cambridge University Press: 04 August 2021

Peter L. Bartlett ,

Andrea Montanari and

Alexander Rakhlin

Show author details

Peter L. Bartlett: Affiliation:
Departments of Statistics and EECS, University of California, Berkeley, CA94720, USA E-mail: peter@berkeley.edu
Andrea Montanari: Affiliation:
Departments of EE and Statistics, Stanford University, Stanford, CA94304, USA E-mail: montanar@stanford.edu
Alexander Rakhlin: Affiliation:
Department of Brain & Cognitive Sciences and Statistics and Data Science Center, Massachusetts Institute of Technology, Cambridge, MA02139, USA E-mail: rakhlin@mit.edu

Article contents

Abstract
References

Rights & Permissions

Abstract

Core share and HTML view are not available for this content. However, as you have access to this content, a full PDF is available via the ‘Save PDF’ action button.

The remarkable practical success of deep learning has revealed some major surprises from a theoretical perspective. In particular, simple gradient methods easily find near-optimal solutions to non-convex optimization problems, and despite giving a near-perfect fit to training data without any explicit effort to control model complexity, these methods exhibit excellent predictive accuracy. We conjecture that specific principles underlie these phenomena: that overparametrization allows gradient methods to find interpolating solutions, that these methods implicitly impose regularization, and that overparametrization leads to benign overfitting, that is, accurate predictions despite overfitting training data. In this article, we survey recent progress in statistical learning theory that provides examples illustrating these principles in simpler settings. We first review classical uniform convergence results and why they fall short of explaining aspects of the behaviour of deep learning methods. We give examples of implicit regularization in simple settings, where gradient methods lead to minimal norm functions that perfectly fit the training data. Then we review prediction methods that exhibit benign overfitting, focusing on regression problems with quadratic loss. For these methods, we can decompose the prediction rule into a simple component that is useful for prediction and a spiky component that is useful for overfitting but, in a favourable setting, does not harm prediction accuracy. We focus specifically on the linear regime for neural networks, where the network can be approximated by a linear model. In this regime, we demonstrate the success of gradient flow, and we consider benign overfitting with two-layer networks, giving an exact asymptotic analysis that precisely demonstrates the impact of overparametrization. We conclude by highlighting the key challenges that arise in extending these insights to realistic deep learning settings.

Type: Research Article
Information: Acta Numerica , Volume 30 , May 2021 , pp. 87 - 201

DOI: https://doi.org/10.1017/S0962492921000027 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright: © The Author(s), 2021. Published by Cambridge University Press

References

Achlioptas, D. and Molloy, M. (1997), The analysis of a list-coloring algorithm on a random graph, in Proceedings of the 38th Annual Symposium on Foundations of Computer Science (FOCS 1997), IEEE, pp. 204–212.Google Scholar

Aizerman, M. A., Braverman, E. M. and Rozonoer, L. I. (1964), Theoretical foundations of the potential function method in pattern recognition, Avtomat. i Telemeh 25, 917–936.Google Scholar

Ali, A., Kolter, J. Z. and Tibshirani, R. J. (2019), A continuous-time view of early stopping for least squares, in Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS 2019) (Chaudhuri, K. and Sugiyama, M., eds), Vol. 89 of Proceedings of Machine Learning Research, PMLR, pp. 1370–1378.Google Scholar

Allen-Zhu, Z., Li, Y. and Song, Z. (2019), A convergence theory for deep learning via over-parameterization, in Proceedings of the 36th International Conference on Machine Learning (ICML 2019) (Chaudhuri, K. and Salakhutdinov, R., eds), Vol. 97 of Proceedings of Machine Learning Research, PMLR, pp. 242–252.Google Scholar

Ambrosio, L., Gigli, N. and Savaré, G. (2008), Gradient Flows: In Metric Spaces and in the Space of Probability Measures, Lectures in Mathematics (ETH Zürich), Springer.Google Scholar

Anthony, M. and Bartlett, P. L. (1999), Neural Network Learning: Theoretical Foundations, Cambridge University Press.CrossRef Google Scholar

Bach, F. (2013), Sharp analysis of low-rank kernel matrix approximations, in Proceedings of the 26th Annual Conference on Learning Theory (COLT 2013) (Shalev-Shwartz, S. and Steinwart, I., eds), Vol. 30 of Proceedings of Machine Learning Research, PMLR, pp. 185–209.Google Scholar

Bach, F. (2017), Breaking the curse of dimensionality with convex neural networks, J. Mach. Learn. Res. 18, 629–681.Google Scholar

Bai, Y. and Lee, J. D. (2020), Beyond linearization: On quadratic and higher-order approximation of wide neural networks, in 8th International Conference on Learning Representations (ICLR 2020). Available at https://openreview.net/forum?id=rkllGyBFPH.Google Scholar

Balcan, M.-F., Blum, A. and Vempala, S. (2006), Kernels as features: On kernels, margins, and low-dimensional mappings, Mach. Learn. 65, 79–94.CrossRef Google Scholar

Bartlett, P. L. (1998), The sample complexity of pattern classification with neural networks: The size of the weights is more important than the size of the network, IEEE Trans. Inform. Theory 44, 525–536.CrossRef Google Scholar

Bartlett, P. L. (2008), Fast rates for estimation error and oracle inequalities for model selection, Econometric Theory 24, 545–552.CrossRef Google Scholar

Bartlett, P. L. and Ben-David, S. (2002), Hardness results for neural network approximation problems, Theoret . Comput. Sci. 284, 53–66.Google Scholar

Bartlett, P. L. and Long, P. M. (2020), Failures of model-dependent generalization bounds for least-norm interpolation. Available at arXiv:2010.08479.Google Scholar

Bartlett, P. L. and Lugosi, G. (1999), An inequality for uniform deviations of sample averages from their means, Statist . Probab. Lett. 44, 55–62.CrossRef Google Scholar

Bartlett, P. L. and Mendelson, S. (2002), Rademacher and Gaussian complexities: Risk bounds and structural results, J. Mach. Learn. Res. 3, 463–482.Google Scholar

Bartlett, P. L., Boucheron, S. and Lugosi, G. (2002), Model selection and error estimation, Mach . Learn. 48, 85–113.Google Scholar

Bartlett, P. L., Bousquet, O. and Mendelson, S. (2005), Local Rademacher complexities, Ann . Statist. 33, 1497–1537.CrossRef Google Scholar

Bartlett, P. L., Foster, D. J. and Telgarsky, M. (2017), Spectrally-normalized margin bounds for neural networks, in Advances in Neural Information Processing Systems 30 (NIPS 2017) (Guyon, I. et al., eds), Curran Associates, pp. 6240–6249.Google Scholar

Bartlett, P. L., Harvey, N., Liaw, C. and Mehrabian, A. (2019), Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks, J. Mach. Learn. Res. 20, 1–17.Google Scholar

Bartlett, P. L., Jordan, M. I. and McAuliffe, J. D. (2006), Convexity, classification, and risk bounds, J. Amer. Statist. Assoc. 101, 138–156.CrossRef Google Scholar

Bartlett, P. L., Long, P. M., Lugosi, G. and Tsigler, A. (2020), Benign overfitting in linear regression, Proc . Nat. Acad. Sci. 117, 30063–30070.CrossRef Google Scholar

Bartlett, P. L., Maiorov, V. and Meir, R. (1998), Almost linear VC dimension bounds for piecewise polynomial networks, Neural Comput. 10, 2159–2173.CrossRef Google Scholar PubMed

Baum, E. B. and Haussler, D. (1989), What size net gives valid generalization?, Neural Comput. 1, 151–160.CrossRef Google Scholar

Belkin, M., Hsu, D. and Mitra, P. (2018a), Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate, in Advances in Neural Information Processing Systems 31 (NeurIPS 2018) (Bengio, S. et al., eds), Curran Associates, pp. 2306–2317.Google Scholar

Belkin, M., Hsu, D., Ma, S. and Mandal, S. (2019a), Reconciling modern machine-learning practice and the classical bias–variance trade-off, Proc . Nat. Acad. Sci. 116, 15849– 15854.CrossRef Google Scholar

Belkin, M., Ma, S. and Mandal, S. (2018b), To understand deep learning we need to understand kernel learning, in Proceedings of the 35th International Conference on Machine Learning (ICML 2018) (Dy, J. and Krause, A., eds), Vol. 80 of Proceedings of Machine Learning Research, PMLR, pp. 541–549.Google Scholar

Belkin, M., Rakhlin, A. and Tsybakov, A. B. (2019b), Does data interpolation contradict statistical optimality?, in Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS 2019) (Chaudhuri, K. and Sugiyama, M., eds), Vol. 89 of Proceedings of Machine Learning Research, PMLR, pp. 1611–1619.Google Scholar

Bickel, P. J. and Doksum, K. A. (2007), Mathematical Statistics: Basic Ideas and Selected Topics, Prentice Hall.Google Scholar

Blum, A. and Rivest, R. L. (1992), Training a 3-node neural network is NP-complete, Neural Networks 5, 117–127.CrossRef Google Scholar

Blumer, A., Ehrenfeucht, A., Haussler, D. and Warmuth, M. K. (1989), Learnability and the Vapnik–Chervonenkis dimension, J. Assoc. Comput. Mach. 36, 929–965.CrossRef Google Scholar

Boucheron, S., Lugosi, G. and Massart, P. (2013), Concentration Inequalities: A Nonasymp-totic Theory of Independence , Oxford University Press.CrossRef Google Scholar

Bousquet, O. and Elisseeff, A. (2002), Stability and generalization, J. Mach. Learn. Res. 2, 499–526.Google Scholar

Breiman, L. (1998), Arcing classifiers, Ann . Statist. 26, 801–849.Google Scholar

Caponnetto, A. and Vito, E. De (2007), Optimal rates for the regularized least-squares algorithm, Found . Comput. Math. 7, 331–368.Google Scholar

Caruana, R., Lawrence, S. and Giles, C. (2000), Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping, in Advances in Neural Information Processing Systems 13 (NIPS 2000) (Leen, T. et al., eds), MIT Press, pp. 381–387.Google Scholar

Chen, L. and Xu, S. (2021), Deep neural tangent kernel and Laplace kernel have the same RKHS, in 9th International Conference on Learning Representations (ICLR 2021). Available at https://openreview.net/forum?id=vK9WrZ0QYQ.Google Scholar

Cheng, X. and Singer, A. (2013), The spectrum of random inner-product kernel matrices, Random Matrices Theory Appl. 2, 1350010.CrossRef Google Scholar

Chizat, L. and Bach, F. (2018), On the global convergence of gradient descent for over-parameterized models using optimal transport, in Advances in Neural Information Processing Systems 31 (NeurIPS 2018) (Bengio, S. et al., eds), Curran Associates, pp. 3036–3046.Google Scholar

Chizat, L. and Bach, F. (2020), Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss, in Proceedings of the 33rd Conference on Learning Theory (COLT 2020) (Abernethy, J. and Agarwal, S., eds), Vol. 125 of Proceedings of Machine Learning Research, PMLR, pp. 1305–1338.Google Scholar

Chizat, L., Oyallon, E. and Bach, F. (2019), On lazy training in differentiable programming, in Advances in Neural Information Processing Systems 32 (NeurIPS 2019) (Wallach, H. et al., eds), Curran Associates, pp. 2937–2947.Google Scholar

Coja-Oghlan, A. (2010), A better algorithm for random k-sat, SIAM J. Comput. 39, 2823– 2864.CrossRef Google Scholar

Cortes, C. and Vapnik, V. (1995), Support-vector networks, Mach . Learn. 20, 273–297.Google Scholar

Cover, T. and Hart, P. (1967), Nearest neighbor pattern classification, IEEE Trans. Inform. Theory 13, 21–27.CrossRef Google Scholar

DasGupta, B., Siegelmann, H. T. and Sontag, E. D. (1995), On the complexity of training neural networks with continuous activation functions, IEEE Trans. Neural Networks 6, 1490–1504.CrossRef Google Scholar PubMed

Devroye, L. and Wagner, T. (1979), Distribution-free inequalities for the deleted and holdout error estimates, IEEE Trans. Inform. Theory 25, 202–207.CrossRef Google Scholar

Devroye, L., Györfi, L. and Krzyżak, A. (1998), The Hilbert kernel regression estimate, J. Multivariate Anal. 65, 209–227.CrossRef Google Scholar

Dhillon, P. S., Foster, D. P., Kakade, S. M. and Ungar, L. H. (2013), A risk comparison of ordinary least squares vs ridge regression, J. Mach. Learn. Res. 14, 1505–1511.Google Scholar

Do, Y. and Vu, V. (2013), The spectrum of random kernel matrices: Universality results for rough and varying kernels, Random Matrices Theory Appl. 2, 1350005.CrossRef Google Scholar

Drucker, H. and Cortes, C. (1995), Boosting decision trees, in Advances in Neural Information Processing Systems 8 (NIPS 1995), MIT Press, pp. 479–485.Google Scholar

Du, S. S., Lee, J., Li, H., Wang, L. and Zhai, X. (2019a), Gradient descent finds global minima of deep neural networks, in Proceedings of the 36th International Conference on Machine Learning (ICML 2019) (Chaudhuri, K. and Salakhutdinov, R., eds), Vol. 97 of Proceedings of Machine Learning Research, PMLR, pp. 1675–1685.Google Scholar

Du, S. S., Zhai, X., Poczos, B. and Singh, A. (2019b), Gradient descent provably optimizes over-parameterized neural networks, in 7th International Conference on Learning Representations (ICLR 2019). Available at https://openreview.net/forum?id=S1eK3i09YQ.Google Scholar

Dyer, E. and Gur-Ari, G. (2020), Asymptotics of wide networks from Feynman diagrams, in 8th International Conference on Learning Representations (ICLR 2020). Available at https://openreview.net/forum?id=S1gFvANKDS.Google Scholar

Ehrenfeucht, A., Haussler, D., Kearns, M. J. and Valiant, L. G. (1989), A general lower bound on the number of examples needed for learning, Inform. Comput. 82, 247–261.CrossRef Google Scholar

Alaoui, A. El and Mahoney, M. W. (2015), Fast randomized kernel ridge regression with statistical guarantees, in Advances in Neural Information Processing Systems 28 (NIPS 2015) (Cortes, C. et al., eds), Curran Associates, pp. 775–783.Google Scholar

Karoui, N. El (2010), The spectrum of kernel random matrices, Ann . Statist. 38, 1–50.Google Scholar

Fan, Z. and Montanari, A. (2019), The spectral norm of random inner-product kernel matrices, Probab . Theory Related Fields 173, 27–85.CrossRef Google Scholar

Feldman, V. (2020), Does learning require memorization? A short tale about a long tail, in Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing (STOC 2020), ACM, pp. 954–959.Google Scholar

Freund, Y. and Schapire, R. E. (1997), A decision-theoretic generalization of on-line learning and an application to boosting, J. Comput. System Sci. 55, 119–139.CrossRef Google Scholar

Friedman, J. H. (2001), Greedy function approximation: A gradient boosting machine, Ann . Statist. 29, 1189–1232.CrossRef Google Scholar

Frieze, A. and Suen, S. (1996), Analysis of two simple heuristics on a random instance of k-sat, J. Algorithms 20, 312–355.CrossRef Google Scholar

Garey, M. R. and Johnson, D. S. (1979), Computers and Intractability: A Guide to the Theory of NP-Completeness, W. H. Freeman.Google Scholar

Geifman, A., Yadav, A., Kasten, Y., Galun, M., Jacobs, D. and Basri, R. (2020), On the similarity between the Laplace and neural tangent kernels, in Advances in Neural Information Processing Systems 33 (NeurIPS 2020) (Larochelle, H. et al., eds), Curran Associates, pp. 1451–1461.Google Scholar

Geiger, M., Spigler, S., d’Ascoli, S., Sagun, L., Baity-Jesi, M., Biroli, G. and Wyart, M. (2019), Jamming transition as a paradigm to understand the loss landscape of deep neural networks, Phys. Rev. E 100, 012115.CrossRef Google Scholar PubMed

Ghorbani, B., Mei, S., Misiakiewicz, T. and Montanari, A. (2020), When do neural networks outperform kernel methods?, in Advances in Neural Information Processing Systems 33 (NeurIPS 2020) (Larochelle, H. et al., eds), Curran Associates, pp. 14820–14830.Google Scholar

Ghorbani, B., Mei, S., Misiakiewicz, T. and Montanari, A. (2021), Linearized two-layers neural networks in high dimension, Ann . Statist. 49, 1029–1054.CrossRef Google Scholar

Goldt, S., Mézard, M., Krzakala, F. and Zdeborová, L. (2019), Modelling the influence of data structure on learning in neural networks. Available at arXiv:1909.11500.Google Scholar

Goldt, S., Reeves, G., Mézard, M., Krzakala, F. and Zdeborová, L. (2020), The Gaussian equivalence of generative models for learning with two-layer neural networks. Available at arXiv:2006.14709.Google Scholar

Golowich, N., Rakhlin, A. and Shamir, O. (2018), Size-independent sample complexity of neural networks, in Proceedings of the 31st Conference on Learning Theory (COLT 2018) (S. Bubeck, V. Perchet and P. Rigollet, eds), Vol. 75 of Proceedings of Machine Learning Research, PMLR, pp. 297–299.Google Scholar

Goodfellow, I., Bengio, Y. and Courville, A. (2016), Deep Learning, Adaptive Computation and Machine Learning series, MIT Press.Google Scholar

Gunasekar, S., Lee, J. D., Soudry, D. and Srebro, N. (2018a), Implicit bias of gradient descent on linear convolutional networks, in Advances in Neural Information Processing Systems 31 (NeurIPS 2018) (Bengio, S. et al., eds), Curran Associates, pp. 9461–9471.Google Scholar

Gunasekar, S., Lee, J., Soudry, D. and Srebro, N. (2018b), Characterizing implicit bias in terms of optimization geometry, in Proceedings of the 35th International Conference on Machine Learning (ICML 2018) (J. Dy and A. Krause, eds), Vol. 80 of Proceedings of Machine Learning Research, PMLR, pp. 1832–1841.Google Scholar

Gunasekar, S., Woodworth, B., Bhojanapalli, S., Neyshabur, B. and Srebro, N. (2017), Implicit regularization in matrix factorization, in Advances in Neural Information Processing Systems 30 (NIPS 2017) (Guyon, I. et al., eds), Curran Associates, pp. 6152–6160.Google Scholar

Han, S., Mao, H. and Dally, W. J. (2015), Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. Available at arXiv:1510.00149.Google Scholar

Hanin, B. and Nica, M. (2020), Finite depth and width corrections to the neural tangent kernel, in 8th International Conference on Learning Representations (ICLR 2020). Available at https://openreview.net/forum?id=SJgndT4KwB.Google Scholar

Hastie, T., Montanari, A., Rosset, S. and Tibshirani, R. J. (2019), Surprises in high-dimensional ridgeless least squares interpolation. Available at arXiv:1903.08560.Google Scholar

Hastie, T., Tibshirani, R. and Friedman, J. (2001), The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer Series in Statistics, Springer.CrossRef Google Scholar

Haussler, D. (1992), Decision theoretic generalizations of the PAC model for neural net and other learning applications, Inform. Comput. 100, 78–150.CrossRef Google Scholar

Hu, H. and Lu, Y. M. (2020), Universality laws for high-dimensional learning with random features. Available at arXiv:2009.07669.Google Scholar

Huang, J. and Yau, H.-T. (2020), Dynamics of deep neural networks and neural tangent hierarchy, in Proceedings of the 37th International Conference on Machine Learning (ICML 2020) (H. Daumé III and A. Singh, eds), Vol. 119 of Proceedings of Machine Learning Research, PMLR, pp. 4542–4551.Google Scholar

Jacot, A., Gabriel, F. and Hongler, C. (2018), Neural tangent kernel: Convergence and generalization in neural networks, in Advances in Neural Information Processing Systems 31 (NeurIPS 2018) (Bengio, S. et al., eds), Curran Associates, pp. 8571–8580.Google Scholar

Ji, Z. and Telgarsky, M. (2018), Risk and parameter convergence of logistic regression. Available at arXiv:1803.07300.Google Scholar

Ji, Z. and Telgarsky, M. (2019), Characterizing the implicit bias via a primal–dual analysis. Available at arXiv:1906.04540.Google Scholar

Johnson, D. S. and Preparata, F. P. (1978), The densest hemisphere problem, Theoret . Comput. Sci. 6, 93–107.Google Scholar

Johnstone, I. M. (2019), Gaussian estimation: Sequence and wavelet models. Available at http://statweb.stanford.edu/~imj/.Google Scholar

Judd, J. S. (1990), Neural Network Design and the Complexity of Learning, MIT Press.CrossRef Google Scholar

Karpinski, M. and Macintyre, A. J. (1997), Polynomial bounds for VC dimension of sig-moidal and general Pfaffian neural networks, J. Comput. System Sci. 54, 169–176.CrossRef Google Scholar

Kini, G. R. and Thrampoulidis, C. (2020), Analytic study of double descent in binary classification: The impact of loss, in IEEE International Symposium on Information Theory (ISIT 2020), IEEE, pp. 2527–2532.Google Scholar

Knowles, A. and Yin, J. (2017), Anisotropic local laws for random matrices, Probab . Theory Related Fields 169, 257–352.CrossRef Google Scholar

Koltchinskii, V. (2001), Rademacher penalties and structural risk minimization, IEEE Trans. Inform. Theory 47, 1902–1914.CrossRef Google Scholar

Koltchinskii, V. (2006), Local Rademacher complexities and oracle inequalities in risk minimization, Ann . Statist. 34, 2593–2656.Google Scholar

Koltchinskii, V. and Lounici, K. (2017), Concentration inequalities and moment bounds for sample covariance operators, Bernoulli 23, 110–133.CrossRef Google Scholar

Koltchinskii, V. and Mendelson, S. (2015), Bounding the smallest singular value of a random matrix without concentration, Int. Math. Res. Not. 2015, 12991–13008.Google Scholar

Koltchinskii, V. and Panchenko, D. (2000), Rademacher processes and bounding the risk of function learning, in High Dimensional Probability II (Giné, E., Mason, D. M. and Wellner, J. A., eds), Vol. 47 of Progress in Probability, Birkhäuser, pp. 443–459.CrossRef Google Scholar

Kurková, V. (1997), Dimension-independent rates of approximation by neural networks, in Computer Intensive Methods in Control and Signal Processing (Kárný, M. and Warwick, K., eds), Springer, pp. 261–270.CrossRef Google Scholar

Kurková, V. and Sanguineti, M. (2001), Bounds on rates of variable-basis and neural-network approximation, IEEE Trans. Inform. Theory 47, 2659–2665.CrossRef Google Scholar

Kurková, V. and Sanguineti, M. (2002), Comparison of worst case errors in linear and neural network approximation, IEEE Trans. Inform. Theory 48, 264–275.CrossRef Google Scholar

Lawrence, S., Giles, C. L. and Tsoi, A. C. (1997), Lessons in neural network training: Overfitting may be harder than expected, in Proceedings of the Fourteenth National Conference on Artificial Intelligence (AAAI-97), AAAI Press, pp. 540–545.Google Scholar

LeCun, Y., Bengio, Y. and Hinton, G. (2015), Deep learning, Nature 521, 436–444.CrossRef Google Scholar PubMed

Ledoux, M. (2001), The Concentration of Measure Phenomenon, Vol. 89 of Mathematical Surveys and Monographs, American Mathematical Society.Google Scholar

Ledoux, M. and Talagrand, M. (1991), Probability in Banach Spaces: Isoperimetry and Processes, Springer.CrossRef Google Scholar

Lee, W. S., Bartlett, P. L. and Williamson, R. C. (1996), Efficient agnostic learning of neural networks with bounded fan-in, IEEE Trans. Inform. Theory 42, 2118–2132.Google Scholar

Li, Y., Ma, T. and Zhang, H. (2018), Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations, in Proceedings of the 31st Conference on Learning Theory (COLT 2018) (Bubeck, S., Perchet, V. and Rigollet, P., eds), Vol. 75 of Proceedings of Machine Learning Research, PMLR, pp. 2–47.Google Scholar

Liang, T. (2020). Personal communication.Google Scholar

Liang, T. and Rakhlin, A. (2020), Just interpolate: Kernel ‘ridgeless’ regression can generalize, Ann . Statist. 48, 1329–1347.CrossRef Google Scholar

Liang, T., Rakhlin, A. and Sridharan, K. (2015), Learning with square loss: Localization through offset Rademacher complexity, in Proceedings of the 28th Conference on Learning Theory (COLT 2015) (Grünwald, P., Hazan, E. and Kale, S., eds), Vol. 40 of Proceedings of Machine Learning Research, PMLR, pp. 1260–1285.Google Scholar

Liang, T., Rakhlin, A. and Zhai, X. (2020), On the multiple descent of minimum-norm inter-polants and restricted lower isometry of kernels, in Proceedings of the 33rd Conference on Learning Theory (COLT 2020) (Abernethy, J. and Agarwal, S., eds), Vol. 125 of Proceedings of Machine Learning Research, PMLR, pp. 2683–2711.Google Scholar

Lin, Y. (2004), A note on margin-based loss functions in classification, Statist . Probab. Lett. 68, 73–82.CrossRef Google Scholar

Liu, C., Zhu, L. and Belkin, M. (2020), On the linearity of large non-linear models: When and why the tangent kernel is constant, in Advances in Neural Information Processing Systems 33 (NeurIPS 2020) (Larochelle, H. et al., eds), Curran Associates, pp. 15954– 15964.Google Scholar

Louart, C., Liao, Z. and Couillet, R. (2018), A random matrix approach to neural networks, Ann . Appl. Probab. 28, 1190–1248.Google Scholar

Lugosi, G. and Vayatis, N. (2004), On the Bayes-risk consistency of regularized boosting methods, Ann . Statist. 32, 30–55.CrossRef Google Scholar

Martin, G. and Pittman, J. A. (1990), Recognizing hand-printed letters and digits, in Advances in Neural Information Processing Systems 2 (NIPS 1989) (Touretzky, D., ed.), Morgan Kaufmann.Google Scholar

Mei, S. and Montanari, A. (2019), The generalization error of random features regression: Precise asymptotics and double descent curve. Available at arXiv:1908.05355 (to appear in Comm. Pure Appl. Math.).Google Scholar

Mei, S., Misiakiewicz, T. and Montanari, A. (2019), Mean-field theory of two-layers neural networks: Dimension-free bounds and kernel limit, in Proceedings of the 32nd Conference on Learning Theory (COLT 2019) (Beygelzimer, A. and Hsu, D., eds), Vol. 99 of Proceedings of Machine Learning Research, PMLR, pp. 2388–2464.Google Scholar

Mei, S., Misiakiewicz, T. and Montanari, A. (2021), Generalization error of random features and kernel methods: Hypercontractivity and kernel matrix concentration. Available at arXiv:2101.10588.CrossRef Google Scholar

Mei, S., Montanari, A. and Nguyen, P.-M. (2018), A mean field view of the landscape of two-layer neural networks, Proc . Nat. Acad. Sci. 115, E7665–E7671.CrossRef Google Scholar

Mendelson, S. (2002), Improving the sample complexity using global data, IEEE Trans. Inform. Theory 48, 1977–1991.CrossRef Google Scholar

Mendelson, S. (2021), Extending the scope of the small-ball method, Studia Math. 256, 147–167.CrossRef Google Scholar

Montanari, A. and Zhong, Y. (2020), The interpolation phase transition in neural networks: Memorization and generalization under lazy training. Available at arXiv:2007.12826.Google Scholar

Montanari, A., Ruan, F., Sohn, Y. and Yan, J. (2019), The generalization error of max-margin linear classifiers: High-dimensional asymptotics in the overparametrized regime. Available at arXiv:1911.01544.Google Scholar

Nacson, M. S., Lee, J., Gunasekar, S., Savarese, P. H. P., Srebro, N. and Soudry, D. (2019), Convergence of gradient descent on separable data, in Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS 2019) (Chaudhuri, K. and Sugiyama, M., eds), Vol. 89 of Proceedings of Machine Learning Research, PMLR, pp. 3420–3428.Google Scholar

Nadaraya, E. A. (1964), On estimating regression, Theory Probab . Appl. 9, 141–142.Google Scholar

Nagarajan, V. and Kolter, J. Z. (2019), Uniform convergence may be unable to explain generalization in deep learning, in Advances in Neural Information Processing Systems 32 (NeurIPS 2019) (Wallach, H. et al., eds), Curran Associates, pp. 11611–11622.Google Scholar

Neyshabur, B., Tomioka, R. and Srebro, N. (2015), Norm-based capacity control in neural networks, in Proceedings of the 28th Conference on Learning Theory (COLT 2015) (Grünwald, P., Hazan, E. and Kale, S., eds), Vol. 40 of Proceedings of Machine Learning Research, PMLR, pp. 1376–1401.Google Scholar

Neyshabur, B., Tomioka, R., Salakhutdinov, R. and Srebro, N. (2017), Geometry of optimization and implicit regularization in deep learning. Available at arXiv:1705.03071.Google Scholar

Nguyen, P.-M. and Pham, H. T. (2020), A rigorous framework for the mean field limit of multilayer neural networks. Available at arXiv:2001.11443.Google Scholar

Nitanda, A. and Suzuki, T. (2017), Stochastic particle gradient descent for infinite ensembles. Available at arXiv:1712.05438.Google Scholar

Oymak, S. and Soltanolkotabi, M. (2019), Overparameterized nonlinear learning: Gradient descent takes the shortest path?, in Proceedings of the 36th International Conference on Machine Learning (ICML 2019) (Chaudhuri, K. and Salakhutdinov, R., eds), Vol. 97 of Proceedings of Machine Learning Research, PMLR, pp. 4951–4960.Google Scholar

Oymak, S. and Soltanolkotabi, M. (2020), Towards moderate overparameterization: Global convergence guarantees for training shallow neural networks, IEEE J. Selected Areas Inform. Theory 1, 84–105.CrossRef Google Scholar

Pennington, J. and Worah, P. (2017), Nonlinear random matrix theory for deep learning, in Advances in Neural Information Processing Systems 30 (NIPS 2017) (Guyon, I. et al., eds), Curran Associates, pp. 2637–2646.Google Scholar

Pennington, J. and Worah, P. (2018), The spectrum of the Fisher information matrix of a single-hidden-layer neural network, in Advances in Neural Information Processing Systems 31 (NeurIPS 2018) (Bengio, S. et al., eds), Curran Associates, pp. 5410–5419.Google Scholar

Pollard, D. (1990), Empirical Processes: Theory and Applications, Vol. 2 of NSF-CBMS Regional Conference Series in Probability and Statistics, Institute of Mathematical Statistics.CrossRef Google Scholar

Pollard, D. (1995), Uniform ratio limit theorems for empirical processes, Scand . J. Statist. 22, 271–278.Google Scholar

Quinlan, J. R. (1996), Bagging, boosting, and C4.5, in Proceedings of the Thirteenth National Conference on Artificial Intelligence (AAAI-96), AAAI Press, pp. 725–730.Google Scholar

Rahimi, A. and Recht, B. (2007), Random features for large-scale kernel machines, in Advances in Neural Information Processing Systems 20 (NIPS 2007) (Platt, J. C. et al., eds), Curran Associates, pp. 1177–1184.Google Scholar

Rahimi, A. and Recht, B. (2008), Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning, in Advances in Neural Information Processing Systems 21 (NIPS 2008) (Koller, D. et al., eds), Curran Associates, pp. 1313–1320.Google Scholar

Rakhlin, A. and Zhai, X. (2019), Consistency of interpolation with Laplace kernels is a high-dimensional phenomenon, in Proceedings of the 32nd Conference on Learning Theory (COLT 2019) (Beygelzimer, A. and Hsu, D., eds), Vol. 99 of Proceedings of Machine Learning Research, PMLR, pp. 2595–2623.Google Scholar

Rakhlin, A., Sridharan, K. and Tsybakov, A. B. (2017), Empirical entropy, minimax regret and minimax risk, Bernoulli 23, 789–824.CrossRef Google Scholar

Rotskoff, G. M. and Vanden-Eijnden, E. (2018), Neural networks as interacting particle systems: Asymptotic convexity of the loss landscape and universal scaling of the approximation error, in Advances in Neural Information Processing Systems 31 (NeurIPS 2018) (Bengio, S. et al., eds), Curran Associates.Google Scholar

Rudelson, M. and Vershynin, R. (2006), Combinatorics of random processes and sections of convex bodies, Ann. of Math. 164, 603–648.CrossRef Google Scholar

Rudi, A. and Rosasco, L. (2017), Generalization properties of learning with random features, in Advances in Neural Information Processing Systems 30 (NIPS 2017) (Guyon, I. et al., eds), Curran Associates, pp. 3215–3225.Google Scholar

Rudi, A., Camoriano, R. and Rosasco, L. (2015), Less is more: Nyström computational regularization, in Advances in Neural Information Processing Systems 28 (NIPS 2015) (Cortes, C. et al., eds), Curran Associates, pp. 1657–1665.Google Scholar

Santambrogio, F. (2015), Optimal Transport for Applied Mathematicians: Calculus of Variations, PDEs, and Modeling, Vol. 87 of Progress in Nonlinear Differential Equations and Their Applications, Birkhäuser.CrossRef Google Scholar

Schapire, R. E., Freund, Y., Bartlett, P. and Lee, W. S. (1998), Boosting the margin: A new explanation for the effectiveness of voting methods, Ann . Statist. 26, 1651–1686.Google Scholar

Sirignano, J. and Spiliopoulos, K. (2020), Mean field analysis of neural networks: A law of large numbers, SIAM J. Appl. Math. 80, 725–752.CrossRef Google Scholar

Soudry, D., Hoffer, E., Nacson, M. S., Gunasekar, S. and Srebro, N. (2018), The implicit bias of gradient descent on separable data, J. Mach. Learn. Res. 19, 2822–2878.Google Scholar

Spigler, S., Geiger, M., d’Ascoli, S., Sagun, L., Biroli, G. and Wyart, M. (2019), A jamming transition from under- to over-parametrization affects generalization in deep learning, J. Phys. A 52, 474001.CrossRef Google Scholar

Srebro, N., Sridharan, K. and Tewari, A. (2010), Optimistic rates for learning with a smooth loss. Available at arXiv:1009.3896.Google Scholar

Taheri, H., Pedarsani, R. and Thrampoulidis, C. (2020), Fundamental limits of ridge-regularized empirical risk minimization in high dimensions. Available at arXiv:2006.08917.Google Scholar

Talagrand, M. (1994), Sharper bounds for Gaussian and empirical processes, Ann . Probab. 22, 28–76.CrossRef Google Scholar

Telgarsky, M. (2013), Margins, shrinkage, and boosting, in Proceedings of the 30th International Conference on Machine Learning (ICML 2013) (Dasgupta, S. and McAllester, D., eds), Vol. 28 of Proceedings of Machine Learning Research, PMLR, pp. 307–315.Google Scholar

Tibshirani, R. (1996), Regression shrinkage and selection via the Lasso, J. Royal Statist. Soc. Ser. B 58, 267–288.Google Scholar

Tsigler, A. and Bartlett, P. L. (2020), Benign overfitting in ridge regression. Available at arXiv:2009.14286.Google Scholar

Tsybakov, A. B. (2008), Introduction to Nonparametric Estimation, Springer Series in Statistics, Springer.Google Scholar

Geer, S. van de (1990), Estimating a regression function, Ann . Statist. 18, 907–924.Google Scholar

Vapnik, V. N. and Chervonenkis, A. Y. (1971), On the uniform convergence of relative frequencies of events to their probabilities, Theory Probab . Appl. 16, 264–280.Google Scholar

Vapnik, V. N. and Chervonenkis, A. Y. (1974), Theory of Pattern Recognition, Nauka.Google Scholar

Vershynin, R. (2018), High-Dimensional Probability: An Introduction with Applications in Data Science, Cambridge University Press.CrossRef Google Scholar

Vu, V. H. (1997), On the infeasibility of training neural networks with small squared errors, in Advances in Neural Information Processing Systems 10 (NIPS 1997) (Jordan, M. I. et al., eds), MIT Press, pp. 371–377.Google Scholar

Wasserman, L. (2013), All of Statistics: A Concise Course in Statistical Inference, Springer Texts in Statistics, Springer.Google Scholar

Watson, G. S. (1964), Smooth regression analysis, Sankhyā A 26, 359–372.Google Scholar

Williams, C. K. I. and Seeger, M. (2000), Using the Nyström method to speed up kernel machines, in Advances in Neural Information Processing Systems 13 (NIPS 2000) (Leen, T. K. et al., eds), MIT Press, pp. 682–688.Google Scholar

Wyner, A. J., Olson, M., Bleich, J. and Mease, D. (2017), Explaining the success of AdaBoost and random forests as interpolating classifiers, J. Mach. Learn. Res. 18, 1558–1590.Google Scholar

Zhang, C., Bengio, S., Hardt, M., Recht, B. and Vinyals, O. (2017), Understanding deep learning requires rethinking generalization, in 5th International Conference on Learning Representations (ICLR 2017). Available at https://openreview.net/forum?id=Sy8gdB9xx.Google Scholar

Zhang, T. (2004), Statistical behavior and consistency of classification methods based on convex risk minimization, Ann . Statist. 32, 56–85.CrossRef Google Scholar

Zhang, T. and Yu, B. (2005), Boosting with early stopping: Convergence and consistency, Ann . Statist. 33, 1538–1579.CrossRef Google Scholar

Zou, D., Cao, Y., Zhou, D. and Gu, Q. (2020), Gradient descent optimizes over-parameterized deep ReLU networks, Mach. Learn. 109, 467–492.CrossRef Google Scholar

Article contents

Deep learning: a statistical viewpoint

Abstract

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests