Large-Scale Machine Learning with Stochastic Gradient Descent

Bottou, Léon

doi:10.1007/978-3-7908-2604-3_16

Léon Bottou³

10k Accesses
2216 Citations

Abstract

During the last decade, the data sizes have grown faster than the speed of processors. In this context, the capabilities of statistical machine learning methods is limited by the computing time rather than the sample size. A more precise analysis uncovers qualitatively different tradeoffs for the case of small-scale and large-scale learning problems. The large-scale case involves the computational complexity of the underlying optimization algorithm in non-trivial ways. Unlikely optimization algorithms such as stochastic gradient descent show amazing performance for large-scale problems. In particular, second order stochastic gradient and averaged stochastic gradient are asymptotically efficient after a single pass on the training set.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

BORDES. A., BOTTOU, L., and GALLINARI, P. (2009): SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent. Journal of Machine Learning Research, 10:1737-1754. With Erratum (to appear).
MathSciNet Google Scholar
BOTTOU, L. and BOUSQUET, O. (2008): The Tradeoffs of Large Scale Learning, In Advances in Neural Information Processing Systems, vol.20, 161-168.
Google Scholar
BOTTOU, L. and LECUN, Y. (2004): On-line Learning for Very Large Datasets. Applied Stochastic Models in Business and Industry, 21(2):137-151
Article Google Scholar
BOUSQUET, O. (2002): Concentration Inequalities and Empirical Processes Theory Applied to the Analysis of Learning Algorithms. Thèse de doctorat, Ecole Polytechnique, Palaiseau, France.
Google Scholar
CORTES, C. and VAPNIK, V. N. (1995): Support Vector Networks, Machine Learning, 20:273-297.
MATH Google Scholar
DENNIS, J. E., Jr., and SCHNABEL, R. B. (1983): Numerical Methods For Unconstrained Optimization and Nonlinear Equations. Prentice-Hall
Google Scholar
JOACHIMS, T. (2006): Training Linear SVMs in Linear Time. In Proceedings of the 12th ACM SIGKDD, ACM Press.
Google Scholar
LAFFERTY, J. D., MCCALLUM, A., and PEREIRA, F. (2001): Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of ICML 2001, 282-289, Morgan Kaufman.
Google Scholar
LEE, W. S., BARTLETT, P. L., and WILLIAMSON, R. C. (1998): The Importance of Convexity in Learning with Squared Loss. IEEE Transactions on Information Theory, 44(5):1974-1980.
Article MathSciNet MATH Google Scholar
LEWIS, D. D., YANG, Y., ROSE, T. G., and LI, F. (2004): RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research, 5:361-397.
Google Scholar
LIN, C. J., WENG, R. C., and KEERTHI, S. S. (2007): Trust region Newton methods for large-scale logistic regression. In Proceedings of ICML 2007, 561-568, ACM Press.
Google Scholar
MACQUEEN, J. (1967): Some Methods for Classification and Analysis of Multivariate Observations. In Fifth Berkeley Symposium on Mathematics, Statistics, and Probabilities, vol.1, 281-297, University of California Press.
MathSciNet Google Scholar
MASSART, P. (2000): Some applications of concentration inequalities to Statistics, Annales de la Faculté des Sciences de Toulouse, series 6,9,(2):245-303.
MathSciNet Google Scholar
MURATA, N. (1998): A Statistical Study of On-line Learning. In Online Learning and Neural Networks, Cambridge University Press.
Google Scholar
POLYAK, B. T. and JUDITSKY, A. B. (1992): Acceleration of stochastic approximation by averaging. SIAM J. Control and Optimization, 30(4):838-855.
Article MathSciNet MATH Google Scholar
ROSENBLATT, F. (1957): The Perceptron: A perceiving and recognizing automaton. Technical Report 85-460-1, Project PARA, Cornell Aeronautical Lab.
Google Scholar
RUMELHART, D. E., HINTON, G. E., and WILLIAMS, R. J. (1986): Learning internal representations by error propagation. In Parallel distributed processing: Explorations in the microstructure of cognition, vol.I, 318-362, Bradford Books.
Google Scholar
SHALEV-SHWARTZ, S. and SREBRO, N. (2008): SVM optimization: inverse dependence on training set size. In Proceedings of the ICML 2008, 928-935, ACM.
Google Scholar
TIBSHIRANI, R. (1996): Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, Series B, 58(1):267-288.
MathSciNet MATH Google Scholar
TJONG KIM SANG E. F., and BUCHHOLZ, S. (2000): Introduction to the CoNLL-2000 Shared Task: Chunking. In Proceedings of CoNLL-2000, 127-132.
Google Scholar
TSYBAKOV, A. B. (2004): Optimal aggregation of classifiers in statistical learning, Annals of Statististics, 32(1).
Google Scholar
VAPNIK, V. N. and CHERVONENKIS, A. YA. (1971): On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities. Theory of Probability and its Applications, 16(2):264-280.
Article MathSciNet MATH Google Scholar
WIDROW, B. and HOFF, M. E. (1960): Adaptive switching circuits. IRE WESCON Conv. Record, Part 4., 96-104.
Google Scholar
XU, W. (2010): Towards Optimal One Pass Large Scale Learning with Averaged Stochastic Gradient Descent. Journal of Machine Learning Research (to appear).
Google Scholar

Download references

Author information

Authors and Affiliations

NEC Labs America, Princeton, NJ, 08542, USA
Léon Bottou

Authors

Léon Bottou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Léon Bottou .

Editor information

Editors and Affiliations

Centre de Recherche INRIA Paris-Rocquenc, Domaine de Voluceau, Le Chesnay cedex, 78153, France
Yves Lechevallier
, chaire de statistique appliquée, CNAM, rue Saint Martin 292, Paris, 75003, France
Gilbert Saporta

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bottou, L. (2010). Large-Scale Machine Learning with Stochastic Gradient Descent. In: Lechevallier, Y., Saporta, G. (eds) Proceedings of COMPSTAT'2010. Physica-Verlag HD. https://doi.org/10.1007/978-3-7908-2604-3_16

Download citation

DOI: https://doi.org/10.1007/978-3-7908-2604-3_16
Published: 30 September 2010
Publisher Name: Physica-Verlag HD
Print ISBN: 978-3-7908-2603-6
Online ISBN: 978-3-7908-2604-3
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics