Dropout vs. batch normalization: an empirical study of their impact to deep learning

Garbin, Christian; Zhu, Xingquan; Marques, Oge

doi:10.1007/s11042-019-08453-9

Dropout vs. batch normalization: an empirical study of their impact to deep learning

Published: 22 January 2020

Volume 79, pages 12777–12815, (2020)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

14k Accesses
227 Citations
Explore all metrics

Abstract

Overfitting and long training time are two fundamental challenges in multilayered neural network learning and deep learning in particular. Dropout and batch normalization are two well-recognized approaches to tackle these challenges. While both approaches share overlapping design principles, numerous research results have shown that they have unique strengths to improve deep learning. Many tools simplify these two approaches as a simple function call, allowing flexible stacking to form deep learning architectures. Although their usage guidelines are available, unfortunately no well-defined set of rules or comprehensive studies to investigate them concerning data input, network configurations, learning efficiency, and accuracy. It is not clear when users should consider using dropout and/or batch normalization, and how they should be combined (or used alternatively) to achieve optimized deep learning outcomes. In this paper we conduct an empirical study to investigate the effect of dropout and batch normalization on training deep learning models. We use multilayered dense neural networks and convolutional neural networks (CNN) as the deep learning models, and mix dropout and batch normalization to design different architectures and subsequently observe their performance in terms of training and test CPU time, number of parameters in the model (as a proxy for model size), and classification accuracy. The interplay between network structures, dropout, and batch normalization, allow us to conclude when and how dropout and batch normalization should be considered in deep learning. The empirical study quantified the increase in training time when dropout and batch normalization are used, as well as the increase in prediction time (important for constrained environments, such as smartphones and low-powered IoT devices). It showed that a non-adaptive optimizer (e.g. SGD) can outperform adaptive optimizers, but only at the cost of a significant amount of training times to perform hyperparameter tuning, while an adaptive optimizer (e.g. RMSProp) performs well without much tuning. Finally, it showed that dropout and batch normalization should be used in CNNs only with caution and experimentation (when in doubt and short on time to experiment, use only batch normalization).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 13

Fig. 14

Fig. 15

Exploring the Relationship Between Learning Rate, Batch Size, and Epochs in Deep Learning: An Experimental Study

MaxDropoutV2: An Improved Method to Drop Out Neurons in Convolutional Neural Networks

NetScore: Towards Universal Metrics for Large-Scale Performance Analysis of Deep Neural Networks for Practical On-Device Edge Usage

Notes

https://github.com/fau-masters-collected-works-cgarbin/cap6619-deep-learning-term-project
Note that the Keras API uses this parameter to control the number of units to remove (the opposite meaning of what is used in the dropout paper). This paper follows the Keras API, i.e. units to remove.
The source code used in the experiments is available in Github at https://github.com/fau-masters-collected-works-cgarbin/cap6619-deep-learning-term-project
Note that the dropout network is listed in the top 10 results as “1,024 hidden units”. The number of units is adjusted by the dropout rate, 0.5 in this case. The adjustment results in a dropout network configured to run with 1,024 units in a layer to effectively have 2,048 units in that layer.

References

Bengio Y (2012) Practical recommendations for gradient-based training of deep architectures
Google Scholar
Brock A, Lim T, Ritchie JM, Weston N (2017) Freezeout: accelerate training by progressively freezing layers. arXiv:1706.04983
Cortes C, Vapnik V (1995) Mach Learning, 273–297
Deng L, Hinton G, Kingsbury B (2013) New types of deep neural network learning for speech recognition and related applications: an overview. https://doi.org/10.1109/icassp.2013.6639344
Goodfellow IJ, Bengio Y, Courville AC (2016) Deep learning. Adaptive Computation and Machine Learning. MIT Press, Cambridge. http://www.deeplearningbook.org/
MATH Google Scholar
Hinz T, Navarro-Guerrero N, Magg S, Wermter S (2018) Speeding up the Hyperparameter Optimization of Deep Convolutional Neural Networks. International Journal of Computation Intelligence and Applications 17(2):1850008. https://doi.org/10.1142/s1469026818500086
Article Google Scholar
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift
KerasTeam (2016) Using test data as validation data during training. https://github.com/keras-team/keras/issues/1753
Kohler JM, Daneshmand H, Lucchi A, Zhou M, Neymeyr K, Hofmann T (2018) Towards a theoretical understanding of batch normalization, arXiv:1805.10694
Krizhevsky A (2009) Learning multiple layers of features from tiny images. https://www.cs.toronto.edu/kriz/learning-features-2009-TR.pdf
Krizhevsky A, Sutskever I, Hinton GE (2012) . In: Bartlett PL, Pereira FCN, Burges CJC, Bottou L, Weinberger KQ (eds) Advances in neural information processing systems 25: 26th annual conference on neural information processing systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, USA, pp 1106–1114. http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks
Krizhevsky A, Nair V, Hinton G (2019) The cifar-10 dataset. https://www.cs.toronto.edu/kriz/cifar.html
Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks 60: 84. https://doi.org/10.1145/3065386
Article Google Scholar
Längkvist M, Karlsson L, Loutfi A (2014) A review of unsupervised feature learning and deep learning for time-series modeling 42: 11. https://doi.org/10.1016/j.patrec.2014.01.008
Article Google Scholar
LeCun Y (1999) The mnist database of handwritten digits. http://yann.lecun.com/exdb/mnist/
LeCun Y, Bengio Y, Hinton G (2015) . Deep Learning 521:436. https://doi.org/10.1038/nature14539
Article Google Scholar
Li X, Chen S, Hu X, Yang J (2018) Understanding the disharmony between dropout and batch normalization by variance shift
Lipton ZC, Berkowitz J, Elkan C (2015) A critical review of recurrent neural networks for sequence learning
Loh WY (2014) Fifty years of classification and regression trees 82: 329. https://doi.org/10.1111/insr.12016
Article MathSciNet Google Scholar
Luo P, Wang X, Shao W, Peng Z (2018) Towards understanding regularization in batch normalization
Mishkin D, Sergievskiy N, Matas J (2017) Systematic evaluation of convolution neural network advances on the imagenet. Comput. Vision Image Understanding. https://doi.org/10.1016/j.cviu.2017.05.007. http://www.sciencedirect.com/science/article/pii/S1077314217300814
Article Google Scholar
Morgan N, Bourlard H (1989) . In: Touretzky DS (ed) Advances in neural information processing systems 2, [NIPS Conference, Denver, Colorado, USA, November 27-30, 1989]. http://papers.nips.cc/paper/275-generalization-and-parameter-estimation-in-feedforward-nets-some-experiments. Morgan Kaufmann, pp 630–637
Nair V, Hinton GE (2010) . In: Fürnkranz J, Joachims T (eds) Proceedings of the 27th international conference on machine learning (ICML-10), June 21-24, 2010, Haifa, Israel. Omnipress, pp 807–814. http://www.icml2010.org/papers/432.pdf
Perez L, Wang J (2017) The effectiveness of data augmentation in image classification using deep learning
Ruder S (2016) An overview of gradient descent optimization algorithms
Rumelhart DE, Hinton G, Williams RJ (1985) Learning internal representations by error propagation. https://doi.org/10.21236/ada164453
Smith LN, A disciplined approach to neural network hyper-parameters: part 1 – learning rate batch size momentum and weight decay (2018)
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929. http://dl.acm.org/citation.cfm?id=?2627435.2670313
MathSciNet MATH Google Scholar
Team K (2019) https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py
Tieleman T, Hinton G (2012) Lecture 6.5—Rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning
University S (2018) Stanford university cs231n: convolutional neural networks for visual recognition. http://cs231n.github.io/classification/#nn
Wang X, Gao L, Song J, Shen H, Beyond frame-level CNN (2017) Saliency-aware 3-D CNN with lstm for video action recognition. IEEE Signal Process Lett 24(4):510. https://doi.org/10.1109/LSP.2016.2611485
Article Google Scholar
Wang X, Gao L, Wang P, Sun X, Liu X (2018) Two-stream 3-D convnet fusion for action recognition in videos with arbitrary size and length. IEEE Trans Multimed 20(3):634. https://doi.org/10.1109/TMM.2017.2749159
Article Google Scholar
Zhu X (2005) Semi-supervised learning literature survey. Tech. Rep. 1530, Computer Sciences, University of Wisconsin-Madison

Download references

Acknowledgements

This research is sponsored by the US National Science Foundation (NSF) through Grants IIS-1763452 and CNS-1828181.

Author information

Authors and Affiliations

Department of Computer & Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL, 33431, USA
Christian Garbin, Xingquan Zhu & Oge Marques

Authors

Christian Garbin
View author publications
You can also search for this author in PubMed Google Scholar
Xingquan Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Oge Marques
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xingquan Zhu.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Garbin, C., Zhu, X. & Marques, O. Dropout vs. batch normalization: an empirical study of their impact to deep learning. Multimed Tools Appl 79, 12777–12815 (2020). https://doi.org/10.1007/s11042-019-08453-9

Download citation

Received: 26 April 2019
Revised: 03 September 2019
Accepted: 07 November 2019
Published: 22 January 2020
Issue Date: May 2020
DOI: https://doi.org/10.1007/s11042-019-08453-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Dropout vs. batch normalization: an empirical study of their impact to deep learning

Abstract

Access this article

Similar content being viewed by others

Exploring the Relationship Between Learning Rate, Batch Size, and Epochs in Deep Learning: An Experimental Study

MaxDropoutV2: An Improved Method to Drop Out Neurons in Convolutional Neural Networks

NetScore: Towards Universal Metrics for Large-Scale Performance Analysis of Deep Neural Networks for Practical On-Device Edge Usage

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Dropout vs. batch normalization: an empirical study of their impact to deep learning

Abstract

Access this article

Similar content being viewed by others

Exploring the Relationship Between Learning Rate, Batch Size, and Epochs in Deep Learning: An Experimental Study

MaxDropoutV2: An Improved Method to Drop Out Neurons in Convolutional Neural Networks

NetScore: Towards Universal Metrics for Large-Scale Performance Analysis of Deep Neural Networks for Practical On-Device Edge Usage

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation