In this paper we investigate the error criteria that are optimized during the training of artificial neural networks (ANN). We compare the bounds of the squared error (SE) and the cross-entropy (CE) criteria being the most popular choices in state-of-the-art implementations. The evaluation is performed on automatic speech recognition (ASR) and handwriting recognition (HWR) tasks using a hybrid HMM-ANN model. We find that with randomly initialized weights, the squared error based ANN does not converge to a good local optimum. However, with a good initialization by pre-training, the word error rate of our best CE trained system could be reduced from 30.9% to 30.5% on the ASR, and from 22.7% to 21.9% on the HWR task by performing a few additional "fine-tuning" iterations with the SE criterion.
Cite as: Golik, P., Doetsch, P., Ney, H. (2013) Cross-entropy vs. squared error training: a theoretical and experimental comparison. Proc. Interspeech 2013, 1756-1760, doi: 10.21437/Interspeech.2013-436
@inproceedings{golik13_interspeech, author={Pavel Golik and Patrick Doetsch and Hermann Ney}, title={{Cross-entropy vs. squared error training: a theoretical and experimental comparison}}, year=2013, booktitle={Proc. Interspeech 2013}, pages={1756--1760}, doi={10.21437/Interspeech.2013-436} }