Cross-entropy vs. squared error training: a theoretical and experimental comparison

Golik, Pavel; Doetsch, Patrick; Ney, Hermann

doi:10.21437/Interspeech.2013-436

Cross-entropy vs. squared error training: a theoretical and experimental comparison

Pavel Golik, Patrick Doetsch, Hermann Ney

In this paper we investigate the error criteria that are optimized during the training of artificial neural networks (ANN). We compare the bounds of the squared error (SE) and the cross-entropy (CE) criteria being the most popular choices in state-of-the-art implementations. The evaluation is performed on automatic speech recognition (ASR) and handwriting recognition (HWR) tasks using a hybrid HMM-ANN model. We find that with randomly initialized weights, the squared error based ANN does not converge to a good local optimum. However, with a good initialization by pre-training, the word error rate of our best CE trained system could be reduced from 30.9% to 30.5% on the ASR, and from 22.7% to 21.9% on the HWR task by performing a few additional "fine-tuning" iterations with the SE criterion.

doi: 10.21437/Interspeech.2013-436

Cite as: Golik, P., Doetsch, P., Ney, H. (2013) Cross-entropy vs. squared error training: a theoretical and experimental comparison. Proc. Interspeech 2013, 1756-1760, doi: 10.21437/Interspeech.2013-436

@inproceedings{golik13_interspeech,
  author={Pavel Golik and Patrick Doetsch and Hermann Ney},
  title={{Cross-entropy vs. squared error training: a theoretical and experimental comparison}},
  year=2013,
  booktitle={Proc. Interspeech 2013},
  pages={1756--1760},
  doi={10.21437/Interspeech.2013-436}
}