We present a magnitude estimation network that is combined with a modified ResNet x-vector system to generate embeddings whose inner product is able to produce calibrated scores with increased discrimination. A three-step training procedure is used. First, the network is trained using short segments and a multi-class cross-entropy loss with angular margin softmax. During the second step, only a reduced subset of the DNN parameters are refined using full-length recordings. Finally, the magnitude estimation network is trained using a binary cross-entropy loss over pairs of target and non-target trials. The resulting system is evaluated on 4 widely-used benchmarks and provides significant discrimination and calibration gains at multiple operating points.
Cite as: Garcia-Romero, D., Sell, G., Mccree, A. (2020) MagNetO: X-vector Magnitude Estimation Network plus Offset for Improved Speaker Recognition. Proc. The Speaker and Language Recognition Workshop (Odyssey 2020), 1-8, doi: 10.21437/Odyssey.2020-1
@inproceedings{garciaromero20_odyssey, author={Daniel Garcia-Romero and Greg Sell and Alan Mccree}, title={{MagNetO: X-vector Magnitude Estimation Network plus Offset for Improved Speaker Recognition}}, year=2020, booktitle={Proc. The Speaker and Language Recognition Workshop (Odyssey 2020)}, pages={1--8}, doi={10.21437/Odyssey.2020-1} }