Abstract
Besides the well-known classification task, these days neural networks are frequently being applied to generate or transform data, such as images and audio signals. In such tasks, the conventional loss functions like the mean squared error (MSE) may not give satisfactory results. To improve the perceptual quality of the generated signals, one possibility is to increase their similarity to real signals, where the similarity is evaluated via a discriminator network. The combination of the generator and discriminator nets is called a Generative Adversarial Network (GAN). Here, we evaluate this adversarial training framework in the articulatory-to-acoustic mapping task, where the goal is to reconstruct the speech signal from a recording of the movement of articulatory organs. As the generator, we apply a 3D convolutional network that gave us good results in an earlier study. To turn it into a GAN, we extend the conventional MSE training loss with an adversarial loss component provided by a discriminator network. As for the evaluation, we report various objective speech quality metrics such as the Perceptual Evaluation of Speech Quality (PESQ), and the Mel-Cepstral Distortion (MCD). Our results indicate that the application of the adversarial training loss brings about a slight, but consistent improvement in all these metrics.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Schultz, T., Wand, M., Hueber, T., Krusienski, D.J., Herff, C., Brumberg, J.S.: Biosignal-based spoken communication: a survey. IEEE/ACM Trans. ASLP 25(12), 2257–2271 (2017)
Denby, B., Schultz, T., Honda, K., Hueber, T., Gilbert, J.M., Brumberg, J.S.: Silent speech interfaces. Speech Commun. 52(4), 270–287 (2010)
Kim, M., Cao, B., Mau, T., Wang, J.: Speaker-independent silent speech recognition from flesh-point articulatory movements using an LSTM neural network. IEEE/ACM Trans. ASLP 25(12), 2323–2336 (2017)
Taguchi, F., Kaburagi, T.: Articulatory-to-speech conversion using bi-directional long short-term memory. In: Proceedings of Interspeech, pp. 2499–2503 (2018)
Tóth, L., Shandiz, A.H.: 3D convolutional neural networks for ultrasound-based silent speech interfaces. In: Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J. (eds.) ICAISC 2020. LNCS, vol. 12415, pp. 159–169. Springer, Heidelberg (2020)
Saha, P., Liu, Y., Gick, B., Fels, S.: Ultra2speech – a deep learning framework for formant frequency estimation and tracking from ultrasound tongue images. In: Martel, A., Abolmaesumi, P., Stoyanov, D., Mateus, D., Zuluaga, M., Kevin Zhou, S., Racoceanu, D., Joskowicz, L. (eds.) MICCAI 2020. LNCS, vol. 12263, pp. 473–482. Springer, Heidelberg (2020)
Hueber, T., Benaroya, E.L., Chollet, G., Dreyfus, G., Stone, M.: Development of a silent speech interface driven by ultrasound and optical images of the tongue and lips. Speech Commun. 52(4), 288–300 (2010)
Gonzalez, J.A., Cheah, L.A., Gomez, A.M., Green, P.D., Gilbert, J.M., Ell, S.R., Moore, R.K., Holdsworth, E.: Direct speech reconstruction from articulatory sensor data by machine learning. IEEE/ACM Trans. ASLP 25(12), 2362–2374 (2017)
Fagan, M.J., Ell, S.R., Gilbert, J.M., Sarrazin, E., Chapman, P.M.: Development of a (silent) speech recognition system for patients following laryngectomy. Med. Eng. Phys. 30(4), 419–425 (2008)
Wand, M., Schultz, T., Schmidhuber, J.: Domain-adversarial training for session independent EMG-based speech recognition. In: Proceedings of Interspeech, pp. 3167–3171 (2018)
Janke, M., Diener, L.: EMG-to-speech: direct generation of speech from facial electromyographic signals. IEEE/ACM Trans. ASLP 25(12), 2375–2385 (2017)
Wand, M., Koutník, J., Schmidhuber, J.: Lipreading with long short-term memory. In: Proceedings of ICASSP, pp. 6115–6119. IEEE (2016)
Zhao, Y., Xu, B., Giri, R., Zhang, T.: Perceptually guided speech enhancement using deep neural networks. In: Proceedings of ICASSP, pp. 5074–5078 (2018)
Martín-Doñas, J., Gomez, A., Gonzalez Lopez, J., Peinado, A.: A deep learning loss function based on the perceptual evaluation of the speech quality. IEEE Signal Process. Lett. 25(11), 1680–1684 (2018)
Pihlgren, G.G., Sandin, F., Liwicki, M.: Improving image autoencoder embeddings with perceptual loss. In: Proceedings of IJCNN, pp. 1–7. IEEE (2020)
Dosovitskiy, A., Brox, T.: Generating images with perceptual similarity metrics based on deep networks. arXiv preprint arXiv:1602.02644 (2016)
Prenger, R., Valle, R., Catanzaro, B.: WaveGlow: a flow based generative network for speech synthesis. In: Proceedings of ICASSP, pp. 3617–3621 (2019)
Govalkar, P., Fisher, J., Zalkov, F., Dittmar, C.: A comparison of recent neural vocoders for speech signal reconstruction. In: Proceedings of ISCA Speech Synthesis Workshop (2019)
Csapó, T.G., Zainkó, C., Tóth, L., Gosztolya, G., Markó, A.: Ultrasound-based articulatory-to-acoustic mapping with WaveGlow speech synthesis. In: Proceedings of Interspeech 2020, pp. 2727–2731 (2020)
Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. arXiv preprint arXiv:1406.2661 (2014)
Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014)
Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: Proceedings of Conference on Computer Vision and Pattern Recognition, pp. 1125–1134 (2017)
Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of International Conference on Computer Vision, pp. 2223–2232 (2017)
Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generative adversarial text to image synthesis. In: International Conference on Machine Learning, pp. 1060–1069 (2016)
Zhang, Z., Deng, C., Shen, Y., Williamson, D.S., Sha, Y., Zhang, Y., Song, H., Li, X.: On loss functions and recurrency training for GAN-based speech enhancement systems. In: Proceedings of Interspeech, pp. 3266–3270 (2020)
Kaneko, T., Kameoka, H., Tanaka, K., Hojo, N.: StarGAN-VC2: rethinking conditional methods for StarGAN-based voice conversion. In: Proceedings of Interspeech, pp. 679–683 (2019)
Kumar, K., Kumar, R., de Boissiere, T., Gestin, L., Teoh, W.Z., Sotelo, J., de Brébisson, A., Bengio, Y., Courville, A.C.: MelGAN: generative adversarial networks for conditional waveform synthesis. Adv. Neural. Inf. Process. Syst. 32, 14910–14921 (2019)
Ribeiro, M.S., Sanger, J., Zhang, J.X., Eshky, A., Wrench, A., Richmond, K., Renals, S.: TaL: a synchronised multi-speaker corpus of ultrasound tongue imaging, audio, and lip videos. arXiv preprint arXiv:2011.09804 (2020)
ITU-R: ITU-R recommendation BS.1534: method for the subjective assessment of intermediate audio quality (2001)
Roux, J.L., Wisdom, S., Erdogan, H., Hershey, J.R.: SDR - half-baked or well done? In: Proceedings of ICASSP (2019)
Acknowledgments
This study was supported by the grant NKFIH-1279-2/2020 of the Ministry for Innovation and Technology, Hungary, and by the Ministry of Innovation and the National Research, Development and Innovation Office through project FK 124584 and within the framework of the Artificial Intelligence National Laboratory Programme. Gábor Gosztolya was supported by the UNKP 20-5 National Excellence Programme of the Ministry of Innovation and Technology, and by the János Bolyai Research Scholarship of the Hungarian Academy of Science. The GPU card used for the computations was donated by the NVIDIA Corporation.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Shandiz, A.H., Tóth, L., Gosztolya, G., Markó, A., Csapó, T.G. (2021). Improving Neural Silent Speech Interface Models by Adversarial Training. In: Hassanien, A.E., et al. Proceedings of the International Conference on Artificial Intelligence and Computer Vision (AICV2021). AICV 2021. Advances in Intelligent Systems and Computing, vol 1377. Springer, Cham. https://doi.org/10.1007/978-3-030-76346-6_39
Download citation
DOI: https://doi.org/10.1007/978-3-030-76346-6_39
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-76345-9
Online ISBN: 978-3-030-76346-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)