Improving Neural Silent Speech Interface Models by Adversarial Training

Shandiz, Amin Honarmandi; Tóth, László; Gosztolya, Gábor; Markó, Alexandra; Csapó, Tamás Gábor

doi:10.1007/978-3-030-76346-6_39

Amin Honarmandi Shandiz²²,
László Tóth²²,
Gábor Gosztolya²³,
Alexandra Markó^25,26 &
…
Tamás Gábor Csapó^24,25

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1377))

Included in the following conference series:

The International Conference on Artificial Intelligence and Computer Vision

2213 Accesses
3 Citations

Abstract

Besides the well-known classification task, these days neural networks are frequently being applied to generate or transform data, such as images and audio signals. In such tasks, the conventional loss functions like the mean squared error (MSE) may not give satisfactory results. To improve the perceptual quality of the generated signals, one possibility is to increase their similarity to real signals, where the similarity is evaluated via a discriminator network. The combination of the generator and discriminator nets is called a Generative Adversarial Network (GAN). Here, we evaluate this adversarial training framework in the articulatory-to-acoustic mapping task, where the goal is to reconstruct the speech signal from a recording of the movement of articulatory organs. As the generator, we apply a 3D convolutional network that gave us good results in an earlier study. To turn it into a GAN, we extend the conventional MSE training loss with an adversarial loss component provided by a discriminator network. As for the evaluation, we report various objective speech quality metrics such as the Perceptual Evaluation of Speech Quality (PESQ), and the Mel-Cepstral Distortion (MCD). Our results indicate that the application of the adversarial training loss brings about a slight, but consistent improvement in all these metrics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Schultz, T., Wand, M., Hueber, T., Krusienski, D.J., Herff, C., Brumberg, J.S.: Biosignal-based spoken communication: a survey. IEEE/ACM Trans. ASLP 25(12), 2257–2271 (2017)
Google Scholar
Denby, B., Schultz, T., Honda, K., Hueber, T., Gilbert, J.M., Brumberg, J.S.: Silent speech interfaces. Speech Commun. 52(4), 270–287 (2010)
Article Google Scholar
Kim, M., Cao, B., Mau, T., Wang, J.: Speaker-independent silent speech recognition from flesh-point articulatory movements using an LSTM neural network. IEEE/ACM Trans. ASLP 25(12), 2323–2336 (2017)
Google Scholar
Taguchi, F., Kaburagi, T.: Articulatory-to-speech conversion using bi-directional long short-term memory. In: Proceedings of Interspeech, pp. 2499–2503 (2018)
Google Scholar
Tóth, L., Shandiz, A.H.: 3D convolutional neural networks for ultrasound-based silent speech interfaces. In: Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J. (eds.) ICAISC 2020. LNCS, vol. 12415, pp. 159–169. Springer, Heidelberg (2020)
Google Scholar
Saha, P., Liu, Y., Gick, B., Fels, S.: Ultra2speech – a deep learning framework for formant frequency estimation and tracking from ultrasound tongue images. In: Martel, A., Abolmaesumi, P., Stoyanov, D., Mateus, D., Zuluaga, M., Kevin Zhou, S., Racoceanu, D., Joskowicz, L. (eds.) MICCAI 2020. LNCS, vol. 12263, pp. 473–482. Springer, Heidelberg (2020)
Google Scholar
Hueber, T., Benaroya, E.L., Chollet, G., Dreyfus, G., Stone, M.: Development of a silent speech interface driven by ultrasound and optical images of the tongue and lips. Speech Commun. 52(4), 288–300 (2010)
Article Google Scholar
Gonzalez, J.A., Cheah, L.A., Gomez, A.M., Green, P.D., Gilbert, J.M., Ell, S.R., Moore, R.K., Holdsworth, E.: Direct speech reconstruction from articulatory sensor data by machine learning. IEEE/ACM Trans. ASLP 25(12), 2362–2374 (2017)
Google Scholar
Fagan, M.J., Ell, S.R., Gilbert, J.M., Sarrazin, E., Chapman, P.M.: Development of a (silent) speech recognition system for patients following laryngectomy. Med. Eng. Phys. 30(4), 419–425 (2008)
Article Google Scholar
Wand, M., Schultz, T., Schmidhuber, J.: Domain-adversarial training for session independent EMG-based speech recognition. In: Proceedings of Interspeech, pp. 3167–3171 (2018)
Google Scholar
Janke, M., Diener, L.: EMG-to-speech: direct generation of speech from facial electromyographic signals. IEEE/ACM Trans. ASLP 25(12), 2375–2385 (2017)
Google Scholar
Wand, M., Koutník, J., Schmidhuber, J.: Lipreading with long short-term memory. In: Proceedings of ICASSP, pp. 6115–6119. IEEE (2016)
Google Scholar
Zhao, Y., Xu, B., Giri, R., Zhang, T.: Perceptually guided speech enhancement using deep neural networks. In: Proceedings of ICASSP, pp. 5074–5078 (2018)
Google Scholar
Martín-Doñas, J., Gomez, A., Gonzalez Lopez, J., Peinado, A.: A deep learning loss function based on the perceptual evaluation of the speech quality. IEEE Signal Process. Lett. 25(11), 1680–1684 (2018)
Article Google Scholar
Pihlgren, G.G., Sandin, F., Liwicki, M.: Improving image autoencoder embeddings with perceptual loss. In: Proceedings of IJCNN, pp. 1–7. IEEE (2020)
Google Scholar
Dosovitskiy, A., Brox, T.: Generating images with perceptual similarity metrics based on deep networks. arXiv preprint arXiv:1602.02644 (2016)
Prenger, R., Valle, R., Catanzaro, B.: WaveGlow: a flow based generative network for speech synthesis. In: Proceedings of ICASSP, pp. 3617–3621 (2019)
Google Scholar
Govalkar, P., Fisher, J., Zalkov, F., Dittmar, C.: A comparison of recent neural vocoders for speech signal reconstruction. In: Proceedings of ISCA Speech Synthesis Workshop (2019)
Google Scholar
Csapó, T.G., Zainkó, C., Tóth, L., Gosztolya, G., Markó, A.: Ultrasound-based articulatory-to-acoustic mapping with WaveGlow speech synthesis. In: Proceedings of Interspeech 2020, pp. 2727–2731 (2020)
Google Scholar
Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. arXiv preprint arXiv:1406.2661 (2014)
Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014)
Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: Proceedings of Conference on Computer Vision and Pattern Recognition, pp. 1125–1134 (2017)
Google Scholar
Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of International Conference on Computer Vision, pp. 2223–2232 (2017)
Google Scholar
Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generative adversarial text to image synthesis. In: International Conference on Machine Learning, pp. 1060–1069 (2016)
Google Scholar
Zhang, Z., Deng, C., Shen, Y., Williamson, D.S., Sha, Y., Zhang, Y., Song, H., Li, X.: On loss functions and recurrency training for GAN-based speech enhancement systems. In: Proceedings of Interspeech, pp. 3266–3270 (2020)
Google Scholar
Kaneko, T., Kameoka, H., Tanaka, K., Hojo, N.: StarGAN-VC2: rethinking conditional methods for StarGAN-based voice conversion. In: Proceedings of Interspeech, pp. 679–683 (2019)
Google Scholar
Kumar, K., Kumar, R., de Boissiere, T., Gestin, L., Teoh, W.Z., Sotelo, J., de Brébisson, A., Bengio, Y., Courville, A.C.: MelGAN: generative adversarial networks for conditional waveform synthesis. Adv. Neural. Inf. Process. Syst. 32, 14910–14921 (2019)
Google Scholar
Ribeiro, M.S., Sanger, J., Zhang, J.X., Eshky, A., Wrench, A., Richmond, K., Renals, S.: TaL: a synchronised multi-speaker corpus of ultrasound tongue imaging, audio, and lip videos. arXiv preprint arXiv:2011.09804 (2020)
ITU-R: ITU-R recommendation BS.1534: method for the subjective assessment of intermediate audio quality (2001)
Google Scholar
Roux, J.L., Wisdom, S., Erdogan, H., Hershey, J.R.: SDR - half-baked or well done? In: Proceedings of ICASSP (2019)
Google Scholar

Download references

Acknowledgments

This study was supported by the grant NKFIH-1279-2/2020 of the Ministry for Innovation and Technology, Hungary, and by the Ministry of Innovation and the National Research, Development and Innovation Office through project FK 124584 and within the framework of the Artificial Intelligence National Laboratory Programme. Gábor Gosztolya was supported by the UNKP 20-5 National Excellence Programme of the Ministry of Innovation and Technology, and by the János Bolyai Research Scholarship of the Hungarian Academy of Science. The GPU card used for the computations was donated by the NVIDIA Corporation.

Author information

Authors and Affiliations

Institute of Informatics, University of Szeged, Szeged, Hungary
Amin Honarmandi Shandiz & László Tóth
MTA-SZTE Research Group on Artificial Intelligence, Szeged, Hungary
Gábor Gosztolya
Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics, Budapest, Hungary
Tamás Gábor Csapó
MTA-ELTE “Lendület” Lingual Articulation Research Group, Budapest, Hungary
Alexandra Markó & Tamás Gábor Csapó
Department of Applied Linguistics and Phonetics, Eötvös Loránd University, Budapest, Hungary
Alexandra Markó

Authors

Amin Honarmandi Shandiz
View author publications
You can also search for this author in PubMed Google Scholar
László Tóth
View author publications
You can also search for this author in PubMed Google Scholar
Gábor Gosztolya
View author publications
You can also search for this author in PubMed Google Scholar
Alexandra Markó
View author publications
You can also search for this author in PubMed Google Scholar
Tamás Gábor Csapó
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Amin Honarmandi Shandiz .

Editor information

Editors and Affiliations

Information Technology Department, Faculty of Computers and Artificial Intelligence, Cairo University, Giza, Egypt
Aboul Ella Hassanien
Faculty of Sciences and Techniques, Hassan 1st University, Settat, Morocco
Abdelkrim Haqiq
School of Medicine, University of Missouri, Columbia, MO, USA
Peter J. Tonellato
ISAE-ENSMA, Futuroscope Chasseneuil Cedex, France
Ladjel Bellatreche
British University Vietnam, Hung Yen, Vietnam
Sam Goundar
Faculty of Computers and Artificial Intelligence, Benha University, Benha, Egypt
Ahmad Taher Azar
NEST Research Group, ENSEM, Hassan II University of Casablanca, Casablanca, Morocco
Essaid Sabir
High National School for Computer Science and Systems Analysis (ENSIAS), Mohammed V University, Rabat, Morocco
Driss Bouzidi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shandiz, A.H., Tóth, L., Gosztolya, G., Markó, A., Csapó, T.G. (2021). Improving Neural Silent Speech Interface Models by Adversarial Training. In: Hassanien, A.E., et al. Proceedings of the International Conference on Artificial Intelligence and Computer Vision (AICV2021). AICV 2021. Advances in Intelligent Systems and Computing, vol 1377. Springer, Cham. https://doi.org/10.1007/978-3-030-76346-6_39

Download citation

DOI: https://doi.org/10.1007/978-3-030-76346-6_39
Published: 29 May 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-76345-9
Online ISBN: 978-3-030-76346-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics