Skip to main content

Improving Neural Silent Speech Interface Models by Adversarial Training

  • Conference paper
  • First Online:
Proceedings of the International Conference on Artificial Intelligence and Computer Vision (AICV2021) (AICV 2021)

Abstract

Besides the well-known classification task, these days neural networks are frequently being applied to generate or transform data, such as images and audio signals. In such tasks, the conventional loss functions like the mean squared error (MSE) may not give satisfactory results. To improve the perceptual quality of the generated signals, one possibility is to increase their similarity to real signals, where the similarity is evaluated via a discriminator network. The combination of the generator and discriminator nets is called a Generative Adversarial Network (GAN). Here, we evaluate this adversarial training framework in the articulatory-to-acoustic mapping task, where the goal is to reconstruct the speech signal from a recording of the movement of articulatory organs. As the generator, we apply a 3D convolutional network that gave us good results in an earlier study. To turn it into a GAN, we extend the conventional MSE training loss with an adversarial loss component provided by a discriminator network. As for the evaluation, we report various objective speech quality metrics such as the Perceptual Evaluation of Speech Quality (PESQ), and the Mel-Cepstral Distortion (MCD). Our results indicate that the application of the adversarial training loss brings about a slight, but consistent improvement in all these metrics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Schultz, T., Wand, M., Hueber, T., Krusienski, D.J., Herff, C., Brumberg, J.S.: Biosignal-based spoken communication: a survey. IEEE/ACM Trans. ASLP 25(12), 2257–2271 (2017)

    Google Scholar 

  2. Denby, B., Schultz, T., Honda, K., Hueber, T., Gilbert, J.M., Brumberg, J.S.: Silent speech interfaces. Speech Commun. 52(4), 270–287 (2010)

    Article  Google Scholar 

  3. Kim, M., Cao, B., Mau, T., Wang, J.: Speaker-independent silent speech recognition from flesh-point articulatory movements using an LSTM neural network. IEEE/ACM Trans. ASLP 25(12), 2323–2336 (2017)

    Google Scholar 

  4. Taguchi, F., Kaburagi, T.: Articulatory-to-speech conversion using bi-directional long short-term memory. In: Proceedings of Interspeech, pp. 2499–2503 (2018)

    Google Scholar 

  5. Tóth, L., Shandiz, A.H.: 3D convolutional neural networks for ultrasound-based silent speech interfaces. In: Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J. (eds.) ICAISC 2020. LNCS, vol. 12415, pp. 159–169. Springer, Heidelberg (2020)

    Google Scholar 

  6. Saha, P., Liu, Y., Gick, B., Fels, S.: Ultra2speech – a deep learning framework for formant frequency estimation and tracking from ultrasound tongue images. In: Martel, A., Abolmaesumi, P., Stoyanov, D., Mateus, D., Zuluaga, M., Kevin Zhou, S., Racoceanu, D., Joskowicz, L. (eds.) MICCAI 2020. LNCS, vol. 12263, pp. 473–482. Springer, Heidelberg (2020)

    Google Scholar 

  7. Hueber, T., Benaroya, E.L., Chollet, G., Dreyfus, G., Stone, M.: Development of a silent speech interface driven by ultrasound and optical images of the tongue and lips. Speech Commun. 52(4), 288–300 (2010)

    Article  Google Scholar 

  8. Gonzalez, J.A., Cheah, L.A., Gomez, A.M., Green, P.D., Gilbert, J.M., Ell, S.R., Moore, R.K., Holdsworth, E.: Direct speech reconstruction from articulatory sensor data by machine learning. IEEE/ACM Trans. ASLP 25(12), 2362–2374 (2017)

    Google Scholar 

  9. Fagan, M.J., Ell, S.R., Gilbert, J.M., Sarrazin, E., Chapman, P.M.: Development of a (silent) speech recognition system for patients following laryngectomy. Med. Eng. Phys. 30(4), 419–425 (2008)

    Article  Google Scholar 

  10. Wand, M., Schultz, T., Schmidhuber, J.: Domain-adversarial training for session independent EMG-based speech recognition. In: Proceedings of Interspeech, pp. 3167–3171 (2018)

    Google Scholar 

  11. Janke, M., Diener, L.: EMG-to-speech: direct generation of speech from facial electromyographic signals. IEEE/ACM Trans. ASLP 25(12), 2375–2385 (2017)

    Google Scholar 

  12. Wand, M., Koutník, J., Schmidhuber, J.: Lipreading with long short-term memory. In: Proceedings of ICASSP, pp. 6115–6119. IEEE (2016)

    Google Scholar 

  13. Zhao, Y., Xu, B., Giri, R., Zhang, T.: Perceptually guided speech enhancement using deep neural networks. In: Proceedings of ICASSP, pp. 5074–5078 (2018)

    Google Scholar 

  14. Martín-Doñas, J., Gomez, A., Gonzalez Lopez, J., Peinado, A.: A deep learning loss function based on the perceptual evaluation of the speech quality. IEEE Signal Process. Lett. 25(11), 1680–1684 (2018)

    Article  Google Scholar 

  15. Pihlgren, G.G., Sandin, F., Liwicki, M.: Improving image autoencoder embeddings with perceptual loss. In: Proceedings of IJCNN, pp. 1–7. IEEE (2020)

    Google Scholar 

  16. Dosovitskiy, A., Brox, T.: Generating images with perceptual similarity metrics based on deep networks. arXiv preprint arXiv:1602.02644 (2016)

  17. Prenger, R., Valle, R., Catanzaro, B.: WaveGlow: a flow based generative network for speech synthesis. In: Proceedings of ICASSP, pp. 3617–3621 (2019)

    Google Scholar 

  18. Govalkar, P., Fisher, J., Zalkov, F., Dittmar, C.: A comparison of recent neural vocoders for speech signal reconstruction. In: Proceedings of ISCA Speech Synthesis Workshop (2019)

    Google Scholar 

  19. Csapó, T.G., Zainkó, C., Tóth, L., Gosztolya, G., Markó, A.: Ultrasound-based articulatory-to-acoustic mapping with WaveGlow speech synthesis. In: Proceedings of Interspeech 2020, pp. 2727–2731 (2020)

    Google Scholar 

  20. Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. arXiv preprint arXiv:1406.2661 (2014)

  21. Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014)

  22. Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: Proceedings of Conference on Computer Vision and Pattern Recognition, pp. 1125–1134 (2017)

    Google Scholar 

  23. Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of International Conference on Computer Vision, pp. 2223–2232 (2017)

    Google Scholar 

  24. Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generative adversarial text to image synthesis. In: International Conference on Machine Learning, pp. 1060–1069 (2016)

    Google Scholar 

  25. Zhang, Z., Deng, C., Shen, Y., Williamson, D.S., Sha, Y., Zhang, Y., Song, H., Li, X.: On loss functions and recurrency training for GAN-based speech enhancement systems. In: Proceedings of Interspeech, pp. 3266–3270 (2020)

    Google Scholar 

  26. Kaneko, T., Kameoka, H., Tanaka, K., Hojo, N.: StarGAN-VC2: rethinking conditional methods for StarGAN-based voice conversion. In: Proceedings of Interspeech, pp. 679–683 (2019)

    Google Scholar 

  27. Kumar, K., Kumar, R., de Boissiere, T., Gestin, L., Teoh, W.Z., Sotelo, J., de Brébisson, A., Bengio, Y., Courville, A.C.: MelGAN: generative adversarial networks for conditional waveform synthesis. Adv. Neural. Inf. Process. Syst. 32, 14910–14921 (2019)

    Google Scholar 

  28. Ribeiro, M.S., Sanger, J., Zhang, J.X., Eshky, A., Wrench, A., Richmond, K., Renals, S.: TaL: a synchronised multi-speaker corpus of ultrasound tongue imaging, audio, and lip videos. arXiv preprint arXiv:2011.09804 (2020)

  29. ITU-R: ITU-R recommendation BS.1534: method for the subjective assessment of intermediate audio quality (2001)

    Google Scholar 

  30. Roux, J.L., Wisdom, S., Erdogan, H., Hershey, J.R.: SDR - half-baked or well done? In: Proceedings of ICASSP (2019)

    Google Scholar 

Download references

Acknowledgments

This study was supported by the grant NKFIH-1279-2/2020 of the Ministry for Innovation and Technology, Hungary, and by the Ministry of Innovation and the National Research, Development and Innovation Office through project FK 124584 and within the framework of the Artificial Intelligence National Laboratory Programme. Gábor Gosztolya was supported by the UNKP 20-5 National Excellence Programme of the Ministry of Innovation and Technology, and by the János Bolyai Research Scholarship of the Hungarian Academy of Science. The GPU card used for the computations was donated by the NVIDIA Corporation.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Amin Honarmandi Shandiz .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Shandiz, A.H., Tóth, L., Gosztolya, G., Markó, A., Csapó, T.G. (2021). Improving Neural Silent Speech Interface Models by Adversarial Training. In: Hassanien, A.E., et al. Proceedings of the International Conference on Artificial Intelligence and Computer Vision (AICV2021). AICV 2021. Advances in Intelligent Systems and Computing, vol 1377. Springer, Cham. https://doi.org/10.1007/978-3-030-76346-6_39

Download citation

Publish with us

Policies and ethics