3D Convolutional Neural Networks for Ultrasound-Based Silent Speech Interfaces

Tóth, László; Shandiz, Amin Honarmandi

doi:10.1007/978-3-030-61401-0_16

László Tóth¹⁴ &
Amin Honarmandi Shandiz¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12415))

Included in the following conference series:

International Conference on Artificial Intelligence and Soft Computing

2058 Accesses
4 Citations
2 Altmetric

Abstract

Silent speech interfaces (SSI) aim to reconstruct the speech signal from a recording of the articulatory movement, such as an ultrasound video of the tongue. Currently, deep neural networks are the most successful technology for this task. The efficient solution requires methods that do not simply process single images, but are able to extract the tongue movement information from a sequence of video frames. One option for this is to apply recurrent neural structures such as the long short-term memory network (LSTM) in combination with 2D convolutional neural networks (CNNs). Here, we experiment with another approach that extends the CNN to perform 3D convolution, where the extra dimension corresponds to time. In particular, we apply the spatial and temporal convolutions in a decomposed form, which proved very successful recently in video action recognition. We find experimentally that our 3D network outperforms the CNN+LSTM model, indicating that 3D CNNs may be a feasible alternative to CNN+LSTM networks in SSI systems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Akbari, H., Arora, H., Cao, L., Mesgarani, N.: Lip2AudSpec: speech reconstruction from silent lip movements video. In: Proceedings of ICASSP, pp. 2516–2520 (2018)
Google Scholar
Bradbury, J., Merity, S., Xiong, C., Socher, R.: Quasi-recurrent neural networks. In: Proceedings of ICLR (2017)
Google Scholar
Cho, K., et al: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of EMNLP, pp. 1724–1734 (2014)
Google Scholar
Chollet, F., et al.: Keras (2015). https://github.com/fchollet/keras
Csapó, T.G., Grósz, T., Gosztolya, G., Tóth, L., Markó, A.: DNN-based ultrasound-to-speech conversion for a silent speech interface. In: Proceedings of Interspeech, pp. 3672–3676 (2017)
Google Scholar
Denby, B., Schultz, T., Honda, K., Hueber, T., Gilbert, J.M., Brumberg, J.S.: Silent speech interfaces. Speech Commun. 52(4), 270–287 (2010)
Article Google Scholar
Denby, B., et al: Towards a practical silent speech interface based on vocal tract imaging. In: Proceedings of ISSP, pp. 89–94 (2011)
Google Scholar
Ephrat, A., Peleg, S.: Vid2speech: speech reconstruction from silent video. In: Proceedings of ICASSP, pp. 5095–5099 (2017)
Google Scholar
Gonzalez, J.A., et al.: Direct speech reconstruction from articulatory sensor data by machine learning. IEEE/ACM Trans. Audio Speech Lang. Process. 25(12), 2362–2374 (2017)
Article Google Scholar
Grósz, T., Gosztolya, G., Tóth, L., Csapó, T.G., Markó, A.: F0 estimation for DNN-based ultrasound silent speech interfaces. In: Proceedings of ICASSP, pp. 291–295 (2018)
Google Scholar
Hinton, G., et al.: Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)
Article Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Hueber, T., Benaroya, E.L., Chollet, G., Dreyfus, G., Stone, M.: Development of a silent speech interface driven by ultrasound and optical images of the tongue and lips. Speech Commun. 52(4), 288–300 (2010)
Article Google Scholar
Janke, M., Diener, L.: EMG-to-speech: direct generation of speech from facial electromyographic signals. IEEE/ACM Trans. Audio Speech Lang. Process. 25(12), 2375–2385 (2017)
Article Google Scholar
Janke, M., Wand, M., Nakamura, K., Schultz, T.: Further investigations on EMG-to-speech conversion. In: Proceedings of ICASSP, pp. 365–368 (2012)
Google Scholar
Jaumard-Hakoun, A., Xu, K., Leboullenger, C., Roussel-Ragot, P., Denby, B.: An articulatory-based singing voice synthesis using tongue and lips imaging. In: Proceedings of Interspeech, pp. 1467–1471 (2016)
Google Scholar
Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)
Article Google Scholar
Kim, M., Cao, B., Mau, T., Wang, J.: Multiview representation learning via deep CCA for silent speech recognition. In: Proceedings of Interspeech, pp. 2769–2773 (2017)
Google Scholar
Kim, M., Cao, B., Mau, T., Wang, J.: Speaker-independent silent speech recognition from flesh-point articulatory movements using an LSTM neural network. IEEE/ACM Trans. ASLP 25(12), 2323–2336 (2017)
Google Scholar
Kimura, N., Kono, M., Rekimoto, J.: SottoVoce: an ultrasound imaging-based silent speech interaction using deep neural networks. In: Proceedings of CHI Conference on Human Factors in Computing Systems (2019)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems 25, pp. 1097–1105 (2012)
Google Scholar
Ling, Z.H., et al.: Deep learning for acoustic modeling in parametric speech generation. IEEE Signal Process. Mag. 32(3), 35–52 (2015)
Article Google Scholar
Liu, Z.C., Ling, Z.H., Dai, L.R.: Articulatory-to-acoustic conversion using BLSTM-RNNs with augmented input representation. Speech Commun. 99(2017), 161–172 (2018)
Article Google Scholar
Maier-Hein, L., Metze, F., Schultz, T., Waibel, A.: Session independent non-audible speech recognition using surface electromyography. In: Proceedings of ASRU, pp. 331–336 (2005)
Google Scholar
Moliner, E., Csapó, T.: Ultrasound-based silent speech interface using convolutional and recurrent neural networks. Acta Acust. United Acust. 105, 587–590 (2019)
Article Google Scholar
Peddinti, V., Povey, D., Khudanpur, S.: A time delay neural network architecture for efficient modeling of long temporal contexts. In: Proceedings of Interspeech, pp. 3214–3218 (2015)
Google Scholar
Ramachandran, P., Zoph, B., Le, Q.V.: Swish: a self-gated activation function. ArXiv e-prints 1710.05941 (2017)
Schultz, T., Wand, M., Hueber, T., Krusienski, D.J., Herff, C., Brumberg, J.S.: Biosignal-based spoken communication: a survey. IEEE/ACM Trans. ASLP 25(12), 2257–2271 (2017)
Google Scholar
Tóth, L.: Combining time-and frequency-domain convolution in convolutional neural network-based phone recognition. In: Proceedings of ICASSP, pp. 190–194 (2014)
Google Scholar
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of CVPR (2018)
Google Scholar
Wang, J., Samal, A., Green, J.: Preliminary test of a real-time, interactive silent speech interface based on electromagnetic articulograph. In: Proceedings of SLPAT, pp. 38–45 (2014)
Google Scholar
Wu, C., Chen, S., Sheng, G., Roussel, P., Denby, B.: Predicting tongue motion in unlabeled ultrasound video using 3D convolutional neural networks. In: Proceedings of ICASSP, pp. 5764–5768 (2018)
Google Scholar
Young, T., Hazarika, D., Poria, S., Cambria, E.: Recent trends in deep learning based natural language processing. IEEE Comput. Intell. Mag. 13(3), 55–75 (2018)
Article Google Scholar
Zhang, S., Lei, M., Yan, Z., Dai, L.: Deep-FSMN for large vocabulary continuous speech recognition. In: Proceedings of ICASSP (2018)
Google Scholar
Zhao, C., Zhang, J., Wu, C., Wang, H., Xu, K.: Predicting tongue motion in unlabeled ultrasound video using convolutional LSTM neural networks. In: Proceedings of ICASSP, pp. 5926–5930 (2019)
Google Scholar
Zhao, S., Liu, Y., Han, Y., Hong, R., Hu, Q., Tian, Q.: Pooling the convolutional layers in deep convnets for video action recognition. IEEE Trans. Circuits Syst. Video Technol. 28(8), 1839–1849 (2018)
Article Google Scholar
Zhao, Y., Xiong, Y., Lin, D.: Trajectory convolution for action recognition. In: Advances in Neural Information Processing Systems 31, pp. 2204–2215 (2018)
Google Scholar

Download references

Acknowledgements

This study was supported by the National Research, Development and Innovation Office of Hungary through project FK 124584 and by the AI National Excellence Program (grant 2018-1.2.1-NKP-2018-00008) and by grant TUDFO/47138-1/2019-ITM of the Ministry of Innovation and Technology. László Tóth was supported by the UNKP 19-4 National Excellence Programme of the Ministry of Innovation and Technology, and by the János Bolyai Research Scholarship of the Hungarian Academy of Science. The GPU card used for the computations was donated by the NVIDIA Corporation. We thank the MTA-ELTE Lendület Lingual Articulation Research Group for providing the ultrasound recordings.

Author information

Authors and Affiliations

Institute of Informatics, University of Szeged, Szeged, Hungary
László Tóth & Amin Honarmandi Shandiz

Authors

László Tóth
View author publications
You can also search for this author in PubMed Google Scholar
Amin Honarmandi Shandiz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to László Tóth .

Editor information

Editors and Affiliations

Częstochowa University of Technology, Częstochowa, Poland
Leszek Rutkowski
Częstochowa University of Technology, Częstochowa, Poland
Rafał Scherer
Częstochowa University of Technology, Częstochowa, Poland
Marcin Korytkowski
Electrical and Computer Engineering, University of Alberta, Edmonton, AB, Canada
Witold Pedrycz
AGH University of Science and Technology, Kraków, Poland
Ryszard Tadeusiewicz
Electrical and Computer Engineering, University of Louisville, Louisville, KY, USA
Jacek M. Zurada

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tóth, L., Shandiz, A.H. (2020). 3D Convolutional Neural Networks for Ultrasound-Based Silent Speech Interfaces. In: Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J.M. (eds) Artificial Intelligence and Soft Computing. ICAISC 2020. Lecture Notes in Computer Science(), vol 12415. Springer, Cham. https://doi.org/10.1007/978-3-030-61401-0_16

Download citation

DOI: https://doi.org/10.1007/978-3-030-61401-0_16
Published: 07 October 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-61400-3
Online ISBN: 978-3-030-61401-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics