Mobile Device-Based Speech Enhancement System Using Lip-Reading

Nakahara, Tomonori; Fukuyama, Kohei; Hamada, Mitsuru; Matsui, Kenji; Nakatoh, Yoshihisa; Kato, Yumiko O.; Rivas, Alberto; Corchado, Juan Manuel

doi:10.1007/978-3-030-53036-5_17

Tomonori Nakahara²⁰,
Kohei Fukuyama²⁰,
Mitsuru Hamada²⁰,
Kenji Matsui²⁰,
Yoshihisa Nakatoh²¹,
Yumiko O. Kato²²,
Alberto Rivas²³ &
…
Juan Manuel Corchado²³

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1237))

Included in the following conference series:

International Symposium on Distributed Computing and Artificial Intelligence

641 Accesses
4 Citations

Abstract

The lip-reading based speech enhancement method for laryngectomees is proposed to improve their communication in an inconspicuous way. First, we developed a simple lip-reading mobile phone application for Japanese using the YOLOv3-Tiny, which can recognize Japanese vowel sequences. Four laryngectomees tested the application, and we confirmed that the system design concept is along the line of the user needs. Second, the user-dependent lip-reading algorithm with very small training data set was developed. Each 36 viseme images were converted into very small data using VAE(Variational Autoencoder), then the training data for the word recognition model was generated. Viseme is a group of phonemes with identical appearances on the lips. Our viseme sequence representation with the VAE was used to be able to adapt users with very small amount of training data set. Word recognition experiment using the VAE encoder and CNN was performed with 20 Japanese words. The experimental result showed 65% recognition accuracy, and 100% including 1^st and 2^nd candidates. The lip-reading type speech enhancement seems appropriate for embedding mobile devices in consideration of both usability and small vocabulary recognition accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Kimura, K., et al.: Development of wearable speech enhancement system for laryngectomees. In: NCSP2016, pp. 339–342, March (2016)
Google Scholar
Denby, B., Schultz, T., Honda, K., Hueber, T., Gilbert, J.M., et al.: Silent speech interfaces. Speech Commun. 52(4), 270 (2010)
Article Google Scholar
Kapur, A., Kapur, S., Maes, P.: AlterEgo: a personalized wearable silent speech interface. In: IUI 2018, Tokyo, Japan, 7–11 March 2018
Google Scholar
Goodfellow, Ian, Bengio, Yoshua: Aaron Courville, Deep Leaning. MIT Press, Cambridge (2016)
Google Scholar
Saito, Y.: Deep Learning from Scratch. O’Reilly, Japan (2016)
Google Scholar
Hideki, A., et al.: Deep Leaning. Kindai Kagakusya, Tokyo (2015)
Google Scholar
King, D.E.: Max-margin object detection. arXiv:1502.00046v1 31 Jan 2015
Kazemi, V., Sullivan, J.: One millisecond face alignment with an ensemble of regression trees. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1867–1874 (2014)
Google Scholar
Assael, Y.M., Shillingford, B., Whiteson, S., de Freitas, N.: LipNet: end-to-end sentence-level lipreading. In: GPU Technology Conference (2017)
Google Scholar
Ito, D., Takiguchi, T., Ariki Y.: Lip image to speech conversion using LipNet. Acoustic Society of Japan articles, March 2018
Google Scholar
Kawahara, H.: STRAIGHT, exploitation of the other aspect of vocoder: perceptually isomorphic decomposition of speech sounds. Acoust. Sci. Technol. 27(6), 349–353 (2006)
Article Google Scholar
Asami, et al.: Basic study on lip reading for Japanese speaker by machine learning. In: 33^rd, Picture Coding Symposium (PCSJ/IMPS2018), P–3–08, November 2018
Google Scholar
Saitoh, T., Kubokawa, M.: SSSD: Japanese speech scene database by smart device for visual speech recognition. In: IEICE, vol. 117, no. 513, pp. 163–168 (2018)
Google Scholar
Saitoh, T., Kubokawa, M.: SSSD: speech scene database by smart device for visual speech recognition. In: Proceedings of ICPR2018, pp. 3228–3232 (2018)
Google Scholar

Download references

Acknowledgment

This work was supported by JSPS KAKENHI Grant-in-Aid for Scientific Research(C) Grant Number 19K012905.

Author information

Authors and Affiliations

Osaka Institute of Technology, Osaka, Japan
Tomonori Nakahara, Kohei Fukuyama, Mitsuru Hamada & Kenji Matsui
Kyushu Institute of Technology, Fukuoka, Japan
Yoshihisa Nakatoh
St. Marianna University School of Medicine, Kawasaki, Japan
Yumiko O. Kato
BISITE Digital Innovation Hub, University of Salamanca, Salamanca, Spain
Alberto Rivas & Juan Manuel Corchado

Authors

Tomonori Nakahara
View author publications
You can also search for this author in PubMed Google Scholar
Kohei Fukuyama
View author publications
You can also search for this author in PubMed Google Scholar
Mitsuru Hamada
View author publications
You can also search for this author in PubMed Google Scholar
Kenji Matsui
View author publications
You can also search for this author in PubMed Google Scholar
Yoshihisa Nakatoh
View author publications
You can also search for this author in PubMed Google Scholar
Yumiko O. Kato
View author publications
You can also search for this author in PubMed Google Scholar
Alberto Rivas
View author publications
You can also search for this author in PubMed Google Scholar
Juan Manuel Corchado
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kenji Matsui .

Editor information

Editors and Affiliations

Business School, Sichuan University, Chengdu, China
Yucheng Dong
Andalusian Research Institute on Data Science and Computational Intelligence (DaSCI), University of Granada, Granada, Spain
Enrique Herrera-Viedma
Dept. of System Design, Osaka Institute of Technology, Osaka, Japan
Kenji Matsui
Hiroshima University, Osaka, Japan
Shigeru Omatsu
GRASIA Research Group, Facultad de Informática, Universidad Complutense de Madrid, Madrid, Spain
Alfonso González Briones
IoT European Digital Innovation Hub, Bioinformatics Intelligent Systems and Educational Technology Research Group, Department of Computer Science, Faculty of Science, University of Salamanca, Salamanca, Spain
Sara Rodríguez González

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nakahara, T. et al. (2021). Mobile Device-Based Speech Enhancement System Using Lip-Reading. In: Dong, Y., Herrera-Viedma, E., Matsui, K., Omatsu, S., González Briones, A., Rodríguez González, S. (eds) Distributed Computing and Artificial Intelligence, 17th International Conference. DCAI 2020. Advances in Intelligent Systems and Computing, vol 1237. Springer, Cham. https://doi.org/10.1007/978-3-030-53036-5_17

Download citation

DOI: https://doi.org/10.1007/978-3-030-53036-5_17
Published: 07 August 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-53035-8
Online ISBN: 978-3-030-53036-5
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics