Skip to main content

Mobile Device-Based Speech Enhancement System Using Lip-Reading

  • Conference paper
  • First Online:
Distributed Computing and Artificial Intelligence, 17th International Conference (DCAI 2020)

Abstract

The lip-reading based speech enhancement method for laryngectomees is proposed to improve their communication in an inconspicuous way. First, we developed a simple lip-reading mobile phone application for Japanese using the YOLOv3-Tiny, which can recognize Japanese vowel sequences. Four laryngectomees tested the application, and we confirmed that the system design concept is along the line of the user needs. Second, the user-dependent lip-reading algorithm with very small training data set was developed. Each 36 viseme images were converted into very small data using VAE(Variational Autoencoder), then the training data for the word recognition model was generated. Viseme is a group of phonemes with identical appearances on the lips. Our viseme sequence representation with the VAE was used to be able to adapt users with very small amount of training data set. Word recognition experiment using the VAE encoder and CNN was performed with 20 Japanese words. The experimental result showed 65% recognition accuracy, and 100% including 1st and 2nd candidates. The lip-reading type speech enhancement seems appropriate for embedding mobile devices in consideration of both usability and small vocabulary recognition accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Kimura, K., et al.: Development of wearable speech enhancement system for laryngectomees. In: NCSP2016, pp. 339–342, March (2016)

    Google Scholar 

  2. Denby, B., Schultz, T., Honda, K., Hueber, T., Gilbert, J.M., et al.: Silent speech interfaces. Speech Commun. 52(4), 270 (2010)

    Article  Google Scholar 

  3. Kapur, A., Kapur, S., Maes, P.: AlterEgo: a personalized wearable silent speech interface. In: IUI 2018, Tokyo, Japan, 7–11 March 2018

    Google Scholar 

  4. Goodfellow, Ian, Bengio, Yoshua: Aaron Courville, Deep Leaning. MIT Press, Cambridge (2016)

    Google Scholar 

  5. Saito, Y.: Deep Learning from Scratch. O’Reilly, Japan (2016)

    Google Scholar 

  6. Hideki, A., et al.: Deep Leaning. Kindai Kagakusya, Tokyo (2015)

    Google Scholar 

  7. King, D.E.: Max-margin object detection. arXiv:1502.00046v1 31 Jan 2015

  8. Kazemi, V., Sullivan, J.: One millisecond face alignment with an ensemble of regression trees. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1867–1874 (2014)

    Google Scholar 

  9. Assael, Y.M., Shillingford, B., Whiteson, S., de Freitas, N.: LipNet: end-to-end sentence-level lipreading. In: GPU Technology Conference (2017)

    Google Scholar 

  10. Ito, D., Takiguchi, T., Ariki Y.: Lip image to speech conversion using LipNet. Acoustic Society of Japan articles, March 2018

    Google Scholar 

  11. Kawahara, H.: STRAIGHT, exploitation of the other aspect of vocoder: perceptually isomorphic decomposition of speech sounds. Acoust. Sci. Technol. 27(6), 349–353 (2006)

    Article  Google Scholar 

  12. Asami, et al.: Basic study on lip reading for Japanese speaker by machine learning. In: 33rd, Picture Coding Symposium (PCSJ/IMPS2018), P–3–08, November 2018

    Google Scholar 

  13. Saitoh, T., Kubokawa, M.: SSSD: Japanese speech scene database by smart device for visual speech recognition. In: IEICE, vol. 117, no. 513, pp. 163–168 (2018)

    Google Scholar 

  14. Saitoh, T., Kubokawa, M.: SSSD: speech scene database by smart device for visual speech recognition. In: Proceedings of ICPR2018, pp. 3228–3232 (2018)

    Google Scholar 

Download references

Acknowledgment

This work was supported by JSPS KAKENHI Grant-in-Aid for Scientific Research(C) Grant Number 19K012905.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kenji Matsui .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Nakahara, T. et al. (2021). Mobile Device-Based Speech Enhancement System Using Lip-Reading. In: Dong, Y., Herrera-Viedma, E., Matsui, K., Omatsu, S., González Briones, A., Rodríguez González, S. (eds) Distributed Computing and Artificial Intelligence, 17th International Conference. DCAI 2020. Advances in Intelligent Systems and Computing, vol 1237. Springer, Cham. https://doi.org/10.1007/978-3-030-53036-5_17

Download citation

Publish with us

Policies and ethics