A multimodel keyword spotting system based on lip movement and speech features

Handa, Anand; Agarwal, Rashi; Kohli, Narendra

doi:10.1007/s11042-020-08837-2

A multimodel keyword spotting system based on lip movement and speech features

Published: 20 April 2020

Volume 79, pages 20461–20481, (2020)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Anand Handa¹,
Rashi Agarwal² &
Narendra Kohli³

410 Accesses
6 Citations
Explore all metrics

Abstract

The spoken keyword recognition and its localization are one of the fundamental aspects of speech recognition and known as keyword spotting. In automatic keyword spotting systems, the Lip-reading (LR) methods have a broader role when audio data is not present or has corrupted information. The available works from the literature have focussed on recognizing a limited number of words or phrases and require the cropped region of face or lip. Whereas the proposed model does not require the cropping of the video frames and it is recognition free. The proposed model is utilizing Convolutional Neural Networks and Long Short Term Memory networks to improve the overall performance. The model creates a 128-dimensional subspace to represent the feature vectors for speech signals and corresponding lip movements (focused viseme sequences). Thus the proposed model can tackle lip reading as an unconstrained natural speech signal in the video sequences. In the experiments, different standard datasets as LRW (Oxford-BBC), MIRACL-VC1, OuluVS, GRID, and CUAVE are used for the evaluation of the proposed model. The experiments also have a comparative analysis of the proposed model with current state-of-the-art methods for Lip-Reading task and keyword spotting task. The proposed model obtain excellent results for all datasets under consideration.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Facial emotion recognition using convolutional neural networks (FERC)

Article 18 February 2020

Biometrics recognition using deep learning: a survey

Article 13 January 2023

Automatic speech recognition: a survey

Article 10 November 2020

References

Arganda-Carreras I, Turaga SC, Berger DR, Cireşan D, Giusti A, Gambardella LM, Schmidhuber J, Laptev D, Dwivedi S, Buhmann JM, et al. (2015) Crowdsourcing the creation of image segmentation algorithms for connectomics. Frontiers in neuroanatomy 9:142. http://brainiac2.mit.edu/isbi_challenge/
Article Google Scholar
Bakry A, Elgammal A (2013) Mkpls: manifold kernel partial least squares for lipreading and speaker identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 684–691
Basu S, Oliver N, Pentland A (1998) 3d modeling and tracking of human lip motions. In: Sixth international conference on computer vision (IEEE cat. no. 98 CH 36271), pp. 337–343. IEEE
Bourlard HA, Morgan N (2012) Connectionist speech recognition: a hybrid approach, vol. 247 Springer Science & Business Media
Chung JS, Senior A, Vinyals O, Zisserman A (2017) Lip reading sentences in the wild 2017 IEEE Conference on computer vision and pattern recognition (CVPR), pp. 3444–3453. IEEE
Chung JS, Zisserman A (2016) Lip reading in the wild. In: Asian conference on computer vision, pp. 87–103. Springer
Cooke M, Barker J, Cunningham S, Shao X (2006) An audio-visual corpus for speech perception and automatic speech recognition. J Acoust Soc Am 120 (5):2421–2424
Article Google Scholar
Cox SJ, Harvey RW, Lan Y, Newman JL, Theobald BJ (2008) The challenge of multispeaker lip-reading. In: AVSP, pp. 179–184. Citeseer
Estellers V, Thiran JP (2012) Multi-pose lipreading and audio-visual speech recognition. EURASIP J Adv Sig Pr 2012(1):51
Article Google Scholar
Galatas G, Potamianos G, Makedon F (2012) Audio-visual speech recognition incorporating facial depth information captured by the kinect. In: 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO), pp. 2714–2717. IEEE
Giotis AP, Sfikas G, Gatos B, Nikou C (2017) A survey of document image word spotting techniques. Pattern Recogn 68:310–332
Article Google Scholar
Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249–256
Gowdy JN, Subramanya A, Bartels C, Bilmes J (2004) Dbn based multi-stream models for audio-visual speech recognition. In: 2004 IEEE International conference on acoustics, speech, and signal processing, vol. 1, pp. i–993. IEEE
Hecht-Nielsen R (1992) Theory of the backpropagation neural network. In: Neural networks for perception, pp. 65–93. Elsevier
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural comput 9(8):1735–1780
Article Google Scholar
Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: ICML 2015
Jha A, Namboodiri VP, Jawahar C (2018) Word spotting in silent lip videos. In: 2018 IEEE Winter conference on applications of computer vision (WACV), pp. 150–159. IEEE
Jha A, Namboodiri VP, Jawahar C (2019) Spotting words in silent speech videos: a retrieval-based approach. Machine Vision and Application, pp. 1–13
Ji S., Xu W., Yang M., Yu k. (2013) 3d convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence 35 (1):221–231
Article Google Scholar
Krishnan P, Jawahar C (2013) Bringing semantics in word image retrieval. In: 2013 12Th international conference on document analysis and recognition, pp. 733–737. IEEE
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp. 1097–1105
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324
Article Google Scholar
Lucey PJ, Sridharan S, Dean DB (2008) Continuous pose-invariant lipreading
Manmatha R, Han C, Riseman EM (1996) Word spotting: a new approach to indexing handwriting. In: Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 631–637. IEEE
McGurk H, MacDonald J (1976) Hearing lips and seeing voices. Nature 264 (5588):746
Article Google Scholar
Mroueh Y, Marcheret E, Goel V. (2015) Deep multimodal learning for audio-visual speech recognition. In: 2015 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp. 2130–2134. IEEE
Noda K, Yamaguchi Y, Nakadai K, Okuno HG, Ogata T (2015) Audio-visual speech recognition using deep learning. Appl Intell 42(4):722–737
Article Google Scholar
Pan SJ, Yang Q, et al. (2010) A survey on transfer learning. IEEE Transactions on knowledge and data engineering 22(10):1345–1359
Article Google Scholar
Papandreou G, Katsamanis A, Pitsikalis V, Maragos P (2009) Adaptive multimodal fusion by uncertainty compensation with application to audiovisual speech recognition. IEEE Transactions on Audio Speech, and Language Processing 17 (3):423–435
Article Google Scholar
Patterson EK, Gurbuz S, Tufekci Z, Gowdy JN (2002) Cuave: a new audio-visual database for multimodal human-computer interface research. In: 2002 IEEE International conference on acoustics, speech, and signal processing, vol. 2, pp. II–2017. IEEE
Pei Y, Kim TK, Zha H (2013) Unsupervised random forest manifold alignment for lipreading. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 129–136
Potamianos G, Neti C, Luettin J, Matthews I (2004) Audio-visual automatic speech recognition: an overview. Issues in visual and audio-visual speech processing 22:23
Google Scholar
Rekik A, Ben-Hamadou A, Mahdi W (2014) A new visual speech recognition approach for rgb-d cameras. In: International conference image analysis and recognition, pp. 21–28. Springer
Rekik A, Ben-Hamadou A, Mahdi W (2016) An adaptive approach for lip-reading using image and depth data. Multimedia Tools and Applications 75(14):8609–8636
Article Google Scholar
Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: International conference on medical image computing and computer-assisted intervention, pp. 234–241. Springer
Saenko K, Livescu K, Siracusa M, Wilson K, Glass J, Darrell T (2005) Visual speech recognition with loosely synchronized feature streams. In: 10th IEEE international conference on computer vision (ICCV’05) volume 1, vol. 2, pp. 1424–1431. IEEE
Shaikh AA, Kumar DK, Yau WC, Azemin MC, Gubbi J (2010) Lip reading using optical flow and support vector machines. In: 2010 3Rd international congress on image and signal processing, vol. 1, pp. 327–330. IEEE
Shin J, Lee J, Kim D (2011) Real-time lip reading system for isolated korean word recognition. Pattern Recogn 44(3):559–571
Article Google Scholar
Tamura S, Ninomiya H, Kitaoka N, Osuga S, Iribe Y, Takeda K, Hayamizu S (2015) Audio-visual speech recognition using deep bottleneck features and high-performance lipreading. In: 2015 Asia-pacific signal and information processing association annual summit and conference (APSIPA), pp. 575–582. IEEE
Wang K, Belongie S (2010) Word spotting in the wild. In: European conference on computer vision, pp. 591–604. Springer
Yargıċ A, Doġan M (2013) A lip reading application on ms kinect camera. In: 2013 IEEE INISTA, pp. 1–5. IEEE
Yu H, He F, Pan Y (2019) A novel segmentation model for medical images with intensity inhomogeneity based on adaptive perturbation. Multimed Tools Appl 78 (9):11,779–11,798
Article Google Scholar
Zhao G, Barnard M, Pietikainen M (2009) Lipreading with local spatiotemporal descriptors. IEEE T Multimed 11(7):1254–1265
Article Google Scholar
Zhou Z, Zhao G, Hong X, Pietikäinen M (2014) A review of recent advances in visual speech decoding. Image vision comput 32(9):590–605
Article Google Scholar
Zhou Z, Zhao G, Pietikainen M (2010) Lipreading: a graph embedding approach. In: 2010 20Th international conference on pattern recognition, pp. 523–526. IEEE

Download references

Author information

Authors and Affiliations

Department of CSE, Dr. APJ Abdul Kalam Technical University, Lucknow, India
Anand Handa
Department of Information Technology, UIET, CSJM University, Kanpur, India
Rashi Agarwal
Department of Computer Science and Engineering, HBTU, Kanpur, India
Narendra Kohli

Authors

Anand Handa
View author publications
You can also search for this author in PubMed Google Scholar
Rashi Agarwal
View author publications
You can also search for this author in PubMed Google Scholar
Narendra Kohli
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anand Handa.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Handa, A., Agarwal, R. & Kohli, N. A multimodel keyword spotting system based on lip movement and speech features. Multimed Tools Appl 79, 20461–20481 (2020). https://doi.org/10.1007/s11042-020-08837-2

Download citation

Received: 18 April 2019
Revised: 12 February 2020
Accepted: 09 March 2020
Published: 20 April 2020
Issue Date: July 2020
DOI: https://doi.org/10.1007/s11042-020-08837-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A multimodel keyword spotting system based on lip movement and speech features

Abstract

Access this article

Similar content being viewed by others

Facial emotion recognition using convolutional neural networks (FERC)

Biometrics recognition using deep learning: a survey

Automatic speech recognition: a survey

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A multimodel keyword spotting system based on lip movement and speech features

Abstract

Access this article

Similar content being viewed by others

Facial emotion recognition using convolutional neural networks (FERC)

Biometrics recognition using deep learning: a survey

Automatic speech recognition: a survey

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation