Skip to main content
Log in

A multimodel keyword spotting system based on lip movement and speech features

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

The spoken keyword recognition and its localization are one of the fundamental aspects of speech recognition and known as keyword spotting. In automatic keyword spotting systems, the Lip-reading (LR) methods have a broader role when audio data is not present or has corrupted information. The available works from the literature have focussed on recognizing a limited number of words or phrases and require the cropped region of face or lip. Whereas the proposed model does not require the cropping of the video frames and it is recognition free. The proposed model is utilizing Convolutional Neural Networks and Long Short Term Memory networks to improve the overall performance. The model creates a 128-dimensional subspace to represent the feature vectors for speech signals and corresponding lip movements (focused viseme sequences). Thus the proposed model can tackle lip reading as an unconstrained natural speech signal in the video sequences. In the experiments, different standard datasets as LRW (Oxford-BBC), MIRACL-VC1, OuluVS, GRID, and CUAVE are used for the evaluation of the proposed model. The experiments also have a comparative analysis of the proposed model with current state-of-the-art methods for Lip-Reading task and keyword spotting task. The proposed model obtain excellent results for all datasets under consideration.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Arganda-Carreras I, Turaga SC, Berger DR, Cireşan D, Giusti A, Gambardella LM, Schmidhuber J, Laptev D, Dwivedi S, Buhmann JM, et al. (2015) Crowdsourcing the creation of image segmentation algorithms for connectomics. Frontiers in neuroanatomy 9:142. http://brainiac2.mit.edu/isbi_challenge/

    Article  Google Scholar 

  2. Bakry A, Elgammal A (2013) Mkpls: manifold kernel partial least squares for lipreading and speaker identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 684–691

  3. Basu S, Oliver N, Pentland A (1998) 3d modeling and tracking of human lip motions. In: Sixth international conference on computer vision (IEEE cat. no. 98 CH 36271), pp. 337–343. IEEE

  4. Bourlard HA, Morgan N (2012) Connectionist speech recognition: a hybrid approach, vol. 247 Springer Science & Business Media

  5. Chung JS, Senior A, Vinyals O, Zisserman A (2017) Lip reading sentences in the wild 2017 IEEE Conference on computer vision and pattern recognition (CVPR), pp. 3444–3453. IEEE

  6. Chung JS, Zisserman A (2016) Lip reading in the wild. In: Asian conference on computer vision, pp. 87–103. Springer

  7. Cooke M, Barker J, Cunningham S, Shao X (2006) An audio-visual corpus for speech perception and automatic speech recognition. J Acoust Soc Am 120 (5):2421–2424

    Article  Google Scholar 

  8. Cox SJ, Harvey RW, Lan Y, Newman JL, Theobald BJ (2008) The challenge of multispeaker lip-reading. In: AVSP, pp. 179–184. Citeseer

  9. Estellers V, Thiran JP (2012) Multi-pose lipreading and audio-visual speech recognition. EURASIP J Adv Sig Pr 2012(1):51

    Article  Google Scholar 

  10. Galatas G, Potamianos G, Makedon F (2012) Audio-visual speech recognition incorporating facial depth information captured by the kinect. In: 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO), pp. 2714–2717. IEEE

  11. Giotis AP, Sfikas G, Gatos B, Nikou C (2017) A survey of document image word spotting techniques. Pattern Recogn 68:310–332

    Article  Google Scholar 

  12. Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249–256

  13. Gowdy JN, Subramanya A, Bartels C, Bilmes J (2004) Dbn based multi-stream models for audio-visual speech recognition. In: 2004 IEEE International conference on acoustics, speech, and signal processing, vol. 1, pp. i–993. IEEE

  14. Hecht-Nielsen R (1992) Theory of the backpropagation neural network. In: Neural networks for perception, pp. 65–93. Elsevier

  15. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural comput 9(8):1735–1780

    Article  Google Scholar 

  16. Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: ICML 2015

  17. Jha A, Namboodiri VP, Jawahar C (2018) Word spotting in silent lip videos. In: 2018 IEEE Winter conference on applications of computer vision (WACV), pp. 150–159. IEEE

  18. Jha A, Namboodiri VP, Jawahar C (2019) Spotting words in silent speech videos: a retrieval-based approach. Machine Vision and Application, pp. 1–13

  19. Ji S., Xu W., Yang M., Yu k. (2013) 3d convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence 35 (1):221–231

    Article  Google Scholar 

  20. Krishnan P, Jawahar C (2013) Bringing semantics in word image retrieval. In: 2013 12Th international conference on document analysis and recognition, pp. 733–737. IEEE

  21. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp. 1097–1105

  22. LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324

    Article  Google Scholar 

  23. Lucey PJ, Sridharan S, Dean DB (2008) Continuous pose-invariant lipreading

  24. Manmatha R, Han C, Riseman EM (1996) Word spotting: a new approach to indexing handwriting. In: Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 631–637. IEEE

  25. McGurk H, MacDonald J (1976) Hearing lips and seeing voices. Nature 264 (5588):746

    Article  Google Scholar 

  26. Mroueh Y, Marcheret E, Goel V. (2015) Deep multimodal learning for audio-visual speech recognition. In: 2015 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp. 2130–2134. IEEE

  27. Noda K, Yamaguchi Y, Nakadai K, Okuno HG, Ogata T (2015) Audio-visual speech recognition using deep learning. Appl Intell 42(4):722–737

    Article  Google Scholar 

  28. Pan SJ, Yang Q, et al. (2010) A survey on transfer learning. IEEE Transactions on knowledge and data engineering 22(10):1345–1359

    Article  Google Scholar 

  29. Papandreou G, Katsamanis A, Pitsikalis V, Maragos P (2009) Adaptive multimodal fusion by uncertainty compensation with application to audiovisual speech recognition. IEEE Transactions on Audio Speech, and Language Processing 17 (3):423–435

    Article  Google Scholar 

  30. Patterson EK, Gurbuz S, Tufekci Z, Gowdy JN (2002) Cuave: a new audio-visual database for multimodal human-computer interface research. In: 2002 IEEE International conference on acoustics, speech, and signal processing, vol. 2, pp. II–2017. IEEE

  31. Pei Y, Kim TK, Zha H (2013) Unsupervised random forest manifold alignment for lipreading. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 129–136

  32. Potamianos G, Neti C, Luettin J, Matthews I (2004) Audio-visual automatic speech recognition: an overview. Issues in visual and audio-visual speech processing 22:23

    Google Scholar 

  33. Rekik A, Ben-Hamadou A, Mahdi W (2014) A new visual speech recognition approach for rgb-d cameras. In: International conference image analysis and recognition, pp. 21–28. Springer

  34. Rekik A, Ben-Hamadou A, Mahdi W (2016) An adaptive approach for lip-reading using image and depth data. Multimedia Tools and Applications 75(14):8609–8636

    Article  Google Scholar 

  35. Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: International conference on medical image computing and computer-assisted intervention, pp. 234–241. Springer

  36. Saenko K, Livescu K, Siracusa M, Wilson K, Glass J, Darrell T (2005) Visual speech recognition with loosely synchronized feature streams. In: 10th IEEE international conference on computer vision (ICCV’05) volume 1, vol. 2, pp. 1424–1431. IEEE

  37. Shaikh AA, Kumar DK, Yau WC, Azemin MC, Gubbi J (2010) Lip reading using optical flow and support vector machines. In: 2010 3Rd international congress on image and signal processing, vol. 1, pp. 327–330. IEEE

  38. Shin J, Lee J, Kim D (2011) Real-time lip reading system for isolated korean word recognition. Pattern Recogn 44(3):559–571

    Article  Google Scholar 

  39. Tamura S, Ninomiya H, Kitaoka N, Osuga S, Iribe Y, Takeda K, Hayamizu S (2015) Audio-visual speech recognition using deep bottleneck features and high-performance lipreading. In: 2015 Asia-pacific signal and information processing association annual summit and conference (APSIPA), pp. 575–582. IEEE

  40. Wang K, Belongie S (2010) Word spotting in the wild. In: European conference on computer vision, pp. 591–604. Springer

  41. Yargıċ A, Doġan M (2013) A lip reading application on ms kinect camera. In: 2013 IEEE INISTA, pp. 1–5. IEEE

  42. Yu H, He F, Pan Y (2019) A novel segmentation model for medical images with intensity inhomogeneity based on adaptive perturbation. Multimed Tools Appl 78 (9):11,779–11,798

    Article  Google Scholar 

  43. Zhao G, Barnard M, Pietikainen M (2009) Lipreading with local spatiotemporal descriptors. IEEE T Multimed 11(7):1254–1265

    Article  Google Scholar 

  44. Zhou Z, Zhao G, Hong X, Pietikäinen M (2014) A review of recent advances in visual speech decoding. Image vision comput 32(9):590–605

    Article  Google Scholar 

  45. Zhou Z, Zhao G, Pietikainen M (2010) Lipreading: a graph embedding approach. In: 2010 20Th international conference on pattern recognition, pp. 523–526. IEEE

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anand Handa.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Handa, A., Agarwal, R. & Kohli, N. A multimodel keyword spotting system based on lip movement and speech features. Multimed Tools Appl 79, 20461–20481 (2020). https://doi.org/10.1007/s11042-020-08837-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-020-08837-2

Keywords

Navigation