Real-time lip tracking for audio-visual speech recognition applications

Kaucic, Robert; Dalton, Barney; Blake, Andrew

doi:10.1007/3-540-61123-1_154

Robert Kaucic¹,
Barney Dalton¹ &
Andrew Blake¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1065))

Included in the following conference series:

European Conference on Computer Vision

456 Accesses
28 Citations

Abstract

Developments in dynamic contour tracking permit sparse representation of the outlines of moving contours. Given the increasing computing power of general-purpose workstations it is now possible to track human faces and parts of faces in real-time without special hardware. This paper describes a real-time lip tracker that uses a Kalman filter based dynamic contour to track the outline of the lips. Two alternative lip trackers, one that tracks lips from a profile view and the other from a frontal view, were developed to extract visual speech recognition features from the lip contour. In both cases, visual features have been incorporated into an acoustic automatic speech recogniser. Tests on small isolated-word vocabularies using a dynamic time warping based audio-visual recogniser demonstrate that real-time, contour-based lip tracking can be used to supplement acoustic-only speech recognisers enabling robust recognition of speech in the presence of acoustic noise.

Download to read the full chapter text

Chapter PDF

Lip-Reading: Toward Phoneme Recognition Through Lip Kinematics

Inner lips feature extraction based on CLNF with hybrid dynamic template for Cued Speech

Article Open access 19 December 2017

An adaptive approach for lip-reading using image and depth data

Article 09 July 2015

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

A. Adjoudani and C. Benoit. On the integration of auditory and visual parameters in an HMM-based ASR. In Proceedings NATO ASI Conference on Speechreading by Man and Machine: Models, Systems and Applications. NATO Scientific Affairs Division, Sep 1995.
Google Scholar
A. Blake, R. Curwen, and A. Zisserman. A framework for spatio-temporal control in the tracking of visual contours. Int. Journal of Computer Vision, 11(2):127–145, 1993.
Google Scholar
A. Blake and M.A. Isard. 3D position, attitude and shape input using video tracking of hands and lips. In Proc. Siggraph, pp. 185–192. ACM, 1994.
Google Scholar
A. Blake, M.A. Isard, and D. Reynard. Learning to track the visual motion of contours. Artificial Intelligence, 78:101–134, 1995.
Google Scholar
C. Bregler and Y. Konig. Eigenlips for robust speech recognition. In Proc. Int. Conf. on Acoust., Speech, Signal Processing, pp. 669–672, Adelaide, 1994.
Google Scholar
C. Bregler and S.M. Omohundro. Nonlinear manifold learning for visual speech recognition. In Proc. 5th Int. Conf. on Computer Vision, pp. 494–499, Boston, Jun 1995.
Google Scholar
R. Cole, L. Hirschmann, L. Atlas, et al. The challenge of spoken language systems: Research directions for the nineties. IEEE Trans. on Speech and Audio Processing, 3(1):1–20, 1995.
Google Scholar
B. Dalton, R. Kaucic, and A. Blake. Automatic Speechreading using dynamic contours. In Proceedings NATO ASI Conference on Speechreading by Man and Machine: Models, Systems and Applications. NATO Scientific Affairs Division, Sep 1995.
Google Scholar
B. Dodd and R. Campbell. Hearing By Eye: The Psychology of Lip Reading. Erlbaum, 1987.
Google Scholar
E. K. Finn and A. A. Montgomery. Automatic optically based recognition of speech. Pattern Recognition Letters, 8(3):159–164, 1988.
Google Scholar
M.J.F. Gales and S. Young. An improved approach to the Hidden Markov Model decomposition of speech and noise. In Proc. Int. Conf. on Acoust., Speech, Signal Processing, pp. 233–239, San Franciso, Mar 1992.
Google Scholar
M.W. Mak and W.G. Allen. Lip-motion analysis for speech segmentation in noise. Speech Communication, 14(3):279–296, 1994.
Google Scholar
Y. Moses, D. Reynard, and A. Blake. Determining facial expressions in real-time. In Proc. 5th Int. Conf. on Computer Vision, pp. 296–301, Boston, Jun 1995.
Google Scholar
J.P. Openshaw and J.S. Mason. A review of robust techniques for the analysis of degraded speech. In Proc. IEEE Region 10 Conf. on Comp., Control, and Power Engr., pp. 329–332, 1993.
Google Scholar
E.D. Petajan, N.M. Brooke, B.J. Bischofy, and D.A. Bodoff. An improved automatic lipreading system to enhance speech recognition. In E. Soloway, D. Frye, and S.B. Sheppard, editors, Proc. Human Factors in Computing Systems, pp. 19–25. ACM, 1988.
Google Scholar
L. Rabiner and J. Bing-Hwang. Fundamentals of speech recognition. Prentice-Hall, 1993.
Google Scholar
D. Reisberg, J. McLean, and A. Goldfield. Easy to hear but hard to understand: A lip-reading advantage with intact auditory stimuli. In B. Dodd and R. Campbell, editors, Hearing By Eye: The Psychology of Lip Reading, pp. 97–113. Erlbaum, 1987.
Google Scholar
D. Reynard, A. Wildenberg, A. Blake, and J. Marchant. Learning dynamics of complex motions from image sequences. In Proc. 4th European Conf. on Computer Vision, Cambridge, England, Apr 1996.
Google Scholar
D.G. Stork, G. Wolff, and E. Levine. Neural network lipreading system for improved speech recognition. In Proceedings International Joint Conference on Neural Networks, volume 2, pp. 289–295, 1992.
Google Scholar
Q. Summerfield, A. MacLeod, M. McGrath, and M. Brooke. Lips, teeth and the benefits of lipreading. In A.W. Young and H.D. Ellis, editors, Handbook of Research on Face Processing, pp. 223–233. Elsevier Science Publishers, 1989.
Google Scholar
A.P. Varga and R.K. Moore. Hidden Markov Model decomposition of speech and noise. In Proc. Int. Conf. on Acoust., Speech, Signal Processing, pp. 845–848, 1990.
Google Scholar
B.P. Yuhas, M.H. Goldstein, T.J. Sejnowski, and R.E. Jenkins. Neural network models of sensory integration for improved vowel recognition. Proceedings of the IEEE, 78(10):1658–1668, 1990.
Google Scholar
V. Zue, J. Glass, D. Goodine, L. Hirschman, H. Leung, M. Phillips, J. Polifroni, and S. Seneff. From speech recognition to spoken language understanding: The development of the MIT SUMMIT and VOYAGER systems. In R.P. Lippman, J.E. Moody, and D.S. Touretzky, editors, Advances in Neural Information Processing 3, pp. 255–261. Morgan Kaufman, 1991.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Engineering Science, University of Oxford, OX1 3PJ, Oxford, UK
Robert Kaucic, Barney Dalton & Andrew Blake

Authors

Robert Kaucic
View author publications
You can also search for this author in PubMed Google Scholar
Barney Dalton
View author publications
You can also search for this author in PubMed Google Scholar
Andrew Blake
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Bernard Buxton Roberto Cipolla

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kaucic, R., Dalton, B., Blake, A. (1996). Real-time lip tracking for audio-visual speech recognition applications. In: Buxton, B., Cipolla, R. (eds) Computer Vision — ECCV '96. ECCV 1996. Lecture Notes in Computer Science, vol 1065. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-61123-1_154

Download citation

DOI: https://doi.org/10.1007/3-540-61123-1_154
Published: 02 June 2005
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-61123-3
Online ISBN: 978-3-540-49950-3
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

Real-time lip tracking for audio-visual speech recognition applications

Abstract

Chapter PDF

Similar content being viewed by others

Lip-Reading: Toward Phoneme Recognition Through Lip Kinematics

Inner lips feature extraction based on CLNF with hybrid dynamic template for Cued Speech

An adaptive approach for lip-reading using image and depth data

Keywords

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Real-time lip tracking for audio-visual speech recognition applications

Abstract

Chapter PDF

Similar content being viewed by others

Lip-Reading: Toward Phoneme Recognition Through Lip Kinematics

Inner lips feature extraction based on CLNF with hybrid dynamic template for Cued Speech

An adaptive approach for lip-reading using image and depth data

Keywords

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation