Abstract
Developments in dynamic contour tracking permit sparse representation of the outlines of moving contours. Given the increasing computing power of general-purpose workstations it is now possible to track human faces and parts of faces in real-time without special hardware. This paper describes a real-time lip tracker that uses a Kalman filter based dynamic contour to track the outline of the lips. Two alternative lip trackers, one that tracks lips from a profile view and the other from a frontal view, were developed to extract visual speech recognition features from the lip contour. In both cases, visual features have been incorporated into an acoustic automatic speech recogniser. Tests on small isolated-word vocabularies using a dynamic time warping based audio-visual recogniser demonstrate that real-time, contour-based lip tracking can be used to supplement acoustic-only speech recognisers enabling robust recognition of speech in the presence of acoustic noise.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
A. Adjoudani and C. Benoit. On the integration of auditory and visual parameters in an HMM-based ASR. In Proceedings NATO ASI Conference on Speechreading by Man and Machine: Models, Systems and Applications. NATO Scientific Affairs Division, Sep 1995.
A. Blake, R. Curwen, and A. Zisserman. A framework for spatio-temporal control in the tracking of visual contours. Int. Journal of Computer Vision, 11(2):127–145, 1993.
A. Blake and M.A. Isard. 3D position, attitude and shape input using video tracking of hands and lips. In Proc. Siggraph, pp. 185–192. ACM, 1994.
A. Blake, M.A. Isard, and D. Reynard. Learning to track the visual motion of contours. Artificial Intelligence, 78:101–134, 1995.
C. Bregler and Y. Konig. Eigenlips for robust speech recognition. In Proc. Int. Conf. on Acoust., Speech, Signal Processing, pp. 669–672, Adelaide, 1994.
C. Bregler and S.M. Omohundro. Nonlinear manifold learning for visual speech recognition. In Proc. 5th Int. Conf. on Computer Vision, pp. 494–499, Boston, Jun 1995.
R. Cole, L. Hirschmann, L. Atlas, et al. The challenge of spoken language systems: Research directions for the nineties. IEEE Trans. on Speech and Audio Processing, 3(1):1–20, 1995.
B. Dalton, R. Kaucic, and A. Blake. Automatic Speechreading using dynamic contours. In Proceedings NATO ASI Conference on Speechreading by Man and Machine: Models, Systems and Applications. NATO Scientific Affairs Division, Sep 1995.
B. Dodd and R. Campbell. Hearing By Eye: The Psychology of Lip Reading. Erlbaum, 1987.
E. K. Finn and A. A. Montgomery. Automatic optically based recognition of speech. Pattern Recognition Letters, 8(3):159–164, 1988.
M.J.F. Gales and S. Young. An improved approach to the Hidden Markov Model decomposition of speech and noise. In Proc. Int. Conf. on Acoust., Speech, Signal Processing, pp. 233–239, San Franciso, Mar 1992.
M.W. Mak and W.G. Allen. Lip-motion analysis for speech segmentation in noise. Speech Communication, 14(3):279–296, 1994.
Y. Moses, D. Reynard, and A. Blake. Determining facial expressions in real-time. In Proc. 5th Int. Conf. on Computer Vision, pp. 296–301, Boston, Jun 1995.
J.P. Openshaw and J.S. Mason. A review of robust techniques for the analysis of degraded speech. In Proc. IEEE Region 10 Conf. on Comp., Control, and Power Engr., pp. 329–332, 1993.
E.D. Petajan, N.M. Brooke, B.J. Bischofy, and D.A. Bodoff. An improved automatic lipreading system to enhance speech recognition. In E. Soloway, D. Frye, and S.B. Sheppard, editors, Proc. Human Factors in Computing Systems, pp. 19–25. ACM, 1988.
L. Rabiner and J. Bing-Hwang. Fundamentals of speech recognition. Prentice-Hall, 1993.
D. Reisberg, J. McLean, and A. Goldfield. Easy to hear but hard to understand: A lip-reading advantage with intact auditory stimuli. In B. Dodd and R. Campbell, editors, Hearing By Eye: The Psychology of Lip Reading, pp. 97–113. Erlbaum, 1987.
D. Reynard, A. Wildenberg, A. Blake, and J. Marchant. Learning dynamics of complex motions from image sequences. In Proc. 4th European Conf. on Computer Vision, Cambridge, England, Apr 1996.
D.G. Stork, G. Wolff, and E. Levine. Neural network lipreading system for improved speech recognition. In Proceedings International Joint Conference on Neural Networks, volume 2, pp. 289–295, 1992.
Q. Summerfield, A. MacLeod, M. McGrath, and M. Brooke. Lips, teeth and the benefits of lipreading. In A.W. Young and H.D. Ellis, editors, Handbook of Research on Face Processing, pp. 223–233. Elsevier Science Publishers, 1989.
A.P. Varga and R.K. Moore. Hidden Markov Model decomposition of speech and noise. In Proc. Int. Conf. on Acoust., Speech, Signal Processing, pp. 845–848, 1990.
B.P. Yuhas, M.H. Goldstein, T.J. Sejnowski, and R.E. Jenkins. Neural network models of sensory integration for improved vowel recognition. Proceedings of the IEEE, 78(10):1658–1668, 1990.
V. Zue, J. Glass, D. Goodine, L. Hirschman, H. Leung, M. Phillips, J. Polifroni, and S. Seneff. From speech recognition to spoken language understanding: The development of the MIT SUMMIT and VOYAGER systems. In R.P. Lippman, J.E. Moody, and D.S. Touretzky, editors, Advances in Neural Information Processing 3, pp. 255–261. Morgan Kaufman, 1991.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1996 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kaucic, R., Dalton, B., Blake, A. (1996). Real-time lip tracking for audio-visual speech recognition applications. In: Buxton, B., Cipolla, R. (eds) Computer Vision — ECCV '96. ECCV 1996. Lecture Notes in Computer Science, vol 1065. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-61123-1_154
Download citation
DOI: https://doi.org/10.1007/3-540-61123-1_154
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-61123-3
Online ISBN: 978-3-540-49950-3
eBook Packages: Springer Book Archive