skip to main content
10.1145/971478.971488acmotherconferencesArticle/Chapter ViewAbstractPublication PagespuiConference Proceedingsconference-collections
Article

Speech driven facial animation

Published:15 November 2001Publication History

ABSTRACT

The results reported in this article are an integral part of a larger project aimed at achieving perceptually realistic animations, including the individualized nuances, of three-dimensional human faces driven by speech. The audiovisual system that has been developed for learning the spatio-temporal relationship between speech acoustics and facial animation is described, including video and speech processing, pattern analysis, and MPEG-4 compliant facial animation for a given speaker. In particular, we propose a perceptual transformation of the speech spectral envelope, which is shown to capture the dynamics of articulatory movements. An efficient nearest-neighbor algorithm is used to predict novel articulatory trajectories from the speech dynamics. The results are very promising and suggest a new way to approach the modeling of synthetic lip motion of a given speaker driven by his/her speech. This would also provide clues toward a more general cross-speaker realistic animation.

References

  1. M. Cohen and D. Massaro, 1993, "Modeling coarticulation in synthetic visual speech," in N. M. Thalmann and D. Thalmann, editors, Models and Techniques in Computer Animation, pp. 141--155. Springer Verlag, Tokyo.Google ScholarGoogle Scholar
  2. F. I. Parke, "Parameterized models for facial animation" IEEE Computer Graphics and Applications, vol. 2, no. 9, pp. 61--68, November 1982.Google ScholarGoogle Scholar
  3. P. Kalra, 1993, "An interactive multimodal facial interaction", Ph.D Dissertation No. 1183, Ecole polytechnique fédérale de Lausanne, Switzerland.Google ScholarGoogle Scholar
  4. J. Fischl, B. Miller and J. Robinson, 1993, "Parameter tracking in muscle-based analysis-synthesis system," in Proceedings of Picture Coding Symposium (PCS93), Lausanne, Switzerland.Google ScholarGoogle Scholar
  5. J. Ostermann and E. Haratsch, 1997, "An animation definition interface - rapid design of MPEG-4 compliant animated faces and bodies," in Proceedings of the International Workshop on Synthetic-Natural Hybrid Coding and 3D Imaging, Rhodes, Greece, September 5--9 1997.Google ScholarGoogle Scholar
  6. E. Cosatto and H. P. Graf, 1998, "Sample-based synthesis of photo-realistic talking heads," in Computer Animation, pp. 103--110, Philadelphia, Pennsylvania, June 8--10, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. Ostermann, 1998, "Animation of synthetic face in MPEG-4," in proceedings of Computer Animation, Philadelphia, PA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. M. Tekalp and J. Ostermann, 2000, "Face and 2-D mesh animation in MPEG-4", in Signal Processing: Image Communication 15, pp. 387--421.Google ScholarGoogle ScholarCross RefCross Ref
  9. I. S. Pandzic, J. Ostermann and D. Millen, 1999, "User evaluation: synthetic talking faces for interactive services", Visual Computer 15, pp. 330--340.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. D. W. Massaro, 1997, Perceiving Talking Faces: From Speech Perception to a Behavioral Principle, MIT Press.Google ScholarGoogle Scholar
  11. S. Morishima and H. Harashima, 1991, "A Media Conversion from Speech to Facial Image for Intelligent Man-Machine Interface," in IEEE Journal on Selected Areas in Communications 9(4), 594--600.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. K. Waters and T. M. Levergood, 1993, "DECface: an automatic lip synchronization algorithm for synthetic faces," Technical Report CRL 93/4, DEC Cambridge Research Laboratory, Cambridge, MA.Google ScholarGoogle Scholar
  13. C. Pelachaud, N. I. Badler and M. Steedman, 1996, "Generating facial Expressions for Speech," in Cognitive Science 20, pp. 1--46.Google ScholarGoogle ScholarCross RefCross Ref
  14. J. Beskow, 1995, "Rule-based visual speech synthesis," in ESCA EUROSPEECH '95, 4th European Conference on Speech Communication and Technology, Madrid, Spain.Google ScholarGoogle Scholar
  15. L. M. Arslan and D. Talkin, 1999, "Codebook Based Face Point Trajectory Synthesis Algorithm using Speech Input," in Speech Communication 27, pp. 81--93. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. T. Ohman, 1998, "An audio-visual speech database and automatic measurements of visual speech," in Quarterly Progress and Status Report, Department of Speech, Music and Hearing, Royal Institute of Technology, Stockholm, Sweden, Stockholm, Sweden.Google ScholarGoogle Scholar
  17. E. Yamamoto, S. Nakamura and K. Shikano, 1998, "Lip movement synthesis from speech based on Hidden Markov models," in Speech Communication 28, pp. 105--115. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. Brand, 1999, "Voice Puppetry," in Proceedings of SIGGRAPH'99 Computer Graphics, Annual Conference Series, pp. 21--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. F. Lavagetto, 1995, "Converting speech into lip movements: A multimedia telephone for hard of hearing people," in IEEE Transactions on Rehabilitation Engineering 3(1), pp. 90--102.Google ScholarGoogle ScholarCross RefCross Ref
  20. H. Yehia, P. Rubin and E. Vatikiotis-Bateson, 1998, "Quantitative Association of Vocal-tract and Facial Behavior," in Speech Communications 26, pp. 23--43. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. L. R. Rabiner and B. H. Juang, 1993, Fundamentals of Speech Recognition, Prentice-Hall, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. H. Hermansky, 1990, "Perceptual linear predictive (PLP) analysis of speech," in Journal of Acoustic Society of America, vol. 87(4), pp. 1738--1792.Google ScholarGoogle ScholarCross RefCross Ref
  23. H. Hermansky and N. Morgan, 1994, "RASTA Processing of Speech," in IEEE Transactions on Speech and Audio Processing 2(4), pp. 578--589.Google ScholarGoogle ScholarCross RefCross Ref
  24. D. H. Klatt, 1982, "Prediction of perceived phonetic distance from critical band spectra: a first step," in Proceedings of the International Congress on Acoustics, Speech, Signal Processing, Paris, IEEE Press, pp. 1278--1281.Google ScholarGoogle Scholar
  25. L. R. Rabiner and R. W. Schafer, 1978, Digital processing of speech signals, Prentice-hall, 1978.Google ScholarGoogle Scholar
  26. G. Aversano, A. Esposito and M. Marinaro, 2001, "A new text-independent method for phoneme segmentation," to appear in the Proceedings of IEEE Midwest Symposium on Circuits and Systems, Dayton 14--17 August 2001.Google ScholarGoogle Scholar
  27. F. Lavagetto and R. Pockaj, 1999, "The Facial Animation Engine: towards a high-level interface for the design of MPEG-4 compliant animated faces", in IEEE Transactions on Circuits and Systems for Video Technology 9(2), pp. 277--289. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. R. Y. Tsai, 1987, "A Versatile Camera Calibration Technique for High-Accuracy 3D Machine Vision Metrology Using Off-the-Shelf TV Cameras and Lenses," in IEEE Journal of Robotics and Automation 3, pp. 323--344.Google ScholarGoogle ScholarCross RefCross Ref
  29. R. Bryll, X. Ma and F. Quek, 1999, "Camera Calibration Utility Description," Technical Report VISLab-01-15, Vision Interfaces and Systems Laboratory, Wright State University. http://vislab.cs.wright.edu/Publications/ technical-reports/BryMQ01.htmlGoogle ScholarGoogle Scholar
  30. F. Quek, D. McNeill, R. Bryll, C. Kirbas, H. Arslan, K. McCullough, N. Furuyama and R. Ansari, 2000, "Gesture, Speech and Gaze Cues for Discourse Segmentation," in Proceedings of CVPR 2000, Hilton Head Island, South Carolina, June 13--15, 2000.Google ScholarGoogle Scholar
  31. J. Garofolo et al, 1998, DARPA TIMIT CD-ROM: An Acoustic Phonetic Continuous Speech Database, National Institute of Standards and Technology, Gaithersburg, MD.Google ScholarGoogle Scholar

Index Terms

  1. Speech driven facial animation

                Recommendations

                Comments

                Login options

                Check if you have access through your login credentials or your institution to get full access on this article.

                Sign in
                • Published in

                  cover image ACM Other conferences
                  PUI '01: Proceedings of the 2001 workshop on Perceptive user interfaces
                  November 2001
                  241 pages
                  ISBN:9781450374736
                  DOI:10.1145/971478

                  Copyright © 2001 ACM

                  Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                  Publisher

                  Association for Computing Machinery

                  New York, NY, United States

                  Publication History

                  • Published: 15 November 2001

                  Permissions

                  Request permissions about this article.

                  Request Permissions

                  Check for updates

                  Qualifiers

                  • Article

                PDF Format

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader