Skip to main content
Log in

Audiovisual Speech Synthesis

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

This paper presents the main approaches used to synthesize talking faces, and provides greater detail on a handful of these approaches. An attempt is made to distinguish between facial synthesis itself (i.e. the manner in which facial movements are rendered on a computer screen), and the way these movements may be controlled and predicted using phonetic input. The two main synthesis techniques (model-based vs. image-based) are contrasted and presented by a brief description of the most illustrative existing systems. The challenging issues—evaluation, data acquisition and modeling—that may drive future models are also discussed and illustrated by our current work at ICP.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Badin, P., Borel, P., Bailly, G., Revéret, L., Baciu, M., and Segebarth, C. (2000). Towards an audiovisual virtual talking head: 3D articulatory modeling of tongue, lips and face based on MRI and video images. Proceedings of the 5th Speech Production Seminar, Germany: Kloster Seeon, pp. 261-264.

    Google Scholar 

  • Bailly,G. (1998). Learning to speak. Sensori-motor control of speech movements. Speech Communication, 22(2/3):251-267.

    Google Scholar 

  • Bailly, G., Gibert, G., and Odisio, M. (2002). Evaluation of movement generation systems using the point-light technique. IEEE Workshop on Speech Synthesis, Santa Monica, CA.

  • Benoît, C., Lallouache, T., Mohamadi, T., and Abry, C. (1992). A set of French visemes for visual speech synthesis. In G. Bailly and C. Benoît (Eds.), Talking Machines: Theories, Models and Designs. Elsevier B.V., pp. 485-501.

  • Bergeron, P. and Lachapelle, P. (1985). Controlling facial expression and body movements in the computer-generated short “Tony de Peltrie”. SIGGRAPH, Advanced Computer Animation Seminar Notes, San Francisco, CA.

  • Beskow, J. (1995). Rule-based Visual Speech Synthesis. Madrid, Spain, Eurospeech, pp. 299-302.

  • Beskow, J., Dahlquist, M., Granström, B., Lundeberg, M., Spens, K.-E., and Öhman, T. (1997). The Teleface project-multimodal speech communication for the hearing impaired. Rhodos, Greece: Eurospeech, 2003-2010.

    Google Scholar 

  • Brand, M. (1999). Voice pupperty. SIGGRAPH'99, Los Angeles, CA, pp. 21-28.

  • Bregler, C., Covell, M., and Slaney, M. (1997a). VideoRewrite: Driving visual speech with audio. SIGGRAPH'97, Los Angeles, CA, pp. 353-360.

  • Bregler, C., Covell, M., and Slaney, M. (1997b). Video rewrite: Visual speech synthesis from video. International Conference on Auditory-Visual Speech Processing, Rhodes, Greece, pp. 153-156.

  • Brooke, N.M. and Scott, S.D. (1998). Two-and three-dimensional audio-visual speech synthesis. International Conference on Auditory-Visual Speech Processing, Terrigal, Australia, pp. 213-218.

  • Browman, C.P. and Goldstein, L.M. (1990). Gestural specification using dynamically-defined articulatory structures. Journal of Phonetics, 18(3):299-320.

    Google Scholar 

  • Chabanas, M. and Payan,Y. (2000). A3Dfinite element model of the face for simulation in plastic and maxillo-facial surgery. International Conference on Medical Image Computing and Computer-Assisted Interventions, Pittsburgh, USA, pp. 1068-1075.

  • Cohen, M.M. and Massaro, D.W. (1993). Modeling coarticulation in synthetic visual speech. In D. Thalmann and N. Magnenat-Thalmann (Eds.), Models and Techniques in Computer Animation. Springer-Verlag: Tokyo, pp. 141-155.

    Google Scholar 

  • Cootes, T.F., Edwards, G.J., and Taylor, C.J. (2001). Active appearance models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(6):681-685.

    Google Scholar 

  • Cosatto, E. and Graf, H.P. (1997). Sample-based synthesis of photo-realistic talking-heads. SIGGRAPH'97, Los Angeles, CA, pp. 353-360.

  • Cosatto, E. and Graf, H.P. (1998). Sample-based synthesis of photo-realistic talking heads. Computer Animation, Philadelphia, Pennsylvania, pp. 103-110.

  • Couteau, B., Payan, Y., and Lavallée, S. (2000). The Mesh-Matching algorithm: An automatic 3D mesh generator for finite element structures. Journal of Biomechanics, 33(8):pp1005-1009.

    Google Scholar 

  • Doenges, P., Capin, T.K., Lavagetto, F., Ostermann, J., Pandzic, I., and Petajan, E. (1997). MPEG-4: audio/video and synthetic graphics/ audio for real-time, interactive media delivery. Image Communications Journal, 9(4):433-463.

    Google Scholar 

  • Eisert, P. and Girod, B. (1998). Analyzing facial expressions for virtual conferencing. IEEE Computer Graphics & Applications: Special Issue: Computer Animation forVirtual Humans, 18(5):70-78.

    Google Scholar 

  • Ekman, P. and Friesen,W.V. (1975). Unmasking the Face. Palo Alto, California: Consulting Psychologists Press.

    Google Scholar 

  • Ekman, P. and Friesen, W. (1978). Facial Action Coding System (FACS): A Technique for the Measurement of Facial Action. Palo Alto, California: Consulting Psychologists Press.

    Google Scholar 

  • Elisei, F., Odisio, M., Bailly, G., and Badin, P. (2001). Creating and controlling video-realistic talking heads. Auditory-Visual Speech Processing Workshop, Scheelsminde, Denmark, pp. 90-97.

  • Ezzat, T. and Poggio, T. (1998). MikeTalk: A Talking Facial Display Based on Morphing Visemes. Philadelphia, PA: Computer Animation, pp. 96-102.

    Google Scholar 

  • Ezzat, T., Geiger, G., and Poggio, T. (2002). Trainable videorealistic speech animation. ACM Transactions on Graphics, 21(3):388-398.

    Google Scholar 

  • Hällgren, Å. and Lyberg, B. (1998). Visual speech synthesis with concatenative speech. Auditory-Visual Speech Processing Conference, Terrigal-Sydney, Australia, pp. 181-183.

  • Harshman, R.A. and Lundy, M.E. (1984). The PARAFAC model for three-way factor analysis and multidimensional scaling. In H.G. Law, C.W. Snyder, J.A. Hattie, and R.P. MacDonald (Eds.), Research Methods for Multimode Data Analysis.New-York: Praeger, pp. 122-215.

    Google Scholar 

  • Ishikawa, T., Sera, H., Morishima, S., and Terzopoulos, D. (1998). Facial image reconstruction by estimated muscle parameter. International Conference on Automatic Face and Gesture Recognition, Nara, Japan, pp. 342-347.

  • Li, H., Roivanen, P., and Forchheimer, R. (1993). 3D motion estimation in model-based facial image coding. IEEE Transactions on PAMI, 15(6):545-555.

    Google Scholar 

  • Massaro, D. (1998a). Illusions and issues in bimodal speech perception. Auditory-Visual Speech Processing Conference, Terrigal, Sydney, Australia, pp. 21-26.

  • Massaro, D.W. (1998b). Perceiving Talking Faces: From Speech Perception to a Behavioral Principle. Cambridge, MA: MIT Press.

    Google Scholar 

  • Matthews, I., Cootes, T.F., and Bangham, J.A. (2002). Extraction of visual features for lipreading. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(2):198-213.

    Google Scholar 

  • McGurk, H. and MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 26:746-748.

    Google Scholar 

  • Minnis, S. and Breen, A.P. (1998). Modeling visual coarticulation in synthetic talking heads using a lip motion unit inventory with concatenative synthesis. ICSLP, Beijing, China, pp. 759-762.

  • Odisio, M., Elisei, F., Bailly, G., and Badin, P. (to appear). 3D talking clones for virtual teleconferencing. Annals of Telecommunications.

  • Ogden, R., Hawkins, S., House, J., Huckvale, M., Local, J., Carter, P., Dankovicov´a, J. and Heid, S. (2000). ProSynth: An integrated prosodic approach to device-independent, natural-sounding speech synthesis. Computer Speech and Language, 14(3):177-210.

    Google Scholar 

  • Öhman, S.E.G. (1967). Numerical model of coarticulation. Journal of the Acoustical Society of America, 41:310-320.

    Google Scholar 

  • Okadome, T., Kaburagi, T., and Honda, M. (1999). Articulatory movement formation by kinematic triphone model. IEEE International Conference on Systems Man and Cybernetics, Tokyo, Japan, pp. 469-474.

  • Olives, J.-L., Möttönen, R., Kulju, J., and Sams, M. (1999). Audio-visual speech synthesis for finnish. Auditory-Visual Speech Processing Workshop, Santa Cruz, CA, pp. 157-162.

  • Pandzic, I., Ostermann, J., and Millen, D. (1999). Users evaluation: Synthetic talking faces for interactive services. The Visual Computer, 15:330-340.

    Google Scholar 

  • Parke, F.I. (1972). Computer generated animation of faces. ACM National Conference, Salt Lake City, pp. 451-457.

  • Parke, F.I. (1975). A model for human faces that allows speech synchronized animation. Journal of Computers and Graphics, 1(1):1-4.

    Google Scholar 

  • Parke, F.I. (1982). A parametrized model for facial animation. IEEE Computer Graphics and Applications, 2(9):61-70.

    Google Scholar 

  • Parke, F.I. and Waters, K. (1996). Computer Facial Animation. Wellesley, MA, USA, A.K. Peters.

  • Perrier, P., Ostry, D.J., and Laboissi`ere, R. (1996). The equilibrium point hypothesis and its application to speech motor control. Journal of Speech and Hearing Research, 39:365-377.

    Google Scholar 

  • Pighin, F., Hecker, J., Lischinski, D., Szeliski, R., and Salesin, D.H. (1998). Synthesizing realistic facial expressions from photographs. Proceedings of Siggraph, Orlando, FL, USA, pp. 75-84.

  • Pisoni, D.B. (1997). Perception of synthetic speech. In J.P.H.V. Santen, R.W. Sproat, J.P. Olive, and J. Hirschberg (Eds.), Progress in Speech Synthesis. Springer Verlag: New York. pp. 541-560.

    Google Scholar 

  • Platt, S.M. and Badler, N.I. (1981). Animating facial expressions. Computer Graphics, 15(3):245-252.

    Google Scholar 

  • Pockaj, R., Costa, M., Lavagetto, F., and Braccini, C. (1999). MPEG-4 facial animation:Animplementation. InternationalWorkshop on Synthetic-Natural Hybrid Coding and Three Dimensional Imaging, Santorini, Greece, pp. 33-36.

  • Revéret, L., Bailly, G., and Badin, P. (2000). MOTHER: A new generation of talking heads providing a flexible articulatory control for video-realistic speech animation. International Conference on Speech and Language Processing, Beijing, China, pp. 755-758.

  • Rydfalk, M. (1987). CANDIDE, a parameterized face. Sweden, Dept. of Electrical Engineering, Linköping University: LiTH-ISYI-866.

  • Seitz, S.M. and Dyer, C.R. (1996). View morphing. ACM SIGGRAPH, New Orleans, Louisiana, pp. 21-30.

  • Shaiman, S. and Porter, R.J. (1991). Different phase-stable relationships of the upper lip and jaw for production of vowels and diphthongs. Journal of the Acoustical Society of America, 90:3000-3007.

    Google Scholar 

  • Takeda, K., Abe, K., and Sagisaka, Y. (1992). On the basic scheme and algorithms in non-uniform unit speech synthesis. In G. Bailly and C. Benoît (Eds.), Talking Machines: Theories, Models and Designs. Elsevier B.V., pp. 93-105.

  • Tamura, M., Kondo, S., Masuko, T., and Kobayashi, T. (1999). Textto-audio-visual speech synthesis based on parameter generation fromHMM.European Conference on Speech Communication and Technology, Budapest, Hungary, pp. 959-962.

  • Tekalp, A.M. and Ostermann, J. (2000). Face and 2-D Mesh animation in MPEG-4. Signal Processing: Image Communication, 15:387-421.

    Google Scholar 

  • Terzopoulos, D. and Waters, K. (1990). Physically-based facial modeling, analysis and animation. The Journal of Visual and Computer Animation, 1:73-80.

    Google Scholar 

  • Theobald, B.J., Bangham, J.A., Matthews, I., and Cawley, G.C. (2001). Visual speech synthesis using statistical models of shape and appearance. Auditory-Visual Speech Processing Workshop, Scheelsminde, Denmark, pp. 78-83.

  • Tsai, C.-J., Eisert, P., Girod, B., and Katsaggelos, A.K. (1997). Model-based synthetic view generation from a monocular video sequence. Proceedings of the International Conference on Image Processing, Santa Barbara, California, pp. 444-447.

  • Turk, M. and Pentland, A. (1991). Eigenfaces for recognition. Journal of Cognitive Neuroscience, 3(1):71-86.

    Google Scholar 

  • Vignoli, F. and Braccini, C. (1999). A text-speech synchronization technique with applications to talking heads. Auditory-Visual Speech Processing Conference, Santa Cruz, California, USA, pp. 128-132.

  • Waters, K. (1987). A muscle model for animating three-dimensional facial expression. Computer Graphics, 21(4):17-24.

    Google Scholar 

  • Waters, K. and Terzopoulos, D. (1992). The computer synthesis of expressive faces. Philosophical Transactions of the Royal Society of London (B), 335:87-93.

    Google Scholar 

  • Yamamoto, E., Nakamura, S., and Shikano, K. (1998). Lipmovement synthesis from speech based on Hidden Markov Models. Speech Communication, 26(1-2):105-115.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bailly, G., Bérar, M., Elisei, F. et al. Audiovisual Speech Synthesis. International Journal of Speech Technology 6, 331–346 (2003). https://doi.org/10.1023/A:1025700715107

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1025700715107

Navigation