Audiovisual Speech Synthesis

Bailly, G.; Bérar, M.; Elisei, F.; Odisio, M.

doi:10.1023/A:1025700715107

G. Bailly¹,
M. Bérar¹,
F. Elisei¹ &
…
M. Odisio¹

289 Accesses
40 Citations
Explore all metrics

Abstract

This paper presents the main approaches used to synthesize talking faces, and provides greater detail on a handful of these approaches. An attempt is made to distinguish between facial synthesis itself (i.e. the manner in which facial movements are rendered on a computer screen), and the way these movements may be controlled and predicted using phonetic input. The two main synthesis techniques (model-based vs. image-based) are contrasted and presented by a brief description of the most illustrative existing systems. The challenging issues—evaluation, data acquisition and modeling—that may drive future models are also discussed and illustrated by our current work at ICP.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Badin, P., Borel, P., Bailly, G., Revéret, L., Baciu, M., and Segebarth, C. (2000). Towards an audiovisual virtual talking head: 3D articulatory modeling of tongue, lips and face based on MRI and video images. Proceedings of the 5th Speech Production Seminar, Germany: Kloster Seeon, pp. 261-264.
Google Scholar
Bailly,G. (1998). Learning to speak. Sensori-motor control of speech movements. Speech Communication, 22(2/3):251-267.
Google Scholar
Bailly, G., Gibert, G., and Odisio, M. (2002). Evaluation of movement generation systems using the point-light technique. IEEE Workshop on Speech Synthesis, Santa Monica, CA.
Benoît, C., Lallouache, T., Mohamadi, T., and Abry, C. (1992). A set of French visemes for visual speech synthesis. In G. Bailly and C. Benoît (Eds.), Talking Machines: Theories, Models and Designs. Elsevier B.V., pp. 485-501.
Bergeron, P. and Lachapelle, P. (1985). Controlling facial expression and body movements in the computer-generated short “Tony de Peltrie”. SIGGRAPH, Advanced Computer Animation Seminar Notes, San Francisco, CA.
Beskow, J. (1995). Rule-based Visual Speech Synthesis. Madrid, Spain, Eurospeech, pp. 299-302.
Beskow, J., Dahlquist, M., Granström, B., Lundeberg, M., Spens, K.-E., and Öhman, T. (1997). The Teleface project-multimodal speech communication for the hearing impaired. Rhodos, Greece: Eurospeech, 2003-2010.
Google Scholar
Brand, M. (1999). Voice pupperty. SIGGRAPH'99, Los Angeles, CA, pp. 21-28.
Bregler, C., Covell, M., and Slaney, M. (1997a). VideoRewrite: Driving visual speech with audio. SIGGRAPH'97, Los Angeles, CA, pp. 353-360.
Bregler, C., Covell, M., and Slaney, M. (1997b). Video rewrite: Visual speech synthesis from video. International Conference on Auditory-Visual Speech Processing, Rhodes, Greece, pp. 153-156.
Brooke, N.M. and Scott, S.D. (1998). Two-and three-dimensional audio-visual speech synthesis. International Conference on Auditory-Visual Speech Processing, Terrigal, Australia, pp. 213-218.
Browman, C.P. and Goldstein, L.M. (1990). Gestural specification using dynamically-defined articulatory structures. Journal of Phonetics, 18(3):299-320.
Google Scholar
Chabanas, M. and Payan,Y. (2000). A3Dfinite element model of the face for simulation in plastic and maxillo-facial surgery. International Conference on Medical Image Computing and Computer-Assisted Interventions, Pittsburgh, USA, pp. 1068-1075.
Cohen, M.M. and Massaro, D.W. (1993). Modeling coarticulation in synthetic visual speech. In D. Thalmann and N. Magnenat-Thalmann (Eds.), Models and Techniques in Computer Animation. Springer-Verlag: Tokyo, pp. 141-155.
Google Scholar
Cootes, T.F., Edwards, G.J., and Taylor, C.J. (2001). Active appearance models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(6):681-685.
Google Scholar
Cosatto, E. and Graf, H.P. (1997). Sample-based synthesis of photo-realistic talking-heads. SIGGRAPH'97, Los Angeles, CA, pp. 353-360.
Cosatto, E. and Graf, H.P. (1998). Sample-based synthesis of photo-realistic talking heads. Computer Animation, Philadelphia, Pennsylvania, pp. 103-110.
Couteau, B., Payan, Y., and Lavallée, S. (2000). The Mesh-Matching algorithm: An automatic 3D mesh generator for finite element structures. Journal of Biomechanics, 33(8):pp1005-1009.
Google Scholar
Doenges, P., Capin, T.K., Lavagetto, F., Ostermann, J., Pandzic, I., and Petajan, E. (1997). MPEG-4: audio/video and synthetic graphics/ audio for real-time, interactive media delivery. Image Communications Journal, 9(4):433-463.
Google Scholar
Eisert, P. and Girod, B. (1998). Analyzing facial expressions for virtual conferencing. IEEE Computer Graphics & Applications: Special Issue: Computer Animation forVirtual Humans, 18(5):70-78.
Google Scholar
Ekman, P. and Friesen,W.V. (1975). Unmasking the Face. Palo Alto, California: Consulting Psychologists Press.
Google Scholar
Ekman, P. and Friesen, W. (1978). Facial Action Coding System (FACS): A Technique for the Measurement of Facial Action. Palo Alto, California: Consulting Psychologists Press.
Google Scholar
Elisei, F., Odisio, M., Bailly, G., and Badin, P. (2001). Creating and controlling video-realistic talking heads. Auditory-Visual Speech Processing Workshop, Scheelsminde, Denmark, pp. 90-97.
Ezzat, T. and Poggio, T. (1998). MikeTalk: A Talking Facial Display Based on Morphing Visemes. Philadelphia, PA: Computer Animation, pp. 96-102.
Google Scholar
Ezzat, T., Geiger, G., and Poggio, T. (2002). Trainable videorealistic speech animation. ACM Transactions on Graphics, 21(3):388-398.
Google Scholar
Hällgren, Å. and Lyberg, B. (1998). Visual speech synthesis with concatenative speech. Auditory-Visual Speech Processing Conference, Terrigal-Sydney, Australia, pp. 181-183.
Harshman, R.A. and Lundy, M.E. (1984). The PARAFAC model for three-way factor analysis and multidimensional scaling. In H.G. Law, C.W. Snyder, J.A. Hattie, and R.P. MacDonald (Eds.), Research Methods for Multimode Data Analysis.New-York: Praeger, pp. 122-215.
Google Scholar
Ishikawa, T., Sera, H., Morishima, S., and Terzopoulos, D. (1998). Facial image reconstruction by estimated muscle parameter. International Conference on Automatic Face and Gesture Recognition, Nara, Japan, pp. 342-347.
Li, H., Roivanen, P., and Forchheimer, R. (1993). 3D motion estimation in model-based facial image coding. IEEE Transactions on PAMI, 15(6):545-555.
Google Scholar
Massaro, D. (1998a). Illusions and issues in bimodal speech perception. Auditory-Visual Speech Processing Conference, Terrigal, Sydney, Australia, pp. 21-26.
Massaro, D.W. (1998b). Perceiving Talking Faces: From Speech Perception to a Behavioral Principle. Cambridge, MA: MIT Press.
Google Scholar
Matthews, I., Cootes, T.F., and Bangham, J.A. (2002). Extraction of visual features for lipreading. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(2):198-213.
Google Scholar
McGurk, H. and MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 26:746-748.
Google Scholar
Minnis, S. and Breen, A.P. (1998). Modeling visual coarticulation in synthetic talking heads using a lip motion unit inventory with concatenative synthesis. ICSLP, Beijing, China, pp. 759-762.
Odisio, M., Elisei, F., Bailly, G., and Badin, P. (to appear). 3D talking clones for virtual teleconferencing. Annals of Telecommunications.
Ogden, R., Hawkins, S., House, J., Huckvale, M., Local, J., Carter, P., Dankovicov´a, J. and Heid, S. (2000). ProSynth: An integrated prosodic approach to device-independent, natural-sounding speech synthesis. Computer Speech and Language, 14(3):177-210.
Google Scholar
Öhman, S.E.G. (1967). Numerical model of coarticulation. Journal of the Acoustical Society of America, 41:310-320.
Google Scholar
Okadome, T., Kaburagi, T., and Honda, M. (1999). Articulatory movement formation by kinematic triphone model. IEEE International Conference on Systems Man and Cybernetics, Tokyo, Japan, pp. 469-474.
Olives, J.-L., Möttönen, R., Kulju, J., and Sams, M. (1999). Audio-visual speech synthesis for finnish. Auditory-Visual Speech Processing Workshop, Santa Cruz, CA, pp. 157-162.
Pandzic, I., Ostermann, J., and Millen, D. (1999). Users evaluation: Synthetic talking faces for interactive services. The Visual Computer, 15:330-340.
Google Scholar
Parke, F.I. (1972). Computer generated animation of faces. ACM National Conference, Salt Lake City, pp. 451-457.
Parke, F.I. (1975). A model for human faces that allows speech synchronized animation. Journal of Computers and Graphics, 1(1):1-4.
Google Scholar
Parke, F.I. (1982). A parametrized model for facial animation. IEEE Computer Graphics and Applications, 2(9):61-70.
Google Scholar
Parke, F.I. and Waters, K. (1996). Computer Facial Animation. Wellesley, MA, USA, A.K. Peters.
Perrier, P., Ostry, D.J., and Laboissi`ere, R. (1996). The equilibrium point hypothesis and its application to speech motor control. Journal of Speech and Hearing Research, 39:365-377.
Google Scholar
Pighin, F., Hecker, J., Lischinski, D., Szeliski, R., and Salesin, D.H. (1998). Synthesizing realistic facial expressions from photographs. Proceedings of Siggraph, Orlando, FL, USA, pp. 75-84.
Pisoni, D.B. (1997). Perception of synthetic speech. In J.P.H.V. Santen, R.W. Sproat, J.P. Olive, and J. Hirschberg (Eds.), Progress in Speech Synthesis. Springer Verlag: New York. pp. 541-560.
Google Scholar
Platt, S.M. and Badler, N.I. (1981). Animating facial expressions. Computer Graphics, 15(3):245-252.
Google Scholar
Pockaj, R., Costa, M., Lavagetto, F., and Braccini, C. (1999). MPEG-4 facial animation:Animplementation. InternationalWorkshop on Synthetic-Natural Hybrid Coding and Three Dimensional Imaging, Santorini, Greece, pp. 33-36.
Revéret, L., Bailly, G., and Badin, P. (2000). MOTHER: A new generation of talking heads providing a flexible articulatory control for video-realistic speech animation. International Conference on Speech and Language Processing, Beijing, China, pp. 755-758.
Rydfalk, M. (1987). CANDIDE, a parameterized face. Sweden, Dept. of Electrical Engineering, Linköping University: LiTH-ISYI-866.
Seitz, S.M. and Dyer, C.R. (1996). View morphing. ACM SIGGRAPH, New Orleans, Louisiana, pp. 21-30.
Shaiman, S. and Porter, R.J. (1991). Different phase-stable relationships of the upper lip and jaw for production of vowels and diphthongs. Journal of the Acoustical Society of America, 90:3000-3007.
Google Scholar
Takeda, K., Abe, K., and Sagisaka, Y. (1992). On the basic scheme and algorithms in non-uniform unit speech synthesis. In G. Bailly and C. Benoît (Eds.), Talking Machines: Theories, Models and Designs. Elsevier B.V., pp. 93-105.
Tamura, M., Kondo, S., Masuko, T., and Kobayashi, T. (1999). Textto-audio-visual speech synthesis based on parameter generation fromHMM.European Conference on Speech Communication and Technology, Budapest, Hungary, pp. 959-962.
Tekalp, A.M. and Ostermann, J. (2000). Face and 2-D Mesh animation in MPEG-4. Signal Processing: Image Communication, 15:387-421.
Google Scholar
Terzopoulos, D. and Waters, K. (1990). Physically-based facial modeling, analysis and animation. The Journal of Visual and Computer Animation, 1:73-80.
Google Scholar
Theobald, B.J., Bangham, J.A., Matthews, I., and Cawley, G.C. (2001). Visual speech synthesis using statistical models of shape and appearance. Auditory-Visual Speech Processing Workshop, Scheelsminde, Denmark, pp. 78-83.
Tsai, C.-J., Eisert, P., Girod, B., and Katsaggelos, A.K. (1997). Model-based synthetic view generation from a monocular video sequence. Proceedings of the International Conference on Image Processing, Santa Barbara, California, pp. 444-447.
Turk, M. and Pentland, A. (1991). Eigenfaces for recognition. Journal of Cognitive Neuroscience, 3(1):71-86.
Google Scholar
Vignoli, F. and Braccini, C. (1999). A text-speech synchronization technique with applications to talking heads. Auditory-Visual Speech Processing Conference, Santa Cruz, California, USA, pp. 128-132.
Waters, K. (1987). A muscle model for animating three-dimensional facial expression. Computer Graphics, 21(4):17-24.
Google Scholar
Waters, K. and Terzopoulos, D. (1992). The computer synthesis of expressive faces. Philosophical Transactions of the Royal Society of London (B), 335:87-93.
Google Scholar
Yamamoto, E., Nakamura, S., and Shikano, K. (1998). Lipmovement synthesis from speech based on Hidden Markov Models. Speech Communication, 26(1-2):105-115.
Google Scholar

Download references

Author information

Authors and Affiliations

Institut de la Communication Parlée UMR CNRS no 5009 INPG/Univ. Stendhal 46, av. Félix Viallet, 38031, Grenoble Cedex, France
G. Bailly, M. Bérar, F. Elisei & M. Odisio

Authors

G. Bailly
View author publications
You can also search for this author in PubMed Google Scholar
M. Bérar
View author publications
You can also search for this author in PubMed Google Scholar
F. Elisei
View author publications
You can also search for this author in PubMed Google Scholar
M. Odisio
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bailly, G., Bérar, M., Elisei, F. et al. Audiovisual Speech Synthesis. International Journal of Speech Technology 6, 331–346 (2003). https://doi.org/10.1023/A:1025700715107

Download citation

Issue Date: October 2003
DOI: https://doi.org/10.1023/A:1025700715107

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Audiovisual Speech Synthesis

Abstract

Access this article

Similar content being viewed by others

Facial Expression Recognition Using Machine Learning and Deep Learning Techniques: A Systematic Review

Assessing Facial Symmetry and Attractiveness using Augmented Reality

Text2Sign: Towards Sign Language Production Using Neural Machine Translation and Generative Adversarial Networks

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Audiovisual Speech Synthesis

Abstract

Access this article

Similar content being viewed by others

Facial Expression Recognition Using Machine Learning and Deep Learning Techniques: A Systematic Review

Assessing Facial Symmetry and Attractiveness using Augmented Reality

Text2Sign: Towards Sign Language Production Using Neural Machine Translation and Generative Adversarial Networks

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation