Neural Face Models for Example-Based Visual Speech Synthesis

Authors:
Wolfgang Paier

Fraunhofer Heinrich Hertz Institute, Germany

Fraunhofer Heinrich Hertz Institute, Germany
View Profile

,
Anna Hilsmann

Fraunhofer Heinrich Hertz Institute, Germany

Fraunhofer Heinrich Hertz Institute, Germany
View Profile

,
Peter Eisert

Fraunhofer Heinrich Hertz Institute, Germany

Fraunhofer Heinrich Hertz Institute, Germany
View Profile

CVMP '20: Proceedings of the 17th ACM SIGGRAPH European Conference on Visual Media ProductionDecember 2020Article No.: 4Pages 1–10https://doi.org/10.1145/3429341.3429356

Published:08 December 2020Publication History

CVMP '20: Proceedings of the 17th ACM SIGGRAPH European Conference on Visual Media Production

Pages 1–10

ABSTRACT

Creating realistic animations of human faces with computer graphic models is still a challenging task. It is often solved either with tedious manual work or motion capture based techniques that require specialised and costly hardware. Example based animation approaches circumvent these problems by re-using captured data of real people. This data is split into short motion samples that can be looped or concatenated in order to create novel motion sequences. The obvious advantages of this approach are the simplicity of use and the high realism, since the data exhibits only real deformations. Rather than tuning weights of a complex face rig, the animation task is performed on a higher level by arranging typical motion samples in a way such that the desired facial performance is achieved. Two difficulties with example based approaches, however, are high memory requirements as well as the creation of artefact-free and realistic transitions between motion samples. We solve these problems by combining the realism and simplicity of example-based animations with the advantages of neural face models. Our neural face model is capable of synthesising high quality 3D face geometry and texture according to a compact latent parameter vector. This latent representation reduces memory requirements by a factor of 100 and helps creating seamless transitions between concatenated motion samples. In this paper, we present a marker-less approach for facial motion capture based on multi-view video. Based on the captured data, we learn a neural representation of facial expressions, which is used to seamlessly concatenate facial performances during the animation procedure. We demonstrate the effectiveness of our approach by synthesising mouthings for Swiss-German sign language based on viseme query sequences.

References

Volker Blanz and Thomas Vetter. 1999. A Morphable Model for the Synthesis of 3D Faces. In Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques(SIGGRAPH ’99). ACM Press/Addison-Wesley Publishing Co., New York, NY, USA, 187–194. https://doi.org/10.1145/311535.311556Google ScholarDigital Library
George Borshukov, Jefferson Montgomery, Witek Werner, Barry Ruff, James Lau, Paul Thuriot, Patrick Mooney, Stefan Van Niekerk, Dave Raposo, Jean-Luc Duprat, John Hable, Håkan Kihlström, Daniel Roizman, Kevin Noone, and Jeff O’Connell. 2006. Playable Universal Capture. In ACM SIGGRAPH 2006 Sketches.Google Scholar
Yuri Boykov, Olga Veksler, and Ramin Zabih. 2001. Fast Approximate Energy Minimization via Graph Cuts. IEEE Trans. Pattern Anal. Mach. Intell. 23, 11 (Nov. 2001), 1222–1239. https://doi.org/10.1109/34.969114Google ScholarDigital Library
Christoph Bregler, Michele Covell, and Malcolm Slaney. 1997. Video Rewrite: Driving Visual Speech with Audio. In Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques(SIGGRAPH ’97). ACM Press/Addison-Wesley Publishing Co., USA, 353–360. https://doi.org/10.1145/258734.258880Google ScholarDigital Library
Chen Cao, Yanlin Weng, Shun Zhou, Yiying Tong, and Kun Zhou. 2014. FaceWarehouse: A 3D Facial Expression Database for Visual Computing. IEEE Transactions on Visualization and Computer Graphics 20, 3 (March 2014), 413–425. https://doi.org/10.1109/TVCG.2013.249Google ScholarDigital Library
Joel Carranza, Christian Theobalt, Marcus A. Magnor, and Hans-Peter Seidel. 2003. Free-viewpoint Video of Human Actors. ACM Trans. Graph. 22, 3 (July 2003), 569–577.Google ScholarDigital Library
D. Casas, M. Volino, J. Collomosse, and A. Hilton. 2014. 4D Video Textures for Interactive Character Appearance. Computer Graphics Forum 33, 2 (2014), 371–380.Google ScholarDigital Library
Timothy F. Cootes, Gareth J. Edwards, and Christopher J. Taylor. 1998. Active Appearance Models. In IEEE Transactions on Pattern Analysis and Machine Intelligence. Springer, 484–498.Google Scholar
E. Cosatto and H. P. Graf. 2000. Photo-Realistic Talking-Heads from Image Samples. Trans. Multi. 2, 3 (Sept. 2000), 152–163. https://doi.org/10.1109/6046.865480Google ScholarDigital Library
Daniel Cudeiro, Timo Bolkart, Cassidy Laidlaw, Anurag Ranjan, and Michael Black. 2019. Capture, Learning, and Synthesis of 3D Speaking Styles. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 10101–10111.Google ScholarCross Ref
Kevin Dale, Kalyan Sunkavalli, Micah K. Johnson, Daniel Vlasic, Wojciech Matusik, and Hanspeter Pfister. 2011. Video Face Replacement. ACM Transactions on Graphics (Proc. SIGGRAPH Asia) 30 (2011).Google Scholar
P. Eisert and B. Girod. 1998. Analyzing Facial Expressions for Virtual Conferencing. IEEE Computer Graphics and Applications 18, 5 (Sep. 1998), 70–78.Google ScholarDigital Library
Eeva A. Elliott. 2013. Phonological Functions of Facial Movements. Ph.D. Dissertation. http://dx.doi.org/10.17169/refubium-8503Google Scholar
Tony Ezzat, Gadi Geiger, and Tomaso Poggio. 2002. Trainable Videorealistic Speech Animation. ACM Trans. Graph. 21, 3 (July 2002), 388–398. https://doi.org/10.1145/566654.566594Google ScholarDigital Library
Philipp Fechteler, Wolfgang Paier, and Peter Eisert. 2014. Articulated 3D Model Tracking with on-the-fly Texturing. In Proceedings of the 21st IEEE International Conference on Image Processing. Paris, France, 3998–4002. ICIP 2014.Google ScholarCross Ref
Philipp Fechteler, Wolfgang Paier, Anna Hilsmann, and Peter Eisert. 2016. Real-time Avatar Animation with Dynamic Face Texturing. In Proceedings of the 23rd IEEE International Conference on Image Processing. Phoenix, Arizona, USA. ICIP 2016.Google ScholarCross Ref
Pablo Garrido, Levi Valgaert, Chenglei Wu, and Christian Theobalt. 2013. Reconstructing Detailed Dynamic Face Geometry from Monocular Video. ACM Trans. Graph. 32, 6, Article 158 (Nov. 2013), 10 pages. https://doi.org/10.1145/2508363.2508380Google ScholarDigital Library
Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2 (Montreal, Canada) (NIPS’14). MIT Press, Cambridge, MA, USA, 2672–2680. http://dl.acm.org/citation.cfm?id=2969033.2969125Google ScholarDigital Library
J. Hou, S. Wang, Y. Lai, J. Lin, Y. Tsao, H. Chang, and H. Wang. 2016. Audio-visual speech enhancement using deep neural networks. In 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA). 1–6.Google Scholar
Ahmed Hussen Abdelaziz, Barry-John Theobald, Justin Binder, Gabriele Fanelli, Paul Dixon, Nick Apostoloff, Thibaut Weise, and Sachin Kajareker. 2019. Speaker-Independent Speech-Driven Visual Speech Synthesis Using Domain-Adapted Acoustic Models. In 2019 International Conference on Multimodal Interaction (Suzhou, China) (ICMI ’19). Association for Computing Machinery, New York, NY, USA, 220–225. https://doi.org/10.1145/3340555.3353745Google ScholarDigital Library
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2016. Image-to-Image Translation with Conditional Adversarial Networks. arxiv (2016).Google Scholar
Tero Karras, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. 2017. Audio-Driven Facial Animation by Joint End-to-End Learning of Pose and Emotion. ACM Trans. Graph. 36, 4, Article 94 (July 2017), 12 pages. https://doi.org/10.1145/3072959.3073658Google ScholarDigital Library
Vahid Kazemi and Josephine Sullivan. 2014. One millisecond face alignment with an ensemble of regression trees. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1867–1874.Google ScholarDigital Library
M. Kettern, A. Hilsmann, and P. Eisert. 2015. Temporally Consistent Wide Baseline Facial Performance Capture via Image Warping. In Proceedings of the Vision, Modeling, and Visualization Workshop 2015.Google Scholar
J. Kilner, J. Starck, and A. Hilton. 2006. A Comparative Study of Free-Viewpoint Video Techniques for Sports Events. In Proceedings of the 3rd European Conference on Visual Media Production.Google Scholar
Diederik P Kingma and Max Welling. 2013. Auto-Encoding Variational Bayes. arxiv:1312.6114 [stat.ML]Google Scholar
L. Kovar, M. Gleicher, and F. Pighin. 2002. Motion Graphs. In Proc. of the 29th Annual Conference on Computer Graphics and Interactive Techniques (San Antonio, Texas) (SIGGRAPH ’02). ACM, New York, NY, USA, 473–482. https://doi.org/10.1145/566570.566605Google ScholarDigital Library
Hao Li, Jihun Yu, Yuting Ye, and Chris Bregler. 2013. Realtime Facial Animation with On-the-fly Correctives. ACM Trans. Graph. 32, 4, Article 42 (July 2013), 10 pages. https://doi.org/10.1145/2461912.2462019Google ScholarDigital Library
C. Lipski, F. Klose, K. Ruhl, and M. Magnor. 2011. Making of ”Who Cares?” HD Stereoscopic Free Viewpoint video. In Proceedings of the 8th European Conference on Visual Media Production.Google Scholar
Stephen Lombardi, Jason M. Saragih, Tomas Simon, and Yaser Sheikh. 2018. Deep Appearance Models for Face Rendering. CoRR abs/1808.00362(2018). arxiv:1808.00362http://arxiv.org/abs/1808.00362Google Scholar
Dominic Massaro. 1998. Perceiving Talking Faces. The MIT Press.Google Scholar
Koki Nagano, Jaewoo Seo, Jun Xing, Lingyu Wei, Zimo Li, Shunsuke Saito, Aviral Agarwal, Jens Fursund, and Hao Li. 2018. paGAN: real-time avatars using dynamic textures. In SIGGRAPH Asia 2018 Technical Papers. ACM, 258.Google Scholar
Wolfgang Paier, Anna Hilsmann, and Peter Eisert. 2020. Interactive Facial Animation with Deep Neural Networks. IET Computer Vision, Special Issue on Computer Vision for the Creative Industries (2020).Google Scholar
Wolfgang Paier, Markus Kettern, Hilsmann Anna, and Peter Eisert. 2015. Video-Based Facial Re-Animation. In Proceedings of the 12th European Conference on Visual Media Production. London, UK.Google ScholarDigital Library
Wolfgang Paier, Markus Kettern, Anna Hilsmann, and Peter Eisert. 2017. A Hybrid Approach for Facial Performance Analysis and Editing. IEEE Trans. Cir. and Sys. for Video Technol. 27, 4 (April 2017), 784–797. https://doi.org/10.1109/TCSVT.2016.2610078Google ScholarDigital Library
P. Pérez, M. Gangnet, and A. Blake. 2003. Poisson Image Editing. ACM Trans. Graph. 22, 3 (July 2003), 313–318. https://doi.org/10.1145/882262.882269Google ScholarDigital Library
A. Pumarola, A. Agudo, A.M. Martinez, A. Sanfeliu, and F. Moreno-Noguer. 2018. GANimation: Anatomically-aware Facial Animation from a Single Image. In Proceedings of the European Conference on Computer Vision (ECCV).Google Scholar
Jun’ichiro Seyama and Ruth S. Nagayama. 2007. The Uncanny Valley: Effect of Realism on the Impression of Artificial Human Faces. Presence: Teleoper. Virtual Environ. 16, 4 (August 2007), 337–351.Google ScholarDigital Library
T. Shimba, R. Sakurai, H. Yamazoe, and J. Lee. 2015. Talking heads synthesis from audio with deep neural networks. In 2015 IEEE/SICE International Symposium on System Integration (SII). 100–105.Google Scholar
Ron Slossberg, Gil Shamai, and Ron Kimmel. 2018. High Quality Facial Surface and Texture Synthesis via Generative Adversarial Networks. CoRR abs/1808.08281(2018). arxiv:1808.08281http://arxiv.org/abs/1808.08281Google Scholar
Supasorn Suwajanakorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. 2017. Synthesizing Obama: Learning Lip Sync from Audio. ACM Trans. Graph. 36, 4, Article 95 (July 2017), 13 pages. https://doi.org/10.1145/3072959.3073640Google ScholarDigital Library
Sarah L. Taylor, Moshe Mahler, Barry-John Theobald, and Iain Matthews. 2012. Dynamic Units of Visual Speech. In Proceedings of the ACM SIGGRAPH/Eurographics Symposium on Computer Animation (Lausanne, Switzerland) (SCA ’12). Eurographics Association, Goslar, DEU, 275–284.Google ScholarDigital Library
Justus Thies, Mohamed Elgharib, Ayush Tewari, Christian Theobalt, and Matthias Nießner. 2020. Neural Voice Puppetry: Audio-driven Facial Reenactment. ECCV 2020 (2020).Google Scholar
J. Thies, M. Zollhöfer, M. Nießner, L. Valgaerts, M. Stamminger, and C. Theobalt. 2015. Real-time Expression Transfer for Facial Reenactment. ACM Transactions on Graphics (TOG) 34, 6 (2015).Google ScholarDigital Library
Daniel Vlasic, Matthew Brand, Hanspeter Pfister, and Jovan Popović. 2005. Face Transfer with Multilinear Models. ACM Trans. Graph. 24, 3 (July 2005), 426–433. https://doi.org/10.1145/1073204.1073209Google ScholarDigital Library
Yang Zhou, Zhan Xu, Chris Landreth, Evangelos Kalogerakis, Subhransu Maji, and Karan Singh. 2018. Visemenet: Audio-Driven Animator-Centric Speech Animation. ACM Trans. Graph. 37, 4, Article 161 (July 2018), 10 pages. https://doi.org/10.1145/3197517.3201292Google ScholarDigital Library

Recommendations

Semantic 3D motion retargeting for facial animation
APGV '06: Proceedings of the 3rd symposium on Applied perception in graphics and visualization

We present a system for realistic facial animation that decomposes facial motion capture data into semantically meaningful motion channels based on the Facial Action Coding System. A captured performance is retargeted onto a morphable 3D face model ...
Read More
High fidelity facial animation capture and retargeting with contours
SCA '13: Proceedings of the 12th ACM SIGGRAPH/Eurographics Symposium on Computer Animation

Human beings are naturally sensitive to subtle cues in facial expressions, especially in areas of the eyes and mouth. Current facial motion capture methods fail to accurately reproduce motions in those areas due to multiple limitations. In this paper, ...
Read More
Expressive Facial Animation Synthesis by Learning Speech Coarticulation and Expression Spaces

Synthesizing expressive facial animation is a very challenging topic within the graphics community. In this paper, we present an expressive facial animation synthesis system enabled by automated learning from facial motion capture data. Accurate 3D ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

CVMP '20: Proceedings of the 17th ACM SIGGRAPH European Conference on Visual Media Production
December 2020
46 pages
ISBN:9781450381987
DOI:10.1145/3429341

Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 8 December 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Badges
- Best Paper
Author Tags
facial animation
hybrid face model
performance capture
visual speech synthesis
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate40of67submissions,60%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 797
  Total Downloads
- Downloads (Last 12 months)174
- Downloads (Last 6 weeks)15
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Neural Face Models for Example-Based Visual Speech Synthesis

CVMP '20: Proceedings of the 17th ACM SIGGRAPH European Conference on Visual Media Production

ABSTRACT

References

Cited By

Recommendations

Semantic 3D motion retargeting for facial animation

High fidelity facial animation capture and retargeting with contours

Expressive Facial Animation Synthesis by Learning Speech Coarticulation and Expression Spaces