ABSTRACT
Creating realistic animations of human faces with computer graphic models is still a challenging task. It is often solved either with tedious manual work or motion capture based techniques that require specialised and costly hardware. Example based animation approaches circumvent these problems by re-using captured data of real people. This data is split into short motion samples that can be looped or concatenated in order to create novel motion sequences. The obvious advantages of this approach are the simplicity of use and the high realism, since the data exhibits only real deformations. Rather than tuning weights of a complex face rig, the animation task is performed on a higher level by arranging typical motion samples in a way such that the desired facial performance is achieved. Two difficulties with example based approaches, however, are high memory requirements as well as the creation of artefact-free and realistic transitions between motion samples. We solve these problems by combining the realism and simplicity of example-based animations with the advantages of neural face models. Our neural face model is capable of synthesising high quality 3D face geometry and texture according to a compact latent parameter vector. This latent representation reduces memory requirements by a factor of 100 and helps creating seamless transitions between concatenated motion samples. In this paper, we present a marker-less approach for facial motion capture based on multi-view video. Based on the captured data, we learn a neural representation of facial expressions, which is used to seamlessly concatenate facial performances during the animation procedure. We demonstrate the effectiveness of our approach by synthesising mouthings for Swiss-German sign language based on viseme query sequences.
- Volker Blanz and Thomas Vetter. 1999. A Morphable Model for the Synthesis of 3D Faces. In Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques(SIGGRAPH ’99). ACM Press/Addison-Wesley Publishing Co., New York, NY, USA, 187–194. https://doi.org/10.1145/311535.311556Google ScholarDigital Library
- George Borshukov, Jefferson Montgomery, Witek Werner, Barry Ruff, James Lau, Paul Thuriot, Patrick Mooney, Stefan Van Niekerk, Dave Raposo, Jean-Luc Duprat, John Hable, Håkan Kihlström, Daniel Roizman, Kevin Noone, and Jeff O’Connell. 2006. Playable Universal Capture. In ACM SIGGRAPH 2006 Sketches.Google Scholar
- Yuri Boykov, Olga Veksler, and Ramin Zabih. 2001. Fast Approximate Energy Minimization via Graph Cuts. IEEE Trans. Pattern Anal. Mach. Intell. 23, 11 (Nov. 2001), 1222–1239. https://doi.org/10.1109/34.969114Google ScholarDigital Library
- Christoph Bregler, Michele Covell, and Malcolm Slaney. 1997. Video Rewrite: Driving Visual Speech with Audio. In Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques(SIGGRAPH ’97). ACM Press/Addison-Wesley Publishing Co., USA, 353–360. https://doi.org/10.1145/258734.258880Google ScholarDigital Library
- Chen Cao, Yanlin Weng, Shun Zhou, Yiying Tong, and Kun Zhou. 2014. FaceWarehouse: A 3D Facial Expression Database for Visual Computing. IEEE Transactions on Visualization and Computer Graphics 20, 3 (March 2014), 413–425. https://doi.org/10.1109/TVCG.2013.249Google ScholarDigital Library
- Joel Carranza, Christian Theobalt, Marcus A. Magnor, and Hans-Peter Seidel. 2003. Free-viewpoint Video of Human Actors. ACM Trans. Graph. 22, 3 (July 2003), 569–577.Google ScholarDigital Library
- D. Casas, M. Volino, J. Collomosse, and A. Hilton. 2014. 4D Video Textures for Interactive Character Appearance. Computer Graphics Forum 33, 2 (2014), 371–380.Google ScholarDigital Library
- Timothy F. Cootes, Gareth J. Edwards, and Christopher J. Taylor. 1998. Active Appearance Models. In IEEE Transactions on Pattern Analysis and Machine Intelligence. Springer, 484–498.Google Scholar
- E. Cosatto and H. P. Graf. 2000. Photo-Realistic Talking-Heads from Image Samples. Trans. Multi. 2, 3 (Sept. 2000), 152–163. https://doi.org/10.1109/6046.865480Google ScholarDigital Library
- Daniel Cudeiro, Timo Bolkart, Cassidy Laidlaw, Anurag Ranjan, and Michael Black. 2019. Capture, Learning, and Synthesis of 3D Speaking Styles. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 10101–10111.Google ScholarCross Ref
- Kevin Dale, Kalyan Sunkavalli, Micah K. Johnson, Daniel Vlasic, Wojciech Matusik, and Hanspeter Pfister. 2011. Video Face Replacement. ACM Transactions on Graphics (Proc. SIGGRAPH Asia) 30 (2011).Google Scholar
- P. Eisert and B. Girod. 1998. Analyzing Facial Expressions for Virtual Conferencing. IEEE Computer Graphics and Applications 18, 5 (Sep. 1998), 70–78.Google ScholarDigital Library
- Eeva A. Elliott. 2013. Phonological Functions of Facial Movements. Ph.D. Dissertation. http://dx.doi.org/10.17169/refubium-8503Google Scholar
- Tony Ezzat, Gadi Geiger, and Tomaso Poggio. 2002. Trainable Videorealistic Speech Animation. ACM Trans. Graph. 21, 3 (July 2002), 388–398. https://doi.org/10.1145/566654.566594Google ScholarDigital Library
- Philipp Fechteler, Wolfgang Paier, and Peter Eisert. 2014. Articulated 3D Model Tracking with on-the-fly Texturing. In Proceedings of the 21st IEEE International Conference on Image Processing. Paris, France, 3998–4002. ICIP 2014.Google ScholarCross Ref
- Philipp Fechteler, Wolfgang Paier, Anna Hilsmann, and Peter Eisert. 2016. Real-time Avatar Animation with Dynamic Face Texturing. In Proceedings of the 23rd IEEE International Conference on Image Processing. Phoenix, Arizona, USA. ICIP 2016.Google ScholarCross Ref
- Pablo Garrido, Levi Valgaert, Chenglei Wu, and Christian Theobalt. 2013. Reconstructing Detailed Dynamic Face Geometry from Monocular Video. ACM Trans. Graph. 32, 6, Article 158 (Nov. 2013), 10 pages. https://doi.org/10.1145/2508363.2508380Google ScholarDigital Library
- Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2 (Montreal, Canada) (NIPS’14). MIT Press, Cambridge, MA, USA, 2672–2680. http://dl.acm.org/citation.cfm?id=2969033.2969125Google ScholarDigital Library
- J. Hou, S. Wang, Y. Lai, J. Lin, Y. Tsao, H. Chang, and H. Wang. 2016. Audio-visual speech enhancement using deep neural networks. In 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA). 1–6.Google Scholar
- Ahmed Hussen Abdelaziz, Barry-John Theobald, Justin Binder, Gabriele Fanelli, Paul Dixon, Nick Apostoloff, Thibaut Weise, and Sachin Kajareker. 2019. Speaker-Independent Speech-Driven Visual Speech Synthesis Using Domain-Adapted Acoustic Models. In 2019 International Conference on Multimodal Interaction (Suzhou, China) (ICMI ’19). Association for Computing Machinery, New York, NY, USA, 220–225. https://doi.org/10.1145/3340555.3353745Google ScholarDigital Library
- Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2016. Image-to-Image Translation with Conditional Adversarial Networks. arxiv (2016).Google Scholar
- Tero Karras, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. 2017. Audio-Driven Facial Animation by Joint End-to-End Learning of Pose and Emotion. ACM Trans. Graph. 36, 4, Article 94 (July 2017), 12 pages. https://doi.org/10.1145/3072959.3073658Google ScholarDigital Library
- Vahid Kazemi and Josephine Sullivan. 2014. One millisecond face alignment with an ensemble of regression trees. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1867–1874.Google ScholarDigital Library
- M. Kettern, A. Hilsmann, and P. Eisert. 2015. Temporally Consistent Wide Baseline Facial Performance Capture via Image Warping. In Proceedings of the Vision, Modeling, and Visualization Workshop 2015.Google Scholar
- J. Kilner, J. Starck, and A. Hilton. 2006. A Comparative Study of Free-Viewpoint Video Techniques for Sports Events. In Proceedings of the 3rd European Conference on Visual Media Production.Google Scholar
- Diederik P Kingma and Max Welling. 2013. Auto-Encoding Variational Bayes. arxiv:1312.6114 [stat.ML]Google Scholar
- L. Kovar, M. Gleicher, and F. Pighin. 2002. Motion Graphs. In Proc. of the 29th Annual Conference on Computer Graphics and Interactive Techniques (San Antonio, Texas) (SIGGRAPH ’02). ACM, New York, NY, USA, 473–482. https://doi.org/10.1145/566570.566605Google ScholarDigital Library
- Hao Li, Jihun Yu, Yuting Ye, and Chris Bregler. 2013. Realtime Facial Animation with On-the-fly Correctives. ACM Trans. Graph. 32, 4, Article 42 (July 2013), 10 pages. https://doi.org/10.1145/2461912.2462019Google ScholarDigital Library
- C. Lipski, F. Klose, K. Ruhl, and M. Magnor. 2011. Making of ”Who Cares?” HD Stereoscopic Free Viewpoint video. In Proceedings of the 8th European Conference on Visual Media Production.Google Scholar
- Stephen Lombardi, Jason M. Saragih, Tomas Simon, and Yaser Sheikh. 2018. Deep Appearance Models for Face Rendering. CoRR abs/1808.00362(2018). arxiv:1808.00362http://arxiv.org/abs/1808.00362Google Scholar
- Dominic Massaro. 1998. Perceiving Talking Faces. The MIT Press.Google Scholar
- Koki Nagano, Jaewoo Seo, Jun Xing, Lingyu Wei, Zimo Li, Shunsuke Saito, Aviral Agarwal, Jens Fursund, and Hao Li. 2018. paGAN: real-time avatars using dynamic textures. In SIGGRAPH Asia 2018 Technical Papers. ACM, 258.Google Scholar
- Wolfgang Paier, Anna Hilsmann, and Peter Eisert. 2020. Interactive Facial Animation with Deep Neural Networks. IET Computer Vision, Special Issue on Computer Vision for the Creative Industries (2020).Google Scholar
- Wolfgang Paier, Markus Kettern, Hilsmann Anna, and Peter Eisert. 2015. Video-Based Facial Re-Animation. In Proceedings of the 12th European Conference on Visual Media Production. London, UK.Google ScholarDigital Library
- Wolfgang Paier, Markus Kettern, Anna Hilsmann, and Peter Eisert. 2017. A Hybrid Approach for Facial Performance Analysis and Editing. IEEE Trans. Cir. and Sys. for Video Technol. 27, 4 (April 2017), 784–797. https://doi.org/10.1109/TCSVT.2016.2610078Google ScholarDigital Library
- P. Pérez, M. Gangnet, and A. Blake. 2003. Poisson Image Editing. ACM Trans. Graph. 22, 3 (July 2003), 313–318. https://doi.org/10.1145/882262.882269Google ScholarDigital Library
- A. Pumarola, A. Agudo, A.M. Martinez, A. Sanfeliu, and F. Moreno-Noguer. 2018. GANimation: Anatomically-aware Facial Animation from a Single Image. In Proceedings of the European Conference on Computer Vision (ECCV).Google Scholar
- Jun’ichiro Seyama and Ruth S. Nagayama. 2007. The Uncanny Valley: Effect of Realism on the Impression of Artificial Human Faces. Presence: Teleoper. Virtual Environ. 16, 4 (August 2007), 337–351.Google ScholarDigital Library
- T. Shimba, R. Sakurai, H. Yamazoe, and J. Lee. 2015. Talking heads synthesis from audio with deep neural networks. In 2015 IEEE/SICE International Symposium on System Integration (SII). 100–105.Google Scholar
- Ron Slossberg, Gil Shamai, and Ron Kimmel. 2018. High Quality Facial Surface and Texture Synthesis via Generative Adversarial Networks. CoRR abs/1808.08281(2018). arxiv:1808.08281http://arxiv.org/abs/1808.08281Google Scholar
- Supasorn Suwajanakorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. 2017. Synthesizing Obama: Learning Lip Sync from Audio. ACM Trans. Graph. 36, 4, Article 95 (July 2017), 13 pages. https://doi.org/10.1145/3072959.3073640Google ScholarDigital Library
- Sarah L. Taylor, Moshe Mahler, Barry-John Theobald, and Iain Matthews. 2012. Dynamic Units of Visual Speech. In Proceedings of the ACM SIGGRAPH/Eurographics Symposium on Computer Animation (Lausanne, Switzerland) (SCA ’12). Eurographics Association, Goslar, DEU, 275–284.Google ScholarDigital Library
- Justus Thies, Mohamed Elgharib, Ayush Tewari, Christian Theobalt, and Matthias Nießner. 2020. Neural Voice Puppetry: Audio-driven Facial Reenactment. ECCV 2020 (2020).Google Scholar
- J. Thies, M. Zollhöfer, M. Nießner, L. Valgaerts, M. Stamminger, and C. Theobalt. 2015. Real-time Expression Transfer for Facial Reenactment. ACM Transactions on Graphics (TOG) 34, 6 (2015).Google ScholarDigital Library
- Daniel Vlasic, Matthew Brand, Hanspeter Pfister, and Jovan Popović. 2005. Face Transfer with Multilinear Models. ACM Trans. Graph. 24, 3 (July 2005), 426–433. https://doi.org/10.1145/1073204.1073209Google ScholarDigital Library
- Yang Zhou, Zhan Xu, Chris Landreth, Evangelos Kalogerakis, Subhransu Maji, and Karan Singh. 2018. Visemenet: Audio-Driven Animator-Centric Speech Animation. ACM Trans. Graph. 37, 4, Article 161 (July 2018), 10 pages. https://doi.org/10.1145/3197517.3201292Google ScholarDigital Library
Recommendations
Semantic 3D motion retargeting for facial animation
APGV '06: Proceedings of the 3rd symposium on Applied perception in graphics and visualizationWe present a system for realistic facial animation that decomposes facial motion capture data into semantically meaningful motion channels based on the Facial Action Coding System. A captured performance is retargeted onto a morphable 3D face model ...
High fidelity facial animation capture and retargeting with contours
SCA '13: Proceedings of the 12th ACM SIGGRAPH/Eurographics Symposium on Computer AnimationHuman beings are naturally sensitive to subtle cues in facial expressions, especially in areas of the eyes and mouth. Current facial motion capture methods fail to accurately reproduce motions in those areas due to multiple limitations. In this paper, ...
Expressive Facial Animation Synthesis by Learning Speech Coarticulation and Expression Spaces
Synthesizing expressive facial animation is a very challenging topic within the graphics community. In this paper, we present an expressive facial animation synthesis system enabled by automated learning from facial motion capture data. Accurate 3D ...
Comments