skip to main content
10.1145/3429341.3429356acmconferencesArticle/Chapter ViewAbstractPublication PagescvmpConference Proceedingsconference-collections
research-article
Open Access
Best Paper

Neural Face Models for Example-Based Visual Speech Synthesis

Published:08 December 2020Publication History

ABSTRACT

Creating realistic animations of human faces with computer graphic models is still a challenging task. It is often solved either with tedious manual work or motion capture based techniques that require specialised and costly hardware. Example based animation approaches circumvent these problems by re-using captured data of real people. This data is split into short motion samples that can be looped or concatenated in order to create novel motion sequences. The obvious advantages of this approach are the simplicity of use and the high realism, since the data exhibits only real deformations. Rather than tuning weights of a complex face rig, the animation task is performed on a higher level by arranging typical motion samples in a way such that the desired facial performance is achieved. Two difficulties with example based approaches, however, are high memory requirements as well as the creation of artefact-free and realistic transitions between motion samples. We solve these problems by combining the realism and simplicity of example-based animations with the advantages of neural face models. Our neural face model is capable of synthesising high quality 3D face geometry and texture according to a compact latent parameter vector. This latent representation reduces memory requirements by a factor of 100 and helps creating seamless transitions between concatenated motion samples. In this paper, we present a marker-less approach for facial motion capture based on multi-view video. Based on the captured data, we learn a neural representation of facial expressions, which is used to seamlessly concatenate facial performances during the animation procedure. We demonstrate the effectiveness of our approach by synthesising mouthings for Swiss-German sign language based on viseme query sequences.

References

  1. Volker Blanz and Thomas Vetter. 1999. A Morphable Model for the Synthesis of 3D Faces. In Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques(SIGGRAPH ’99). ACM Press/Addison-Wesley Publishing Co., New York, NY, USA, 187–194. https://doi.org/10.1145/311535.311556Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. George Borshukov, Jefferson Montgomery, Witek Werner, Barry Ruff, James Lau, Paul Thuriot, Patrick Mooney, Stefan Van Niekerk, Dave Raposo, Jean-Luc Duprat, John Hable, Håkan Kihlström, Daniel Roizman, Kevin Noone, and Jeff O’Connell. 2006. Playable Universal Capture. In ACM SIGGRAPH 2006 Sketches.Google ScholarGoogle Scholar
  3. Yuri Boykov, Olga Veksler, and Ramin Zabih. 2001. Fast Approximate Energy Minimization via Graph Cuts. IEEE Trans. Pattern Anal. Mach. Intell. 23, 11 (Nov. 2001), 1222–1239. https://doi.org/10.1109/34.969114Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Christoph Bregler, Michele Covell, and Malcolm Slaney. 1997. Video Rewrite: Driving Visual Speech with Audio. In Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques(SIGGRAPH ’97). ACM Press/Addison-Wesley Publishing Co., USA, 353–360. https://doi.org/10.1145/258734.258880Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Chen Cao, Yanlin Weng, Shun Zhou, Yiying Tong, and Kun Zhou. 2014. FaceWarehouse: A 3D Facial Expression Database for Visual Computing. IEEE Transactions on Visualization and Computer Graphics 20, 3 (March 2014), 413–425. https://doi.org/10.1109/TVCG.2013.249Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Joel Carranza, Christian Theobalt, Marcus A. Magnor, and Hans-Peter Seidel. 2003. Free-viewpoint Video of Human Actors. ACM Trans. Graph. 22, 3 (July 2003), 569–577.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. D. Casas, M. Volino, J. Collomosse, and A. Hilton. 2014. 4D Video Textures for Interactive Character Appearance. Computer Graphics Forum 33, 2 (2014), 371–380.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Timothy F. Cootes, Gareth J. Edwards, and Christopher J. Taylor. 1998. Active Appearance Models. In IEEE Transactions on Pattern Analysis and Machine Intelligence. Springer, 484–498.Google ScholarGoogle Scholar
  9. E. Cosatto and H. P. Graf. 2000. Photo-Realistic Talking-Heads from Image Samples. Trans. Multi. 2, 3 (Sept. 2000), 152–163. https://doi.org/10.1109/6046.865480Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Daniel Cudeiro, Timo Bolkart, Cassidy Laidlaw, Anurag Ranjan, and Michael Black. 2019. Capture, Learning, and Synthesis of 3D Speaking Styles. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 10101–10111.Google ScholarGoogle ScholarCross RefCross Ref
  11. Kevin Dale, Kalyan Sunkavalli, Micah K. Johnson, Daniel Vlasic, Wojciech Matusik, and Hanspeter Pfister. 2011. Video Face Replacement. ACM Transactions on Graphics (Proc. SIGGRAPH Asia) 30 (2011).Google ScholarGoogle Scholar
  12. P. Eisert and B. Girod. 1998. Analyzing Facial Expressions for Virtual Conferencing. IEEE Computer Graphics and Applications 18, 5 (Sep. 1998), 70–78.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Eeva A. Elliott. 2013. Phonological Functions of Facial Movements. Ph.D. Dissertation. http://dx.doi.org/10.17169/refubium-8503Google ScholarGoogle Scholar
  14. Tony Ezzat, Gadi Geiger, and Tomaso Poggio. 2002. Trainable Videorealistic Speech Animation. ACM Trans. Graph. 21, 3 (July 2002), 388–398. https://doi.org/10.1145/566654.566594Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Philipp Fechteler, Wolfgang Paier, and Peter Eisert. 2014. Articulated 3D Model Tracking with on-the-fly Texturing. In Proceedings of the 21st IEEE International Conference on Image Processing. Paris, France, 3998–4002. ICIP 2014.Google ScholarGoogle ScholarCross RefCross Ref
  16. Philipp Fechteler, Wolfgang Paier, Anna Hilsmann, and Peter Eisert. 2016. Real-time Avatar Animation with Dynamic Face Texturing. In Proceedings of the 23rd IEEE International Conference on Image Processing. Phoenix, Arizona, USA. ICIP 2016.Google ScholarGoogle ScholarCross RefCross Ref
  17. Pablo Garrido, Levi Valgaert, Chenglei Wu, and Christian Theobalt. 2013. Reconstructing Detailed Dynamic Face Geometry from Monocular Video. ACM Trans. Graph. 32, 6, Article 158 (Nov. 2013), 10 pages. https://doi.org/10.1145/2508363.2508380Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2 (Montreal, Canada) (NIPS’14). MIT Press, Cambridge, MA, USA, 2672–2680. http://dl.acm.org/citation.cfm?id=2969033.2969125Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J. Hou, S. Wang, Y. Lai, J. Lin, Y. Tsao, H. Chang, and H. Wang. 2016. Audio-visual speech enhancement using deep neural networks. In 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA). 1–6.Google ScholarGoogle Scholar
  20. Ahmed Hussen Abdelaziz, Barry-John Theobald, Justin Binder, Gabriele Fanelli, Paul Dixon, Nick Apostoloff, Thibaut Weise, and Sachin Kajareker. 2019. Speaker-Independent Speech-Driven Visual Speech Synthesis Using Domain-Adapted Acoustic Models. In 2019 International Conference on Multimodal Interaction (Suzhou, China) (ICMI ’19). Association for Computing Machinery, New York, NY, USA, 220–225. https://doi.org/10.1145/3340555.3353745Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2016. Image-to-Image Translation with Conditional Adversarial Networks. arxiv (2016).Google ScholarGoogle Scholar
  22. Tero Karras, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. 2017. Audio-Driven Facial Animation by Joint End-to-End Learning of Pose and Emotion. ACM Trans. Graph. 36, 4, Article 94 (July 2017), 12 pages. https://doi.org/10.1145/3072959.3073658Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Vahid Kazemi and Josephine Sullivan. 2014. One millisecond face alignment with an ensemble of regression trees. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1867–1874.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. M. Kettern, A. Hilsmann, and P. Eisert. 2015. Temporally Consistent Wide Baseline Facial Performance Capture via Image Warping. In Proceedings of the Vision, Modeling, and Visualization Workshop 2015.Google ScholarGoogle Scholar
  25. J. Kilner, J. Starck, and A. Hilton. 2006. A Comparative Study of Free-Viewpoint Video Techniques for Sports Events. In Proceedings of the 3rd European Conference on Visual Media Production.Google ScholarGoogle Scholar
  26. Diederik P Kingma and Max Welling. 2013. Auto-Encoding Variational Bayes. arxiv:1312.6114 [stat.ML]Google ScholarGoogle Scholar
  27. L. Kovar, M. Gleicher, and F. Pighin. 2002. Motion Graphs. In Proc. of the 29th Annual Conference on Computer Graphics and Interactive Techniques (San Antonio, Texas) (SIGGRAPH ’02). ACM, New York, NY, USA, 473–482. https://doi.org/10.1145/566570.566605Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Hao Li, Jihun Yu, Yuting Ye, and Chris Bregler. 2013. Realtime Facial Animation with On-the-fly Correctives. ACM Trans. Graph. 32, 4, Article 42 (July 2013), 10 pages. https://doi.org/10.1145/2461912.2462019Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. C. Lipski, F. Klose, K. Ruhl, and M. Magnor. 2011. Making of ”Who Cares?” HD Stereoscopic Free Viewpoint video. In Proceedings of the 8th European Conference on Visual Media Production.Google ScholarGoogle Scholar
  30. Stephen Lombardi, Jason M. Saragih, Tomas Simon, and Yaser Sheikh. 2018. Deep Appearance Models for Face Rendering. CoRR abs/1808.00362(2018). arxiv:1808.00362http://arxiv.org/abs/1808.00362Google ScholarGoogle Scholar
  31. Dominic Massaro. 1998. Perceiving Talking Faces. The MIT Press.Google ScholarGoogle Scholar
  32. Koki Nagano, Jaewoo Seo, Jun Xing, Lingyu Wei, Zimo Li, Shunsuke Saito, Aviral Agarwal, Jens Fursund, and Hao Li. 2018. paGAN: real-time avatars using dynamic textures. In SIGGRAPH Asia 2018 Technical Papers. ACM, 258.Google ScholarGoogle Scholar
  33. Wolfgang Paier, Anna Hilsmann, and Peter Eisert. 2020. Interactive Facial Animation with Deep Neural Networks. IET Computer Vision, Special Issue on Computer Vision for the Creative Industries (2020).Google ScholarGoogle Scholar
  34. Wolfgang Paier, Markus Kettern, Hilsmann Anna, and Peter Eisert. 2015. Video-Based Facial Re-Animation. In Proceedings of the 12th European Conference on Visual Media Production. London, UK.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Wolfgang Paier, Markus Kettern, Anna Hilsmann, and Peter Eisert. 2017. A Hybrid Approach for Facial Performance Analysis and Editing. IEEE Trans. Cir. and Sys. for Video Technol. 27, 4 (April 2017), 784–797. https://doi.org/10.1109/TCSVT.2016.2610078Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. P. Pérez, M. Gangnet, and A. Blake. 2003. Poisson Image Editing. ACM Trans. Graph. 22, 3 (July 2003), 313–318. https://doi.org/10.1145/882262.882269Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. A. Pumarola, A. Agudo, A.M. Martinez, A. Sanfeliu, and F. Moreno-Noguer. 2018. GANimation: Anatomically-aware Facial Animation from a Single Image. In Proceedings of the European Conference on Computer Vision (ECCV).Google ScholarGoogle Scholar
  38. Jun’ichiro Seyama and Ruth S. Nagayama. 2007. The Uncanny Valley: Effect of Realism on the Impression of Artificial Human Faces. Presence: Teleoper. Virtual Environ. 16, 4 (August 2007), 337–351.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. T. Shimba, R. Sakurai, H. Yamazoe, and J. Lee. 2015. Talking heads synthesis from audio with deep neural networks. In 2015 IEEE/SICE International Symposium on System Integration (SII). 100–105.Google ScholarGoogle Scholar
  40. Ron Slossberg, Gil Shamai, and Ron Kimmel. 2018. High Quality Facial Surface and Texture Synthesis via Generative Adversarial Networks. CoRR abs/1808.08281(2018). arxiv:1808.08281http://arxiv.org/abs/1808.08281Google ScholarGoogle Scholar
  41. Supasorn Suwajanakorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. 2017. Synthesizing Obama: Learning Lip Sync from Audio. ACM Trans. Graph. 36, 4, Article 95 (July 2017), 13 pages. https://doi.org/10.1145/3072959.3073640Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Sarah L. Taylor, Moshe Mahler, Barry-John Theobald, and Iain Matthews. 2012. Dynamic Units of Visual Speech. In Proceedings of the ACM SIGGRAPH/Eurographics Symposium on Computer Animation (Lausanne, Switzerland) (SCA ’12). Eurographics Association, Goslar, DEU, 275–284.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Justus Thies, Mohamed Elgharib, Ayush Tewari, Christian Theobalt, and Matthias Nießner. 2020. Neural Voice Puppetry: Audio-driven Facial Reenactment. ECCV 2020 (2020).Google ScholarGoogle Scholar
  44. J. Thies, M. Zollhöfer, M. Nießner, L. Valgaerts, M. Stamminger, and C. Theobalt. 2015. Real-time Expression Transfer for Facial Reenactment. ACM Transactions on Graphics (TOG) 34, 6 (2015).Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Daniel Vlasic, Matthew Brand, Hanspeter Pfister, and Jovan Popović. 2005. Face Transfer with Multilinear Models. ACM Trans. Graph. 24, 3 (July 2005), 426–433. https://doi.org/10.1145/1073204.1073209Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Yang Zhou, Zhan Xu, Chris Landreth, Evangelos Kalogerakis, Subhransu Maji, and Karan Singh. 2018. Visemenet: Audio-Driven Animator-Centric Speech Animation. ACM Trans. Graph. 37, 4, Article 161 (July 2018), 10 pages. https://doi.org/10.1145/3197517.3201292Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    CVMP '20: Proceedings of the 17th ACM SIGGRAPH European Conference on Visual Media Production
    December 2020
    46 pages
    ISBN:9781450381987
    DOI:10.1145/3429341

    Copyright © 2020 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 8 December 2020

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed limited

    Acceptance Rates

    Overall Acceptance Rate40of67submissions,60%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format