Skip to main content

Out of Time: Automated Lip Sync in the Wild

  • Conference paper
  • First Online:
Computer Vision – ACCV 2016 Workshops (ACCV 2016)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10117))

Included in the following conference series:

Abstract

The goal of this work is to determine the audio-video synchronisation between mouth motion and speech in a video.

We propose a two-stream ConvNet architecture that enables the mapping between the sound and the mouth images to be trained end-to-end from unlabelled data. The trained network is used to determine the lip-sync error in a video.

We apply the network to two further tasks: active speaker detection and lip reading. On both tasks we set a new state-of-the-art on standard benchmark datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Bt.1359: Relative timing of sound and vision for broadcasting. ITU (1998)

    Google Scholar 

  2. Anina, I., Zhou, Z., Zhao, G., Pietikäinen, M.: Ouluvs2: a multi-view audiovisual database for non-rigid mouth motion analysis. In: 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), vol. 1, pp. 1–5. IEEE (2015)

    Google Scholar 

  3. Bredin, H., Chollet, G.: Audiovisual speech synchrony measure: application to biometrics. EURASIP J. Appl. Signal Process. 2007(1), 179 (2007)

    Article  MATH  Google Scholar 

  4. Chakravarty, P., Tuytelaars, T.: Cross-modal supervision for learning active speaker detection in video. arXiv preprint arXiv:1603.08907 (2016)

  5. Chatfield, K., Simonyan, K., Vedaldi, A., Zisserman, A.: Return of the devil in the details: delving deep into convolutional nets. In: Proceedings of BMVC (2014)

    Google Scholar 

  6. Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with application to face verification. In: Proceedings of the CVPR, vol. 1, pp. 539–546. IEEE (2005)

    Google Scholar 

  7. Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Proceedings of ACCV (2016)

    Google Scholar 

  8. Davis, S.B., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980)

    Article  Google Scholar 

  9. Geras, K.J., Mohamed, A.R., Caruana, R., Urban, G., Wang, S., Aslan, O., Philipose, M., Richardson, M., Sutton, C.: Compressing LSTMS into CNNS. arXiv preprint arXiv:1511.06433 (2015)

  10. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)

  11. Jia, Y.: Caffe: an open source convolutional architecture for fast feature embedding (2013). http://caffe.berkeleyvision.org/

  12. King, D.E.: Dlib-ml: a machine learning toolkit. J. Mach. Learn. Res. 10, 1755–1758 (2009)

    Google Scholar 

  13. Koster, B.E., Rodman, R.D., Bitzer, D.: Automated lip-sync: direct translation of speech-sound to mouth-shape. In: 1994 Conference Record of the Twenty-Eighth Asilomar Conference on Signals, Systems and Computers, vol. 1, pp. 583–586. IEEE (1994)

    Google Scholar 

  14. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: NIPS, pp. 1106–1114 (2012)

    Google Scholar 

  15. Lewis, J.: Automated lip-sync: background and techniques. J. Vis. Comput. Anim. 2(4), 118–122 (1991)

    Article  Google Scholar 

  16. Lienhart, R.: Reliable transition detection in videos: a survey and practitioner’s guide. Int. J. Image Graph. 1, 469–486 (2001)

    Article  Google Scholar 

  17. Marcheret, E., Potamianos, G., Vopicka, J., Goel, V.: Detecting audio-visual synchrony using deep neural networks. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)

    Google Scholar 

  18. McAllister, D.F., Rodman, R.D., Bitzer, D.L., Freeman, A.S.: Lip synchronization of speech. In: Audio-Visual Speech Processing: Computational & Cognitive Science Approaches (1997)

    Google Scholar 

  19. Morishima, S., Ogata, S., Murai, K., Nakamura, S.: Audio-visual speech translation with automatic lip syncronization and face tracking based on 3-D head model. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 2, p. II-2117. IEEE (2002)

    Google Scholar 

  20. Rúa, E.A., Bredin, H., Mateo, C.G., Chollet, G., Jiménez, D.G.: Audio-visual speech asynchrony detection using co-inertia analysis and coupled hidden markov models. Pattern Anal. Appl. 12(3), 271–284 (2009)

    Article  MathSciNet  Google Scholar 

  21. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, S., Karpathy, A., Khosla, A., Bernstein, M., Berg, A., Li, F.: ImageNet large scale visual recognition challenge. IJCV 115, 211–252 (2015)

    Article  MathSciNet  Google Scholar 

  22. Sargin, M.E., Yemez, Y., Erzin, E., Tekalp, A.M.: Audiovisual synchronization and fusion using canonical correlation analysis. IEEE Trans. Multimed. 9(7), 1396–1403 (2007)

    Article  Google Scholar 

  23. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS (2014)

    Google Scholar 

  24. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (2015)

    Google Scholar 

  25. Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision, Vancouver, BC, Canada (1981)

    Google Scholar 

  26. Vedaldi, A., Lenc, K.: Matconvnet - convolutional neural networks for MATLAB. CoRR abs/1412.4564 (2014)

    Google Scholar 

  27. Zhong, Y., Arandjelović, R., Zisserman, A.: Faces in places: compound query retrieval. In: British Machine Vision Conference (2016)

    Google Scholar 

  28. Zhou, Z., Hong, X., Zhao, G., Pietikäinen, M.: A compact representation of visual speech data using latent variables. IEEE Trans. Pattern Anal. Mach. Intell. 36(1), 1 (2014)

    Article  Google Scholar 

  29. Zoric, G., Pandzic, I.S.: A real-time lip sync system using a genetic algorithm for automatic neural network configuration. In: 2005 IEEE International Conference on Multimedia and Expo, pp. 1366–1369. IEEE (2005)

    Google Scholar 

Download references

Acknowledgements

We are very grateful to Andrew Senior for suggesting this problem; to Rob Cooper and Matt Haynes at BBC Research for help in obtaining the lip synchronisation dataset; and to Punarjay Chakravarty and Tinne Tuytelaars for supplying the Columbia dataset. Funding for this research is provided by the EPSRC Programme Grant Seebibyte EP/M013774/1.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Joon Son Chung .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Chung, J.S., Zisserman, A. (2017). Out of Time: Automated Lip Sync in the Wild. In: Chen, CS., Lu, J., Ma, KK. (eds) Computer Vision – ACCV 2016 Workshops. ACCV 2016. Lecture Notes in Computer Science(), vol 10117. Springer, Cham. https://doi.org/10.1007/978-3-319-54427-4_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-54427-4_19

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-54426-7

  • Online ISBN: 978-3-319-54427-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics