Out of Time: Automated Lip Sync in the Wild

Chung, Joon Son; Zisserman, Andrew

doi:10.1007/978-3-319-54427-4_19

Joon Son Chung¹⁶ &
Andrew Zisserman¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10117))

Included in the following conference series:

Asian Conference on Computer Vision

3552 Accesses
132 Citations

Abstract

The goal of this work is to determine the audio-video synchronisation between mouth motion and speech in a video.

We propose a two-stream ConvNet architecture that enables the mapping between the sound and the mouth images to be trained end-to-end from unlabelled data. The trained network is used to determine the lip-sync error in a video.

We apply the network to two further tasks: active speaker detection and lip reading. On both tasks we set a new state-of-the-art on standard benchmark datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bt.1359: Relative timing of sound and vision for broadcasting. ITU (1998)
Google Scholar
Anina, I., Zhou, Z., Zhao, G., Pietikäinen, M.: Ouluvs2: a multi-view audiovisual database for non-rigid mouth motion analysis. In: 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), vol. 1, pp. 1–5. IEEE (2015)
Google Scholar
Bredin, H., Chollet, G.: Audiovisual speech synchrony measure: application to biometrics. EURASIP J. Appl. Signal Process. 2007(1), 179 (2007)
Article MATH Google Scholar
Chakravarty, P., Tuytelaars, T.: Cross-modal supervision for learning active speaker detection in video. arXiv preprint arXiv:1603.08907 (2016)
Chatfield, K., Simonyan, K., Vedaldi, A., Zisserman, A.: Return of the devil in the details: delving deep into convolutional nets. In: Proceedings of BMVC (2014)
Google Scholar
Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with application to face verification. In: Proceedings of the CVPR, vol. 1, pp. 539–546. IEEE (2005)
Google Scholar
Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Proceedings of ACCV (2016)
Google Scholar
Davis, S.B., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980)
Article Google Scholar
Geras, K.J., Mohamed, A.R., Caruana, R., Urban, G., Wang, S., Aslan, O., Philipose, M., Richardson, M., Sutton, C.: Compressing LSTMS into CNNS. arXiv preprint arXiv:1511.06433 (2015)
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)
Jia, Y.: Caffe: an open source convolutional architecture for fast feature embedding (2013). http://caffe.berkeleyvision.org/
King, D.E.: Dlib-ml: a machine learning toolkit. J. Mach. Learn. Res. 10, 1755–1758 (2009)
Google Scholar
Koster, B.E., Rodman, R.D., Bitzer, D.: Automated lip-sync: direct translation of speech-sound to mouth-shape. In: 1994 Conference Record of the Twenty-Eighth Asilomar Conference on Signals, Systems and Computers, vol. 1, pp. 583–586. IEEE (1994)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: NIPS, pp. 1106–1114 (2012)
Google Scholar
Lewis, J.: Automated lip-sync: background and techniques. J. Vis. Comput. Anim. 2(4), 118–122 (1991)
Article Google Scholar
Lienhart, R.: Reliable transition detection in videos: a survey and practitioner’s guide. Int. J. Image Graph. 1, 469–486 (2001)
Article Google Scholar
Marcheret, E., Potamianos, G., Vopicka, J., Goel, V.: Detecting audio-visual synchrony using deep neural networks. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)
Google Scholar
McAllister, D.F., Rodman, R.D., Bitzer, D.L., Freeman, A.S.: Lip synchronization of speech. In: Audio-Visual Speech Processing: Computational & Cognitive Science Approaches (1997)
Google Scholar
Morishima, S., Ogata, S., Murai, K., Nakamura, S.: Audio-visual speech translation with automatic lip syncronization and face tracking based on 3-D head model. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 2, p. II-2117. IEEE (2002)
Google Scholar
Rúa, E.A., Bredin, H., Mateo, C.G., Chollet, G., Jiménez, D.G.: Audio-visual speech asynchrony detection using co-inertia analysis and coupled hidden markov models. Pattern Anal. Appl. 12(3), 271–284 (2009)
Article MathSciNet Google Scholar
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, S., Karpathy, A., Khosla, A., Bernstein, M., Berg, A., Li, F.: ImageNet large scale visual recognition challenge. IJCV 115, 211–252 (2015)
Article MathSciNet Google Scholar
Sargin, M.E., Yemez, Y., Erzin, E., Tekalp, A.M.: Audiovisual synchronization and fusion using canonical correlation analysis. IEEE Trans. Multimed. 9(7), 1396–1403 (2007)
Article Google Scholar
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS (2014)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (2015)
Google Scholar
Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision, Vancouver, BC, Canada (1981)
Google Scholar
Vedaldi, A., Lenc, K.: Matconvnet - convolutional neural networks for MATLAB. CoRR abs/1412.4564 (2014)
Google Scholar
Zhong, Y., Arandjelović, R., Zisserman, A.: Faces in places: compound query retrieval. In: British Machine Vision Conference (2016)
Google Scholar
Zhou, Z., Hong, X., Zhao, G., Pietikäinen, M.: A compact representation of visual speech data using latent variables. IEEE Trans. Pattern Anal. Mach. Intell. 36(1), 1 (2014)
Article Google Scholar
Zoric, G., Pandzic, I.S.: A real-time lip sync system using a genetic algorithm for automatic neural network configuration. In: 2005 IEEE International Conference on Multimedia and Expo, pp. 1366–1369. IEEE (2005)
Google Scholar

Download references

Acknowledgements

We are very grateful to Andrew Senior for suggesting this problem; to Rob Cooper and Matt Haynes at BBC Research for help in obtaining the lip synchronisation dataset; and to Punarjay Chakravarty and Tinne Tuytelaars for supplying the Columbia dataset. Funding for this research is provided by the EPSRC Programme Grant Seebibyte EP/M013774/1.

Author information

Authors and Affiliations

Visual Geometry Group, Department of Engineering Science, University of Oxford, Oxford, UK
Joon Son Chung & Andrew Zisserman

Authors

Joon Son Chung
View author publications
You can also search for this author in PubMed Google Scholar
Andrew Zisserman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Joon Son Chung .

Editor information

Editors and Affiliations

Institute of Information Science, Academia Sinica, Taipei, Taiwan
Chu-Song Chen
Tsinghua University, Beijing, China
Jiwen Lu
School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore, Singapore
Kai-Kuang Ma

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chung, J.S., Zisserman, A. (2017). Out of Time: Automated Lip Sync in the Wild. In: Chen, CS., Lu, J., Ma, KK. (eds) Computer Vision – ACCV 2016 Workshops. ACCV 2016. Lecture Notes in Computer Science(), vol 10117. Springer, Cham. https://doi.org/10.1007/978-3-319-54427-4_19

Download citation

DOI: https://doi.org/10.1007/978-3-319-54427-4_19
Published: 16 March 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-54426-7
Online ISBN: 978-3-319-54427-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics