SVTS: Scalable Video-to-Speech Synthesis

Mira, Rodrigo; Haliassos, Alexandros; Petridis, Stavros; Schuller, Björn W.; Pantic, Maja

Computer Science > Sound

arXiv:2205.02058 (cs)

[Submitted on 4 May 2022 (v1), last revised 15 Aug 2022 (this version, v2)]

Title:SVTS: Scalable Video-to-Speech Synthesis

Authors:Rodrigo Mira, Alexandros Haliassos, Stavros Petridis, Björn W. Schuller, Maja Pantic

View PDF

Abstract:Video-to-speech synthesis (also known as lip-to-speech) refers to the translation of silent lip movements into the corresponding audio. This task has received an increasing amount of attention due to its self-supervised nature (i.e., can be trained without manual labelling) combined with the ever-growing collection of audio-visual data available online. Despite these strong motivations, contemporary video-to-speech works focus mainly on small- to medium-sized corpora with substantial constraints in both vocabulary and setting. In this work, we introduce a scalable video-to-speech framework consisting of two components: a video-to-spectrogram predictor and a pre-trained neural vocoder, which converts the mel-frequency spectrograms into waveform audio. We achieve state-of-the art results for GRID and considerably outperform previous approaches on LRW. More importantly, by focusing on spectrogram prediction using a simple feedforward model, we can efficiently and effectively scale our method to very large and unconstrained datasets: To the best of our knowledge, we are the first to show intelligible results on the challenging LRS3 dataset.

Comments:	accepted to INTERSPEECH 2022 (Oral Presentation)
Subjects:	Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2205.02058 [cs.SD]
	(or arXiv:2205.02058v2 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2205.02058

Submission history

From: Rodrigo Mira [view email]
[v1] Wed, 4 May 2022 13:34:07 UTC (325 KB)
[v2] Mon, 15 Aug 2022 18:38:37 UTC (325 KB)

Computer Science > Sound

Title:SVTS: Scalable Video-to-Speech Synthesis

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:SVTS: Scalable Video-to-Speech Synthesis

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators