Text does not fully specify the spoken form, so text-to-speech models
must be able to learn from speech data that vary in ways not explained
by the corresponding text. One way to reduce the amount of unexplained
variation in training data is to provide acoustic information as an
additional learning signal. When generating speech, modifying this
acoustic information enables multiple distinct renditions of a text
to be produced.
Since much of the unexplained variation is in the prosody, we
propose a model that generates speech explicitly conditioned on the
three primary acoustic correlates of prosody: F0, energy
and duration. The model is flexible about how the values of these features
are specified: they can be externally provided, or predicted from text,
or predicted then subsequently modified.
Compared to a model
that employs a variational auto-encoder to learn unsupervised latent
features, our model provides more interpretable, temporally-precise,
and disentangled control. When automatically predicting the acoustic
features from text, it generates speech that is more natural than that
from a Tacotron 2 model with reference encoder. Subsequent human-in-the-loop
modification of the predicted acoustic features can significantly further
increase naturalness.
Cite as: Mohan, D.S.R., Hu, V., Teh, T.H., Torresquintero, A., Wallis, C.G.R., Staib, M., Foglianti, L., Gao, J., King, S. (2021) Ctrl-P: Temporal Control of Prosodic Variation for Speech Synthesis. Proc. Interspeech 2021, 3875-3879, doi: 10.21437/Interspeech.2021-1583
@inproceedings{mohan21_interspeech, author={Devang S. Ram Mohan and Vivian Hu and Tian Huey Teh and Alexandra Torresquintero and Christopher G.R. Wallis and Marlene Staib and Lorenzo Foglianti and Jiameng Gao and Simon King}, title={{Ctrl-P: Temporal Control of Prosodic Variation for Speech Synthesis}}, year=2021, booktitle={Proc. Interspeech 2021}, pages={3875--3879}, doi={10.21437/Interspeech.2021-1583} }