Human and Transformer-Based Prosodic Phrasing in Two Speech Genres

Volín, Jan; Řezáčková, Markéta; Matouřek, Jindřich

doi:10.1007/978-3-030-87802-3_68

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12997))

Included in the following conference series:

International Conference on Speech and Computer

1584 Accesses
2 Citations

Abstract

The chief objective of the study was to observe phrasing behaviour of transformer-based neural networks from the linguistic point of view. The transformer-based architecture mapped prosodic phrasing in isolated sentences read out on request, but was commanded to predict prosodic phrases in continuous texts of journalistic style taken from radio news bulletins. The transfer was quite successful in that most of the prosodic phrase boundaries in the actual newsreading (established by expert auditory analysis) were correctly suggested by the machine. This result is not unexpected as both genres belong to clearly enunciated informative speaking style. The outcome partially rehabilitates the so-called laboratory speech, which is sometimes branded as ecologically invalid. The follow-up analyses revealed that the differences between human phrasing in news bulletins and the partition suggested by the machine can be classified into meaningful linguistic categories based on the syntactic structure or semantic contents, and as such, they can inform further research design.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Beckman, M.E., Ayers Elam, G.: Guidelines for ToBI Labelling, version 3. The Ohio State University Research Foundation, Ohio State University (1997)
Google Scholar
Cruttenden, A.: Intonation. In: Cambridge Textbooks in Linguistics, 2nd edn. Cambridge University Press, Cambridge (1997)
Google Scholar
Daneš, F.: Intonace a věta ve spisovné češtině. ČSAV, Praha (1957)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding (2019). http://arxiv.org/abs/1810.04805arXiv:1810.04805
Foltz, A., Maday, K., Ito, K.: Order effects in production and comprehension of prosodic boundaries. In: Frota, S., Elordiet, G., Prieto, P. (eds.) Prosodic Categories: Production. Perception and Comprehension. Studies in Natural Language and Linguistic Theory. Springer, Dordrecht (2011). https://doi.org/10.1007/978-94-007-0137-3_3
Friederici, A., Alter, K.: Lateralization of auditory language functions: a dynamic dual pathway model. Brain Lang. 89(2), 267–276 (2004)
Article Google Scholar
Halliday, M.A.K.: Intonation and Grammar in British English. The Hague, Paris (1967)
Google Scholar
Hanzlíček, Z., Vít, J., Tihelka, D.: LSTM-based speech segmentation for TTS synthesis. In: Ekštein, K. (ed.) TSD 2019. LNCS (LNAI), vol. 11697, pp. 361–372. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-27947-9_31
Chapter Google Scholar
Jůzová, M., Tihelka, D.: Speaker-dependent BiLSTM-based phrasing. In: Sojka, P., Kopeček, I., Pala, K., Horák, A. (eds.) TSD 2020. LNCS (LNAI), vol. 12284, pp. 340–347. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58323-1_37
Chapter Google Scholar
Kohler, K.: Editorial. Phonetika 65, 189–193 (2008)
Google Scholar
Louw, J.A., Moodley, A.: Speaker specific phrase break modeling with conditional random fields for text-to-speech. In: 2016 Pattern Recognition Association of South Africa and Robotics and Mechatronics International Conference (PRASA-RobMech), pp. 1–6 (2016)
Google Scholar
Luong, T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 1412–1421. Association for Computational Linguistics, September 2015
Google Scholar
Matoušek, J., Romportl, J.: On building phonetically and prosodically rich speech corpus for text-to-speech synthesis. In: Proceedings of the 2nd IASTED International Conference on Computational Intelligence, San Francisco, USA, pp. 442–447. ACTA Press (2006)
Google Scholar
Matoušek, J., Tihelka, D.: Annotation errors detection in TTS corpora. In: Proceedings of INTERSPEECH 2013, Lyon, France, pp. 1511–1515 (2013). http://www.kky.zcu.cz/en/publications/MatousekJ_2013_AnnotationErrors
Matoušek, J., Tihelka, D., Psutka, J.: Experiments with automatic segmentation for Czech speech synthesis. In: Matoušek, V., Mautner, P. (eds.) TSD 2003. LNCS (LNAI), vol. 2807, pp. 287–294. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-39398-6_41
Chapter Google Scholar
Matoušek, J., Tihelka, D., Romportl, J.: Building of a speech corpus optimised for unit selection TTS synthesis. In: LREC 2008. Proceedings of 6th International Conference on Language Resources and Evaluation, Marrakech, Morocco, pp. 1296–1299. ELRA (2008)
Google Scholar
Prahallad, K., Raghavendra, E.V., Black, A.W.: Learning speaker-specific phrase breaks for text-to-speech systems. In: SSW (2010)
Google Scholar
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer (2020). arXiv:1910.10683
Read, I., Cox, S.: Stochastic and syntactic techniques for predicting phrase breaks. Comput. Speech Lang. 21(3), 519–542 (2007). https://doi.org/10.1016/j.csl.2006.09.004
Article Google Scholar
Romportl, J., Matoušek, J.: Formal prosodic structures and their application in NLP. In: Matoušek, V., Mautner, P., Pavelka, T. (eds.) TSD 2005. LNCS (LNAI), vol. 3658, pp. 371–378. Springer, Heidelberg (2005). https://doi.org/10.1007/11551874_48
Chapter Google Scholar
Rosenberg, A., Fernandez, R., Ramabhadran, B.: Modeling phrasing and prominence using deep recurrent learning. In: Interspeech 2015, pp. 3066–3070. ISCA (2015)
Google Scholar
Steinhauer, K., Alter, K., Friederici, A.D.: Brain potentials indicate immediate use of prosodic cues in natural speech processing. Nature Neurosci. 2, 191–196 (1999)
Article Google Scholar
Taylor, P., Black, A.: Assigning phrase breaks from part-of-speech sequences. Comput. Speech Lang. 12, 99–117 (1998)
Article Google Scholar
Tihelka, D., Hanzlíček, Z., Jůzová, M., Vít, J., Matoušek, J., Grůber, M.: Current state of text-to-speech system ARTIC: a decade of research on the field of speech technologies. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2018. LNCS (LNAI), vol. 11107, pp. 369–378. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00794-2_40
Chapter Google Scholar
Vaswani, A., et al.: Attention is all you need (2017). arXiv:1706.03762
Volín, J.: The size of prosodic phrases in native and foreign-accented read-out monologues. Acta Universitatis Carolinae - Philologica 2, 145–158 (2019)
Google Scholar
Wells, J.C.: English Intonation. An Introduction. Cambridge University Press, Cambridge (2006)
Google Scholar
Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45. Association for Computational Linguistics, Online, October 2020
Google Scholar
Švec, J.: t5s–T5 made simple (2020). http://github.com/honzas83/t5s. Accessed 02 Apr 2020
Švec, J., et al.: General framework for mining, processing and storing large amounts of electronic texts for language modeling purposes. Lang. Res. Eval. 48(2), 227–248 (2013). https://doi.org/10.1007/s10579-013-9246-z

Download references

Acknowledgments

This work was funded by Czech Science Foundation (GA–CR), project GA21-14758S, and by the grant of the University of West Bohemia, project No. SGS-2019-027. Computational resources were supplied by project “e-Infrastruktura CZ” (e-INFRA LM2018140) within Projects of Large Research, Development and Innovations Infrastructures.

Author information

Authors and Affiliations

Institute of Phonetics, Charles University, Prague, Czech Republic
Jan Volín
New Technologies for the Information Society, Pilsen, Czech Republic
Markéta Řezáčková & Jindřich Matouřek
Department of Cybernetics, Faculty of Applied Sciences, University of West Bohemia, Pilsen, Czech Republic
Jindřich Matouřek

Authors

Jan Volín
View author publications
You can also search for this author in PubMed Google Scholar
Markéta Řezáčková
View author publications
You can also search for this author in PubMed Google Scholar
Jindřich Matouřek
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jan Volín .

Editor information

Editors and Affiliations

St. Petersburg Federal Research Center of the Russian Academy of Sciences, St. Petersburg, Russia
Alexey Karpov
Moscow State Linguistic University, Moscow, Russia
Rodmonga Potapova

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Volín, J., Řezáčková, M., Matouřek, J. (2021). Human and Transformer-Based Prosodic Phrasing in Two Speech Genres. In: Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2021. Lecture Notes in Computer Science(), vol 12997. Springer, Cham. https://doi.org/10.1007/978-3-030-87802-3_68

Download citation

DOI: https://doi.org/10.1007/978-3-030-87802-3_68
Published: 22 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-87801-6
Online ISBN: 978-3-030-87802-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics