Abstract
The chief objective of the study was to observe phrasing behaviour of transformer-based neural networks from the linguistic point of view. The transformer-based architecture mapped prosodic phrasing in isolated sentences read out on request, but was commanded to predict prosodic phrases in continuous texts of journalistic style taken from radio news bulletins. The transfer was quite successful in that most of the prosodic phrase boundaries in the actual newsreading (established by expert auditory analysis) were correctly suggested by the machine. This result is not unexpected as both genres belong to clearly enunciated informative speaking style. The outcome partially rehabilitates the so-called laboratory speech, which is sometimes branded as ecologically invalid. The follow-up analyses revealed that the differences between human phrasing in news bulletins and the partition suggested by the machine can be classified into meaningful linguistic categories based on the syntactic structure or semantic contents, and as such, they can inform further research design.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Beckman, M.E., Ayers Elam, G.: Guidelines for ToBI Labelling, version 3. The Ohio State University Research Foundation, Ohio State University (1997)
Cruttenden, A.: Intonation. In: Cambridge Textbooks in Linguistics, 2nd edn. Cambridge University Press, Cambridge (1997)
Daneš, F.: Intonace a věta ve spisovné češtině. ČSAV, Praha (1957)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding (2019). http://arxiv.org/abs/1810.04805arXiv:1810.04805
Foltz, A., Maday, K., Ito, K.: Order effects in production and comprehension of prosodic boundaries. In: Frota, S., Elordiet, G., Prieto, P. (eds.) Prosodic Categories: Production. Perception and Comprehension. Studies in Natural Language and Linguistic Theory. Springer, Dordrecht (2011). https://doi.org/10.1007/978-94-007-0137-3_3
Friederici, A., Alter, K.: Lateralization of auditory language functions: a dynamic dual pathway model. Brain Lang. 89(2), 267–276 (2004)
Halliday, M.A.K.: Intonation and Grammar in British English. The Hague, Paris (1967)
Hanzlíček, Z., Vít, J., Tihelka, D.: LSTM-based speech segmentation for TTS synthesis. In: Ekštein, K. (ed.) TSD 2019. LNCS (LNAI), vol. 11697, pp. 361–372. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-27947-9_31
Jůzová, M., Tihelka, D.: Speaker-dependent BiLSTM-based phrasing. In: Sojka, P., Kopeček, I., Pala, K., Horák, A. (eds.) TSD 2020. LNCS (LNAI), vol. 12284, pp. 340–347. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58323-1_37
Kohler, K.: Editorial. Phonetika 65, 189–193 (2008)
Louw, J.A., Moodley, A.: Speaker specific phrase break modeling with conditional random fields for text-to-speech. In: 2016 Pattern Recognition Association of South Africa and Robotics and Mechatronics International Conference (PRASA-RobMech), pp. 1–6 (2016)
Luong, T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 1412–1421. Association for Computational Linguistics, September 2015
Matoušek, J., Romportl, J.: On building phonetically and prosodically rich speech corpus for text-to-speech synthesis. In: Proceedings of the 2nd IASTED International Conference on Computational Intelligence, San Francisco, USA, pp. 442–447. ACTA Press (2006)
Matoušek, J., Tihelka, D.: Annotation errors detection in TTS corpora. In: Proceedings of INTERSPEECH 2013, Lyon, France, pp. 1511–1515 (2013). http://www.kky.zcu.cz/en/publications/MatousekJ_2013_AnnotationErrors
Matoušek, J., Tihelka, D., Psutka, J.: Experiments with automatic segmentation for Czech speech synthesis. In: Matoušek, V., Mautner, P. (eds.) TSD 2003. LNCS (LNAI), vol. 2807, pp. 287–294. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-39398-6_41
Matoušek, J., Tihelka, D., Romportl, J.: Building of a speech corpus optimised for unit selection TTS synthesis. In: LREC 2008. Proceedings of 6th International Conference on Language Resources and Evaluation, Marrakech, Morocco, pp. 1296–1299. ELRA (2008)
Prahallad, K., Raghavendra, E.V., Black, A.W.: Learning speaker-specific phrase breaks for text-to-speech systems. In: SSW (2010)
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer (2020). arXiv:1910.10683
Read, I., Cox, S.: Stochastic and syntactic techniques for predicting phrase breaks. Comput. Speech Lang. 21(3), 519–542 (2007). https://doi.org/10.1016/j.csl.2006.09.004
Romportl, J., Matoušek, J.: Formal prosodic structures and their application in NLP. In: Matoušek, V., Mautner, P., Pavelka, T. (eds.) TSD 2005. LNCS (LNAI), vol. 3658, pp. 371–378. Springer, Heidelberg (2005). https://doi.org/10.1007/11551874_48
Rosenberg, A., Fernandez, R., Ramabhadran, B.: Modeling phrasing and prominence using deep recurrent learning. In: Interspeech 2015, pp. 3066–3070. ISCA (2015)
Steinhauer, K., Alter, K., Friederici, A.D.: Brain potentials indicate immediate use of prosodic cues in natural speech processing. Nature Neurosci. 2, 191–196 (1999)
Taylor, P., Black, A.: Assigning phrase breaks from part-of-speech sequences. Comput. Speech Lang. 12, 99–117 (1998)
Tihelka, D., Hanzlíček, Z., Jůzová, M., Vít, J., Matoušek, J., Grůber, M.: Current state of text-to-speech system ARTIC: a decade of research on the field of speech technologies. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2018. LNCS (LNAI), vol. 11107, pp. 369–378. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00794-2_40
Vaswani, A., et al.: Attention is all you need (2017). arXiv:1706.03762
Volín, J.: The size of prosodic phrases in native and foreign-accented read-out monologues. Acta Universitatis Carolinae - Philologica 2, 145–158 (2019)
Wells, J.C.: English Intonation. An Introduction. Cambridge University Press, Cambridge (2006)
Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45. Association for Computational Linguistics, Online, October 2020
Švec, J.: t5s–T5 made simple (2020). http://github.com/honzas83/t5s. Accessed 02 Apr 2020
Švec, J., et al.: General framework for mining, processing and storing large amounts of electronic texts for language modeling purposes. Lang. Res. Eval. 48(2), 227–248 (2013). https://doi.org/10.1007/s10579-013-9246-z
Acknowledgments
This work was funded by Czech Science Foundation (GA–CR), project GA21-14758S, and by the grant of the University of West Bohemia, project No. SGS-2019-027. Computational resources were supplied by project “e-Infrastruktura CZ” (e-INFRA LM2018140) within Projects of Large Research, Development and Innovations Infrastructures.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Volín, J., Řezáčková, M., Matouřek, J. (2021). Human and Transformer-Based Prosodic Phrasing in Two Speech Genres. In: Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2021. Lecture Notes in Computer Science(), vol 12997. Springer, Cham. https://doi.org/10.1007/978-3-030-87802-3_68
Download citation
DOI: https://doi.org/10.1007/978-3-030-87802-3_68
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-87801-6
Online ISBN: 978-3-030-87802-3
eBook Packages: Computer ScienceComputer Science (R0)