Skip to main content

Human and Transformer-Based Prosodic Phrasing in Two Speech Genres

  • Conference paper
  • First Online:
Speech and Computer (SPECOM 2021)

Abstract

The chief objective of the study was to observe phrasing behaviour of transformer-based neural networks from the linguistic point of view. The transformer-based architecture mapped prosodic phrasing in isolated sentences read out on request, but was commanded to predict prosodic phrases in continuous texts of journalistic style taken from radio news bulletins. The transfer was quite successful in that most of the prosodic phrase boundaries in the actual newsreading (established by expert auditory analysis) were correctly suggested by the machine. This result is not unexpected as both genres belong to clearly enunciated informative speaking style. The outcome partially rehabilitates the so-called laboratory speech, which is sometimes branded as ecologically invalid. The follow-up analyses revealed that the differences between human phrasing in news bulletins and the partition suggested by the machine can be classified into meaningful linguistic categories based on the syntactic structure or semantic contents, and as such, they can inform further research design.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Beckman, M.E., Ayers Elam, G.: Guidelines for ToBI Labelling, version 3. The Ohio State University Research Foundation, Ohio State University (1997)

    Google Scholar 

  2. Cruttenden, A.: Intonation. In: Cambridge Textbooks in Linguistics, 2nd edn. Cambridge University Press, Cambridge (1997)

    Google Scholar 

  3. Daneš, F.: Intonace a věta ve spisovné češtině. ČSAV, Praha (1957)

    Google Scholar 

  4. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding (2019). http://arxiv.org/abs/1810.04805arXiv:1810.04805

  5. Foltz, A., Maday, K., Ito, K.: Order effects in production and comprehension of prosodic boundaries. In: Frota, S., Elordiet, G., Prieto, P. (eds.) Prosodic Categories: Production. Perception and Comprehension. Studies in Natural Language and Linguistic Theory. Springer, Dordrecht (2011). https://doi.org/10.1007/978-94-007-0137-3_3

  6. Friederici, A., Alter, K.: Lateralization of auditory language functions: a dynamic dual pathway model. Brain Lang. 89(2), 267–276 (2004)

    Article  Google Scholar 

  7. Halliday, M.A.K.: Intonation and Grammar in British English. The Hague, Paris (1967)

    Google Scholar 

  8. Hanzlíček, Z., Vít, J., Tihelka, D.: LSTM-based speech segmentation for TTS synthesis. In: Ekštein, K. (ed.) TSD 2019. LNCS (LNAI), vol. 11697, pp. 361–372. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-27947-9_31

    Chapter  Google Scholar 

  9. Jůzová, M., Tihelka, D.: Speaker-dependent BiLSTM-based phrasing. In: Sojka, P., Kopeček, I., Pala, K., Horák, A. (eds.) TSD 2020. LNCS (LNAI), vol. 12284, pp. 340–347. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58323-1_37

    Chapter  Google Scholar 

  10. Kohler, K.: Editorial. Phonetika 65, 189–193 (2008)

    Google Scholar 

  11. Louw, J.A., Moodley, A.: Speaker specific phrase break modeling with conditional random fields for text-to-speech. In: 2016 Pattern Recognition Association of South Africa and Robotics and Mechatronics International Conference (PRASA-RobMech), pp. 1–6 (2016)

    Google Scholar 

  12. Luong, T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 1412–1421. Association for Computational Linguistics, September 2015

    Google Scholar 

  13. Matoušek, J., Romportl, J.: On building phonetically and prosodically rich speech corpus for text-to-speech synthesis. In: Proceedings of the 2nd IASTED International Conference on Computational Intelligence, San Francisco, USA, pp. 442–447. ACTA Press (2006)

    Google Scholar 

  14. Matoušek, J., Tihelka, D.: Annotation errors detection in TTS corpora. In: Proceedings of INTERSPEECH 2013, Lyon, France, pp. 1511–1515 (2013). http://www.kky.zcu.cz/en/publications/MatousekJ_2013_AnnotationErrors

  15. Matoušek, J., Tihelka, D., Psutka, J.: Experiments with automatic segmentation for Czech speech synthesis. In: Matoušek, V., Mautner, P. (eds.) TSD 2003. LNCS (LNAI), vol. 2807, pp. 287–294. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-39398-6_41

    Chapter  Google Scholar 

  16. Matoušek, J., Tihelka, D., Romportl, J.: Building of a speech corpus optimised for unit selection TTS synthesis. In: LREC 2008. Proceedings of 6th International Conference on Language Resources and Evaluation, Marrakech, Morocco, pp. 1296–1299. ELRA (2008)

    Google Scholar 

  17. Prahallad, K., Raghavendra, E.V., Black, A.W.: Learning speaker-specific phrase breaks for text-to-speech systems. In: SSW (2010)

    Google Scholar 

  18. Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer (2020). arXiv:1910.10683

  19. Read, I., Cox, S.: Stochastic and syntactic techniques for predicting phrase breaks. Comput. Speech Lang. 21(3), 519–542 (2007). https://doi.org/10.1016/j.csl.2006.09.004

    Article  Google Scholar 

  20. Romportl, J., Matoušek, J.: Formal prosodic structures and their application in NLP. In: Matoušek, V., Mautner, P., Pavelka, T. (eds.) TSD 2005. LNCS (LNAI), vol. 3658, pp. 371–378. Springer, Heidelberg (2005). https://doi.org/10.1007/11551874_48

    Chapter  Google Scholar 

  21. Rosenberg, A., Fernandez, R., Ramabhadran, B.: Modeling phrasing and prominence using deep recurrent learning. In: Interspeech 2015, pp. 3066–3070. ISCA (2015)

    Google Scholar 

  22. Steinhauer, K., Alter, K., Friederici, A.D.: Brain potentials indicate immediate use of prosodic cues in natural speech processing. Nature Neurosci. 2, 191–196 (1999)

    Article  Google Scholar 

  23. Taylor, P., Black, A.: Assigning phrase breaks from part-of-speech sequences. Comput. Speech Lang. 12, 99–117 (1998)

    Article  Google Scholar 

  24. Tihelka, D., Hanzlíček, Z., Jůzová, M., Vít, J., Matoušek, J., Grůber, M.: Current state of text-to-speech system ARTIC: a decade of research on the field of speech technologies. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2018. LNCS (LNAI), vol. 11107, pp. 369–378. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00794-2_40

    Chapter  Google Scholar 

  25. Vaswani, A., et al.: Attention is all you need (2017). arXiv:1706.03762

  26. Volín, J.: The size of prosodic phrases in native and foreign-accented read-out monologues. Acta Universitatis Carolinae - Philologica 2, 145–158 (2019)

    Google Scholar 

  27. Wells, J.C.: English Intonation. An Introduction. Cambridge University Press, Cambridge (2006)

    Google Scholar 

  28. Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45. Association for Computational Linguistics, Online, October 2020

    Google Scholar 

  29. Švec, J.: t5s–T5 made simple (2020). http://github.com/honzas83/t5s. Accessed 02 Apr 2020

  30. Švec, J., et al.: General framework for mining, processing and storing large amounts of electronic texts for language modeling purposes. Lang. Res. Eval. 48(2), 227–248 (2013). https://doi.org/10.1007/s10579-013-9246-z

Download references

Acknowledgments

This work was funded by Czech Science Foundation (GA–CR), project GA21-14758S, and by the grant of the University of West Bohemia, project No. SGS-2019-027. Computational resources were supplied by project “e-Infrastruktura CZ” (e-INFRA LM2018140) within Projects of Large Research, Development and Innovations Infrastructures.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jan Volín .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Volín, J., Řezáčková, M., Matouřek, J. (2021). Human and Transformer-Based Prosodic Phrasing in Two Speech Genres. In: Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2021. Lecture Notes in Computer Science(), vol 12997. Springer, Cham. https://doi.org/10.1007/978-3-030-87802-3_68

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-87802-3_68

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-87801-6

  • Online ISBN: 978-3-030-87802-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics