Skip to main content
Log in

The German Text-to-Speech Synthesis System MARY: A Tool for Research, Development and Teaching

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

This paper introduces the German text-to-speech synthesis system MARY. The system's main features, namely a modular design and an XML-based system-internal data representation, are pointed out, and the properties of the individual modules are briefly presented. An interface allowing the user to access and modify intermediate processing steps without the need for a technical understanding of the system is described, along with examples of how this interface can be put to use in research, development and teaching. The usefulness of the modular and transparent design approach is further illustrated with an early prototype of an interface for emotional speech synthesis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Allen, J., Hunnicutt, S., and Klatt, D.H. (1987). From Text to Speech: The MITalk System. Cambridge, UK: Cambridge University Press.

    Google Scholar 

  • Baayen, R.H., Piepenbrock, R., and Gulikers, L. (1995). The CELEX Lexical Database (CDROM). Philadelphia, PA, USA: Linguistic Data Consortium, University of Pennsylvania.

    Google Scholar 

  • Baumann, S. and Trouvain, J. (2001). On the prosody of German telephone numbers. In Proceedings of Eurospeech 2001. Aalborg, Denmark, pp. 557-560.

  • Benzmüller, R. and Grice, M. (1997). Trainingsmaterialien zur Etikettierung deutscher Intonation mitGToBI. Phonus 3, Research Report of the Institute of Phonetics, University of the Saarland, pp. 9-34.

  • Black, A., Taylor, P., and Caley, R. (1999). Festival speech synthesis system, edition 1.4. Technical report, Centre for Speech Technology Research, University of Edinburgh, UK. http://www.cstr.ed.ac.uk/projects/festival

    Google Scholar 

  • Brants, T. (2000). TnT-Astatistical part-of-speech tagger. Proceedings of the 6th Conference on Applied Natural Language Processing. Seattle,WA, USA. www.coli.uni-sb.de/thorsten/publications

  • Breitenbücher, M. (1999). Textvorverarbeitung zur deutschen Version des Festival Text-to-Speech Synthese Systems. Technical report, IMS Stuttgart. http://elib.uni-stuttgart.de/opus/volltexte/-1999/225

  • Brinckmann, C. and Trouvain, J. (2003). The role of duration models and symbolic representation for timing in synthetic speech. International Journal of Speech Technology, 6:21-31.

    Google Scholar 

  • Cowie, R., Douglas-Cowie, E., Savvidou, S., McMahon, E., Sawey, M. and Schröder, M. (2000). 'FEELTRACE': An instrument for recording perceived emotion in real time. Proceedings of the ISCA Workshop on Speech and Emotion. Northern Ireland, pp. 19-24. http://www.qub.ac.uk/en/isca/proceedings

  • Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis, G., Kollias, S., Fellenz, W., and Taylor, J. (2001). Emotion recognition in human-computer interaction. IEEE Signal Processing Magazine, 18(1):32-80.

    Google Scholar 

  • Dutoit, T. (1997). An Introduction to Text-to-Speech Synthesis. Dordrecht: Kluwer Academic Publishers.

    Google Scholar 

  • Dutoit, T., Pagel, V., Pierret, N., Bataille, F., and van der Vrecken, O. (1996). The MBROLA project: Towards a set of high quality speech synthesisers free of use for non commercial purposes. Proceedings of the 4th International Conference of Spoken Language Processing. Philadelphia, USA, pp. 1393-1396.

  • Grice, M., Baumann, S., and Benzmüller, R. (2002). German intonation in autosegmental-metrical phonology. In S.-A. Jun (Ed.), Prosodic Typology. Oxford University Press.

  • Harold, E.R. (1999). XML Bible. Hungry Minds, Inc. http://www.-ibiblio.org/xml/books/bible

  • Hoffmann, R., Kordon, U., Kürbis, S., Ketzmerick, B., and Fellbaum, K. (1999). An interactive course on speech synthesis. Proceedings of the ESCA/SOCRATES Workshop MATISSE, pp. 61-64.

  • Jessen, M. (1999). German. In H. van der Hulst (Ed.)Word Prosodic Systems in the Languages of Europe. Berlin, New York: Mouton de Gruyter, pp. 515-545.

  • JSML (1999). Java speech markup language 0.6. Technical report, Sun Microsystems. http://java.sun.com/products/javamedia/ speech/forDevelopers/JSML

  • Klabbers, E., Stöber, K., Veldhuis, R., Wagner, P., and Breuer, S. (2001). Speech synthesis development made easy: The bonn open synthesis system. Proceedings of Eurospeech 2001. Aalborg, Denmark, pp. 521-524.

  • Klatt, D.H. (1979). Synthesis by rule of segmental durations in English sentences. In B. Lindblom and S. Öhman (Eds.), Frontiers of Speech Communication, New York: Academic, pp. 287-299.

    Google Scholar 

  • Microsoft (2002). SAPI5: Microsoft Speech API 5.1. http://www.-microsoft.com/speech

  • Möbius, B. (1999). The Bell Labs German text-to-speechsystem. Computer Speech and Language, 13:319-357.

    Google Scholar 

  • Murray, I.R. and Arnott, J.L. (1995). Implementation and testing of a system for producing emotion-by-rule in synthetic speech. Speech Communication, 16:369-390.

    Google Scholar 

  • Petitpierre, D. and Russell, G. (1995). MMORPH-The Multext morphology program. deliverable report, MULTEXT. ftp://issco-ftp.unige.ch/pub/multext/mmorph.doc.ps.tar.gz

  • Russell, J.A. (1980). A circumplex model of affect. Journal of Personality and Social Psychology, 39:1161-1178.

    Google Scholar 

  • Schiller, A., Teufel, S., and Thielen, C. (1995). Guidelines für das Tagging deutscher Textkorpora mit STTS. Technical report, IMS-CL, University Stuttgart. http://www.sfs.nphil.unituebingen. de/Elwis/stts/stts.html

  • Schlosberg, H. (1941). A scale for the judgement of facial expressions. Journal of Experimental Psychology, 29:497-510.

    Google Scholar 

  • Schröder, M. (2001). Emotional speech synthesis: A review. Proceedings of Eurospeech 2001, Aalborg, Denmark, vol. 1, pp. 561-564. http://www.dfki.de/schroed

    Google Scholar 

  • Schröder, M., Cowie, R., Douglas-Cowie, E., Westerdijk, M., and Gielen, S. (2001). Acoustic correlates of emotion dimensions in view of speech synthesis. Proceedings of Eurospeech 2001, Aalborg, Denmark, vol. 1, pp. 87-90. http://www.dfki.de/schroed

    Google Scholar 

  • Silverman, K., Beckman, M., Pitrelli, J., Ostendorf, M., Wightman, C., Price, P., Pierrehumbert, J., and Hirschberg, J. (1992). ToBI: A standard for labeling english prosody. Proceedings of the 2nd International Conference of Spoken Language Processing. Banff, Canada, pp. 867-870.

  • Skut,W. and Brants,T. (1998). Chunk tagger-Statistical recognition of noun phrases. Proceedings of the ESSLLI Workshop on Automated Acquisition of Syntax and Parsing. Saarbrücken, Germany. http://www.coli.uni-sb.de/thorsten/publications

  • Skut,W., Krenn, B., Brants, T., and Uszkoreit, H. (1997). An annotation scheme for free word order languages. Proceedings of the 5th Conference on Applied Natural Language Processing.Washington DC, USA. http://www.coli.uni-sb.de/sfb378/negra-corpus/negracorpus. html

  • Sproat, R. (Ed.) (1997). Multilingual Text-to-Speech Synthesis: The Bell Labs Approach. Boston: Kluwer.

    Google Scholar 

  • Sproat, R., Hunt, A., Ostendorf, M., Taylor, P., Black, A., Lenzo, K., and Edgington, M. (1998). SABLE: A standard for TTS markup. Proceedings of the 5th International Conference of Spoken Language Processing. Sydney, Australia, pp. 1719-1724.

  • Sproat, R., Taylor, P.A., Tanenblatt, M., and Isard, A. (1997). A markup language for text-to-speech synthesis. In Proceedings of Eurospeech 1997. Rhodes/Athens, Greece.

  • Taylor, P. and Isard, A. (1997). SSML: A speech synthesis markup language. Speech Communication, 21:123-133.

    Google Scholar 

  • Traber, C. (1993). Syntactic processing and prosody control in the SVOX TTS system for German. Proceedings of Eurospeech 1993. Berlin, Germany, pp. 2099-2102.

  • Trouvain, J. (2002). Tempo control in speech synthesis by prosodic phrasing. Proceedings of Konvens, Saarbrücken, Germany.

  • Trouvain, J. and Grice, M. (1999). The effect of tempo on prosodic structure. Proceedings of the 14th International Conference of Phonetic Sciences. San Francisco, USA, pp. 1067-1070.

  • VoiceXML (2002). VoiceXML 1.0 Specification. VoiceXML Forum. http://www.voicexml.org

  • Walker, M.R. and Hunt, A. (2001). Speech Synthesis Markup Language Specification. W3C. http://www.w3.org/TR/speechsynthesis

  • Wells, J.C. (1996). SAMPA Phonetic Alphabet for German. http://www.phon.ucl.ac.uk/home/sampa/german.html

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Schröder, M., Trouvain, J. The German Text-to-Speech Synthesis System MARY: A Tool for Research, Development and Teaching. International Journal of Speech Technology 6, 365–377 (2003). https://doi.org/10.1023/A:1025708916924

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1025708916924

Navigation