Abstract
This paper introduces the German text-to-speech synthesis system MARY. The system's main features, namely a modular design and an XML-based system-internal data representation, are pointed out, and the properties of the individual modules are briefly presented. An interface allowing the user to access and modify intermediate processing steps without the need for a technical understanding of the system is described, along with examples of how this interface can be put to use in research, development and teaching. The usefulness of the modular and transparent design approach is further illustrated with an early prototype of an interface for emotional speech synthesis.
Similar content being viewed by others
References
Allen, J., Hunnicutt, S., and Klatt, D.H. (1987). From Text to Speech: The MITalk System. Cambridge, UK: Cambridge University Press.
Baayen, R.H., Piepenbrock, R., and Gulikers, L. (1995). The CELEX Lexical Database (CDROM). Philadelphia, PA, USA: Linguistic Data Consortium, University of Pennsylvania.
Baumann, S. and Trouvain, J. (2001). On the prosody of German telephone numbers. In Proceedings of Eurospeech 2001. Aalborg, Denmark, pp. 557-560.
Benzmüller, R. and Grice, M. (1997). Trainingsmaterialien zur Etikettierung deutscher Intonation mitGToBI. Phonus 3, Research Report of the Institute of Phonetics, University of the Saarland, pp. 9-34.
Black, A., Taylor, P., and Caley, R. (1999). Festival speech synthesis system, edition 1.4. Technical report, Centre for Speech Technology Research, University of Edinburgh, UK. http://www.cstr.ed.ac.uk/projects/festival
Brants, T. (2000). TnT-Astatistical part-of-speech tagger. Proceedings of the 6th Conference on Applied Natural Language Processing. Seattle,WA, USA. www.coli.uni-sb.de/thorsten/publications
Breitenbücher, M. (1999). Textvorverarbeitung zur deutschen Version des Festival Text-to-Speech Synthese Systems. Technical report, IMS Stuttgart. http://elib.uni-stuttgart.de/opus/volltexte/-1999/225
Brinckmann, C. and Trouvain, J. (2003). The role of duration models and symbolic representation for timing in synthetic speech. International Journal of Speech Technology, 6:21-31.
Cowie, R., Douglas-Cowie, E., Savvidou, S., McMahon, E., Sawey, M. and Schröder, M. (2000). 'FEELTRACE': An instrument for recording perceived emotion in real time. Proceedings of the ISCA Workshop on Speech and Emotion. Northern Ireland, pp. 19-24. http://www.qub.ac.uk/en/isca/proceedings
Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis, G., Kollias, S., Fellenz, W., and Taylor, J. (2001). Emotion recognition in human-computer interaction. IEEE Signal Processing Magazine, 18(1):32-80.
Dutoit, T. (1997). An Introduction to Text-to-Speech Synthesis. Dordrecht: Kluwer Academic Publishers.
Dutoit, T., Pagel, V., Pierret, N., Bataille, F., and van der Vrecken, O. (1996). The MBROLA project: Towards a set of high quality speech synthesisers free of use for non commercial purposes. Proceedings of the 4th International Conference of Spoken Language Processing. Philadelphia, USA, pp. 1393-1396.
Grice, M., Baumann, S., and Benzmüller, R. (2002). German intonation in autosegmental-metrical phonology. In S.-A. Jun (Ed.), Prosodic Typology. Oxford University Press.
Harold, E.R. (1999). XML Bible. Hungry Minds, Inc. http://www.-ibiblio.org/xml/books/bible
Hoffmann, R., Kordon, U., Kürbis, S., Ketzmerick, B., and Fellbaum, K. (1999). An interactive course on speech synthesis. Proceedings of the ESCA/SOCRATES Workshop MATISSE, pp. 61-64.
Jessen, M. (1999). German. In H. van der Hulst (Ed.)Word Prosodic Systems in the Languages of Europe. Berlin, New York: Mouton de Gruyter, pp. 515-545.
JSML (1999). Java speech markup language 0.6. Technical report, Sun Microsystems. http://java.sun.com/products/javamedia/ speech/forDevelopers/JSML
Klabbers, E., Stöber, K., Veldhuis, R., Wagner, P., and Breuer, S. (2001). Speech synthesis development made easy: The bonn open synthesis system. Proceedings of Eurospeech 2001. Aalborg, Denmark, pp. 521-524.
Klatt, D.H. (1979). Synthesis by rule of segmental durations in English sentences. In B. Lindblom and S. Öhman (Eds.), Frontiers of Speech Communication, New York: Academic, pp. 287-299.
Microsoft (2002). SAPI5: Microsoft Speech API 5.1. http://www.-microsoft.com/speech
Möbius, B. (1999). The Bell Labs German text-to-speechsystem. Computer Speech and Language, 13:319-357.
Murray, I.R. and Arnott, J.L. (1995). Implementation and testing of a system for producing emotion-by-rule in synthetic speech. Speech Communication, 16:369-390.
Petitpierre, D. and Russell, G. (1995). MMORPH-The Multext morphology program. deliverable report, MULTEXT. ftp://issco-ftp.unige.ch/pub/multext/mmorph.doc.ps.tar.gz
Russell, J.A. (1980). A circumplex model of affect. Journal of Personality and Social Psychology, 39:1161-1178.
Schiller, A., Teufel, S., and Thielen, C. (1995). Guidelines für das Tagging deutscher Textkorpora mit STTS. Technical report, IMS-CL, University Stuttgart. http://www.sfs.nphil.unituebingen. de/Elwis/stts/stts.html
Schlosberg, H. (1941). A scale for the judgement of facial expressions. Journal of Experimental Psychology, 29:497-510.
Schröder, M. (2001). Emotional speech synthesis: A review. Proceedings of Eurospeech 2001, Aalborg, Denmark, vol. 1, pp. 561-564. http://www.dfki.de/schroed
Schröder, M., Cowie, R., Douglas-Cowie, E., Westerdijk, M., and Gielen, S. (2001). Acoustic correlates of emotion dimensions in view of speech synthesis. Proceedings of Eurospeech 2001, Aalborg, Denmark, vol. 1, pp. 87-90. http://www.dfki.de/schroed
Silverman, K., Beckman, M., Pitrelli, J., Ostendorf, M., Wightman, C., Price, P., Pierrehumbert, J., and Hirschberg, J. (1992). ToBI: A standard for labeling english prosody. Proceedings of the 2nd International Conference of Spoken Language Processing. Banff, Canada, pp. 867-870.
Skut,W. and Brants,T. (1998). Chunk tagger-Statistical recognition of noun phrases. Proceedings of the ESSLLI Workshop on Automated Acquisition of Syntax and Parsing. Saarbrücken, Germany. http://www.coli.uni-sb.de/thorsten/publications
Skut,W., Krenn, B., Brants, T., and Uszkoreit, H. (1997). An annotation scheme for free word order languages. Proceedings of the 5th Conference on Applied Natural Language Processing.Washington DC, USA. http://www.coli.uni-sb.de/sfb378/negra-corpus/negracorpus. html
Sproat, R. (Ed.) (1997). Multilingual Text-to-Speech Synthesis: The Bell Labs Approach. Boston: Kluwer.
Sproat, R., Hunt, A., Ostendorf, M., Taylor, P., Black, A., Lenzo, K., and Edgington, M. (1998). SABLE: A standard for TTS markup. Proceedings of the 5th International Conference of Spoken Language Processing. Sydney, Australia, pp. 1719-1724.
Sproat, R., Taylor, P.A., Tanenblatt, M., and Isard, A. (1997). A markup language for text-to-speech synthesis. In Proceedings of Eurospeech 1997. Rhodes/Athens, Greece.
Taylor, P. and Isard, A. (1997). SSML: A speech synthesis markup language. Speech Communication, 21:123-133.
Traber, C. (1993). Syntactic processing and prosody control in the SVOX TTS system for German. Proceedings of Eurospeech 1993. Berlin, Germany, pp. 2099-2102.
Trouvain, J. (2002). Tempo control in speech synthesis by prosodic phrasing. Proceedings of Konvens, Saarbrücken, Germany.
Trouvain, J. and Grice, M. (1999). The effect of tempo on prosodic structure. Proceedings of the 14th International Conference of Phonetic Sciences. San Francisco, USA, pp. 1067-1070.
VoiceXML (2002). VoiceXML 1.0 Specification. VoiceXML Forum. http://www.voicexml.org
Walker, M.R. and Hunt, A. (2001). Speech Synthesis Markup Language Specification. W3C. http://www.w3.org/TR/speechsynthesis
Wells, J.C. (1996). SAMPA Phonetic Alphabet for German. http://www.phon.ucl.ac.uk/home/sampa/german.html
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Schröder, M., Trouvain, J. The German Text-to-Speech Synthesis System MARY: A Tool for Research, Development and Teaching. International Journal of Speech Technology 6, 365–377 (2003). https://doi.org/10.1023/A:1025708916924
Issue Date:
DOI: https://doi.org/10.1023/A:1025708916924