ABSTRACT
Second language (L2) English learners often find it difficult to improve their pronunciations due to the lack of expressive and personalized corrective feedback. In this paper, we present Pronunciation Teacher (PTeacher), a Computer-Aided Pronunciation Training (CAPT) system that provides personalized exaggerated audio-visual corrective feedback for mispronunciations. Though the effectiveness of exaggerated feedback has been demonstrated, it is still unclear how to define the appropriate degrees of exaggeration when interacting with individual learners. To fill in this gap, we interview 100 L2 English learners and 22 professional native teachers to understand their needs and experiences. Three critical metrics are proposed for both learners and teachers to identify the best exaggeration levels in both audio and visual modalities. Additionally, we incorporate the personalized dynamic feedback mechanism given the English proficiency of learners. Based on the obtained insights, a comprehensive interactive pronunciation training course is designed to help L2 learners rectify mispronunciations in a more perceptible, understandable, and discriminative manner. Extensive user studies demonstrate that our system significantly promotes the learners’ learning efficiency.
Supplemental Material
Available for Download
- Najwa Alghamdi, Steve Maddock, Jon Barker, and Guy J Brown. 2017. The impact of automatic exaggeration of the visual articulatory features of a talker on the intelligibility of spectrally distorted speech. Speech Communication 95(2017), 127–136.Google ScholarDigital Library
- Pierre Badin, Atef Ben Youssef, Gérard Bailly, Frédéric Elisei, and Thomas Hueber. 2010. Visual articulatory feedback for phonetic correction in second language learning. In Second Language Studies: Acquisition, Learning, Education and Technology.Google Scholar
- Heather Bliss, Jennifer Abel, and Bryan Gick. 2018. Computer-assisted visual articulation feedback in L2 pronunciation instruction: A review. Journal of Second Language Pronunciation 4, 1 (2018), 129–153.Google ScholarCross Ref
- Ann R Bradlow, David B Pisoni, Reiko Akahane-Yamada, and Yoh’ichi Tohkura. 1997. Training Japanese listeners to identify English/r/and/l: IV. Some effects of perceptual learning on speech production. The Journal of the Acoustical Society of America 101, 4 (1997), 2299–2310.Google ScholarCross Ref
- Catherine P Browman and Louis Goldstein. 1992. Articulatory phonology: An overview. Phonetica 49, 3-4 (1992), 155–180.Google ScholarCross Ref
- Matthew I Brown and Avi E Cieplinski. 2020. Device, method, and graphical user interface for providing audiovisual feedback. US Patent 10,599,394.Google Scholar
- Yaohua Bu, Jia Jia, Xiang Li, Suping Zhou, and Xiaobo Lu. 2018. IcooBook: when the picture book for children encounters aesthetics of interaction. In Proceedings of the 26th ACM international conference on Multimedia. 1260–1262.Google ScholarDigital Library
- Yaohua Bu, Weijun Li, Tianyi Ma, Shengqi Chen, Jia Jia, Kun Li, and Xiaobo Lu. 2020. Visual-speech Synthesis of Exaggerated Corrective Feedback. In Proceedings of the 28th ACM International Conference on Multimedia. 4521–4523.Google ScholarDigital Library
- Eva Cerviño-Povedano and Joan C Mora. 2010. Investigating Catalan learners of English over-reliance on duration: Vowel cue weighting and phonological short-term memory. Achievements and perspectives in the acquisition of second language speech: New Sounds (2010), 53–64.Google Scholar
- Pierre Chalfoun and Claude Frasson. 2011. Subliminal cues while teaching: HCI technique for enhanced learning. Advances in Human-Computer Interaction 2011 (2011).Google Scholar
- Bay-Wei Chang and David Ungar. 1993. Animation: from cartoons to the user interface. In Proceedings of the 6th annual ACM symposium on User interface software and technology. 45–55.Google ScholarDigital Library
- Tsuhan Chen and Ram R Rao. 1998. Audio-visual integration in multimodal communication. Proc. IEEE 86, 5 (1998), 837–852.Google ScholarCross Ref
- Bing Cheng, Xiaojuan Zhang, Siying Fan, and Yang Zhang. 2019. The role of temporal acoustic exaggeration in high variability phonetic training: A behavioral and ERP study. Frontiers in psychology 10 (2019), 1178.Google Scholar
- Bing Cheng, Xiaojuan Zhang, and Yang Zhang. 2019. Temporal exaggeration facilitates second language phonetic training: The case of syllable-final nasal contrast. The Journal of the Acoustical Society of America 146, 4 (2019), 2844–2844.Google ScholarCross Ref
- Laura Colantoni, Jeffrey Steele, Paola Escudero, and Paola Rocío Escudero Neyra. 2015. Second language speech. Cambridge University Press.Google Scholar
- Juliet Corbin and Anselm Strauss. 2014. Basics of qualitative research: Techniques and procedures for developing grounded theory. Sage publications.Google Scholar
- Nuria Calvo Cortés. 2005. Negative language transfer when learning Spanish as a foreign language. Interlingüística16 (2005), 237–248.Google Scholar
- British Council. 2013. The English Effect. Retrieved March 22(2013), 2015.Google Scholar
- David Crystal. 2011. A dictionary of linguistics and phonetics. Vol. 30. John Wiley & Sons.Google Scholar
- Tracey M Derwing and Murray J Munro. 2005. Second language accent and pronunciation teaching: A research-based approach. TESOL quarterly 39, 3 (2005), 379–397.Google Scholar
- Tracey M Derwing and Marian J Rossiter. 2002. ESL learners’ perceptions of their pronunciation needs and strategies. System 30, 2 (2002), 155–166.Google ScholarCross Ref
- Paola Escudero. 2001. The role of the input in the development of L1 and L2 sound contrasts: language-specific cue weighting for vowels. In Proceedings of the 25th annual Boston University conference on language development, Vol. 1. Citeseer, 250–261.Google Scholar
- Paola Rocío Escudero Neyra. 2005. Linguistic perception and second language acquisition: explaining the attainment of optimal phonological categorization. Ph.D. Dissertation. Utrecht University & LOT.Google Scholar
- Tony Ezzat and Tomaso Poggio. 2000. Visual speech synthesis by morphing visemes. International Journal of Computer Vision 38, 1 (2000), 45–57.Google ScholarDigital Library
- Christina Garcia, Mark Kolat, and Terrell A Morgan. 2018. SELF-CORRECTION OF SECOND-LANGUAGE PRONUNCIATION VIA ONLINE, REAL-TIME, VISUAL FEEDBACK. In PRONUNCIATION IN SECOND LANGUAGE LEARNING AND TEACHING CONFERENCE (ISSN 2380-9566). 54.Google Scholar
- Patrick H Geoghegan, C Spence, Wei H Ho, X Lu, M Jermy, P Hunter, and J Cater. 2012. Stereoscopic PIV measurement of airflow in human speech during pronunciation of fricatives. In 16th International Symposium of Laser Techniques to Fluid Mechanics, Lisbon, Portugal, 9th-12th July.Google Scholar
- Ewa M Golonka, Anita R Bowles, Victor M Frank, Dorna L Richardson, and Suzanne Freynik. 2014. Technologies for foreign language learning: a review of technology types and their effectiveness. Computer assisted language learning 27, 1 (2014), 70–105.Google Scholar
- Antti Granqvist, Tapio Takala, Jari Takatalo, and Perttu Hämäläinen. 2018. Exaggeration of Avatar Flexibility in Virtual Reality. In Proceedings of the 2018 Annual Symposium on Computer-Human Interaction in Play. 201–209.Google ScholarDigital Library
- Joshua Hailpern, Karrie Karahalios, and James Halle. 2009. Creating a spoken impact: encouraging vocalization through audio visual feedback in children with ASD. In Proceedings of the SIGCHI conference on human factors in computing systems. 453–462.Google ScholarDigital Library
- Morris Halle, Bert Vaux, and Andrew Wolfe. 2000. On feature spreading and the representation of place of articulation. Linguistic inquiry 31, 3 (2000), 387–444.Google Scholar
- CC Hsu. [n.d.]. Python-wrapper-for-world-vocoder.Google Scholar
- Philip Hubbard. 2002. Interactive participatory dramas for language learning. Simulation & Gaming 33, 2 (2002), 210–216.Google ScholarDigital Library
- Yurie Iribe, Silasak Manosavanh, Kouichi Katsurada, Ryoko Hayashi, Chunyue Zhu, and Tsuneo Nitta. 2011. Generating animated pronunciation from speech through articulatory feature extraction. In Twelfth Annual Conference of the International Speech Communication Association.Google ScholarCross Ref
- D Kalikow and J Swets. 1972. Experiments with computer-controlled displays in second-language learning. IEEE Transactions on Audio and Electroacoustics 20, 1(1972), 23–28.Google ScholarCross Ref
- Natalia Kartushina and Ulrich H Frauenfelder. 2014. On the effects of L2 perception and of individual differences in L1 production on L2 pronunciation. Frontiers in psychology 5 (2014), 1246.Google Scholar
- Natalia Kartushina, Alexis Hervais-Adelman, Ulrich Hans Frauenfelder, and Narly Golestani. 2015. The effect of phonetic production training with visual feedback on the perception and production of foreign speech sounds. The journal of the acoustical society of America 138, 2 (2015), 817–832.Google Scholar
- Tatsuya Kawahara, Masatake Dantsuji, and Yasushi Tsubota. 2004. Practical use of English pronunciation system for Japanese students in the CALL classroom. In Eighth International Conference on Spoken Language Processing.Google ScholarCross Ref
- Gerald Kelly. 2006. How To Teach Pronunciation (With Cd). Pearson Education India.Google Scholar
- P Khul, K Williams, F Lacerda, and K Lindblom Stevens. [n.d.]. B.(1992). Linguistic Experience Alters Phonetic Perception in Infants by 6 Months of Age. Science 255([n. d.]).Google Scholar
- AJ King and AR Palmer. 1985. Integration of visual and auditory information in bimodal neurones in the guinea-pig superior colliculus. Experimental brain research 60, 3 (1985), 492–500.Google Scholar
- Valeri Aleksandrovich Kozhevnikov and Liudmila Andreevna Chistovich. 1967. Speech: articulation and perception. Vol. 30. US Department of Commerce, Clearinghouse for Federal Scientific and ….Google Scholar
- John Lasseter. 1987. Principles of traditional animation applied to 3D computer animation. In Proceedings of the 14th annual conference on Computer graphics and interactive techniques. 35–44.Google ScholarDigital Library
- Andrew H Lee and Roy Lyster. 2016. The effects of corrective feedback on instructed L2 speech perception. Studies in Second Language Acquisition 38, 1 (2016), 35.Google ScholarCross Ref
- Bradford Lee, Luke Plonsky, and Kazuya Saito. 2020. The effects of perception-vs. production-based pronunciation instruction. System 88(2020), 102185.Google ScholarCross Ref
- Wai-Kim Leung, Xunying Liu, and Helen Meng. 2019. CNN-RNN-CTC based end-to-end mispronunciation detection and diagnosis. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 8132–8136.Google ScholarCross Ref
- Wai-Kim Leung, Ka-Wa Yuen, Ka-Ho Wong, and Helen Meng. 2013. Development of text-to-audiovisual speech synthesis to support interactive language learning on a mobile device. In 2013 IEEE 4th International Conference on Cognitive Infocommunications (CogInfoCom). IEEE, 583–588.Google ScholarCross Ref
- Kun Li, Jing Li, Yufang Song, and Hewei Fu. 2015. Rating Algorithm for Pronunciation of English Based on Audio Feature Pattern Matching. In MATEC Web of Conferences, Vol. 22. EDP Sciences, 01032.Google Scholar
- Kun Li, Xiaojun Qian, Shiyin Kang, Pengfei Liu, and Helen Meng. 2015. Integrating acoustic and state-transition models for free phone recognition in L2 English speech using multi-distribution deep neural networks.. In SLaTE. 119–124.Google Scholar
- Kun Li, Xiaojun Qian, and Helen Meng. 2016. Mispronunciation detection and diagnosis in l2 english speech using multidistribution deep neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25, 1(2016), 193–207.Google ScholarDigital Library
- Alvin M Liberman, Katherine Safford Harris, Howard S Hoffman, and Belver C Griffith. 1957. The discrimination of speech sounds within and across phoneme boundaries.Journal of experimental psychology 54, 5 (1957), 358.Google Scholar
- Patsy M Lightbown and Nina Spada. 2000. Do they know what they’re doing? L2 learners’ awareness of L1 influence. Language Awareness 9, 4 (2000), 198–217.Google ScholarCross Ref
- Guanhong Liu, Xianghua Ding, Chun Yu, Lan Gao, Xingyu Chi, and Yuanchun Shi. 2019. ” I Bought This for Me to Look More Ordinary” A Study of Blind People Doing Online Shopping. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–11.Google ScholarDigital Library
- Pengfei Liu, Ka-Wa Yuen, Wai-Kim Leung, and Helen Meng. 2012. menunciate: Development of a computer-aided pronunciation training system on a cross-platform framework for mobile, speech-enabled application development. In 2012 8th International Symposium on Chinese Spoken Language Processing. IEEE, 170–173.Google ScholarCross Ref
- Jingli Lu, Ruili Wang, and Liyanage C De Silva. 2012. Automatic stress exaggeration by prosody modification to assist language learners perceive sentence stress. International journal of speech technology 15, 2 (2012), 87–98.Google ScholarDigital Library
- Jingli Lu, Ruili Wang, Liyanage C De Silva, Yang Gao, and Jia Liu. 2010. CASTLE: a computer-assisted stress teaching and learning environment for learners of English as a second language. In Eleventh Annual Conference of the International Speech Communication Association.Google ScholarCross Ref
- Michael McAuliffe, Michaela Socolof, Sarah Mihuc, Michael Wagner, and Morgan Sonderegger. 2017. Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi.. In Interspeech, Vol. 2017. 498–502.Google Scholar
- Fanbo Meng, Helen Meng, Zhiyong Wu, and Lianhong Cai. 2010. Synthesizing expressive speech to convey focus using a perturbation model for computer-aided pronunciation training. In Second Language Studies: Acquisition, Learning, Education and Technology.Google Scholar
- Fanbo Meng, Zhiyong Wu, Jia Jia, Helen Meng, and Lianhong Cai. 2014. Synthesizing English emphatic speech for multimodal corrective feedback in computer-aided pronunciation training. Multimedia tools and applications 73, 1 (2014), 463–489.Google Scholar
- Fanbo Meng, Zhiyong Wu, Helen Meng, Jia Jia, and Lianhong Cai. 2012. Hierarchical English emphatic speech synthesis based on HMM with limited training data. In Thirteenth Annual Conference of the International Speech Communication Association.Google ScholarCross Ref
- Helen Meng, Yuen Yee Lo, Lan Wang, and Wing Yiu Lau. 2007. Deriving salient learners’ mispronunciations from cross-language phonological comparisons. In 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU). IEEE, 437–442.Google ScholarCross Ref
- Richard I Miller. 1990. Major American Higher Education Issues and Challenges in the 1990s. Higher Education Policy Series 9.ERIC.Google Scholar
- Joan C Mora and Isabelle Darcy. 2017. The relationship between cognitive control and pronunciation in a second language. Second language pronunciation assessment(2017), 95.Google Scholar
- Murray J Munro, Tracey M Derwing, and James E Flege. 1999. Canadians in Alabama: A perceptual study of dialect acquisition in adults. Journal of Phonetics 27, 4 (1999), 385–403.Google ScholarCross Ref
- Ambra Neri, Catia Cucchiarini, and Helmer Strik. 2006. ASR corrective feedback on pronunciation: Does it really work?(2006).Google Scholar
- Ambra Neri, Catia Cucchiarini, Helmer Strik, and Lou Boves. 2002. The pedagogy-technology interface in computer assisted pronunciation training. Computer assisted language learning 15, 5 (2002), 441–467.Google Scholar
- Ambra Neri, Ornella Mich, Matteo Gerosa, and Diego Giuliani. 2008. The effectiveness of computer assisted pronunciation training for foreign language learning by children. Computer Assisted Language Learning 21, 5 (2008), 393–408.Google ScholarCross Ref
- Yishuang Ning, Zhiyong Wu, Jia Jia, Fanbo Meng, Helen Meng, and Lianhong Cai. 2015. HMM-based emphatic speech synthesis for corrective feedback in computer-aided pronunciation training. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4934–4938.Google ScholarCross Ref
- Richard Ogden. 2017. Introduction to English Phonetics. Edinburgh university press.Google Scholar
- Mirian Oliveira, Claudia Bitencourt, Eduardo Teixeira, and Ana Clarissa Santos. 2013. Thematic content analysis: Is there a difference between the support provided by the MAXQDA® and NVivo® software packages. In Proceedings of the 12th European Conference on Research Methods for Business and Management Studies. 304–314.Google Scholar
- Marta Ortega and Valerie Hazan. 1999. Enhancing acoustic cues to aid L2 speech perception. In Proceedings of the International Congress of Phonetics Sciences. 117–120.Google Scholar
- Martha C Pennington. 1999. Computer-aided pronunciation pedagogy: Promise, limitations, directions. Computer Assisted Language Learning 12, 5 (1999), 427–440.Google ScholarCross Ref
- Janet Breckenridge Pierrehumbert. 1980. The phonology and phonetics of English intonation. Ph.D. Dissertation. Massachusetts Institute of Technology.Google Scholar
- Linda Polka and Janet F Werker. 1994. Developmental changes in perception of nonnative vowel contrasts.Journal of Experimental Psychology: Human perception and performance 20, 2(1994), 421.Google Scholar
- Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2019. Fastspeech: Fast, robust and controllable text to speech. In Advances in Neural Information Processing Systems. 3171–3180.Google Scholar
- Tiago Ribeiro and Ana Paiva. 2012. The illusion of robotic life: principles and practices of animation for robots. In Proceedings of the seventh annual ACM/IEEE international conference on Human-Robot Interaction. 383–390.Google ScholarDigital Library
- Ellen Ricard. 1986. Beyond Fossilization: A Course in Strategies and Techniques in Pronunciation for Advanced Adult Learners.TESL Canada Journal (1986), 243–253.Google Scholar
- Sean Robertson, Cosmin Munteanu, and Gerald Penn. 2018. Designing Pronunciation Learning Tools: The Case for Interactivity against Over-Engineering. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. 1–13.Google ScholarDigital Library
- Pamela Rogerson-Revell. 2011. English phonology and pronunciation teaching. Bloomsbury Publishing.Google Scholar
- Winifred Strange. 1995. Speech perception and linguistic experience: Theoretical and methodological issues.Google Scholar
- Winifred Strange, Valerie L Shafer, 2008. Speech perception in second language learners: The re-education of selective perception. Phonology and second language acquisition 36 (2008), 153–192.Google Scholar
- Frank Thomas, Ollie Johnston, and Frank Thomas. 1995. The illusion of life: Disney animation. Hyperion New York.Google Scholar
- Ingo R Titze and Daniel W Martin. 1998. Principles of voice production.Google Scholar
- Nikolai Sergeevich Trubetzkoy. 1969. Principles of phonology.(1969).Google Scholar
- Ganna Veselovska. 2016. Teaching elements of English RP connected speech and CALL: Phonemic assimilation. Education and Information Technologies 21, 5 (2016), 1387–1400.Google ScholarCross Ref
- Amy B Wohlert and Vicki L Hammen. 2000. Lip muscle activity related to speech rate and loudness. Journal of Speech, Language, and Hearing Research 43, 5 (2000), 1229–1239.Google ScholarCross Ref
- Ka-Ho Wong, Wai-Kim Leung, Wai-Kit Lo, and Helen Meng. 2010. Development of an articulatory visual-speech synthesizer to support language learning. In 2010 7th International Symposium on Chinese Spoken Language Processing. IEEE, 139–143.Google ScholarCross Ref
- Ka-Wa Yuen, Wai-Kim Leung, Peng-fei Liu, Ka-Ho Wong, Xiao-jun Qian, Wai-Kit Lo, and Helen Meng. 2011. Enunciate: An internet-accessible computer-aided pronunciation training system and related user evaluations. In 2011 International Conference on Speech Database and Assessments (Oriental COCOSDA). IEEE, 85–90.Google ScholarCross Ref
- Fan-Gang Zeng, Kristina M Martino, Fred H Linthicum, and Sigfrid D Soli. 2000. Auditory perception in vestibular neurectomy subjects. Hearing research 142, 1-2 (2000), 102–112.Google Scholar
- Junhong Zhao, Hua Yuan, Wai-Kim Leung, Helen Meng, Jia Liu, and Shanhong Xia. 2013. Audiovisual synthesis of exaggerated speech for corrective feedback in computer-assisted pronunciation training. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 8218–8222.Google ScholarCross Ref
- Hang Zhou, Yu Liu, Ziwei Liu, Ping Luo, and Xiaogang Wang. 2019. Talking Face Generation by Adversarially Disentangled Audio-Visual Representation. In AAAI Conference on Artificial Intelligence (AAAI).Google ScholarDigital Library
- Yang Zhou, Xintong Han, Eli Shechtman, Jose Echevarria, Evangelos Kalogerakis, and Dingzeyu Li. 2020. MakeItTalk: Speaker-Aware Talking-Head Animation. ACM Transactions on Graphics 39, 6 (2020).Google ScholarDigital Library
Index Terms
- PTeacher: a Computer-Aided Personalized Pronunciation Training System with Exaggerated Audio-Visual Corrective Feedback
Recommendations
Visual-speech Synthesis of Exaggerated Corrective Feedback
MM '20: Proceedings of the 28th ACM International Conference on MultimediaTo provide more discriminative feedback for the second language (L2) learners to better identify their mispronunciation, we propose a method for exaggerated visual-speech feedback in computer-assisted pronunciation training (CAPT). The speech ...
Synthesizing English emphatic speech for multimodal corrective feedback in computer-aided pronunciation training
Emphasis plays an important role in expressive speech synthesis in highlighting the focus of an utterance to draw the attention of the listener. We present a hidden Markov model (HMM)-based emphatic speech synthesis model. The ultimate objective is to ...
Foreign accent conversion in computer assisted pronunciation training
Learners of a second language practice their pronunciation by listening to and imitating utterances from native speakers. Recent research has shown that choosing a well-matched native speaker to imitate can have a positive impact on pronunciation ...
Comments