skip to main content
10.1145/3338533.3366579acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

A Cascade Sequence-to-Sequence Model for Chinese Mandarin Lip Reading

Authors Info & Claims
Published:10 January 2020Publication History

ABSTRACT

Lip reading aims at decoding texts from the movement of a speaker's mouth. In recent years, lip reading methods have made great progress for English, at both word-level and sentence-level. Unlike English, however, Chinese Mandarin is a tone-based language and relies on pitches to distinguish lexical or grammatical meaning, which significantly increases the ambiguity for the lip reading task. In this paper, we propose a Cascade Sequence-to-Sequence Model for Chinese Mandarin (CSSMCM) lip reading, which explicitly models tones when predicting sentence. Tones are modeled based on visual information and syntactic structure, and are used to predict sentence along with visual information and syntactic structure. In order to evaluate CSSMCM, a dataset called CMLR (Chinese Mandarin Lip Reading) is collected and released, consisting of over 100,000 natural sentences from China Network Television website. When trained on CMLR dataset, the proposed CSSMCM surpasses the performance of state-of-the-art lip reading frameworks, which confirms the effectiveness of explicit modeling of tones for Chinese Mandarin lip reading.

References

  1. Yannis M Assael, Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. 2016. Lipnet: Sentence-level lipreading. arXiv preprint (2016).Google ScholarGoogle Scholar
  2. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. international conference on learning representations (2015).Google ScholarGoogle Scholar
  3. Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam M. Shazeer. 2015. Scheduled sampling for sequence prediction with recurrent Neural networks. neural information processing systems (2015), 1171--1179.Google ScholarGoogle Scholar
  4. Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014. Return of the Devil in the Details: Delving Deep into Convolutional Nets.. In British Machine Vision Conference 2014.Google ScholarGoogle ScholarCross RefCross Ref
  5. C. Julian Chen, Ramesh A. Gopinath, Michael D. Monkowski, Michael A. Picheny, and Katherine Shen. 1997. New methods in continuous Mandarin speech recognition.. In EUROSPEECH.Google ScholarGoogle Scholar
  6. Trevor H Chen and Dominic W Massaro. 2008. Seeing pitch: Visual information for lexical tones of Mandarin-Chinese. The Journal of the Acoustical Society of America 123, 4 (2008), 2356--2366.Google ScholarGoogle ScholarCross RefCross Ref
  7. Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder--Decoder for Statistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing(EMNLP). 1724--1734.Google ScholarGoogle ScholarCross RefCross Ref
  8. Joon Son Chung, Andrew W Senior, Oriol Vinyals, and Andrew Zisserman. 2017. Lip Reading Sentences in the Wild. In CVPR. 3444--3453.Google ScholarGoogle Scholar
  9. Joon Son Chung and Andrew Zisserman. 2016. Lip reading in the wild. In Asian Conference on Computer Vision. Springer, 87--103.Google ScholarGoogle Scholar
  10. Alex Graves, Santiago Fernández, Faustino J. Gomez, and Jürgen Schmidhuber. 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning. 369--376.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Davis E King. 2009. Dlib-ml: A machine learning toolkit. Journal of Machine Learning Research 10, Jul (2009), 1755--1758.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Min Lin, Qiang Chen, and Shuicheng Yan. 2014. Network In Network. international conference on learning representations (2014).Google ScholarGoogle Scholar
  13. Stavros Petridis and Maja Pantic. 2016. Deep complementary bottleneck features for visual speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). IEEE, 2304--2308.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Shuang Yang, Yuanhang Zhang, Dalu Feng, Mingmin Yang, Chenhao Wang, Jingyun Xiao, Keyu Long, Shiguang Shan, and Xilin Chen. 2018. LRW-1000: A Naturally-Distributed Large-Scale Benchmark for Lip Reading in the Wild. arXiv preprint arXiv:1810.06990 (2018).Google ScholarGoogle Scholar
  15. Xiaobing Zhang, Haigang Gong, Xili Dai, Fan Yang, Nianbo Liu, and Ming Liu. 2019. Understanding Pictograph with Facial Features: End-to-End Sentence-level Lip Reading of Chinese. In AAAI 2019: Thirty-Third AAAI Conference on Artificial Intelligence.Google ScholarGoogle Scholar
  16. Shiyu Zhou, Linhao Dong, Shuang Xu, and Bo Xu. 2018. A Comparison of Modeling Units in Sequence-to-Sequence Speech Recognition with the Transformer on Mandarin Chinese. international conference on neural information processing 2018 (2018), 210--220.Google ScholarGoogle Scholar
  17. Shiyu Zhou, Linhao Dong, Shuang Xu, and Bo Xu. 2018. Syllable-Based Sequence-to-Sequence Speech Recognition with the Transformer in Mandarin Chinese. Proc. Interspeech 2018 (2018), 791--795.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. A Cascade Sequence-to-Sequence Model for Chinese Mandarin Lip Reading

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          MMAsia '19: Proceedings of the 1st ACM International Conference on Multimedia in Asia
          December 2019
          403 pages
          ISBN:9781450368414
          DOI:10.1145/3338533

          Copyright © 2019 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 10 January 2020

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed limited

          Acceptance Rates

          MMAsia '19 Paper Acceptance Rate59of204submissions,29%Overall Acceptance Rate59of204submissions,29%

          Upcoming Conference

          MM '24
          MM '24: The 32nd ACM International Conference on Multimedia
          October 28 - November 1, 2024
          Melbourne , VIC , Australia

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader