research-article

A Cascade Sequence-to-Sequence Model for Chinese Mandarin Lip Reading

Authors:
Ya Zhao

Zhejiang Provincial Key Laboratory of Service Robots Zhejiang University

Zhejiang Provincial Key Laboratory of Service Robots Zhejiang University
View Profile

,
Rui Xu

Zhejiang Provincial Key Laboratory of Service Robots Zhejiang University

Zhejiang Provincial Key Laboratory of Service Robots Zhejiang University
View Profile

,
Mingli Song

Zhejiang Provincial Key Laboratory of Service Robots Zhejiang University

Zhejiang Provincial Key Laboratory of Service Robots Zhejiang University
View Profile

MMAsia '19: Proceedings of the 1st ACM International Conference on Multimedia in AsiaDecember 2019Article No.: 32Pages 1–6https://doi.org/10.1145/3338533.3366579

Published:10 January 2020Publication History

MMAsia '19: Proceedings of the 1st ACM International Conference on Multimedia in Asia

Pages 1–6

ABSTRACT

Lip reading aims at decoding texts from the movement of a speaker's mouth. In recent years, lip reading methods have made great progress for English, at both word-level and sentence-level. Unlike English, however, Chinese Mandarin is a tone-based language and relies on pitches to distinguish lexical or grammatical meaning, which significantly increases the ambiguity for the lip reading task. In this paper, we propose a Cascade Sequence-to-Sequence Model for Chinese Mandarin (CSSMCM) lip reading, which explicitly models tones when predicting sentence. Tones are modeled based on visual information and syntactic structure, and are used to predict sentence along with visual information and syntactic structure. In order to evaluate CSSMCM, a dataset called CMLR (Chinese Mandarin Lip Reading) is collected and released, consisting of over 100,000 natural sentences from China Network Television website. When trained on CMLR dataset, the proposed CSSMCM surpasses the performance of state-of-the-art lip reading frameworks, which confirms the effectiveness of explicit modeling of tones for Chinese Mandarin lip reading.

References

Yannis M Assael, Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. 2016. Lipnet: Sentence-level lipreading. arXiv preprint (2016).Google Scholar
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. international conference on learning representations (2015).Google Scholar
Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam M. Shazeer. 2015. Scheduled sampling for sequence prediction with recurrent Neural networks. neural information processing systems (2015), 1171--1179.Google Scholar
Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014. Return of the Devil in the Details: Delving Deep into Convolutional Nets.. In British Machine Vision Conference 2014.Google ScholarCross Ref
C. Julian Chen, Ramesh A. Gopinath, Michael D. Monkowski, Michael A. Picheny, and Katherine Shen. 1997. New methods in continuous Mandarin speech recognition.. In EUROSPEECH.Google Scholar
Trevor H Chen and Dominic W Massaro. 2008. Seeing pitch: Visual information for lexical tones of Mandarin-Chinese. The Journal of the Acoustical Society of America 123, 4 (2008), 2356--2366.Google ScholarCross Ref
Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder--Decoder for Statistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing(EMNLP). 1724--1734.Google ScholarCross Ref
Joon Son Chung, Andrew W Senior, Oriol Vinyals, and Andrew Zisserman. 2017. Lip Reading Sentences in the Wild. In CVPR. 3444--3453.Google Scholar
Joon Son Chung and Andrew Zisserman. 2016. Lip reading in the wild. In Asian Conference on Computer Vision. Springer, 87--103.Google Scholar
Alex Graves, Santiago Fernández, Faustino J. Gomez, and Jürgen Schmidhuber. 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning. 369--376.Google ScholarDigital Library
Davis E King. 2009. Dlib-ml: A machine learning toolkit. Journal of Machine Learning Research 10, Jul (2009), 1755--1758.Google ScholarDigital Library
Min Lin, Qiang Chen, and Shuicheng Yan. 2014. Network In Network. international conference on learning representations (2014).Google Scholar
Stavros Petridis and Maja Pantic. 2016. Deep complementary bottleneck features for visual speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). IEEE, 2304--2308.Google ScholarDigital Library
Shuang Yang, Yuanhang Zhang, Dalu Feng, Mingmin Yang, Chenhao Wang, Jingyun Xiao, Keyu Long, Shiguang Shan, and Xilin Chen. 2018. LRW-1000: A Naturally-Distributed Large-Scale Benchmark for Lip Reading in the Wild. arXiv preprint arXiv:1810.06990 (2018).Google Scholar
Xiaobing Zhang, Haigang Gong, Xili Dai, Fan Yang, Nianbo Liu, and Ming Liu. 2019. Understanding Pictograph with Facial Features: End-to-End Sentence-level Lip Reading of Chinese. In AAAI 2019: Thirty-Third AAAI Conference on Artificial Intelligence.Google Scholar
Shiyu Zhou, Linhao Dong, Shuang Xu, and Bo Xu. 2018. A Comparison of Modeling Units in Sequence-to-Sequence Speech Recognition with the Transformer on Mandarin Chinese. international conference on neural information processing 2018 (2018), 210--220.Google Scholar
Shiyu Zhou, Linhao Dong, Shuang Xu, and Bo Xu. 2018. Syllable-Based Sequence-to-Sequence Speech Recognition with the Transformer in Mandarin Chinese. Proc. Interspeech 2018 (2018), 791--795.Google ScholarCross Ref

Index Terms

A Cascade Sequence-to-Sequence Model for Chinese Mandarin Lip Reading
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
    2. Natural language processing
      1. Machine translation
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks

Recommendations

Speaker-Adaptive Lip Reading with User-Dependent Padding
Computer Vision – ECCV 2022
Abstract
Lip reading aims to predict speech based on lip movements alone. As it focuses on visual information to model the speech, its performance is inherently sensitive to personal lip appearances and movements. This makes the lip reading models show ...
Read More
Real-time lip reading system for isolated Korean word recognition

This paper proposes a real-time lip reading system (consisting of a lip detector, lip tracker, lip activation detector, and word classifier), which can recognize isolated Korean words. Lip detection is performed in several stages: face detection, eye ...
Read More
Multimodal speaker/speech recognition using lip motion, lip texture and audio
Special section: Multimodal human-computer interfaces

We present a new multimodal speaker/speech recognition system that integrates audio, lip texture and lip motion modalities. Fusion of audio and face texture modalities has been investigated in the literature before. The emphasis of this work is to ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

MMAsia '19: Proceedings of the 1st ACM International Conference on Multimedia in Asia
December 2019
403 pages
ISBN:9781450368414
DOI:10.1145/3338533

Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 January 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
datasets
lip reading
multi-modal
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
MMAsia '19 Paper Acceptance Rate59of204submissions,29%Overall Acceptance Rate59of204submissions,29%
More
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 35
  Total Citations
  View Citations
- 276
  Total Downloads
- Downloads (Last 12 months)40
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A Cascade Sequence-to-Sequence Model for Chinese Mandarin Lip Reading

MMAsia '19: Proceedings of the 1st ACM International Conference on Multimedia in Asia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Speaker-Adaptive Lip Reading with User-Dependent Padding

Real-time lip reading system for isolated Korean word recognition

Multimodal speaker/speech recognition using lip motion, lip texture and audio

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

A Cascade Sequence-to-Sequence Model for Chinese Mandarin Lip Reading

MMAsia '19: Proceedings of the 1st ACM International Conference on Multimedia in Asia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Speaker-Adaptive Lip Reading with User-Dependent Padding

Real-time lip reading system for isolated Korean word recognition

Multimodal speaker/speech recognition using lip motion, lip texture and audio

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media