Article

A low-latency lip-synchronized videoconferencing system

Author:
Milton Chen

Stanford University, Stanford, CA

Stanford University, Stanford, CA
View Profile

CHI '03: Proceedings of the SIGCHI Conference on Human Factors in Computing SystemsApril 2003Pages 465–471https://doi.org/10.1145/642611.642692

Published:05 April 2003Publication History

CHI '03: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems

Pages 465–471

ABSTRACT

Audio is presented ahead of video in some videoconferencing systems since audio requires less time to process. Audio could be delayed to synchronize with video to achieve lip synchronization; however, the overall audio latency might then become unacceptable. We built a videoconferencing system to achieve lip synchronization with minimal perceived audio latency. Instead of adding a fixed audio delay, our system time-stretches the audio at the beginning of each utterance until the audio is synchronized with the video. We conducted user studies and found that (1) audio could lead video by roughly 50 msec and still be perceived as synchronized; (2) audio could lead video by 300 msec and still be perceived as synchronized if the audio was time-stretched to synchronization within a short period; and (3) our algorithm appears to strike a favorable balance between minimizing audio latency and supporting lip synchronization.

References

C. Binnie, A. Montgomery, and P. Jackson. Auditory and Visual Contributions to the Perception of Selected English Consonants for Normally Hearing and Hearing-impaired Listeners. Visual and Audio-visual Perception of Speech, volume 4, pages 181--209, 1986.Google Scholar
R. Campbell and B. Dodd. Hearing by Eye. Quarterly Journal of Experimental Psychology, Volume 32, pages 85--99, 1980.Google ScholarCross Ref
M. Chen. The Design of a Virtual Auditorium. ACM Multimedia, pages 19--28, 2001. Google ScholarDigital Library
J. Cooper. Video-to-Audio Synchrony Monitoring and Correction. Journal of the Society of Motion Picture and Television Engineers, pages 695--698, September, 1988.Google ScholarCross Ref
N. Dixon and L. Spitz. The Detection of Auditory Visual Desynchrony. Perception, volume 9, pages 719--721, 1980.Google Scholar
N. Erber and C. DeFilippo. Voice/Mouth Synthesis and Tactual/Visual Perception of Pa, Ba, Ma. Journal of Acoustical Society of America, volume 64, pages 1015--1019, 1978.Google ScholarCross Ref
E. Isaacs and J. Tang. Studying Video-Based Collaboration in Context: from Small Workgroups to Large Organizations. Video-Mediated Communication, Lawrence Erlbaum Associates, pages 173--197, 1997.Google Scholar
E. Koenig. Data discussed at Round table meeting on Modification of Speech Audiometry. VII International Congress on Audiology, volume 4, pages 72--75, 1965.Google Scholar
H. Knoche, H. De Meer, and D. Kirsh. Utility Curves: Mean Opinion Scores Considered Biased. Proceedings of the Seventh International Workshop on Quality of Service, 1999.Google ScholarCross Ref
D. Massaro and M. Cohen. Perceiving Asynchronous Bimodal Speech in Consonant-Vowel and Vowel Syllables. Speech Communication, volume 13, pages 127--134, 1993. Google ScholarDigital Library
D. Massaro, M. Cohen, and P. Smeele. Perception of Asynchronous and Conflicting Visual and Auditory Speech. Journal of the Acoustical Society of America, volume 100, pages 1777--1786, 1996.Google ScholarCross Ref
M. McGrath and Q. Summerfield. Intermodal timing relations and audio-visual speech recognition by normal-hearing adults. Journal of Acoustical Society of America, volume 77, pages 678--685, 1985.Google ScholarCross Ref
H. McGurk and J. MacDonald. Hearing Lips and Seeing Speech. Nature, volume 264, pages 746--748, 1976.Google Scholar
N. Miner and T. Caudell. Computational Requirements and Synchronization Issues of Virtual Acoustic Displays. Presence: Teleoperators and Virtual Environments, volume 7, pages 396--409, 1998. Google ScholarDigital Library
K. Munhall, P. Gribble, L. Sacco, and M. Ward. Temporal Constraints on the McGurk Effect. Perception & Psychophysics, volume 58, pages 351--362, 1996.Google ScholarCross Ref
P. Pandey, H. Kunov, and S. Abel. Disruptive Effects of Auditory Signal Delay on Speech Perception with Lipreading. Journal of Auditory Research, volume 26, pages 27--41, 1986.Google Scholar
S. Rosen, A. Fourcin, and B. Moore. Voice Pitch as an Aid to Lipreading. Nature, volume 291, pages 150--152, 1981.Google Scholar
R. Steinmetz. Human Perception of Jitter and Media Synchronization. IEEE Journal on Selected Areas in Communications, volume 14, pages 61--72, 1996. Google ScholarDigital Library
W. Sumby and I. Pollack. Visual Contribution to Speech Intelligibility in Noise. Journal of Acoustical Society of America, volume 26, pages 212--215, 1954.Google ScholarCross Ref
H. Tillmann, B. Pompino-Marschall, and H. Prozig. Zum Einfluß visuell dargeborener Sprachbewegungen auf die Wahrnehmung der akustisch dodierten Artikulation. Forschungsberichtedes Instituts fur Phonetik und Sprachliche Kommunikation der Universitat Munchen, volume 19, pages 318--338, 1984.Google Scholar
E. Walther. Lipreading, Nelson-Hall Publishers, 1982.Google Scholar
Television Signal Transmission Standards. NAB Engineering Handbook, 7th Edition, National Association of Broadcasters, pages 41--49, 1985.Google Scholar
Tolerances for Transmission Time Differences between the Vision and the Sound Components of a Television Signal. CCIR Recommendation 717, Dusseldorf, 1990.Google Scholar
http://graphics.stanford.edu/~miltchen/VideoAuditorium/Google Scholar

Index Terms

Recommendations

A client-driven media synchronization mechanism for RTP packet-based video streaming

Media synchronization is used to correctly playback a video stream with its associated audio. To support synchronization between video and audio streams transported over IP networks, an RTP/RTCP protocol suite is usually employed. In conventional server-...
Read More
Real-time language independent lip synchronization method using a genetic algorithm
Special section: Multimodal human-computer interfaces

Lip synchronization is a method for the determination of the mouth and tongue motion during a speech. It is widely used in multimedia productions, and real time implementation is opening application possibilities in multimodal interfaces. We present an ...
Read More
Low-Latency and Low-Overhead Mesochronous and Plesiochronous Synchronizers
DSD '11: Proceedings of the 2011 14th Euromicro Conference on Digital System Design

In this paper we present efficient Mesochronous and Plesiochronous interfaces targeting low-latency and low-overhead links. Our source-synchronous scheme can easily be integrated in traditional design flows, supports maximal throughput, has low latency ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CHI '03: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
April 2003
620 pages
ISBN:1581136307
DOI:10.1145/642611
Conference Chairs:
Gilbert Cockton
University of Sunderland, UK
,
Panu Korhonen
Nokia Research Center, Finland
Copyright © 2003 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 5 April 2003
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
lip synchronization
videoconferencing
Qualifiers
- Article
Conference

Acceptance Rates
CHI '03 Paper Acceptance Rate75of468submissions,16%Overall Acceptance Rate6,199of26,314submissions,24%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 14
  Total Citations
  View Citations
- 1,159
  Total Downloads
- Downloads (Last 12 months)12
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A low-latency lip-synchronized videoconferencing system

CHI '03: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

A client-driven media synchronization mechanism for RTP packet-based video streaming

Real-time language independent lip synchronization method using a genetic algorithm

Low-Latency and Low-Overhead Mesochronous and Plesiochronous Synchronizers