ABSTRACT
Audio is presented ahead of video in some videoconferencing systems since audio requires less time to process. Audio could be delayed to synchronize with video to achieve lip synchronization; however, the overall audio latency might then become unacceptable. We built a videoconferencing system to achieve lip synchronization with minimal perceived audio latency. Instead of adding a fixed audio delay, our system time-stretches the audio at the beginning of each utterance until the audio is synchronized with the video. We conducted user studies and found that (1) audio could lead video by roughly 50 msec and still be perceived as synchronized; (2) audio could lead video by 300 msec and still be perceived as synchronized if the audio was time-stretched to synchronization within a short period; and (3) our algorithm appears to strike a favorable balance between minimizing audio latency and supporting lip synchronization.
- C. Binnie, A. Montgomery, and P. Jackson. Auditory and Visual Contributions to the Perception of Selected English Consonants for Normally Hearing and Hearing-impaired Listeners. Visual and Audio-visual Perception of Speech, volume 4, pages 181--209, 1986.Google Scholar
- R. Campbell and B. Dodd. Hearing by Eye. Quarterly Journal of Experimental Psychology, Volume 32, pages 85--99, 1980.Google ScholarCross Ref
- M. Chen. The Design of a Virtual Auditorium. ACM Multimedia, pages 19--28, 2001. Google ScholarDigital Library
- J. Cooper. Video-to-Audio Synchrony Monitoring and Correction. Journal of the Society of Motion Picture and Television Engineers, pages 695--698, September, 1988.Google ScholarCross Ref
- N. Dixon and L. Spitz. The Detection of Auditory Visual Desynchrony. Perception, volume 9, pages 719--721, 1980.Google Scholar
- N. Erber and C. DeFilippo. Voice/Mouth Synthesis and Tactual/Visual Perception of Pa, Ba, Ma. Journal of Acoustical Society of America, volume 64, pages 1015--1019, 1978.Google ScholarCross Ref
- E. Isaacs and J. Tang. Studying Video-Based Collaboration in Context: from Small Workgroups to Large Organizations. Video-Mediated Communication, Lawrence Erlbaum Associates, pages 173--197, 1997.Google Scholar
- E. Koenig. Data discussed at Round table meeting on Modification of Speech Audiometry. VII International Congress on Audiology, volume 4, pages 72--75, 1965.Google Scholar
- H. Knoche, H. De Meer, and D. Kirsh. Utility Curves: Mean Opinion Scores Considered Biased. Proceedings of the Seventh International Workshop on Quality of Service, 1999.Google ScholarCross Ref
- D. Massaro and M. Cohen. Perceiving Asynchronous Bimodal Speech in Consonant-Vowel and Vowel Syllables. Speech Communication, volume 13, pages 127--134, 1993. Google ScholarDigital Library
- D. Massaro, M. Cohen, and P. Smeele. Perception of Asynchronous and Conflicting Visual and Auditory Speech. Journal of the Acoustical Society of America, volume 100, pages 1777--1786, 1996.Google ScholarCross Ref
- M. McGrath and Q. Summerfield. Intermodal timing relations and audio-visual speech recognition by normal-hearing adults. Journal of Acoustical Society of America, volume 77, pages 678--685, 1985.Google ScholarCross Ref
- H. McGurk and J. MacDonald. Hearing Lips and Seeing Speech. Nature, volume 264, pages 746--748, 1976.Google Scholar
- N. Miner and T. Caudell. Computational Requirements and Synchronization Issues of Virtual Acoustic Displays. Presence: Teleoperators and Virtual Environments, volume 7, pages 396--409, 1998. Google ScholarDigital Library
- K. Munhall, P. Gribble, L. Sacco, and M. Ward. Temporal Constraints on the McGurk Effect. Perception & Psychophysics, volume 58, pages 351--362, 1996.Google ScholarCross Ref
- P. Pandey, H. Kunov, and S. Abel. Disruptive Effects of Auditory Signal Delay on Speech Perception with Lipreading. Journal of Auditory Research, volume 26, pages 27--41, 1986.Google Scholar
- S. Rosen, A. Fourcin, and B. Moore. Voice Pitch as an Aid to Lipreading. Nature, volume 291, pages 150--152, 1981.Google Scholar
- R. Steinmetz. Human Perception of Jitter and Media Synchronization. IEEE Journal on Selected Areas in Communications, volume 14, pages 61--72, 1996. Google ScholarDigital Library
- W. Sumby and I. Pollack. Visual Contribution to Speech Intelligibility in Noise. Journal of Acoustical Society of America, volume 26, pages 212--215, 1954.Google ScholarCross Ref
- H. Tillmann, B. Pompino-Marschall, and H. Prozig. Zum Einfluß visuell dargeborener Sprachbewegungen auf die Wahrnehmung der akustisch dodierten Artikulation. Forschungsberichtedes Instituts fur Phonetik und Sprachliche Kommunikation der Universitat Munchen, volume 19, pages 318--338, 1984.Google Scholar
- E. Walther. Lipreading, Nelson-Hall Publishers, 1982.Google Scholar
- Television Signal Transmission Standards. NAB Engineering Handbook, 7th Edition, National Association of Broadcasters, pages 41--49, 1985.Google Scholar
- Tolerances for Transmission Time Differences between the Vision and the Sound Components of a Television Signal. CCIR Recommendation 717, Dusseldorf, 1990.Google Scholar
- http://graphics.stanford.edu/~miltchen/VideoAuditorium/Google Scholar
Index Terms
- A low-latency lip-synchronized videoconferencing system
Recommendations
A client-driven media synchronization mechanism for RTP packet-based video streaming
Media synchronization is used to correctly playback a video stream with its associated audio. To support synchronization between video and audio streams transported over IP networks, an RTP/RTCP protocol suite is usually employed. In conventional server-...
Real-time language independent lip synchronization method using a genetic algorithm
Special section: Multimodal human-computer interfacesLip synchronization is a method for the determination of the mouth and tongue motion during a speech. It is widely used in multimedia productions, and real time implementation is opening application possibilities in multimodal interfaces. We present an ...
Low-Latency and Low-Overhead Mesochronous and Plesiochronous Synchronizers
DSD '11: Proceedings of the 2011 14th Euromicro Conference on Digital System DesignIn this paper we present efficient Mesochronous and Plesiochronous interfaces targeting low-latency and low-overhead links. Our source-synchronous scheme can easily be integrated in traditional design flows, supports maximal throughput, has low latency ...
Comments