skip to main content
10.1145/642611.642692acmconferencesArticle/Chapter ViewAbstractPublication PageschiConference Proceedingsconference-collections
Article

A low-latency lip-synchronized videoconferencing system

Published:05 April 2003Publication History

ABSTRACT

Audio is presented ahead of video in some videoconferencing systems since audio requires less time to process. Audio could be delayed to synchronize with video to achieve lip synchronization; however, the overall audio latency might then become unacceptable. We built a videoconferencing system to achieve lip synchronization with minimal perceived audio latency. Instead of adding a fixed audio delay, our system time-stretches the audio at the beginning of each utterance until the audio is synchronized with the video. We conducted user studies and found that (1) audio could lead video by roughly 50 msec and still be perceived as synchronized; (2) audio could lead video by 300 msec and still be perceived as synchronized if the audio was time-stretched to synchronization within a short period; and (3) our algorithm appears to strike a favorable balance between minimizing audio latency and supporting lip synchronization.

References

  1. C. Binnie, A. Montgomery, and P. Jackson. Auditory and Visual Contributions to the Perception of Selected English Consonants for Normally Hearing and Hearing-impaired Listeners. Visual and Audio-visual Perception of Speech, volume 4, pages 181--209, 1986.Google ScholarGoogle Scholar
  2. R. Campbell and B. Dodd. Hearing by Eye. Quarterly Journal of Experimental Psychology, Volume 32, pages 85--99, 1980.Google ScholarGoogle ScholarCross RefCross Ref
  3. M. Chen. The Design of a Virtual Auditorium. ACM Multimedia, pages 19--28, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. Cooper. Video-to-Audio Synchrony Monitoring and Correction. Journal of the Society of Motion Picture and Television Engineers, pages 695--698, September, 1988.Google ScholarGoogle ScholarCross RefCross Ref
  5. N. Dixon and L. Spitz. The Detection of Auditory Visual Desynchrony. Perception, volume 9, pages 719--721, 1980.Google ScholarGoogle Scholar
  6. N. Erber and C. DeFilippo. Voice/Mouth Synthesis and Tactual/Visual Perception of Pa, Ba, Ma. Journal of Acoustical Society of America, volume 64, pages 1015--1019, 1978.Google ScholarGoogle ScholarCross RefCross Ref
  7. E. Isaacs and J. Tang. Studying Video-Based Collaboration in Context: from Small Workgroups to Large Organizations. Video-Mediated Communication, Lawrence Erlbaum Associates, pages 173--197, 1997.Google ScholarGoogle Scholar
  8. E. Koenig. Data discussed at Round table meeting on Modification of Speech Audiometry. VII International Congress on Audiology, volume 4, pages 72--75, 1965.Google ScholarGoogle Scholar
  9. H. Knoche, H. De Meer, and D. Kirsh. Utility Curves: Mean Opinion Scores Considered Biased. Proceedings of the Seventh International Workshop on Quality of Service, 1999.Google ScholarGoogle ScholarCross RefCross Ref
  10. D. Massaro and M. Cohen. Perceiving Asynchronous Bimodal Speech in Consonant-Vowel and Vowel Syllables. Speech Communication, volume 13, pages 127--134, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D. Massaro, M. Cohen, and P. Smeele. Perception of Asynchronous and Conflicting Visual and Auditory Speech. Journal of the Acoustical Society of America, volume 100, pages 1777--1786, 1996.Google ScholarGoogle ScholarCross RefCross Ref
  12. M. McGrath and Q. Summerfield. Intermodal timing relations and audio-visual speech recognition by normal-hearing adults. Journal of Acoustical Society of America, volume 77, pages 678--685, 1985.Google ScholarGoogle ScholarCross RefCross Ref
  13. H. McGurk and J. MacDonald. Hearing Lips and Seeing Speech. Nature, volume 264, pages 746--748, 1976.Google ScholarGoogle Scholar
  14. N. Miner and T. Caudell. Computational Requirements and Synchronization Issues of Virtual Acoustic Displays. Presence: Teleoperators and Virtual Environments, volume 7, pages 396--409, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. K. Munhall, P. Gribble, L. Sacco, and M. Ward. Temporal Constraints on the McGurk Effect. Perception & Psychophysics, volume 58, pages 351--362, 1996.Google ScholarGoogle ScholarCross RefCross Ref
  16. P. Pandey, H. Kunov, and S. Abel. Disruptive Effects of Auditory Signal Delay on Speech Perception with Lipreading. Journal of Auditory Research, volume 26, pages 27--41, 1986.Google ScholarGoogle Scholar
  17. S. Rosen, A. Fourcin, and B. Moore. Voice Pitch as an Aid to Lipreading. Nature, volume 291, pages 150--152, 1981.Google ScholarGoogle Scholar
  18. R. Steinmetz. Human Perception of Jitter and Media Synchronization. IEEE Journal on Selected Areas in Communications, volume 14, pages 61--72, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. W. Sumby and I. Pollack. Visual Contribution to Speech Intelligibility in Noise. Journal of Acoustical Society of America, volume 26, pages 212--215, 1954.Google ScholarGoogle ScholarCross RefCross Ref
  20. H. Tillmann, B. Pompino-Marschall, and H. Prozig. Zum Einfluß visuell dargeborener Sprachbewegungen auf die Wahrnehmung der akustisch dodierten Artikulation. Forschungsberichtedes Instituts fur Phonetik und Sprachliche Kommunikation der Universitat Munchen, volume 19, pages 318--338, 1984.Google ScholarGoogle Scholar
  21. E. Walther. Lipreading, Nelson-Hall Publishers, 1982.Google ScholarGoogle Scholar
  22. Television Signal Transmission Standards. NAB Engineering Handbook, 7th Edition, National Association of Broadcasters, pages 41--49, 1985.Google ScholarGoogle Scholar
  23. Tolerances for Transmission Time Differences between the Vision and the Sound Components of a Television Signal. CCIR Recommendation 717, Dusseldorf, 1990.Google ScholarGoogle Scholar
  24. http://graphics.stanford.edu/~miltchen/VideoAuditorium/Google ScholarGoogle Scholar

Index Terms

  1. A low-latency lip-synchronized videoconferencing system

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in
              • Published in

                cover image ACM Conferences
                CHI '03: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
                April 2003
                620 pages
                ISBN:1581136307
                DOI:10.1145/642611

                Copyright © 2003 ACM

                Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                • Published: 5 April 2003

                Permissions

                Request permissions about this article.

                Request Permissions

                Check for updates

                Qualifiers

                • Article

                Acceptance Rates

                CHI '03 Paper Acceptance Rate75of468submissions,16%Overall Acceptance Rate6,199of26,314submissions,24%

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader