research-article

Open Access

Direction-of-Voice (DoV) Estimation for Intuitive Speech Interaction with Smart Devices Ecosystems

Authors:
Karan Ahuja

Carnegie Mellon University, Pittsburgh, PA, USA

Carnegie Mellon University, Pittsburgh, PA, USA
View Profile

,
Andy Kong

Carnegie Mellon University, Pittsburgh, PA, USA

Carnegie Mellon University, Pittsburgh, PA, USA
View Profile

,
Mayank Goel

Carnegie Mellon University, Pittsburgh, PA, USA

Carnegie Mellon University, Pittsburgh, PA, USA
View Profile

,
Chris Harrison

Carnegie Mellon University, Pittsburgh, PA, USA

Carnegie Mellon University, Pittsburgh, PA, USA
View Profile

UIST '20: Proceedings of the 33rd Annual ACM Symposium on User Interface Software and TechnologyOctober 2020Pages 1121–1131https://doi.org/10.1145/3379337.3415588

Published:20 October 2020Publication History

UIST '20: Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology

Pages 1121–1131

ABSTRACT

Future homes and offices will feature increasingly dense ecosystems of IoT devices, such as smart lighting, speakers, and domestic appliances. Voice input is a natural candidate for interacting with out-of-reach and often small devices that lack full-sized physical interfaces. However, at present, voice agents generally require wake-words and device names in order to specify the target of a spoken command (e.g., 'Hey Alexa, kitchen lights to full bright-ness'). In this research, we explore whether speech alone can be used as a directional communication channel, in much the same way visual gaze specifies a focus. Instead of a device's microphones simply receiving and processing spoken commands, we suggest they also infer the Direction of Voice (DoV). Our approach innately enables voice commands with addressability (i.e., devices know if a command was directed at them) in a natural and rapid manner. We quantify the accuracy of our implementation across users, rooms, spoken phrases, and other key factors that affect performance and usability. Taken together, we believe our DoV approach demonstrates feasibility and the promise of making distributed voice interactions much more intuitive and fluid.

Supplemental Material

ufp7037pv.mp4

Preview video

mp4

27 MB

Download

ufp7037vf.mp4

Video figure

mp4

167.6 MB

Download

3379337.3415588.mp4

Presentation Video

mp4

38.3 MB

Download

Available for Download

srt

ufp7037pvc.srt (1.1 KB)

Preview video captions

srt

ufp7037vfc.srt (9.7 KB)

Video figure captions

vtt

ufp7037pv.vtt (1.1 KB)

vtt

ufp7037vf.vtt (8.7 KB)

vtt

3379337.3415588.vtt (6.5 KB)

References

Alberto Abad, Dusan Macho, Carlos Segura, Javier Hernando, and Climent Nadeu. "Effect of head orienta-tion on the speaker localization performance in smart-room environment." In Ninth European Conference on Speech Communication and Technology. 2005.Google Scholar
Alberto Abad, Carlos Segura, Climent Nadeu, and Javier Hernando. "Audio-based approaches to head ori-entation estimation in a smart-room." In Eighth An-nual Conference of the International Speech Commu-nication Association. 2007.Google Scholar
Karan Ahuja, Dohyun Kim, Franceska Xhakaj, Virag Varga, Anne Xie, Stanley Zhang, Jay Eric Townsend, Chris Harrison, Amy Ogan, and Yuvraj Agarwal. 2019. EduSense: Practical Classroom Sensing at Scale. Proc. ACM Interact. Mob. Wearable Ubiqui-tous Technol. 3, 3, Article 71 (September 2019), 26 pages. DOI: https://doi.org/10.1145/3351229Google ScholarDigital Library
Dan Barry, Bob Lawlor, and Eugene Coyle. "Sound source separation: Azimuth discrimination and resyn-thesis." In 7th International Conference on Digital Audio Effects, DAFX 04. 2004.Google Scholar
Dirk Bechler, and Kristian Kroschel. "Considering the second peak in the GCC function for multi-source TDOA estimation with a microphone array." In Pro-ceedings of the International Workshop on Acoustic Echo and Noise Control, pp. 315--318. 2003.Google Scholar
Frank Bentley, Chris Luvogt, Max Silverman, Rushani Wirasinghe, Brooke White, and Danielle Lot-tridge. "Understanding the long-term use of smart speaker assistants." Proceedings of the ACM on Inter-active, Mobile, Wearable and Ubiquitous Technolo-gies 2, no. 3 (2018): 1--24. DOI:https://doi.org/10.1145/3264901Google ScholarDigital Library
Leo L. Beranek. 1986. Acoustics. American Institute of Physics, Woodbury, NY, USA.Google Scholar
Richard A. Bolt. 1980. 'Put-that-there': Voice and gesture at the graphics interface. In Proceedings of the 7th annual conference on Computer graphics and in-teractive techniques (SIGGRAPH '80). Association for Computing Machinery, New York, NY, USA, 262--270. DOI: https://doi.org/10.1145/800250.807503Google ScholarDigital Library
M. S. Brandstein and H. F. Silverman. 1997, A ro-bust method for speech signal time-delay estimation in reverberant rooms, Proc. IEEE International Con-ference on Acoustics, Speech and Signal Processing, Munich, Germany. DOI: https://doi.org/10.1109/ICASSP.1997.599651Google Scholar
Hervé Bredin and Grégory Gelly. 2016. Improving Speaker Diarization of TV Series using Talking-Face Detection and Clustering. In Proceedings of the 24th ACM international conference on Multimedia (MM '16). Association for Computing Machinery, New York, NY, USA, 157--161. DOI: https://doi.org/10.1145/2964284.2967202Google ScholarDigital Library
Alessio Brutti, Maurizio Omologo, and Piergiorgio Svaizer. "Oriented global coherence field for the esti-mation of the head orientation in smart rooms equipped with distributed microphone arrays." In Ninth European Conference on Speech Communica-tion and Technology. 2005.Google Scholar
Cristian Canton-Ferrer, Carlos Segura, Montse Pardas, Josep R. Casas, and Javier Hernando. "Multimodal re-al-time focus of attention estimation in smartrooms." In 2008 IEEE Computer Society Conference on Com-puter Vision and Pattern Recognition Workshops, pp. 1--8. IEEE, 2008. DOI: https://doi.org/10.1109/CVPRW.2008.4563180Google Scholar
Rastislav Cervenak, and Pavel Masek. "ARKit as in-door positioning system." In 2019 11th International Congress on Ultra Modern Telecommunications and Control Systems and Workshops (ICUMT), pp. 1--5. IEEE, 2019. DOI: https://doi.org/10.1109/ICUMT48472.2019.8970761Google Scholar
Soumitro Chakrabarty, and Emanuël AP Habets. "Multi-speaker localization using convolutional neural network trained with noise." arXiv preprint arXiv:1712.04276 (2017).Google Scholar
Craig A. Chin, Armando Barreto, Gualberto Cre-mades, and Malek Adjouadi. 2007. Performance anal-ysis of an integrated eye gaze tracking / electromyo-gram cursor control system. In Proceedings of the 9th international ACM SIGACCESS conference on Com-puters and accessibility (Assets '07). Association for Computing Machinery, New York, NY, USA, 233--234. DOI: https://doi.org/10.1145/1296843.1296888Google ScholarDigital Library
W.T. Chu and A.C.C. Warnock. Detailed directivity of sound fields around human talkers, Tech. Rep. RR-104, National Research Council Canada, 2002.Google Scholar
Antoine Deleforge and Radu Horaud. 2012. The cock-tail party robot: sound source separation and localisa-tion with an active binaural head. In Proceedings of the seventh annual ACM/IEEE international confer-ence on Human-Robot Interaction (HRI '12). Associa-tion for Computing Machinery, New York, NY, USA, 431--438. DOI: https://doi.org/10.1145/2157689.2157834Google Scholar
Nilanjan Dey, and Amira S. Ashour. Direction of arri-val estimation and localization of multi-speech sources. Springer International Publishing, 2018.Google ScholarCross Ref
Nilanjan Dey, and Amira S. Ashour. "Challenges and future perspectives in speech-sources direction of arri-val estimation and localization." In Direction of arri-val estimation and localization of multi-speech sources, pp. 49--52. Springer, Cham, 2018. DOI: https://doi.org/10.1007/978--3--319--73059--2_5Google Scholar
Tiago H. Falk, Chenxi Zheng, and Wai-Yip Chan. "A non-intrusive quality and intelligibility measure of re-verberant and dereverberated speech." IEEE Transac-tions on Audio, Speech, and Language Processing 18, no. 7 (2010): 1766--1774. DOI: https://doi.org/10.1109/TASL.2010.2052247Google ScholarDigital Library
Eric L. Ferguson, Stefan B. Williams, and Craig T. Jin. "Sound source localization in a multipath envi-ronment using convolutional neural networks." In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2386--2390. IEEE, 2018. DOI: https://doi.org/10.1109/ICASSP.2018.8462024Google Scholar
Yangyang Huang, Takuma Otsuka, and Hiroshi G. Okuno. "A speaker diarization system with robust speaker localization and voice activity detection." In Contemporary Challenges and Solutions in Applied Artificial Intelligence, pp. 77--82. Springer, Heidel-berg, 2013. DOI: https://doi.org/10.1007/978--3--319-00651--2_11Google Scholar
V. Z. Këpuska, and T. B. Klein. "A novel wake-up-word speech recognition system, wake-up-word recog-nition task, technology and evaluation." Nonlinear Analysis: Theory, Methods & Applications 71, no. 12 (2009): e2772-e2789. DOI: https://doi.org/10.1016/j.na.2009.06.089Google ScholarCross Ref
Seon Man Kim, and Hong Kook Kim. "Direction-of-arrival based SNR estimation for dual-microphone speech enhancement." IEEE/ACM Transactions on Audio, Speech, and Language Processing 22, no. 12 (2014): 2207--2217. DOI: https://doi.org/10.1109/TASLP.2014.2360646Google ScholarDigital Library
C. H. Knapp, and G. C. Carter: 1976, The generalized correlation method for estimation of time delay, IEEE Transactions on Acoustics, Speech and Signal Pro-cessing ASSP-24(4), 320--327. DOI: https://doi.org/10.1109/TASSP.1976.1162830Google Scholar
V. Krishnaveni, T. Kesavamurthy, and B. Aparna. "Beamforming for direction-of-arrival (DOA) estima-tion-a survey." International Journal of Computer Applications 61, no. 11 (2013).Google ScholarCross Ref
Byoungho Kwon, Youngjin Park, and Youn-sik Park. "Multiple sound sources localization using the spatial-ly mapped GCC functions." In 2009 ICCAS-SICE, pp. 1773--1776. IEEE, 2009.Google Scholar
Avram Levi, and Harvey Silverman. "A robust meth-od to extract talker azimuth orientation using a large-aperture microphone array." IEEE transactions on au-dio, speech, and language processing 18, no. 2 (2009): 277--285. DOI: https://doi.org/10.1109/TASL.2009.2025793Google Scholar
Jun-seok Lim, and Hee-Suk Pang. "Time delay esti-mation method based on canonical correlation analy-sis." Circuits, Systems, and Signal Processing 32, no. 5 (2013): 2527--2538. DOI: https://doi.org/10.1007/s00034-013--9578--3Google ScholarDigital Library
Rong Liu, and Yongxuan Wang. "Azimuthal source localization using interaural coherence in a robotic dog: modeling and application." Robotica 28, no. 7 (2010): 1013--1020. DOI: https://doi.org/10.1017/S0263574709990865Google ScholarDigital Library
Michael I. Mandel, Daniel P. Ellis, and Tony Jebara. "An EM algorithm for localizing multiple sound sources in reverberant environments." In Advances in neural information processing systems, pp. 953--960. 2007. DOI: https://doi.org/10.7916/D84176FKGoogle Scholar
Ivan Meza, Caleb Rascon, Gibran Fuentes, and Luis A. Pineda. "On indexicality, direction of arrival of sound sources, and human-robot interaction." Journal of robotics 2016 (2016). DOI: https://doi.org/10.1155/2016/3081048Google Scholar
Menno Müller, Steven van de Par, and Joerg Bitzer. "Head-Orientation-Based Device Selection: Are You Talking to Me?." In Speech Communication; 12. ITG Symposium, pp. 1--5. VDE, 2016.Google Scholar
Hirofumi Nakajima, Keiko Kikuchi, Toru Daigo, Yutaka Kaneda, Kazuhiro Nakadai, and Yuji Hasega-wa. "Real-time sound source orientation estimation using a 96 channel microphone array." In 2009 IEEE/RSJ International Conference on Intelligent Ro-bots and Systems, pp. 676--683. IEEE, 2009. DOI: https://doi.org/10.1109/IROS.2009.5354285Google Scholar
Keisuke Nakamura, Kazuhiro Nakadai, Futoshi Asano, and Gökhan Ince. "Intelligent sound source localiza-tion and its application to multimodal human track-ing." In 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 143--148. IEEE, 2011. DOI: https://doi.org/10.1109/IROS.2011.6094558Google Scholar
Alberto Yoshihiro Nakano, Kazumasa Yamamoto, and Seiichi Nakagawa. "Directional acoustic source's posi-tion and orientation estimation approach by a micro-phone array network." In 2009 IEEE 13th Digital Sig-nal Processing Workshop and 5th IEEE Signal Pro-cessing Education Workshop, pp. 606--611. IEEE, 2009. DOI: https://doi.org/10.1109/DSP.2009.4785995Google Scholar
Aanand Nayyar, Utkarsh Dwivedi, Karan Ahuja, Ni-tendra Rajput, Seema Nagar, and Kuntal Dey. 2017. OptiDwell: Intelligent Adjustment of Dwell Click Time. In Proceedings of the 22nd International Con-ference on Intelligent User Interfaces (IUI '17). Asso-ciation for Computing Machinery, New York, NY, USA, 193--204. DOI: https://doi.org/10.1145/3025171.3025202Google ScholarDigital Library
Kazuhiro Otsuka, Shoko Araki, Kentaro Ishizuka, Masakiyo Fujimoto, Martin Heinrich, and Junji Yamato. "A realtime multimodal system for analyzing group meetings by combining face pose tracking and speaker diarization." In Proceedings of the 10th inter-national conference on Multimodal interfaces, pp. 257--264. 2008. DOI:https://doi.org/10.1145/1452392.1452446Google ScholarDigital Library
Beom-Chul Park, Kyu-Dae Ban, Keun-Chang Kwak, and Ho-Sup Yoon. "Performance analysis of GCC-PHAT-based sound source localization for intelligent robots." Journal of Korea Robotics Society 2, no. 3 (2007): 270--274.Google Scholar
Despoina Pavlidi, Symeon Delikaris-Manias, Ville Pulkki, and Athanasias Mouchtaris. "3D DOA estima-tion of multiple sound sources based on spatially con-strained beamforming driven by intensity vectors." In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 96--100. IEEE, 2016. DOI: https://doi.org/10.1109/ICASSP.2016.7471644Google Scholar
Cathy Pearl. Designing voice user interfaces: princi-ples of conversational experiences. " O'Reilly Media, Inc.", 2016.Google Scholar
Martin Porcheron, Joel E. Fischer, Stuart Reeves, and Sarah Sharples. 2018. Voice Interfaces in Everyday Life. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI '18). As-sociation for Computing Machinery, New York, NY, USA, Paper 640, 1--12. DOI: https://doi.org/10.1145/3173574.3174214Google ScholarDigital Library
Om Prakash Prabhakar, and Navneet Kumar Sahu. A survey on: Voice command recognition technique. In-ternational Journal of Advanced Research in Comput-er Science and Software Engineering 3, no. 5 (2013).Google Scholar
Caleb Rascón, Héctor Avilés, and Luis A. Pineda. "Robotic orientation towards speaker for human-robot interaction." In Ibero-American Conference on Artifi-cial Intelligence, pp. 10--19. Springer, Berlin, Heidel-berg, 2010. DOI: https://doi.org/10.1007/978--3--642--16952--6_2Google Scholar
ReSpeaker. URL: https://wiki.seeedstudio.com/ReSpeaker-USB-Mic-ArrayGoogle Scholar
Joshua M. Sachar, and Harvey F. Silverman. "A base-line algorithm for estimating talker orientation using acoustical data from a large-aperture microphone ar-ray." In 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 4, pp. iv-iv. IEEE, 2004. DOI: https://doi.org/10.1109/ICASSP.2004.1326764Google Scholar
Andreas Schwarz, and Walter Kellermann. "Coherent-to-diffuse power ratio estimation for dereverberation." IEEE/ACM Transactions on Audio, Speech, and Lan-guage Processing 23, no. 6 (2015): 1006--1018. DOI: https://doi.org/10.1109/TASLP.2015.2418571Google ScholarDigital Library
Carlos Segura, Alberto Abad, Javier Hernando, and Climent Nadeu. "Speaker orientation estimation based on hybridation of GCC-PHAT and HLBR." In Ninth Annual Conference of the International Speech Com-munication Association. 2008.Google Scholar
Carlos Segura, and Francisco Javier Hernando Pericás. "GCC-PHAT based head orientation estimation." In 13th Annual Conference of International Speech Communication Association, pp. 1--4. 2012.Google Scholar
Carlos Segura, Cristian Canton-Ferrer, Alberto Abad, Josep R. Casas, and Javier Hernando. "Multimodal head orientation towards attention tracking in smartrooms." In 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP'07, vol. 2, pp. II-681. IEEE, 2007. DOI: https://doi.org/10.1109/ICASSP.2007.366327Google Scholar
Hing Cheung So, Yiu Tong Chan, and Frankie Kit Wing Chan. "Closed-form formulae for time-difference-of-arrival estimation." IEEE Transactions on Signal Processing 56, no. 6 (2008): 2614--2620. DOI: https://doi.org/10.1109/TSP.2007.914342Google ScholarDigital Library
Hannu Soronen, Markku Turunen, and Jaakko Hakulinen. "Voice commands in home environment-a consumer survey." In Ninth Annual Conference of the International Speech Communication Association. 2008.Google Scholar
Norbert Strobel, and Rudolf Rabenstein. "Classifica-tion of time delay estimates for robust speaker locali-zation." In 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceed-ings. ICASSP99 (Cat. No. 99CH36258), vol. 6, pp. 3081--3084. IEEE, 1999. DOI: https://doi.org/10.1109/ICASSP.1999.757492Google Scholar
Ryoichi Takashima, Tetsuya Takiguchi, and Yasuo Ariki. "Estimation of talker's head orientation based on discrimination of the shape of cross-power spec-trum phase coefficients." In Thirteenth Annual Con-ference of the International Speech Communication Association. 2012.Google Scholar
Ryu Takeda, and Kazunori Komatani. "Discriminative multiple sound source localization based on deep neu-ral networks using independent location model." In 2016 IEEE Spoken Language Technology Workshop (SLT), pp. 603--609. IEEE, 2016. DOI: https://doi.org/10.1109/SLT.2016.7846325Google Scholar
Bilge Mutlu. 2020. Designing Social Cues for Collaborative Robots: The Role of Gaze and Breathing in Human-Robot Collabo-ration. In Proceedings of the 2020 ACM/IEEE Inter-national Conference on Human-Robot Interaction (HRI ?20). Association for Computing Machinery, New York, NY, USA, 343--357. DOI: https://doi.org/10.1145/3319502.3374829Google Scholar
Tobii, Pro Glasses 2. URL: https://www.tobiipro.com/product-listing/tobii-pro-glasses-2/Google Scholar
Jose Velasco, Daniel Pizarro, and Javier Macias-Guarasa. "Source localization with acoustic sensor ar-rays using generative model based fitting with sparse constraints." Sensors 12, no. 10 (2012): 13781--13812. DOI: https://doi.org/10.3390/s121013781Google ScholarCross Ref
Hong Wang, and Peter Chu. "Voice source localiza-tion for automatic camera pointing system in vide-oconferencing." In 1997 IEEE International Confer-ence on Acoustics, Speech, and Signal Processing, vol. 1, pp. 187--190. IEEE, 1997. DOI: https://doi.org/10.1109/ASPAA.1997.625639Google Scholar
Jackie (Junrui) Yang, Gaurab Banerjee, Vishesh Gup-ta, Monica S. Lam, and James A. Landay. 2020. Soundr: Head Position and Orientation Prediction Us-ing a Microphone Array. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (CHI '20). Association for Computing Ma-chinery, New York, NY, USA, 1--12. DOI:https://doi.org/10.1145/3313831.3376427Google Scholar
Cha Zhang, Dinei Florêncio, Demba E. Ba, and Zhengyou Zhang. "Maximum likelihood sound source localization and beamforming for directional micro-phone arrays in distributed meetings." IEEE Transac-tions on Multimedia 10, no. 3 (2008): 538--548. DOI: https://doi.org/10.1109/TMM.2008.917406Google ScholarDigital Library

Index Terms

Direction-of-Voice (DoV) Estimation for Intuitive Speech Interaction with Smart Devices Ecosystems
1. Human-centered computing
  1. Human computer interaction (HCI)

Recommendations

Preprocessing for elderly speech recognition of smart devices

HighlightsPreprocessed elderly voice signals were tested with an android smart phone.Speech recognition accuracy increased to 1.5% by increasing the speech rate.Speech recognition accuracy increased to 4.2% by eliminating intersyllabic pauses.Speech ...
Read More
Statistical Voice Conversion Techniques for Body-Conducted Unvoiced Speech Enhancement

In this paper, we present statistical approaches to enhance body-conducted unvoiced speech for silent speech communication. A body-conductive microphone called nonaudible murmur (NAM) microphone is effectively used to detect very soft unvoiced speech ...
Read More
Speech enhancement using voice source models
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
UIST '20: Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology
October 2020
1297 pages
ISBN:9781450375146
DOI:10.1145/3379337
General Chairs:
Shamsi Iqbal
Microsoft Research, USA
,
Karon MacLean
University of British Columbia, Canada
,
Program Chairs:
Fanny Chevalier
University of Toronto, Canada
,
Stefanie Mueller
MIT CSAIL, USA
Copyright © 2020 Owner/Author
This work is licensed under a Creative Commons Attribution International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 October 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
addressability
speaker orientation
voice interfaces
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate842of3,967submissions,21%
Upcoming Conference
UIST '24

Sponsor:

sigchi

sigchi

UIST '24: The 37th Annual ACM Symposium on User Interface Software and Technology

October 13 - 16, 2024

Pittsburgh , PA , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 18
  Total Citations
  View Citations
- 1,117
  Total Downloads
- Downloads (Last 12 months)216
- Downloads (Last 6 weeks)32
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Direction-of-Voice (DoV) Estimation for Intuitive Speech Interaction with Smart Devices Ecosystems

UIST '20: Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology

ABSTRACT

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

Preprocessing for elderly speech recognition of smart devices

Statistical Voice Conversion Techniques for Body-Conducted Unvoiced Speech Enhancement

Speech enhancement using voice source models