ABSTRACT
Future homes and offices will feature increasingly dense ecosystems of IoT devices, such as smart lighting, speakers, and domestic appliances. Voice input is a natural candidate for interacting with out-of-reach and often small devices that lack full-sized physical interfaces. However, at present, voice agents generally require wake-words and device names in order to specify the target of a spoken command (e.g., 'Hey Alexa, kitchen lights to full bright-ness'). In this research, we explore whether speech alone can be used as a directional communication channel, in much the same way visual gaze specifies a focus. Instead of a device's microphones simply receiving and processing spoken commands, we suggest they also infer the Direction of Voice (DoV). Our approach innately enables voice commands with addressability (i.e., devices know if a command was directed at them) in a natural and rapid manner. We quantify the accuracy of our implementation across users, rooms, spoken phrases, and other key factors that affect performance and usability. Taken together, we believe our DoV approach demonstrates feasibility and the promise of making distributed voice interactions much more intuitive and fluid.
Supplemental Material
Available for Download
Preview video captions
Video figure captions
- Alberto Abad, Dusan Macho, Carlos Segura, Javier Hernando, and Climent Nadeu. "Effect of head orienta-tion on the speaker localization performance in smart-room environment." In Ninth European Conference on Speech Communication and Technology. 2005.Google Scholar
- Alberto Abad, Carlos Segura, Climent Nadeu, and Javier Hernando. "Audio-based approaches to head ori-entation estimation in a smart-room." In Eighth An-nual Conference of the International Speech Commu-nication Association. 2007.Google Scholar
- Karan Ahuja, Dohyun Kim, Franceska Xhakaj, Virag Varga, Anne Xie, Stanley Zhang, Jay Eric Townsend, Chris Harrison, Amy Ogan, and Yuvraj Agarwal. 2019. EduSense: Practical Classroom Sensing at Scale. Proc. ACM Interact. Mob. Wearable Ubiqui-tous Technol. 3, 3, Article 71 (September 2019), 26 pages. DOI: https://doi.org/10.1145/3351229Google ScholarDigital Library
- Dan Barry, Bob Lawlor, and Eugene Coyle. "Sound source separation: Azimuth discrimination and resyn-thesis." In 7th International Conference on Digital Audio Effects, DAFX 04. 2004.Google Scholar
- Dirk Bechler, and Kristian Kroschel. "Considering the second peak in the GCC function for multi-source TDOA estimation with a microphone array." In Pro-ceedings of the International Workshop on Acoustic Echo and Noise Control, pp. 315--318. 2003.Google Scholar
- Frank Bentley, Chris Luvogt, Max Silverman, Rushani Wirasinghe, Brooke White, and Danielle Lot-tridge. "Understanding the long-term use of smart speaker assistants." Proceedings of the ACM on Inter-active, Mobile, Wearable and Ubiquitous Technolo-gies 2, no. 3 (2018): 1--24. DOI:https://doi.org/10.1145/3264901Google ScholarDigital Library
- Leo L. Beranek. 1986. Acoustics. American Institute of Physics, Woodbury, NY, USA.Google Scholar
- Richard A. Bolt. 1980. 'Put-that-there': Voice and gesture at the graphics interface. In Proceedings of the 7th annual conference on Computer graphics and in-teractive techniques (SIGGRAPH '80). Association for Computing Machinery, New York, NY, USA, 262--270. DOI: https://doi.org/10.1145/800250.807503Google ScholarDigital Library
- M. S. Brandstein and H. F. Silverman. 1997, A ro-bust method for speech signal time-delay estimation in reverberant rooms, Proc. IEEE International Con-ference on Acoustics, Speech and Signal Processing, Munich, Germany. DOI: https://doi.org/10.1109/ICASSP.1997.599651Google Scholar
- Hervé Bredin and Grégory Gelly. 2016. Improving Speaker Diarization of TV Series using Talking-Face Detection and Clustering. In Proceedings of the 24th ACM international conference on Multimedia (MM '16). Association for Computing Machinery, New York, NY, USA, 157--161. DOI: https://doi.org/10.1145/2964284.2967202Google ScholarDigital Library
- Alessio Brutti, Maurizio Omologo, and Piergiorgio Svaizer. "Oriented global coherence field for the esti-mation of the head orientation in smart rooms equipped with distributed microphone arrays." In Ninth European Conference on Speech Communica-tion and Technology. 2005.Google Scholar
- Cristian Canton-Ferrer, Carlos Segura, Montse Pardas, Josep R. Casas, and Javier Hernando. "Multimodal re-al-time focus of attention estimation in smartrooms." In 2008 IEEE Computer Society Conference on Com-puter Vision and Pattern Recognition Workshops, pp. 1--8. IEEE, 2008. DOI: https://doi.org/10.1109/CVPRW.2008.4563180Google Scholar
- Rastislav Cervenak, and Pavel Masek. "ARKit as in-door positioning system." In 2019 11th International Congress on Ultra Modern Telecommunications and Control Systems and Workshops (ICUMT), pp. 1--5. IEEE, 2019. DOI: https://doi.org/10.1109/ICUMT48472.2019.8970761Google Scholar
- Soumitro Chakrabarty, and Emanuël AP Habets. "Multi-speaker localization using convolutional neural network trained with noise." arXiv preprint arXiv:1712.04276 (2017).Google Scholar
- Craig A. Chin, Armando Barreto, Gualberto Cre-mades, and Malek Adjouadi. 2007. Performance anal-ysis of an integrated eye gaze tracking / electromyo-gram cursor control system. In Proceedings of the 9th international ACM SIGACCESS conference on Com-puters and accessibility (Assets '07). Association for Computing Machinery, New York, NY, USA, 233--234. DOI: https://doi.org/10.1145/1296843.1296888Google ScholarDigital Library
- W.T. Chu and A.C.C. Warnock. Detailed directivity of sound fields around human talkers, Tech. Rep. RR-104, National Research Council Canada, 2002.Google Scholar
- Antoine Deleforge and Radu Horaud. 2012. The cock-tail party robot: sound source separation and localisa-tion with an active binaural head. In Proceedings of the seventh annual ACM/IEEE international confer-ence on Human-Robot Interaction (HRI '12). Associa-tion for Computing Machinery, New York, NY, USA, 431--438. DOI: https://doi.org/10.1145/2157689.2157834Google Scholar
- Nilanjan Dey, and Amira S. Ashour. Direction of arri-val estimation and localization of multi-speech sources. Springer International Publishing, 2018.Google ScholarCross Ref
- Nilanjan Dey, and Amira S. Ashour. "Challenges and future perspectives in speech-sources direction of arri-val estimation and localization." In Direction of arri-val estimation and localization of multi-speech sources, pp. 49--52. Springer, Cham, 2018. DOI: https://doi.org/10.1007/978--3--319--73059--2_5Google Scholar
- Tiago H. Falk, Chenxi Zheng, and Wai-Yip Chan. "A non-intrusive quality and intelligibility measure of re-verberant and dereverberated speech." IEEE Transac-tions on Audio, Speech, and Language Processing 18, no. 7 (2010): 1766--1774. DOI: https://doi.org/10.1109/TASL.2010.2052247Google ScholarDigital Library
- Eric L. Ferguson, Stefan B. Williams, and Craig T. Jin. "Sound source localization in a multipath envi-ronment using convolutional neural networks." In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2386--2390. IEEE, 2018. DOI: https://doi.org/10.1109/ICASSP.2018.8462024Google Scholar
- Yangyang Huang, Takuma Otsuka, and Hiroshi G. Okuno. "A speaker diarization system with robust speaker localization and voice activity detection." In Contemporary Challenges and Solutions in Applied Artificial Intelligence, pp. 77--82. Springer, Heidel-berg, 2013. DOI: https://doi.org/10.1007/978--3--319-00651--2_11Google Scholar
- V. Z. Këpuska, and T. B. Klein. "A novel wake-up-word speech recognition system, wake-up-word recog-nition task, technology and evaluation." Nonlinear Analysis: Theory, Methods & Applications 71, no. 12 (2009): e2772-e2789. DOI: https://doi.org/10.1016/j.na.2009.06.089Google ScholarCross Ref
- Seon Man Kim, and Hong Kook Kim. "Direction-of-arrival based SNR estimation for dual-microphone speech enhancement." IEEE/ACM Transactions on Audio, Speech, and Language Processing 22, no. 12 (2014): 2207--2217. DOI: https://doi.org/10.1109/TASLP.2014.2360646Google ScholarDigital Library
- C. H. Knapp, and G. C. Carter: 1976, The generalized correlation method for estimation of time delay, IEEE Transactions on Acoustics, Speech and Signal Pro-cessing ASSP-24(4), 320--327. DOI: https://doi.org/10.1109/TASSP.1976.1162830Google Scholar
- V. Krishnaveni, T. Kesavamurthy, and B. Aparna. "Beamforming for direction-of-arrival (DOA) estima-tion-a survey." International Journal of Computer Applications 61, no. 11 (2013).Google ScholarCross Ref
- Byoungho Kwon, Youngjin Park, and Youn-sik Park. "Multiple sound sources localization using the spatial-ly mapped GCC functions." In 2009 ICCAS-SICE, pp. 1773--1776. IEEE, 2009.Google Scholar
- Avram Levi, and Harvey Silverman. "A robust meth-od to extract talker azimuth orientation using a large-aperture microphone array." IEEE transactions on au-dio, speech, and language processing 18, no. 2 (2009): 277--285. DOI: https://doi.org/10.1109/TASL.2009.2025793Google Scholar
- Jun-seok Lim, and Hee-Suk Pang. "Time delay esti-mation method based on canonical correlation analy-sis." Circuits, Systems, and Signal Processing 32, no. 5 (2013): 2527--2538. DOI: https://doi.org/10.1007/s00034-013--9578--3Google ScholarDigital Library
- Rong Liu, and Yongxuan Wang. "Azimuthal source localization using interaural coherence in a robotic dog: modeling and application." Robotica 28, no. 7 (2010): 1013--1020. DOI: https://doi.org/10.1017/S0263574709990865Google ScholarDigital Library
- Michael I. Mandel, Daniel P. Ellis, and Tony Jebara. "An EM algorithm for localizing multiple sound sources in reverberant environments." In Advances in neural information processing systems, pp. 953--960. 2007. DOI: https://doi.org/10.7916/D84176FKGoogle Scholar
- Ivan Meza, Caleb Rascon, Gibran Fuentes, and Luis A. Pineda. "On indexicality, direction of arrival of sound sources, and human-robot interaction." Journal of robotics 2016 (2016). DOI: https://doi.org/10.1155/2016/3081048Google Scholar
- Menno Müller, Steven van de Par, and Joerg Bitzer. "Head-Orientation-Based Device Selection: Are You Talking to Me?." In Speech Communication; 12. ITG Symposium, pp. 1--5. VDE, 2016.Google Scholar
- Hirofumi Nakajima, Keiko Kikuchi, Toru Daigo, Yutaka Kaneda, Kazuhiro Nakadai, and Yuji Hasega-wa. "Real-time sound source orientation estimation using a 96 channel microphone array." In 2009 IEEE/RSJ International Conference on Intelligent Ro-bots and Systems, pp. 676--683. IEEE, 2009. DOI: https://doi.org/10.1109/IROS.2009.5354285Google Scholar
- Keisuke Nakamura, Kazuhiro Nakadai, Futoshi Asano, and Gökhan Ince. "Intelligent sound source localiza-tion and its application to multimodal human track-ing." In 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 143--148. IEEE, 2011. DOI: https://doi.org/10.1109/IROS.2011.6094558Google Scholar
- Alberto Yoshihiro Nakano, Kazumasa Yamamoto, and Seiichi Nakagawa. "Directional acoustic source's posi-tion and orientation estimation approach by a micro-phone array network." In 2009 IEEE 13th Digital Sig-nal Processing Workshop and 5th IEEE Signal Pro-cessing Education Workshop, pp. 606--611. IEEE, 2009. DOI: https://doi.org/10.1109/DSP.2009.4785995Google Scholar
- Aanand Nayyar, Utkarsh Dwivedi, Karan Ahuja, Ni-tendra Rajput, Seema Nagar, and Kuntal Dey. 2017. OptiDwell: Intelligent Adjustment of Dwell Click Time. In Proceedings of the 22nd International Con-ference on Intelligent User Interfaces (IUI '17). Asso-ciation for Computing Machinery, New York, NY, USA, 193--204. DOI: https://doi.org/10.1145/3025171.3025202Google ScholarDigital Library
- Kazuhiro Otsuka, Shoko Araki, Kentaro Ishizuka, Masakiyo Fujimoto, Martin Heinrich, and Junji Yamato. "A realtime multimodal system for analyzing group meetings by combining face pose tracking and speaker diarization." In Proceedings of the 10th inter-national conference on Multimodal interfaces, pp. 257--264. 2008. DOI:https://doi.org/10.1145/1452392.1452446Google ScholarDigital Library
- Beom-Chul Park, Kyu-Dae Ban, Keun-Chang Kwak, and Ho-Sup Yoon. "Performance analysis of GCC-PHAT-based sound source localization for intelligent robots." Journal of Korea Robotics Society 2, no. 3 (2007): 270--274.Google Scholar
- Despoina Pavlidi, Symeon Delikaris-Manias, Ville Pulkki, and Athanasias Mouchtaris. "3D DOA estima-tion of multiple sound sources based on spatially con-strained beamforming driven by intensity vectors." In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 96--100. IEEE, 2016. DOI: https://doi.org/10.1109/ICASSP.2016.7471644Google Scholar
- Cathy Pearl. Designing voice user interfaces: princi-ples of conversational experiences. " O'Reilly Media, Inc.", 2016.Google Scholar
- Martin Porcheron, Joel E. Fischer, Stuart Reeves, and Sarah Sharples. 2018. Voice Interfaces in Everyday Life. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI '18). As-sociation for Computing Machinery, New York, NY, USA, Paper 640, 1--12. DOI: https://doi.org/10.1145/3173574.3174214Google ScholarDigital Library
- Om Prakash Prabhakar, and Navneet Kumar Sahu. A survey on: Voice command recognition technique. In-ternational Journal of Advanced Research in Comput-er Science and Software Engineering 3, no. 5 (2013).Google Scholar
- Caleb Rascón, Héctor Avilés, and Luis A. Pineda. "Robotic orientation towards speaker for human-robot interaction." In Ibero-American Conference on Artifi-cial Intelligence, pp. 10--19. Springer, Berlin, Heidel-berg, 2010. DOI: https://doi.org/10.1007/978--3--642--16952--6_2Google Scholar
- ReSpeaker. URL: https://wiki.seeedstudio.com/ReSpeaker-USB-Mic-ArrayGoogle Scholar
- Joshua M. Sachar, and Harvey F. Silverman. "A base-line algorithm for estimating talker orientation using acoustical data from a large-aperture microphone ar-ray." In 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 4, pp. iv-iv. IEEE, 2004. DOI: https://doi.org/10.1109/ICASSP.2004.1326764Google Scholar
- Andreas Schwarz, and Walter Kellermann. "Coherent-to-diffuse power ratio estimation for dereverberation." IEEE/ACM Transactions on Audio, Speech, and Lan-guage Processing 23, no. 6 (2015): 1006--1018. DOI: https://doi.org/10.1109/TASLP.2015.2418571Google ScholarDigital Library
- Carlos Segura, Alberto Abad, Javier Hernando, and Climent Nadeu. "Speaker orientation estimation based on hybridation of GCC-PHAT and HLBR." In Ninth Annual Conference of the International Speech Com-munication Association. 2008.Google Scholar
- Carlos Segura, and Francisco Javier Hernando Pericás. "GCC-PHAT based head orientation estimation." In 13th Annual Conference of International Speech Communication Association, pp. 1--4. 2012.Google Scholar
- Carlos Segura, Cristian Canton-Ferrer, Alberto Abad, Josep R. Casas, and Javier Hernando. "Multimodal head orientation towards attention tracking in smartrooms." In 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP'07, vol. 2, pp. II-681. IEEE, 2007. DOI: https://doi.org/10.1109/ICASSP.2007.366327Google Scholar
- Hing Cheung So, Yiu Tong Chan, and Frankie Kit Wing Chan. "Closed-form formulae for time-difference-of-arrival estimation." IEEE Transactions on Signal Processing 56, no. 6 (2008): 2614--2620. DOI: https://doi.org/10.1109/TSP.2007.914342Google ScholarDigital Library
- Hannu Soronen, Markku Turunen, and Jaakko Hakulinen. "Voice commands in home environment-a consumer survey." In Ninth Annual Conference of the International Speech Communication Association. 2008.Google Scholar
- Norbert Strobel, and Rudolf Rabenstein. "Classifica-tion of time delay estimates for robust speaker locali-zation." In 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceed-ings. ICASSP99 (Cat. No. 99CH36258), vol. 6, pp. 3081--3084. IEEE, 1999. DOI: https://doi.org/10.1109/ICASSP.1999.757492Google Scholar
- Ryoichi Takashima, Tetsuya Takiguchi, and Yasuo Ariki. "Estimation of talker's head orientation based on discrimination of the shape of cross-power spec-trum phase coefficients." In Thirteenth Annual Con-ference of the International Speech Communication Association. 2012.Google Scholar
- Ryu Takeda, and Kazunori Komatani. "Discriminative multiple sound source localization based on deep neu-ral networks using independent location model." In 2016 IEEE Spoken Language Technology Workshop (SLT), pp. 603--609. IEEE, 2016. DOI: https://doi.org/10.1109/SLT.2016.7846325Google Scholar
- Bilge Mutlu. 2020. Designing Social Cues for Collaborative Robots: The Role of Gaze and Breathing in Human-Robot Collabo-ration. In Proceedings of the 2020 ACM/IEEE Inter-national Conference on Human-Robot Interaction (HRI ?20). Association for Computing Machinery, New York, NY, USA, 343--357. DOI: https://doi.org/10.1145/3319502.3374829Google Scholar
- Tobii, Pro Glasses 2. URL: https://www.tobiipro.com/product-listing/tobii-pro-glasses-2/Google Scholar
- Jose Velasco, Daniel Pizarro, and Javier Macias-Guarasa. "Source localization with acoustic sensor ar-rays using generative model based fitting with sparse constraints." Sensors 12, no. 10 (2012): 13781--13812. DOI: https://doi.org/10.3390/s121013781Google ScholarCross Ref
- Hong Wang, and Peter Chu. "Voice source localiza-tion for automatic camera pointing system in vide-oconferencing." In 1997 IEEE International Confer-ence on Acoustics, Speech, and Signal Processing, vol. 1, pp. 187--190. IEEE, 1997. DOI: https://doi.org/10.1109/ASPAA.1997.625639Google Scholar
- Jackie (Junrui) Yang, Gaurab Banerjee, Vishesh Gup-ta, Monica S. Lam, and James A. Landay. 2020. Soundr: Head Position and Orientation Prediction Us-ing a Microphone Array. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (CHI '20). Association for Computing Ma-chinery, New York, NY, USA, 1--12. DOI:https://doi.org/10.1145/3313831.3376427Google Scholar
- Cha Zhang, Dinei Florêncio, Demba E. Ba, and Zhengyou Zhang. "Maximum likelihood sound source localization and beamforming for directional micro-phone arrays in distributed meetings." IEEE Transac-tions on Multimedia 10, no. 3 (2008): 538--548. DOI: https://doi.org/10.1109/TMM.2008.917406Google ScholarDigital Library
Index Terms
- Direction-of-Voice (DoV) Estimation for Intuitive Speech Interaction with Smart Devices Ecosystems
Recommendations
Preprocessing for elderly speech recognition of smart devices
HighlightsPreprocessed elderly voice signals were tested with an android smart phone.Speech recognition accuracy increased to 1.5% by increasing the speech rate.Speech recognition accuracy increased to 4.2% by eliminating intersyllabic pauses.Speech ...
Statistical Voice Conversion Techniques for Body-Conducted Unvoiced Speech Enhancement
In this paper, we present statistical approaches to enhance body-conducted unvoiced speech for silent speech communication. A body-conductive microphone called nonaudible murmur (NAM) microphone is effectively used to detect very soft unvoiced speech ...
Comments