skip to main content
10.1145/3379337.3415588acmconferencesArticle/Chapter ViewAbstractPublication PagesuistConference Proceedingsconference-collections
research-article
Open Access

Direction-of-Voice (DoV) Estimation for Intuitive Speech Interaction with Smart Devices Ecosystems

Authors Info & Claims
Published:20 October 2020Publication History

ABSTRACT

Future homes and offices will feature increasingly dense ecosystems of IoT devices, such as smart lighting, speakers, and domestic appliances. Voice input is a natural candidate for interacting with out-of-reach and often small devices that lack full-sized physical interfaces. However, at present, voice agents generally require wake-words and device names in order to specify the target of a spoken command (e.g., 'Hey Alexa, kitchen lights to full bright-ness'). In this research, we explore whether speech alone can be used as a directional communication channel, in much the same way visual gaze specifies a focus. Instead of a device's microphones simply receiving and processing spoken commands, we suggest they also infer the Direction of Voice (DoV). Our approach innately enables voice commands with addressability (i.e., devices know if a command was directed at them) in a natural and rapid manner. We quantify the accuracy of our implementation across users, rooms, spoken phrases, and other key factors that affect performance and usability. Taken together, we believe our DoV approach demonstrates feasibility and the promise of making distributed voice interactions much more intuitive and fluid.

Skip Supplemental Material Section

Supplemental Material

ufp7037pv.mp4

Preview video

mp4

27 MB

ufp7037vf.mp4

Video figure

mp4

167.6 MB

3379337.3415588.mp4

Presentation Video

mp4

38.3 MB

References

  1. Alberto Abad, Dusan Macho, Carlos Segura, Javier Hernando, and Climent Nadeu. "Effect of head orienta-tion on the speaker localization performance in smart-room environment." In Ninth European Conference on Speech Communication and Technology. 2005.Google ScholarGoogle Scholar
  2. Alberto Abad, Carlos Segura, Climent Nadeu, and Javier Hernando. "Audio-based approaches to head ori-entation estimation in a smart-room." In Eighth An-nual Conference of the International Speech Commu-nication Association. 2007.Google ScholarGoogle Scholar
  3. Karan Ahuja, Dohyun Kim, Franceska Xhakaj, Virag Varga, Anne Xie, Stanley Zhang, Jay Eric Townsend, Chris Harrison, Amy Ogan, and Yuvraj Agarwal. 2019. EduSense: Practical Classroom Sensing at Scale. Proc. ACM Interact. Mob. Wearable Ubiqui-tous Technol. 3, 3, Article 71 (September 2019), 26 pages. DOI: https://doi.org/10.1145/3351229Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Dan Barry, Bob Lawlor, and Eugene Coyle. "Sound source separation: Azimuth discrimination and resyn-thesis." In 7th International Conference on Digital Audio Effects, DAFX 04. 2004.Google ScholarGoogle Scholar
  5. Dirk Bechler, and Kristian Kroschel. "Considering the second peak in the GCC function for multi-source TDOA estimation with a microphone array." In Pro-ceedings of the International Workshop on Acoustic Echo and Noise Control, pp. 315--318. 2003.Google ScholarGoogle Scholar
  6. Frank Bentley, Chris Luvogt, Max Silverman, Rushani Wirasinghe, Brooke White, and Danielle Lot-tridge. "Understanding the long-term use of smart speaker assistants." Proceedings of the ACM on Inter-active, Mobile, Wearable and Ubiquitous Technolo-gies 2, no. 3 (2018): 1--24. DOI:https://doi.org/10.1145/3264901Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Leo L. Beranek. 1986. Acoustics. American Institute of Physics, Woodbury, NY, USA.Google ScholarGoogle Scholar
  8. Richard A. Bolt. 1980. 'Put-that-there': Voice and gesture at the graphics interface. In Proceedings of the 7th annual conference on Computer graphics and in-teractive techniques (SIGGRAPH '80). Association for Computing Machinery, New York, NY, USA, 262--270. DOI: https://doi.org/10.1145/800250.807503Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. S. Brandstein and H. F. Silverman. 1997, A ro-bust method for speech signal time-delay estimation in reverberant rooms, Proc. IEEE International Con-ference on Acoustics, Speech and Signal Processing, Munich, Germany. DOI: https://doi.org/10.1109/ICASSP.1997.599651Google ScholarGoogle Scholar
  10. Hervé Bredin and Grégory Gelly. 2016. Improving Speaker Diarization of TV Series using Talking-Face Detection and Clustering. In Proceedings of the 24th ACM international conference on Multimedia (MM '16). Association for Computing Machinery, New York, NY, USA, 157--161. DOI: https://doi.org/10.1145/2964284.2967202Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Alessio Brutti, Maurizio Omologo, and Piergiorgio Svaizer. "Oriented global coherence field for the esti-mation of the head orientation in smart rooms equipped with distributed microphone arrays." In Ninth European Conference on Speech Communica-tion and Technology. 2005.Google ScholarGoogle Scholar
  12. Cristian Canton-Ferrer, Carlos Segura, Montse Pardas, Josep R. Casas, and Javier Hernando. "Multimodal re-al-time focus of attention estimation in smartrooms." In 2008 IEEE Computer Society Conference on Com-puter Vision and Pattern Recognition Workshops, pp. 1--8. IEEE, 2008. DOI: https://doi.org/10.1109/CVPRW.2008.4563180Google ScholarGoogle Scholar
  13. Rastislav Cervenak, and Pavel Masek. "ARKit as in-door positioning system." In 2019 11th International Congress on Ultra Modern Telecommunications and Control Systems and Workshops (ICUMT), pp. 1--5. IEEE, 2019. DOI: https://doi.org/10.1109/ICUMT48472.2019.8970761Google ScholarGoogle Scholar
  14. Soumitro Chakrabarty, and Emanuël AP Habets. "Multi-speaker localization using convolutional neural network trained with noise." arXiv preprint arXiv:1712.04276 (2017).Google ScholarGoogle Scholar
  15. Craig A. Chin, Armando Barreto, Gualberto Cre-mades, and Malek Adjouadi. 2007. Performance anal-ysis of an integrated eye gaze tracking / electromyo-gram cursor control system. In Proceedings of the 9th international ACM SIGACCESS conference on Com-puters and accessibility (Assets '07). Association for Computing Machinery, New York, NY, USA, 233--234. DOI: https://doi.org/10.1145/1296843.1296888Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. W.T. Chu and A.C.C. Warnock. Detailed directivity of sound fields around human talkers, Tech. Rep. RR-104, National Research Council Canada, 2002.Google ScholarGoogle Scholar
  17. Antoine Deleforge and Radu Horaud. 2012. The cock-tail party robot: sound source separation and localisa-tion with an active binaural head. In Proceedings of the seventh annual ACM/IEEE international confer-ence on Human-Robot Interaction (HRI '12). Associa-tion for Computing Machinery, New York, NY, USA, 431--438. DOI: https://doi.org/10.1145/2157689.2157834Google ScholarGoogle Scholar
  18. Nilanjan Dey, and Amira S. Ashour. Direction of arri-val estimation and localization of multi-speech sources. Springer International Publishing, 2018.Google ScholarGoogle ScholarCross RefCross Ref
  19. Nilanjan Dey, and Amira S. Ashour. "Challenges and future perspectives in speech-sources direction of arri-val estimation and localization." In Direction of arri-val estimation and localization of multi-speech sources, pp. 49--52. Springer, Cham, 2018. DOI: https://doi.org/10.1007/978--3--319--73059--2_5Google ScholarGoogle Scholar
  20. Tiago H. Falk, Chenxi Zheng, and Wai-Yip Chan. "A non-intrusive quality and intelligibility measure of re-verberant and dereverberated speech." IEEE Transac-tions on Audio, Speech, and Language Processing 18, no. 7 (2010): 1766--1774. DOI: https://doi.org/10.1109/TASL.2010.2052247Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Eric L. Ferguson, Stefan B. Williams, and Craig T. Jin. "Sound source localization in a multipath envi-ronment using convolutional neural networks." In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2386--2390. IEEE, 2018. DOI: https://doi.org/10.1109/ICASSP.2018.8462024Google ScholarGoogle Scholar
  22. Yangyang Huang, Takuma Otsuka, and Hiroshi G. Okuno. "A speaker diarization system with robust speaker localization and voice activity detection." In Contemporary Challenges and Solutions in Applied Artificial Intelligence, pp. 77--82. Springer, Heidel-berg, 2013. DOI: https://doi.org/10.1007/978--3--319-00651--2_11Google ScholarGoogle Scholar
  23. V. Z. Këpuska, and T. B. Klein. "A novel wake-up-word speech recognition system, wake-up-word recog-nition task, technology and evaluation." Nonlinear Analysis: Theory, Methods & Applications 71, no. 12 (2009): e2772-e2789. DOI: https://doi.org/10.1016/j.na.2009.06.089Google ScholarGoogle ScholarCross RefCross Ref
  24. Seon Man Kim, and Hong Kook Kim. "Direction-of-arrival based SNR estimation for dual-microphone speech enhancement." IEEE/ACM Transactions on Audio, Speech, and Language Processing 22, no. 12 (2014): 2207--2217. DOI: https://doi.org/10.1109/TASLP.2014.2360646Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. C. H. Knapp, and G. C. Carter: 1976, The generalized correlation method for estimation of time delay, IEEE Transactions on Acoustics, Speech and Signal Pro-cessing ASSP-24(4), 320--327. DOI: https://doi.org/10.1109/TASSP.1976.1162830Google ScholarGoogle Scholar
  26. V. Krishnaveni, T. Kesavamurthy, and B. Aparna. "Beamforming for direction-of-arrival (DOA) estima-tion-a survey." International Journal of Computer Applications 61, no. 11 (2013).Google ScholarGoogle ScholarCross RefCross Ref
  27. Byoungho Kwon, Youngjin Park, and Youn-sik Park. "Multiple sound sources localization using the spatial-ly mapped GCC functions." In 2009 ICCAS-SICE, pp. 1773--1776. IEEE, 2009.Google ScholarGoogle Scholar
  28. Avram Levi, and Harvey Silverman. "A robust meth-od to extract talker azimuth orientation using a large-aperture microphone array." IEEE transactions on au-dio, speech, and language processing 18, no. 2 (2009): 277--285. DOI: https://doi.org/10.1109/TASL.2009.2025793Google ScholarGoogle Scholar
  29. Jun-seok Lim, and Hee-Suk Pang. "Time delay esti-mation method based on canonical correlation analy-sis." Circuits, Systems, and Signal Processing 32, no. 5 (2013): 2527--2538. DOI: https://doi.org/10.1007/s00034-013--9578--3Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Rong Liu, and Yongxuan Wang. "Azimuthal source localization using interaural coherence in a robotic dog: modeling and application." Robotica 28, no. 7 (2010): 1013--1020. DOI: https://doi.org/10.1017/S0263574709990865Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Michael I. Mandel, Daniel P. Ellis, and Tony Jebara. "An EM algorithm for localizing multiple sound sources in reverberant environments." In Advances in neural information processing systems, pp. 953--960. 2007. DOI: https://doi.org/10.7916/D84176FKGoogle ScholarGoogle Scholar
  32. Ivan Meza, Caleb Rascon, Gibran Fuentes, and Luis A. Pineda. "On indexicality, direction of arrival of sound sources, and human-robot interaction." Journal of robotics 2016 (2016). DOI: https://doi.org/10.1155/2016/3081048Google ScholarGoogle Scholar
  33. Menno Müller, Steven van de Par, and Joerg Bitzer. "Head-Orientation-Based Device Selection: Are You Talking to Me?." In Speech Communication; 12. ITG Symposium, pp. 1--5. VDE, 2016.Google ScholarGoogle Scholar
  34. Hirofumi Nakajima, Keiko Kikuchi, Toru Daigo, Yutaka Kaneda, Kazuhiro Nakadai, and Yuji Hasega-wa. "Real-time sound source orientation estimation using a 96 channel microphone array." In 2009 IEEE/RSJ International Conference on Intelligent Ro-bots and Systems, pp. 676--683. IEEE, 2009. DOI: https://doi.org/10.1109/IROS.2009.5354285Google ScholarGoogle Scholar
  35. Keisuke Nakamura, Kazuhiro Nakadai, Futoshi Asano, and Gökhan Ince. "Intelligent sound source localiza-tion and its application to multimodal human track-ing." In 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 143--148. IEEE, 2011. DOI: https://doi.org/10.1109/IROS.2011.6094558Google ScholarGoogle Scholar
  36. Alberto Yoshihiro Nakano, Kazumasa Yamamoto, and Seiichi Nakagawa. "Directional acoustic source's posi-tion and orientation estimation approach by a micro-phone array network." In 2009 IEEE 13th Digital Sig-nal Processing Workshop and 5th IEEE Signal Pro-cessing Education Workshop, pp. 606--611. IEEE, 2009. DOI: https://doi.org/10.1109/DSP.2009.4785995Google ScholarGoogle Scholar
  37. Aanand Nayyar, Utkarsh Dwivedi, Karan Ahuja, Ni-tendra Rajput, Seema Nagar, and Kuntal Dey. 2017. OptiDwell: Intelligent Adjustment of Dwell Click Time. In Proceedings of the 22nd International Con-ference on Intelligent User Interfaces (IUI '17). Asso-ciation for Computing Machinery, New York, NY, USA, 193--204. DOI: https://doi.org/10.1145/3025171.3025202Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Kazuhiro Otsuka, Shoko Araki, Kentaro Ishizuka, Masakiyo Fujimoto, Martin Heinrich, and Junji Yamato. "A realtime multimodal system for analyzing group meetings by combining face pose tracking and speaker diarization." In Proceedings of the 10th inter-national conference on Multimodal interfaces, pp. 257--264. 2008. DOI:https://doi.org/10.1145/1452392.1452446Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Beom-Chul Park, Kyu-Dae Ban, Keun-Chang Kwak, and Ho-Sup Yoon. "Performance analysis of GCC-PHAT-based sound source localization for intelligent robots." Journal of Korea Robotics Society 2, no. 3 (2007): 270--274.Google ScholarGoogle Scholar
  40. Despoina Pavlidi, Symeon Delikaris-Manias, Ville Pulkki, and Athanasias Mouchtaris. "3D DOA estima-tion of multiple sound sources based on spatially con-strained beamforming driven by intensity vectors." In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 96--100. IEEE, 2016. DOI: https://doi.org/10.1109/ICASSP.2016.7471644Google ScholarGoogle Scholar
  41. Cathy Pearl. Designing voice user interfaces: princi-ples of conversational experiences. " O'Reilly Media, Inc.", 2016.Google ScholarGoogle Scholar
  42. Martin Porcheron, Joel E. Fischer, Stuart Reeves, and Sarah Sharples. 2018. Voice Interfaces in Everyday Life. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI '18). As-sociation for Computing Machinery, New York, NY, USA, Paper 640, 1--12. DOI: https://doi.org/10.1145/3173574.3174214Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Om Prakash Prabhakar, and Navneet Kumar Sahu. A survey on: Voice command recognition technique. In-ternational Journal of Advanced Research in Comput-er Science and Software Engineering 3, no. 5 (2013).Google ScholarGoogle Scholar
  44. Caleb Rascón, Héctor Avilés, and Luis A. Pineda. "Robotic orientation towards speaker for human-robot interaction." In Ibero-American Conference on Artifi-cial Intelligence, pp. 10--19. Springer, Berlin, Heidel-berg, 2010. DOI: https://doi.org/10.1007/978--3--642--16952--6_2Google ScholarGoogle Scholar
  45. ReSpeaker. URL: https://wiki.seeedstudio.com/ReSpeaker-USB-Mic-ArrayGoogle ScholarGoogle Scholar
  46. Joshua M. Sachar, and Harvey F. Silverman. "A base-line algorithm for estimating talker orientation using acoustical data from a large-aperture microphone ar-ray." In 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 4, pp. iv-iv. IEEE, 2004. DOI: https://doi.org/10.1109/ICASSP.2004.1326764Google ScholarGoogle Scholar
  47. Andreas Schwarz, and Walter Kellermann. "Coherent-to-diffuse power ratio estimation for dereverberation." IEEE/ACM Transactions on Audio, Speech, and Lan-guage Processing 23, no. 6 (2015): 1006--1018. DOI: https://doi.org/10.1109/TASLP.2015.2418571Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Carlos Segura, Alberto Abad, Javier Hernando, and Climent Nadeu. "Speaker orientation estimation based on hybridation of GCC-PHAT and HLBR." In Ninth Annual Conference of the International Speech Com-munication Association. 2008.Google ScholarGoogle Scholar
  49. Carlos Segura, and Francisco Javier Hernando Pericás. "GCC-PHAT based head orientation estimation." In 13th Annual Conference of International Speech Communication Association, pp. 1--4. 2012.Google ScholarGoogle Scholar
  50. Carlos Segura, Cristian Canton-Ferrer, Alberto Abad, Josep R. Casas, and Javier Hernando. "Multimodal head orientation towards attention tracking in smartrooms." In 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP'07, vol. 2, pp. II-681. IEEE, 2007. DOI: https://doi.org/10.1109/ICASSP.2007.366327Google ScholarGoogle Scholar
  51. Hing Cheung So, Yiu Tong Chan, and Frankie Kit Wing Chan. "Closed-form formulae for time-difference-of-arrival estimation." IEEE Transactions on Signal Processing 56, no. 6 (2008): 2614--2620. DOI: https://doi.org/10.1109/TSP.2007.914342Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Hannu Soronen, Markku Turunen, and Jaakko Hakulinen. "Voice commands in home environment-a consumer survey." In Ninth Annual Conference of the International Speech Communication Association. 2008.Google ScholarGoogle Scholar
  53. Norbert Strobel, and Rudolf Rabenstein. "Classifica-tion of time delay estimates for robust speaker locali-zation." In 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceed-ings. ICASSP99 (Cat. No. 99CH36258), vol. 6, pp. 3081--3084. IEEE, 1999. DOI: https://doi.org/10.1109/ICASSP.1999.757492Google ScholarGoogle Scholar
  54. Ryoichi Takashima, Tetsuya Takiguchi, and Yasuo Ariki. "Estimation of talker's head orientation based on discrimination of the shape of cross-power spec-trum phase coefficients." In Thirteenth Annual Con-ference of the International Speech Communication Association. 2012.Google ScholarGoogle Scholar
  55. Ryu Takeda, and Kazunori Komatani. "Discriminative multiple sound source localization based on deep neu-ral networks using independent location model." In 2016 IEEE Spoken Language Technology Workshop (SLT), pp. 603--609. IEEE, 2016. DOI: https://doi.org/10.1109/SLT.2016.7846325Google ScholarGoogle Scholar
  56. Bilge Mutlu. 2020. Designing Social Cues for Collaborative Robots: The Role of Gaze and Breathing in Human-Robot Collabo-ration. In Proceedings of the 2020 ACM/IEEE Inter-national Conference on Human-Robot Interaction (HRI ?20). Association for Computing Machinery, New York, NY, USA, 343--357. DOI: https://doi.org/10.1145/3319502.3374829Google ScholarGoogle Scholar
  57. Tobii, Pro Glasses 2. URL: https://www.tobiipro.com/product-listing/tobii-pro-glasses-2/Google ScholarGoogle Scholar
  58. Jose Velasco, Daniel Pizarro, and Javier Macias-Guarasa. "Source localization with acoustic sensor ar-rays using generative model based fitting with sparse constraints." Sensors 12, no. 10 (2012): 13781--13812. DOI: https://doi.org/10.3390/s121013781Google ScholarGoogle ScholarCross RefCross Ref
  59. Hong Wang, and Peter Chu. "Voice source localiza-tion for automatic camera pointing system in vide-oconferencing." In 1997 IEEE International Confer-ence on Acoustics, Speech, and Signal Processing, vol. 1, pp. 187--190. IEEE, 1997. DOI: https://doi.org/10.1109/ASPAA.1997.625639Google ScholarGoogle Scholar
  60. Jackie (Junrui) Yang, Gaurab Banerjee, Vishesh Gup-ta, Monica S. Lam, and James A. Landay. 2020. Soundr: Head Position and Orientation Prediction Us-ing a Microphone Array. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (CHI '20). Association for Computing Ma-chinery, New York, NY, USA, 1--12. DOI:https://doi.org/10.1145/3313831.3376427Google ScholarGoogle Scholar
  61. Cha Zhang, Dinei Florêncio, Demba E. Ba, and Zhengyou Zhang. "Maximum likelihood sound source localization and beamforming for directional micro-phone arrays in distributed meetings." IEEE Transac-tions on Multimedia 10, no. 3 (2008): 538--548. DOI: https://doi.org/10.1109/TMM.2008.917406Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Direction-of-Voice (DoV) Estimation for Intuitive Speech Interaction with Smart Devices Ecosystems

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader