skip to main content
research-article

Low-resource Multi-task Audio Sensing for Mobile and Embedded Devices via Shared Deep Neural Network Representations

Published:11 September 2017Publication History
Skip Abstract Section

Abstract

Continuous audio analysis from embedded and mobile devices is an increasingly important application domain. More and more, appliances like the Amazon Echo, along with smartphones and watches, and even research prototypes seek to perform multiple discriminative tasks simultaneously from ambient audio; for example, monitoring background sound classes (e.g., music or conversation), recognizing certain keywords (‘Hey Siri' or ‘Alexa'), or identifying the user and her emotion from speech. The use of deep learning algorithms typically provides state-of-the-art model performances for such general audio tasks. However, the large computational demands of deep learning models are at odds with the limited processing, energy and memory resources of mobile, embedded and IoT devices.

In this paper, we propose and evaluate a novel deep learning modeling and optimization framework that specifically targets this category of embedded audio sensing tasks. Although the supported tasks are simpler than the task of speech recognition, this framework aims at maintaining accuracies in predictions while minimizing the overall processor resource footprint. The proposed model is grounded in multi-task learning principles to train shared deep layers and exploits, as input layer, only statistical summaries of audio filter banks to further lower computations.

We find that for embedded audio sensing tasks our framework is able to maintain similar accuracies, which are observed in comparable deep architectures that use single-task learning and typically more complex input layers. Most importantly, on an average, this approach provides almost a 2.1× reduction in runtime, energy, and memory for four separate audio sensing tasks, assuming a variety of task combinations.

References

  1. 2013. Recent Advances in Deep Learning for Speech Research at Microsoft. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). http://research.microsoft.com/apps/pubs/default.aspx?id=188864Google ScholarGoogle Scholar
  2. 2017. https://www.qualcomm.com/products/snapdragon/processors/400. (2017).Google ScholarGoogle Scholar
  3. 2017. Amazon Echo. http://www.amazon.com/Amazon-Echo-Bluetooth-Speaker-with-WiFi-Alexa/dp/B00X4WHP5E. (2017).Google ScholarGoogle Scholar
  4. 2017. Auto Shazam. https://support.shazam.com/hc/en-us/articles/204457738-Auto-Shazam-iPhone-. (2017).Google ScholarGoogle Scholar
  5. 2017. Fitbit Surge. https://www.fitbit.com/uk/surge. (2017).Google ScholarGoogle Scholar
  6. 2017. Google Home. https://home.google.com/. (2017).Google ScholarGoogle Scholar
  7. 2017. Motorola Moto 360 Smartwatch. http://www.motorola.com/us/products/moto-360. (2017).Google ScholarGoogle Scholar
  8. 2017. Qualcomm Snapdragon 800 MDP. http://goo.gl/ySfCFl. (2017).Google ScholarGoogle Scholar
  9. 2017. TensorFlow. https://www.tensorflow.org/. (2017).Google ScholarGoogle Scholar
  10. 2017. Torch. http://torch.ch/. (2017).Google ScholarGoogle Scholar
  11. Sourav Bhattacharya and Nicholas D. Lane. 2016. From Smart to Deep: Robust Activity Recognition on Smartwatches using Deep Learning. In Workshop on Sensing Systems and Applications Using Wrist Worn Smart Devices (WristSense'16).Google ScholarGoogle Scholar
  12. Sourav Bhattacharya and Nicholas D. Lane. 2016. Sparsification and Separation of Deep Learning Layers for Constrained Resource Inference on Wearables. In ACM Conference on Embedded Networked Sensor Systems (SenSys) 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Rich Caruana. 1997. Multitask Learning. Mach. Learn. 28, 1 (July 1997), 41--75. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Guoguo Chen, Carolina Parada, and Georg Heigold. 2014. Small-footprint Keyword Spotting Using Deep Neural Networks (ICASSP'14).Google ScholarGoogle Scholar
  15. Wenlin Chen, James T. Wilson, Stephen Tyree, Kilian Q. Weinberger, and Yixin Chen. 2015. Compressing Neural Networks with the Hashing Trick. ICML-15 (2015). http://arxiv.org/abs/1504.04788 Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Ronan Collobert and Jason Weston. 2008. A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning. In Proceedings of the 25th International Conference on Machine Learning (ICML '08). ACM, New York, NY, USA, 160--167. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Li Deng, Geoffrey Hinton, and Brian Kingsbury. 2013. New types of deep neural network learning for speech recognition and related applications: An overview. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). http://research.microsoft.com/apps/pubs/default.aspx?id=189004Google ScholarGoogle ScholarCross RefCross Ref
  18. Zheng Fang, Zhang Guoliang, and Song Zhanjiang. 2001. Comparison of Different Implementations of MFCC. J. Comput. Sci. Technol. 16, 6 (Nov. 2001), 582--589. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Yunchao Gong, Liu Liu, Ming Yang, and Lubomir D. Bourdev. 2015. Compressing Deep Convolutional Networks using Vector Quantization. ICLR-15 (2015). http://arxiv.org/abs/1412.6115Google ScholarGoogle Scholar
  20. Nils Hammerla, James Fisher, Peter Andras, Lynn Rochester, Richard Walker, and Thomas Ploetz. 2015. PD Disease State Assessment in Naturalistic Environments Using Deep Learning. (2015). http://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/view/9930 Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Nils Y. Hammerla, Shane Halloran, and Thomas Plötz. 2016. Deep, Convolutional, and Recurrent Models for Human Activity Recognition Using Wearables. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI'16). AAAI Press. http://www.ijcai.org/Abstract/16/220 Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Kun Han, Dong Yu, and Ivan Tashev. 2014. Speech Emotion Recognition Using Deep Neural Network and Extreme Learning Machine. In Interspeech-14. http://research.microsoft.com/apps/pubs/default.aspx?id=230136Google ScholarGoogle Scholar
  23. Song Han, Jeff Pool, John Tran, and William J. Dally. 2015. Learning both Weights and Connections for Efficient Neural Networks. NIPS-15 (2015). http://arxiv.org/abs/1506.02626 Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. CoRR abs/1512.03385 (2015). http://arxiv.org/abs/1512.03385Google ScholarGoogle Scholar
  25. Tianxing He, Yuchen Fan, Yanmin Qian, Tian Tan, and Kai Yu. 2014. Reshaping deep neural network for fast decoding by node-pruning. In ICASSP-14, May 4-9, 2014. 245--249.Google ScholarGoogle ScholarCross RefCross Ref
  26. H. Hermansky. 1990. Perceptual Linear Predictive (PLP) Analysis of Speech. J. Acoust. Soc. Am. 57, 4 (April 1990), 1738--52.Google ScholarGoogle Scholar
  27. Jui-Ting Huang, Jinyu Li, Dong Yu, Li Deng, and Yifan Gong. 2013. Cross-language Knowledge Transfer using Multilingual Deep Neural Network with Shared Hidden Layers. In ICASSP-13. http://research.microsoft.com/apps/pubs/default.aspx?id=189250Google ScholarGoogle Scholar
  28. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In NIPS-12. http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Nicholas Lane, Sourav Bhattacharya, Akhil Mathur, Claudio Forlivesi, and Fahim Kawsar. 2016. Dxtk: Enabling resource-efficient deep learning on mobile and embedded devices with the deepx toolkit. In Proceedings of the 8th EAI International Conference on Mobile Computing, Applications and Services, ser. MobiCASE, Vol. 16. 98--107. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Nicholas D. Lane, Sourav Bhattacharya, Petko Georgiev, Claudio Forlivesi, Lei Jiao, Lorena Qendro, and Fahim Kawsar. 2016. DeepX: A Software Accelerator for Low-Power Deep Learning Inference on Mobile Devices. In International Conference on Information Processing in Sensor Networks (IPSN '16). Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Nicholas D. Lane and Petko Georgiev. 2015. Can Deep Learning Revolutionize Mobile Sensing?. In Proceedings of the 16th International Workshop on Mobile Computing Systems and Applications (HotMobile '15). ACM, New York, NY, USA, 117--122. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Nicholas D. Lane, Petko Georgiev, and Lorena Qendro. 2015. DeepEar: Robust Smartphone Audio Sensing in Unconstrained Acoustic Environments Using Deep Learning. In Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp '15). ACM, New York, NY, USA, 283--294. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Honglak Lee, Peter Pham, Yan Largman, and Andrew Y. Ng. 2009. Unsupervised feature learning for audio classification using convolutional deep belief networks. In NIPS-09. Curran Associates, Inc., 1096--1104. http://papers.nips.cc/paper/3674-unsupervised-feature-learning-for-audio-classification-using-convolutional-deep-belief-networks.pdf Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Youngki Lee, Chulhong Min, Chanyou Hwang, Jaeung Lee, Inseok Hwang, Younghyun Ju, Chungkuk Yoo, Miri Moon, Uichin Lee, and Junehwa Song. 2013. SocioPhone: Everyday Face-to-face Interaction Monitoring Platform Using Multi-phone Sensor Fusion. In Proceeding of the 11th Annual International Conference on Mobile Systems, Applications, and Services (MobiSys '13). ACM, New York, NY, USA, 375--388. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Xi Li, Liming Zhao, Lina Wei, MingHsuan Yang, Fei Wu, Yueting Zhuang, Haibin Ling, and Jingdong Wang. 2015. DeepSaliency: Multi-Task Deep Neural Network Model for Salient Object Detection. CoRR abs/1510.05484 (2015). http://arxiv.org/abs/1510.05484Google ScholarGoogle Scholar
  36. Mark Liberman, Kelly Davis, Murray Grossman, Nii Martey, and John Bell. 2002. Emotional Prosody Speech and Transcripts. (2002).Google ScholarGoogle Scholar
  37. Xiaodong Liu, Jianfeng Gao, Xiaodong He, Li Deng, Kevin Duh, and Ye-Yi Wang. 2015. Representation Learning Using Multi-Task Deep Neural Networks for Semantic Classification and Information Retrieval. In NAACL HLT 2015, The 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado, USA, May 31 - June 5, 2015. 912--921. http://aclweb.org/anthology/N/N15/N15-1092.pdfGoogle ScholarGoogle Scholar
  38. Hong Lu, A.J. Bernheim Brush, Bodhi Priyantha, Amy K. Karlson, and Jie Liu. 2011. SpeakerSense: Energy Efficient Unobtrusive Speaker Identification on Mobile Phones. In Proceedings of the 9th International Conference on Pervasive Computing (Pervasive'11). Springer-Verlag, Berlin, Heidelberg, 188--205. http://dl.acm.org/citation.cfm?id=2021975.2021992 Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Hong Lu, Denise Frauendorfer, Mashfiqui Rabbi, Marianne Schmid Mast, Gokul T. Chittaranjan, Andrew T. Campbell, Daniel Gatica-Perez, and Tanzeem Choudhury. 2012. StressSense: Detecting Stress in Unconstrained Acoustic Environments Using Smartphones. In Proceedings of the 2012 ACM Conference on Ubiquitous Computing (UbiComp '12). 10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Hong Lu, Wei Pan, Nicholas D. Lane, Tanzeem Choudhury, and Andrew T. Campbell. 2009. SoundSense: Scalable Sound Sensing for People-centric Applications on Mobile Phones. In Proceedings of the 7th International Conference on Mobile Systems, Applications, and Services (MobiSys '09). ACM, New York, NY, USA, 165--178. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Hong Lu, Jun Yang, Zhigang Liu, Nicholas D. Lane, Tanzeem Choudhury, and Andrew T. Campbell. 2010. The Jigsaw Continuous Sensing Engine for Mobile Phone Applications. In Proceedings of the 8th ACM Conference on Embedded Networked Sensor Systems (SenSys '10). 71--84. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Hong Lu, Jun Yang, Zhigang Liu, Nicholas D. Lane, Tanzeem Choudhury, and Andrew T. Campbell. 2010. The Jigsaw Continuous Sensing Engine for Mobile Phone Applications. In Proceedings of the 8th ACM Conference on Embedded Networked Sensor Systems (SenSys '10). ACM, New York, NY, USA, 71--84. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Chengwen Luo and Mun Choon Chan. 2013. SocialWeaver: Collaborative Inference of Human Conversation Networks Using Smartphones. In Proceedings of the 11th ACM Conference on Embedded Networked Sensor Systems (SenSys '13). Article 20, 14 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Akhil Mathur, Nicholas D Lane, Sourav Bhattacharya, Aidan Boran, Claudio Forlivesi, and Fahim Kawsar. 2017. DeepEye: Resource Efficient Local Execution of Multiple Deep Vision Models using Wearable Commodity Hardware. In The 15th International Conference on Mobile Systems, Applications and Services (MobiSys). Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Vinod Nair and Geoffrey E. Hinton. 2010. Rectified Linear Units Improve Restricted Boltzmann Machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), Johannes Fürnkranz and Thorsten Joachims (Eds.). Omnipress, 807--814. http://www.icml2010.org/papers/432.pdf Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Thomas Plötz, Nils Y. Hammerla, and Patrick Olivier. 2011. Feature Learning for Activity Recognition in Ubiquitous Computing. In Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence - Volume Volume Two (IJCAI'11). AAAI Press, 1729--1734. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Kiran K. Rachuri, Mirco Musolesi, Cecilia Mascolo, Peter J. Rentfrow, Chris Longworth, and Andrius Aucinas. 2010. EmotionSense: A Mobile Phones Based Adaptive Platform for Experimental Social Psychology Research. In Proceedings of the 12th ACM International Conference on Ubiquitous Computing (Ubicomp '10). 10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Kiran K. Rachuri, Mirco Musolesi, Cecilia Mascolo, Peter J. Rentfrow, Chris Longworth, and Andrius Aucinas. 2010. EmotionSense: A Mobile Phones Based Adaptive Platform for Experimental Social Psychology Research. In Proceedings of the 12th ACM International Conference on Ubiquitous Computing (Ubicomp '10). ACM, New York, NY, USA, 281--290. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Alain Rakotomamonjy and Gilles Gasso. 2015. Histogram of gradients of Time-Frequency Representations for Audio scene detection. CoRR abs/1508.04909 (2015). http://arxiv.org/abs/1508.04909Google ScholarGoogle Scholar
  50. M. Smith and T. Barnwell. 1987. A new filter bank theory for time-frequency representation. IEEE Transactions on Acoustics, Speech, and Signal Processing 35, 3 (Mar 1987), 314--327.Google ScholarGoogle ScholarCross RefCross Ref
  51. Yaniv Taigman, Ming Yang, Marc'Aurelio Ranzato, and Lior Wolf. 2014. DeepFace: Closing the Gap to Human-Level Performance in Face Verification. In CVPR '14. 8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Theano Development Team. 2016. Theano: A Python framework for fast computation of mathematical expressions. arXiv e-prints abs/1605.02688 (May 2016). http://arxiv.org/abs/1605.02688Google ScholarGoogle Scholar
  53. Ehsan Variani, Xin Lei, Erik McDermott, Ignacio Lopez Moreno, and Javier Gonzalez-Dominguez. 2014. Deep neural networks for small footprint text-dependent speaker verification. In ICASSP-14. IEEE, 4052--4056.Google ScholarGoogle Scholar
  54. Max Welling, Michal Rosen-Zvi, and Geoffrey Hinton. 2004. Exponential Family Harmoniums with an Application to Information Retrieval. In Proceedings of the 17th International Conference on Neural Information Processing Systems (NIPS'04). MIT Press, Cambridge, MA, USA, 1481--1488. http://dl.acm.org/citation.cfm?id=2976040.2976226 Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Zhizheng Wu, Tomi Kinnunen, Nicholas Evans, Junichi Yamagishi, Cemal Hanilci, Md Sahidullah, and Aleksandr Sizov. 2015. ASVspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge. In INTERSPEECH 2015, Automatic Speaker Verification Spoofing and Countermeasures Challenge, colocated with INTERSPEECH 2015, September 6-10, 2015, Dresden, Germany. Dresden, ALLEMAGNE. http://www.eurecom.fr/publication/4573Google ScholarGoogle Scholar
  56. Chenren Xu, Sugang Li, Gang Liu, Yanyong Zhang, Emiliano Miluzzo, Yih-Farn Chen, Jun Li, and Bernhard Firner. 2013. Crowd++: Unsupervised Speaker Count with Smartphones. In Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp '13). ACM, New York, NY, USA, 43--52. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Jian Xue, Jinyu Li, and Yifan Gong. 2013. Restructuring of deep neural network acoustic models with singular value decomposition.. In INTERSPEECH, Frédéric Bimbot, Christophe Cerisara, Cécile Fougeron, Guillaume Gravier, Lori Lamel, François Pellegrino, and Pascal Perrier (Eds.). ISCA, 2365--2369.Google ScholarGoogle Scholar
  58. Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How transferable are features in deep neural networks? NIPS-14 (2014). http://arxiv.org/abs/1411.1792 Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Low-resource Multi-task Audio Sensing for Mobile and Embedded Devices via Shared Deep Neural Network Representations

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies
        Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies  Volume 1, Issue 3
        September 2017
        2023 pages
        EISSN:2474-9567
        DOI:10.1145/3139486
        Issue’s Table of Contents

        Copyright © 2017 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 11 September 2017
        • Accepted: 1 June 2017
        • Revised: 1 May 2017
        • Received: 1 November 2016
        Published in imwut Volume 1, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader