research-article

Low-resource Multi-task Audio Sensing for Mobile and Embedded Devices via Shared Deep Neural Network Representations

Authors:
Petko Georgiev

University of Cambridge

University of Cambridge
View Profile

,
Sourav Bhattacharya

Nokia Bell Labs

Nokia Bell Labs
View Profile

,
Nicholas D. Lane

University College London and Nokia Bell Labs

University College London and Nokia Bell Labs
View Profile

,
Cecilia Mascolo

University of Cambridge

University of Cambridge
View Profile

Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies Volume 1 Issue 3Article No.: 50pp 1–19https://doi.org/10.1145/3131895

Published:11 September 2017Publication History

Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies

Abstract

Continuous audio analysis from embedded and mobile devices is an increasingly important application domain. More and more, appliances like the Amazon Echo, along with smartphones and watches, and even research prototypes seek to perform multiple discriminative tasks simultaneously from ambient audio; for example, monitoring background sound classes (e.g., music or conversation), recognizing certain keywords (‘Hey Siri' or ‘Alexa'), or identifying the user and her emotion from speech. The use of deep learning algorithms typically provides state-of-the-art model performances for such general audio tasks. However, the large computational demands of deep learning models are at odds with the limited processing, energy and memory resources of mobile, embedded and IoT devices.

In this paper, we propose and evaluate a novel deep learning modeling and optimization framework that specifically targets this category of embedded audio sensing tasks. Although the supported tasks are simpler than the task of speech recognition, this framework aims at maintaining accuracies in predictions while minimizing the overall processor resource footprint. The proposed model is grounded in multi-task learning principles to train shared deep layers and exploits, as input layer, only statistical summaries of audio filter banks to further lower computations.

We find that for embedded audio sensing tasks our framework is able to maintain similar accuracies, which are observed in comparable deep architectures that use single-task learning and typically more complex input layers. Most importantly, on an average, this approach provides almost a 2.1× reduction in runtime, energy, and memory for four separate audio sensing tasks, assuming a variety of task combinations.

References

2013. Recent Advances in Deep Learning for Speech Research at Microsoft. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). http://research.microsoft.com/apps/pubs/default.aspx?id=188864Google Scholar
2017. https://www.qualcomm.com/products/snapdragon/processors/400. (2017).Google Scholar
2017. Amazon Echo. http://www.amazon.com/Amazon-Echo-Bluetooth-Speaker-with-WiFi-Alexa/dp/B00X4WHP5E. (2017).Google Scholar
2017. Auto Shazam. https://support.shazam.com/hc/en-us/articles/204457738-Auto-Shazam-iPhone-. (2017).Google Scholar
2017. Fitbit Surge. https://www.fitbit.com/uk/surge. (2017).Google Scholar
2017. Google Home. https://home.google.com/. (2017).Google Scholar
2017. Motorola Moto 360 Smartwatch. http://www.motorola.com/us/products/moto-360. (2017).Google Scholar
2017. Qualcomm Snapdragon 800 MDP. http://goo.gl/ySfCFl. (2017).Google Scholar
2017. TensorFlow. https://www.tensorflow.org/. (2017).Google Scholar
2017. Torch. http://torch.ch/. (2017).Google Scholar
Sourav Bhattacharya and Nicholas D. Lane. 2016. From Smart to Deep: Robust Activity Recognition on Smartwatches using Deep Learning. In Workshop on Sensing Systems and Applications Using Wrist Worn Smart Devices (WristSense'16).Google Scholar
Sourav Bhattacharya and Nicholas D. Lane. 2016. Sparsification and Separation of Deep Learning Layers for Constrained Resource Inference on Wearables. In ACM Conference on Embedded Networked Sensor Systems (SenSys) 2016. Google ScholarDigital Library
Rich Caruana. 1997. Multitask Learning. Mach. Learn. 28, 1 (July 1997), 41--75. Google ScholarDigital Library
Guoguo Chen, Carolina Parada, and Georg Heigold. 2014. Small-footprint Keyword Spotting Using Deep Neural Networks (ICASSP'14).Google Scholar
Wenlin Chen, James T. Wilson, Stephen Tyree, Kilian Q. Weinberger, and Yixin Chen. 2015. Compressing Neural Networks with the Hashing Trick. ICML-15 (2015). http://arxiv.org/abs/1504.04788 Google ScholarDigital Library
Ronan Collobert and Jason Weston. 2008. A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning. In Proceedings of the 25th International Conference on Machine Learning (ICML '08). ACM, New York, NY, USA, 160--167. Google ScholarDigital Library
Li Deng, Geoffrey Hinton, and Brian Kingsbury. 2013. New types of deep neural network learning for speech recognition and related applications: An overview. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). http://research.microsoft.com/apps/pubs/default.aspx?id=189004Google ScholarCross Ref
Zheng Fang, Zhang Guoliang, and Song Zhanjiang. 2001. Comparison of Different Implementations of MFCC. J. Comput. Sci. Technol. 16, 6 (Nov. 2001), 582--589. Google ScholarDigital Library
Yunchao Gong, Liu Liu, Ming Yang, and Lubomir D. Bourdev. 2015. Compressing Deep Convolutional Networks using Vector Quantization. ICLR-15 (2015). http://arxiv.org/abs/1412.6115Google Scholar
Nils Hammerla, James Fisher, Peter Andras, Lynn Rochester, Richard Walker, and Thomas Ploetz. 2015. PD Disease State Assessment in Naturalistic Environments Using Deep Learning. (2015). http://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/view/9930 Google ScholarDigital Library
Nils Y. Hammerla, Shane Halloran, and Thomas Plötz. 2016. Deep, Convolutional, and Recurrent Models for Human Activity Recognition Using Wearables. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI'16). AAAI Press. http://www.ijcai.org/Abstract/16/220 Google ScholarDigital Library
Kun Han, Dong Yu, and Ivan Tashev. 2014. Speech Emotion Recognition Using Deep Neural Network and Extreme Learning Machine. In Interspeech-14. http://research.microsoft.com/apps/pubs/default.aspx?id=230136Google Scholar
Song Han, Jeff Pool, John Tran, and William J. Dally. 2015. Learning both Weights and Connections for Efficient Neural Networks. NIPS-15 (2015). http://arxiv.org/abs/1506.02626 Google ScholarDigital Library
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. CoRR abs/1512.03385 (2015). http://arxiv.org/abs/1512.03385Google Scholar
Tianxing He, Yuchen Fan, Yanmin Qian, Tian Tan, and Kai Yu. 2014. Reshaping deep neural network for fast decoding by node-pruning. In ICASSP-14, May 4-9, 2014. 245--249.Google ScholarCross Ref
H. Hermansky. 1990. Perceptual Linear Predictive (PLP) Analysis of Speech. J. Acoust. Soc. Am. 57, 4 (April 1990), 1738--52.Google Scholar
Jui-Ting Huang, Jinyu Li, Dong Yu, Li Deng, and Yifan Gong. 2013. Cross-language Knowledge Transfer using Multilingual Deep Neural Network with Shared Hidden Layers. In ICASSP-13. http://research.microsoft.com/apps/pubs/default.aspx?id=189250Google Scholar
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In NIPS-12. http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf Google ScholarDigital Library
Nicholas Lane, Sourav Bhattacharya, Akhil Mathur, Claudio Forlivesi, and Fahim Kawsar. 2016. Dxtk: Enabling resource-efficient deep learning on mobile and embedded devices with the deepx toolkit. In Proceedings of the 8th EAI International Conference on Mobile Computing, Applications and Services, ser. MobiCASE, Vol. 16. 98--107. Google ScholarDigital Library
Nicholas D. Lane, Sourav Bhattacharya, Petko Georgiev, Claudio Forlivesi, Lei Jiao, Lorena Qendro, and Fahim Kawsar. 2016. DeepX: A Software Accelerator for Low-Power Deep Learning Inference on Mobile Devices. In International Conference on Information Processing in Sensor Networks (IPSN '16). Google ScholarDigital Library
Nicholas D. Lane and Petko Georgiev. 2015. Can Deep Learning Revolutionize Mobile Sensing?. In Proceedings of the 16th International Workshop on Mobile Computing Systems and Applications (HotMobile '15). ACM, New York, NY, USA, 117--122. Google ScholarDigital Library
Nicholas D. Lane, Petko Georgiev, and Lorena Qendro. 2015. DeepEar: Robust Smartphone Audio Sensing in Unconstrained Acoustic Environments Using Deep Learning. In Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp '15). ACM, New York, NY, USA, 283--294. Google ScholarDigital Library
Honglak Lee, Peter Pham, Yan Largman, and Andrew Y. Ng. 2009. Unsupervised feature learning for audio classification using convolutional deep belief networks. In NIPS-09. Curran Associates, Inc., 1096--1104. http://papers.nips.cc/paper/3674-unsupervised-feature-learning-for-audio-classification-using-convolutional-deep-belief-networks.pdf Google ScholarDigital Library
Youngki Lee, Chulhong Min, Chanyou Hwang, Jaeung Lee, Inseok Hwang, Younghyun Ju, Chungkuk Yoo, Miri Moon, Uichin Lee, and Junehwa Song. 2013. SocioPhone: Everyday Face-to-face Interaction Monitoring Platform Using Multi-phone Sensor Fusion. In Proceeding of the 11th Annual International Conference on Mobile Systems, Applications, and Services (MobiSys '13). ACM, New York, NY, USA, 375--388. Google ScholarDigital Library
Xi Li, Liming Zhao, Lina Wei, MingHsuan Yang, Fei Wu, Yueting Zhuang, Haibin Ling, and Jingdong Wang. 2015. DeepSaliency: Multi-Task Deep Neural Network Model for Salient Object Detection. CoRR abs/1510.05484 (2015). http://arxiv.org/abs/1510.05484Google Scholar
Mark Liberman, Kelly Davis, Murray Grossman, Nii Martey, and John Bell. 2002. Emotional Prosody Speech and Transcripts. (2002).Google Scholar
Xiaodong Liu, Jianfeng Gao, Xiaodong He, Li Deng, Kevin Duh, and Ye-Yi Wang. 2015. Representation Learning Using Multi-Task Deep Neural Networks for Semantic Classification and Information Retrieval. In NAACL HLT 2015, The 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado, USA, May 31 - June 5, 2015. 912--921. http://aclweb.org/anthology/N/N15/N15-1092.pdfGoogle Scholar
Hong Lu, A.J. Bernheim Brush, Bodhi Priyantha, Amy K. Karlson, and Jie Liu. 2011. SpeakerSense: Energy Efficient Unobtrusive Speaker Identification on Mobile Phones. In Proceedings of the 9th International Conference on Pervasive Computing (Pervasive'11). Springer-Verlag, Berlin, Heidelberg, 188--205. http://dl.acm.org/citation.cfm?id=2021975.2021992 Google ScholarDigital Library
Hong Lu, Denise Frauendorfer, Mashfiqui Rabbi, Marianne Schmid Mast, Gokul T. Chittaranjan, Andrew T. Campbell, Daniel Gatica-Perez, and Tanzeem Choudhury. 2012. StressSense: Detecting Stress in Unconstrained Acoustic Environments Using Smartphones. In Proceedings of the 2012 ACM Conference on Ubiquitous Computing (UbiComp '12). 10. Google ScholarDigital Library
Hong Lu, Wei Pan, Nicholas D. Lane, Tanzeem Choudhury, and Andrew T. Campbell. 2009. SoundSense: Scalable Sound Sensing for People-centric Applications on Mobile Phones. In Proceedings of the 7th International Conference on Mobile Systems, Applications, and Services (MobiSys '09). ACM, New York, NY, USA, 165--178. Google ScholarDigital Library
Hong Lu, Jun Yang, Zhigang Liu, Nicholas D. Lane, Tanzeem Choudhury, and Andrew T. Campbell. 2010. The Jigsaw Continuous Sensing Engine for Mobile Phone Applications. In Proceedings of the 8th ACM Conference on Embedded Networked Sensor Systems (SenSys '10). 71--84. Google ScholarDigital Library
Hong Lu, Jun Yang, Zhigang Liu, Nicholas D. Lane, Tanzeem Choudhury, and Andrew T. Campbell. 2010. The Jigsaw Continuous Sensing Engine for Mobile Phone Applications. In Proceedings of the 8th ACM Conference on Embedded Networked Sensor Systems (SenSys '10). ACM, New York, NY, USA, 71--84. Google ScholarDigital Library
Chengwen Luo and Mun Choon Chan. 2013. SocialWeaver: Collaborative Inference of Human Conversation Networks Using Smartphones. In Proceedings of the 11th ACM Conference on Embedded Networked Sensor Systems (SenSys '13). Article 20, 14 pages. Google ScholarDigital Library
Akhil Mathur, Nicholas D Lane, Sourav Bhattacharya, Aidan Boran, Claudio Forlivesi, and Fahim Kawsar. 2017. DeepEye: Resource Efficient Local Execution of Multiple Deep Vision Models using Wearable Commodity Hardware. In The 15th International Conference on Mobile Systems, Applications and Services (MobiSys). Google ScholarDigital Library
Vinod Nair and Geoffrey E. Hinton. 2010. Rectified Linear Units Improve Restricted Boltzmann Machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), Johannes Fürnkranz and Thorsten Joachims (Eds.). Omnipress, 807--814. http://www.icml2010.org/papers/432.pdf Google ScholarDigital Library
Thomas Plötz, Nils Y. Hammerla, and Patrick Olivier. 2011. Feature Learning for Activity Recognition in Ubiquitous Computing. In Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence - Volume Volume Two (IJCAI'11). AAAI Press, 1729--1734. Google ScholarDigital Library
Kiran K. Rachuri, Mirco Musolesi, Cecilia Mascolo, Peter J. Rentfrow, Chris Longworth, and Andrius Aucinas. 2010. EmotionSense: A Mobile Phones Based Adaptive Platform for Experimental Social Psychology Research. In Proceedings of the 12th ACM International Conference on Ubiquitous Computing (Ubicomp '10). 10. Google ScholarDigital Library
Kiran K. Rachuri, Mirco Musolesi, Cecilia Mascolo, Peter J. Rentfrow, Chris Longworth, and Andrius Aucinas. 2010. EmotionSense: A Mobile Phones Based Adaptive Platform for Experimental Social Psychology Research. In Proceedings of the 12th ACM International Conference on Ubiquitous Computing (Ubicomp '10). ACM, New York, NY, USA, 281--290. Google ScholarDigital Library
Alain Rakotomamonjy and Gilles Gasso. 2015. Histogram of gradients of Time-Frequency Representations for Audio scene detection. CoRR abs/1508.04909 (2015). http://arxiv.org/abs/1508.04909Google Scholar
M. Smith and T. Barnwell. 1987. A new filter bank theory for time-frequency representation. IEEE Transactions on Acoustics, Speech, and Signal Processing 35, 3 (Mar 1987), 314--327.Google ScholarCross Ref
Yaniv Taigman, Ming Yang, Marc'Aurelio Ranzato, and Lior Wolf. 2014. DeepFace: Closing the Gap to Human-Level Performance in Face Verification. In CVPR '14. 8. Google ScholarDigital Library
Theano Development Team. 2016. Theano: A Python framework for fast computation of mathematical expressions. arXiv e-prints abs/1605.02688 (May 2016). http://arxiv.org/abs/1605.02688Google Scholar
Ehsan Variani, Xin Lei, Erik McDermott, Ignacio Lopez Moreno, and Javier Gonzalez-Dominguez. 2014. Deep neural networks for small footprint text-dependent speaker verification. In ICASSP-14. IEEE, 4052--4056.Google Scholar
Max Welling, Michal Rosen-Zvi, and Geoffrey Hinton. 2004. Exponential Family Harmoniums with an Application to Information Retrieval. In Proceedings of the 17th International Conference on Neural Information Processing Systems (NIPS'04). MIT Press, Cambridge, MA, USA, 1481--1488. http://dl.acm.org/citation.cfm?id=2976040.2976226 Google ScholarDigital Library
Zhizheng Wu, Tomi Kinnunen, Nicholas Evans, Junichi Yamagishi, Cemal Hanilci, Md Sahidullah, and Aleksandr Sizov. 2015. ASVspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge. In INTERSPEECH 2015, Automatic Speaker Verification Spoofing and Countermeasures Challenge, colocated with INTERSPEECH 2015, September 6-10, 2015, Dresden, Germany. Dresden, ALLEMAGNE. http://www.eurecom.fr/publication/4573Google Scholar
Chenren Xu, Sugang Li, Gang Liu, Yanyong Zhang, Emiliano Miluzzo, Yih-Farn Chen, Jun Li, and Bernhard Firner. 2013. Crowd++: Unsupervised Speaker Count with Smartphones. In Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp '13). ACM, New York, NY, USA, 43--52. Google ScholarDigital Library
Jian Xue, Jinyu Li, and Yifan Gong. 2013. Restructuring of deep neural network acoustic models with singular value decomposition.. In INTERSPEECH, Frédéric Bimbot, Christophe Cerisara, Cécile Fougeron, Guillaume Gravier, Lori Lamel, François Pellegrino, and Pascal Perrier (Eds.). ISCA, 2365--2369.Google Scholar
Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How transferable are features in deep neural networks? NIPS-14 (2014). http://arxiv.org/abs/1411.1792 Google ScholarDigital Library

Index Terms

Low-resource Multi-task Audio Sensing for Mobile and Embedded Devices via Shared Deep Neural Network Representations
1. Computer systems organization
  1. Embedded and cyber-physical systems
    1. Embedded systems
2. Human-centered computing
  1. Ubiquitous and mobile computing
    1. Ubiquitous and mobile computing systems and tools

Recommendations

Deep Multi-task Augmented Feature Learning via Hierarchical Graph Neural Network
Machine Learning and Knowledge Discovery in Databases. Research Track
Abstract
Deep multi-task learning attracts much attention in recent years as it achieves good performance in many applications. Feature learning is important to deep multi-task learning for sharing common information among tasks. In this paper, we propose ...
Read More
Enhanced task attention with adversarial learning for dynamic multi-task CNN
Highlights
- We propose a novel learning framework of multi-task CNN to enhance task attention through tuning the TTC of the shared subnet DMT-CNN with adversarial ...
Abstract
Multi-task deep learning is promising to solve multi-label multi-instance visual recognition tasks. However, flexible information sharing in the task group might bring performance bottlenecks to an individual task. To tackle this ...
Read More
Adversarial multi-task deep learning for signer-independent feature representation
Abstract
Previous research has achieved remarkable progress in Sign Language Recognition (SLR). However, for robust open-set SLR applications, it is necessary to solve signer-independent SLR. This paper proposes a novel adversarial multi-task deep learning ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies Volume 1, Issue 3
September 2017
2023 pages
EISSN:2474-9567
DOI:10.1145/3139486
Issue’s Table of Contents

Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 September 2017
- Accepted: 1 June 2017
- Revised: 1 May 2017
- Received: 1 November 2016
Published in imwut Volume 1, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Audio sensing
deep learning
multi-task learning
shared representation
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 38
  Total Citations
  View Citations
- 871
  Total Downloads
- Downloads (Last 12 months)72
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Low-resource Multi-task Audio Sensing for Mobile and Embedded Devices via Shared Deep Neural Network Representations

Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies

Abstract

References

Cited By

Index Terms

Recommendations

Deep Multi-task Augmented Feature Learning via Hierarchical Graph Neural Network

Enhanced task attention with adversarial learning for dynamic multi-task CNN

Adversarial multi-task deep learning for signer-independent feature representation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Low-resource Multi-task Audio Sensing for Mobile and Embedded Devices via Shared Deep Neural Network Representations

Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies

Abstract

References

Cited By

Index Terms

Recommendations

Deep Multi-task Augmented Feature Learning via Hierarchical Graph Neural Network

Enhanced task attention with adversarial learning for dynamic multi-task CNN

Adversarial multi-task deep learning for signer-independent feature representation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media