Abstract
Continuous audio analysis from embedded and mobile devices is an increasingly important application domain. More and more, appliances like the Amazon Echo, along with smartphones and watches, and even research prototypes seek to perform multiple discriminative tasks simultaneously from ambient audio; for example, monitoring background sound classes (e.g., music or conversation), recognizing certain keywords (‘Hey Siri' or ‘Alexa'), or identifying the user and her emotion from speech. The use of deep learning algorithms typically provides state-of-the-art model performances for such general audio tasks. However, the large computational demands of deep learning models are at odds with the limited processing, energy and memory resources of mobile, embedded and IoT devices.
In this paper, we propose and evaluate a novel deep learning modeling and optimization framework that specifically targets this category of embedded audio sensing tasks. Although the supported tasks are simpler than the task of speech recognition, this framework aims at maintaining accuracies in predictions while minimizing the overall processor resource footprint. The proposed model is grounded in multi-task learning principles to train shared deep layers and exploits, as input layer, only statistical summaries of audio filter banks to further lower computations.
We find that for embedded audio sensing tasks our framework is able to maintain similar accuracies, which are observed in comparable deep architectures that use single-task learning and typically more complex input layers. Most importantly, on an average, this approach provides almost a 2.1× reduction in runtime, energy, and memory for four separate audio sensing tasks, assuming a variety of task combinations.
- 2013. Recent Advances in Deep Learning for Speech Research at Microsoft. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). http://research.microsoft.com/apps/pubs/default.aspx?id=188864Google Scholar
- 2017. https://www.qualcomm.com/products/snapdragon/processors/400. (2017).Google Scholar
- 2017. Amazon Echo. http://www.amazon.com/Amazon-Echo-Bluetooth-Speaker-with-WiFi-Alexa/dp/B00X4WHP5E. (2017).Google Scholar
- 2017. Auto Shazam. https://support.shazam.com/hc/en-us/articles/204457738-Auto-Shazam-iPhone-. (2017).Google Scholar
- 2017. Fitbit Surge. https://www.fitbit.com/uk/surge. (2017).Google Scholar
- 2017. Google Home. https://home.google.com/. (2017).Google Scholar
- 2017. Motorola Moto 360 Smartwatch. http://www.motorola.com/us/products/moto-360. (2017).Google Scholar
- 2017. Qualcomm Snapdragon 800 MDP. http://goo.gl/ySfCFl. (2017).Google Scholar
- 2017. TensorFlow. https://www.tensorflow.org/. (2017).Google Scholar
- 2017. Torch. http://torch.ch/. (2017).Google Scholar
- Sourav Bhattacharya and Nicholas D. Lane. 2016. From Smart to Deep: Robust Activity Recognition on Smartwatches using Deep Learning. In Workshop on Sensing Systems and Applications Using Wrist Worn Smart Devices (WristSense'16).Google Scholar
- Sourav Bhattacharya and Nicholas D. Lane. 2016. Sparsification and Separation of Deep Learning Layers for Constrained Resource Inference on Wearables. In ACM Conference on Embedded Networked Sensor Systems (SenSys) 2016. Google ScholarDigital Library
- Rich Caruana. 1997. Multitask Learning. Mach. Learn. 28, 1 (July 1997), 41--75. Google ScholarDigital Library
- Guoguo Chen, Carolina Parada, and Georg Heigold. 2014. Small-footprint Keyword Spotting Using Deep Neural Networks (ICASSP'14).Google Scholar
- Wenlin Chen, James T. Wilson, Stephen Tyree, Kilian Q. Weinberger, and Yixin Chen. 2015. Compressing Neural Networks with the Hashing Trick. ICML-15 (2015). http://arxiv.org/abs/1504.04788 Google ScholarDigital Library
- Ronan Collobert and Jason Weston. 2008. A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning. In Proceedings of the 25th International Conference on Machine Learning (ICML '08). ACM, New York, NY, USA, 160--167. Google ScholarDigital Library
- Li Deng, Geoffrey Hinton, and Brian Kingsbury. 2013. New types of deep neural network learning for speech recognition and related applications: An overview. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). http://research.microsoft.com/apps/pubs/default.aspx?id=189004Google ScholarCross Ref
- Zheng Fang, Zhang Guoliang, and Song Zhanjiang. 2001. Comparison of Different Implementations of MFCC. J. Comput. Sci. Technol. 16, 6 (Nov. 2001), 582--589. Google ScholarDigital Library
- Yunchao Gong, Liu Liu, Ming Yang, and Lubomir D. Bourdev. 2015. Compressing Deep Convolutional Networks using Vector Quantization. ICLR-15 (2015). http://arxiv.org/abs/1412.6115Google Scholar
- Nils Hammerla, James Fisher, Peter Andras, Lynn Rochester, Richard Walker, and Thomas Ploetz. 2015. PD Disease State Assessment in Naturalistic Environments Using Deep Learning. (2015). http://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/view/9930 Google ScholarDigital Library
- Nils Y. Hammerla, Shane Halloran, and Thomas Plötz. 2016. Deep, Convolutional, and Recurrent Models for Human Activity Recognition Using Wearables. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI'16). AAAI Press. http://www.ijcai.org/Abstract/16/220 Google ScholarDigital Library
- Kun Han, Dong Yu, and Ivan Tashev. 2014. Speech Emotion Recognition Using Deep Neural Network and Extreme Learning Machine. In Interspeech-14. http://research.microsoft.com/apps/pubs/default.aspx?id=230136Google Scholar
- Song Han, Jeff Pool, John Tran, and William J. Dally. 2015. Learning both Weights and Connections for Efficient Neural Networks. NIPS-15 (2015). http://arxiv.org/abs/1506.02626 Google ScholarDigital Library
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. CoRR abs/1512.03385 (2015). http://arxiv.org/abs/1512.03385Google Scholar
- Tianxing He, Yuchen Fan, Yanmin Qian, Tian Tan, and Kai Yu. 2014. Reshaping deep neural network for fast decoding by node-pruning. In ICASSP-14, May 4-9, 2014. 245--249.Google ScholarCross Ref
- H. Hermansky. 1990. Perceptual Linear Predictive (PLP) Analysis of Speech. J. Acoust. Soc. Am. 57, 4 (April 1990), 1738--52.Google Scholar
- Jui-Ting Huang, Jinyu Li, Dong Yu, Li Deng, and Yifan Gong. 2013. Cross-language Knowledge Transfer using Multilingual Deep Neural Network with Shared Hidden Layers. In ICASSP-13. http://research.microsoft.com/apps/pubs/default.aspx?id=189250Google Scholar
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In NIPS-12. http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf Google ScholarDigital Library
- Nicholas Lane, Sourav Bhattacharya, Akhil Mathur, Claudio Forlivesi, and Fahim Kawsar. 2016. Dxtk: Enabling resource-efficient deep learning on mobile and embedded devices with the deepx toolkit. In Proceedings of the 8th EAI International Conference on Mobile Computing, Applications and Services, ser. MobiCASE, Vol. 16. 98--107. Google ScholarDigital Library
- Nicholas D. Lane, Sourav Bhattacharya, Petko Georgiev, Claudio Forlivesi, Lei Jiao, Lorena Qendro, and Fahim Kawsar. 2016. DeepX: A Software Accelerator for Low-Power Deep Learning Inference on Mobile Devices. In International Conference on Information Processing in Sensor Networks (IPSN '16). Google ScholarDigital Library
- Nicholas D. Lane and Petko Georgiev. 2015. Can Deep Learning Revolutionize Mobile Sensing?. In Proceedings of the 16th International Workshop on Mobile Computing Systems and Applications (HotMobile '15). ACM, New York, NY, USA, 117--122. Google ScholarDigital Library
- Nicholas D. Lane, Petko Georgiev, and Lorena Qendro. 2015. DeepEar: Robust Smartphone Audio Sensing in Unconstrained Acoustic Environments Using Deep Learning. In Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp '15). ACM, New York, NY, USA, 283--294. Google ScholarDigital Library
- Honglak Lee, Peter Pham, Yan Largman, and Andrew Y. Ng. 2009. Unsupervised feature learning for audio classification using convolutional deep belief networks. In NIPS-09. Curran Associates, Inc., 1096--1104. http://papers.nips.cc/paper/3674-unsupervised-feature-learning-for-audio-classification-using-convolutional-deep-belief-networks.pdf Google ScholarDigital Library
- Youngki Lee, Chulhong Min, Chanyou Hwang, Jaeung Lee, Inseok Hwang, Younghyun Ju, Chungkuk Yoo, Miri Moon, Uichin Lee, and Junehwa Song. 2013. SocioPhone: Everyday Face-to-face Interaction Monitoring Platform Using Multi-phone Sensor Fusion. In Proceeding of the 11th Annual International Conference on Mobile Systems, Applications, and Services (MobiSys '13). ACM, New York, NY, USA, 375--388. Google ScholarDigital Library
- Xi Li, Liming Zhao, Lina Wei, MingHsuan Yang, Fei Wu, Yueting Zhuang, Haibin Ling, and Jingdong Wang. 2015. DeepSaliency: Multi-Task Deep Neural Network Model for Salient Object Detection. CoRR abs/1510.05484 (2015). http://arxiv.org/abs/1510.05484Google Scholar
- Mark Liberman, Kelly Davis, Murray Grossman, Nii Martey, and John Bell. 2002. Emotional Prosody Speech and Transcripts. (2002).Google Scholar
- Xiaodong Liu, Jianfeng Gao, Xiaodong He, Li Deng, Kevin Duh, and Ye-Yi Wang. 2015. Representation Learning Using Multi-Task Deep Neural Networks for Semantic Classification and Information Retrieval. In NAACL HLT 2015, The 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado, USA, May 31 - June 5, 2015. 912--921. http://aclweb.org/anthology/N/N15/N15-1092.pdfGoogle Scholar
- Hong Lu, A.J. Bernheim Brush, Bodhi Priyantha, Amy K. Karlson, and Jie Liu. 2011. SpeakerSense: Energy Efficient Unobtrusive Speaker Identification on Mobile Phones. In Proceedings of the 9th International Conference on Pervasive Computing (Pervasive'11). Springer-Verlag, Berlin, Heidelberg, 188--205. http://dl.acm.org/citation.cfm?id=2021975.2021992 Google ScholarDigital Library
- Hong Lu, Denise Frauendorfer, Mashfiqui Rabbi, Marianne Schmid Mast, Gokul T. Chittaranjan, Andrew T. Campbell, Daniel Gatica-Perez, and Tanzeem Choudhury. 2012. StressSense: Detecting Stress in Unconstrained Acoustic Environments Using Smartphones. In Proceedings of the 2012 ACM Conference on Ubiquitous Computing (UbiComp '12). 10. Google ScholarDigital Library
- Hong Lu, Wei Pan, Nicholas D. Lane, Tanzeem Choudhury, and Andrew T. Campbell. 2009. SoundSense: Scalable Sound Sensing for People-centric Applications on Mobile Phones. In Proceedings of the 7th International Conference on Mobile Systems, Applications, and Services (MobiSys '09). ACM, New York, NY, USA, 165--178. Google ScholarDigital Library
- Hong Lu, Jun Yang, Zhigang Liu, Nicholas D. Lane, Tanzeem Choudhury, and Andrew T. Campbell. 2010. The Jigsaw Continuous Sensing Engine for Mobile Phone Applications. In Proceedings of the 8th ACM Conference on Embedded Networked Sensor Systems (SenSys '10). 71--84. Google ScholarDigital Library
- Hong Lu, Jun Yang, Zhigang Liu, Nicholas D. Lane, Tanzeem Choudhury, and Andrew T. Campbell. 2010. The Jigsaw Continuous Sensing Engine for Mobile Phone Applications. In Proceedings of the 8th ACM Conference on Embedded Networked Sensor Systems (SenSys '10). ACM, New York, NY, USA, 71--84. Google ScholarDigital Library
- Chengwen Luo and Mun Choon Chan. 2013. SocialWeaver: Collaborative Inference of Human Conversation Networks Using Smartphones. In Proceedings of the 11th ACM Conference on Embedded Networked Sensor Systems (SenSys '13). Article 20, 14 pages. Google ScholarDigital Library
- Akhil Mathur, Nicholas D Lane, Sourav Bhattacharya, Aidan Boran, Claudio Forlivesi, and Fahim Kawsar. 2017. DeepEye: Resource Efficient Local Execution of Multiple Deep Vision Models using Wearable Commodity Hardware. In The 15th International Conference on Mobile Systems, Applications and Services (MobiSys). Google ScholarDigital Library
- Vinod Nair and Geoffrey E. Hinton. 2010. Rectified Linear Units Improve Restricted Boltzmann Machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), Johannes Fürnkranz and Thorsten Joachims (Eds.). Omnipress, 807--814. http://www.icml2010.org/papers/432.pdf Google ScholarDigital Library
- Thomas Plötz, Nils Y. Hammerla, and Patrick Olivier. 2011. Feature Learning for Activity Recognition in Ubiquitous Computing. In Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence - Volume Volume Two (IJCAI'11). AAAI Press, 1729--1734. Google ScholarDigital Library
- Kiran K. Rachuri, Mirco Musolesi, Cecilia Mascolo, Peter J. Rentfrow, Chris Longworth, and Andrius Aucinas. 2010. EmotionSense: A Mobile Phones Based Adaptive Platform for Experimental Social Psychology Research. In Proceedings of the 12th ACM International Conference on Ubiquitous Computing (Ubicomp '10). 10. Google ScholarDigital Library
- Kiran K. Rachuri, Mirco Musolesi, Cecilia Mascolo, Peter J. Rentfrow, Chris Longworth, and Andrius Aucinas. 2010. EmotionSense: A Mobile Phones Based Adaptive Platform for Experimental Social Psychology Research. In Proceedings of the 12th ACM International Conference on Ubiquitous Computing (Ubicomp '10). ACM, New York, NY, USA, 281--290. Google ScholarDigital Library
- Alain Rakotomamonjy and Gilles Gasso. 2015. Histogram of gradients of Time-Frequency Representations for Audio scene detection. CoRR abs/1508.04909 (2015). http://arxiv.org/abs/1508.04909Google Scholar
- M. Smith and T. Barnwell. 1987. A new filter bank theory for time-frequency representation. IEEE Transactions on Acoustics, Speech, and Signal Processing 35, 3 (Mar 1987), 314--327.Google ScholarCross Ref
- Yaniv Taigman, Ming Yang, Marc'Aurelio Ranzato, and Lior Wolf. 2014. DeepFace: Closing the Gap to Human-Level Performance in Face Verification. In CVPR '14. 8. Google ScholarDigital Library
- Theano Development Team. 2016. Theano: A Python framework for fast computation of mathematical expressions. arXiv e-prints abs/1605.02688 (May 2016). http://arxiv.org/abs/1605.02688Google Scholar
- Ehsan Variani, Xin Lei, Erik McDermott, Ignacio Lopez Moreno, and Javier Gonzalez-Dominguez. 2014. Deep neural networks for small footprint text-dependent speaker verification. In ICASSP-14. IEEE, 4052--4056.Google Scholar
- Max Welling, Michal Rosen-Zvi, and Geoffrey Hinton. 2004. Exponential Family Harmoniums with an Application to Information Retrieval. In Proceedings of the 17th International Conference on Neural Information Processing Systems (NIPS'04). MIT Press, Cambridge, MA, USA, 1481--1488. http://dl.acm.org/citation.cfm?id=2976040.2976226 Google ScholarDigital Library
- Zhizheng Wu, Tomi Kinnunen, Nicholas Evans, Junichi Yamagishi, Cemal Hanilci, Md Sahidullah, and Aleksandr Sizov. 2015. ASVspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge. In INTERSPEECH 2015, Automatic Speaker Verification Spoofing and Countermeasures Challenge, colocated with INTERSPEECH 2015, September 6-10, 2015, Dresden, Germany. Dresden, ALLEMAGNE. http://www.eurecom.fr/publication/4573Google Scholar
- Chenren Xu, Sugang Li, Gang Liu, Yanyong Zhang, Emiliano Miluzzo, Yih-Farn Chen, Jun Li, and Bernhard Firner. 2013. Crowd++: Unsupervised Speaker Count with Smartphones. In Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp '13). ACM, New York, NY, USA, 43--52. Google ScholarDigital Library
- Jian Xue, Jinyu Li, and Yifan Gong. 2013. Restructuring of deep neural network acoustic models with singular value decomposition.. In INTERSPEECH, Frédéric Bimbot, Christophe Cerisara, Cécile Fougeron, Guillaume Gravier, Lori Lamel, François Pellegrino, and Pascal Perrier (Eds.). ISCA, 2365--2369.Google Scholar
- Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How transferable are features in deep neural networks? NIPS-14 (2014). http://arxiv.org/abs/1411.1792 Google ScholarDigital Library
Index Terms
- Low-resource Multi-task Audio Sensing for Mobile and Embedded Devices via Shared Deep Neural Network Representations
Recommendations
Deep Multi-task Augmented Feature Learning via Hierarchical Graph Neural Network
Machine Learning and Knowledge Discovery in Databases. Research TrackAbstractDeep multi-task learning attracts much attention in recent years as it achieves good performance in many applications. Feature learning is important to deep multi-task learning for sharing common information among tasks. In this paper, we propose ...
Enhanced task attention with adversarial learning for dynamic multi-task CNN
Highlights- We propose a novel learning framework of multi-task CNN to enhance task attention through tuning the TTC of the shared subnet DMT-CNN with adversarial ...
AbstractMulti-task deep learning is promising to solve multi-label multi-instance visual recognition tasks. However, flexible information sharing in the task group might bring performance bottlenecks to an individual task. To tackle this ...
Adversarial multi-task deep learning for signer-independent feature representation
AbstractPrevious research has achieved remarkable progress in Sign Language Recognition (SLR). However, for robust open-set SLR applications, it is necessary to solve signer-independent SLR. This paper proposes a novel adversarial multi-task deep learning ...
Comments