ABSTRACT
Recent advances have enabled automatic sound recognition systems for deaf and hard of hearing (DHH) users on mobile devices. However, these tools use pre-trained, generic sound recognition models, which do not meet the diverse needs of DHH users. We introduce ProtoSound, an interactive system for customizing sound recognition models by recording a few examples, thereby enabling personalized and fine-grained categories. ProtoSound is motivated by prior work examining sound awareness needs of DHH people and by a survey we conducted with 472 DHH participants. To evaluate ProtoSound, we characterized performance on two real-world sound datasets, showing significant improvement over state-of-the-art (e.g., +9.7% accuracy on the first dataset). We then deployed ProtoSound's end-user training and real-time recognition through a mobile application and recruited 19 hearing participants who listened to the real-world sounds and rated the accuracy across 56 locations (e.g., homes, restaurants, parks). Results show that ProtoSound personalized the model on-device in real-time and accurately learned sounds across diverse acoustic contexts. We close by discussing open challenges in personalizable sound recognition, including the need for better recording interfaces and algorithmic improvements.
Supplemental Material
- Jakob Abeßer. 2020. A review of deep learning based methods for acoustic scene classification. Applied Sciences 10, 6.Google ScholarCross Ref
- Sharath Adavanne, Archontis Politis, and Tuomas Virtanen. 2019. TAU Moving Sound Events 2019 - Ambisonic, Anechoic, Synthetic IR and Moving Source Dataset [Data set]. https://doi.org/10.5281/zenodo.2636594Google ScholarCross Ref
- Rosa Ma Alsina-Pagès, Joan Navarro, Francesc Al\’ias, and Marcos Hervás. 2017. Homesound: Real-time audio event detection based on high performance computing for behaviour and surveillance remote monitoring. Sensors 17, 4: 854.Google ScholarCross Ref
- Sanghamitra Bandyopadhyay and Sriparna Saha. 2012. Unsupervised classification: similarity measures, classical and metaheuristic approaches, and applications. Springer Science & Business Media.Google Scholar
- Danielle Bragg, Nicholas Huynh, and Richard E. Ladner. 2016. A Personalizable Mobile Sound Detector App Design for Deaf and Hard-of-Hearing Users. In Proceedings of the 18th International ACM SIGACCESS Conference on Computers and Accessibility, 3–13.Google Scholar
- Anna Cavender and Richard E Ladner. 2008. Hearing impairments. In Web accessibility. Springer, 25–35.Google Scholar
- Sachin Chachada and C-C Jay Kuo. 2014. Environmental sound recognition: A survey. APSIPA Transactions on Signal and Information Processing 3.Google Scholar
- Wei-Yu Chen, Yen-Cheng Liu, Zsolt Kira, Yu-Chiang Frank Wang, and Jia-Bin Huang. 2019. A closer look at few-shot classification. arXiv preprint arXiv:1904.04232.Google Scholar
- Selina Chu, Shrikanth Narayanan, and C-C Jay Kuo. 2009. Environmental sound recognition with time–frequency audio features. IEEE Transactions on Audio, Speech, and Language Processing 17, 6: 1142–1158.Google ScholarDigital Library
- Charles K Chui, Jeffrey M Lemm, and Sahra Sedigh. 1992. An introduction to wavelets. Academic press.Google Scholar
- Courtenay V Cotton and Daniel P W Ellis. 2011. Spectral vs. spectro-temporal features for acoustic event detection. In 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 69–72.Google ScholarCross Ref
- Jeremiah D Deng, Christian Simmermacher, and Stephen Cranefield. 2008. A study on feature analysis for musical instrument classification. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 38, 2: 429–438.Google ScholarDigital Library
- John J Dudley and Per Ola Kristensson. 2018. A review of user interface design for interactive machine learning. ACM Transactions on Interactive Intelligent Systems (TiiS) 8, 2: 1–37.Google ScholarDigital Library
- Leah Findlater, Bonnie Chinh, Dhruv Jain, Jon Froehlich, Raja Kushalnagar, and Angela Carey Lin. 2019. Deaf and Hard-of-hearing Individuals’ Preferences for Wearable and Mobile Sound Awareness Technologies. In SIGCHI Conference on Human Factors in Computing Systems (CHI)., 1–13.Google Scholar
- Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, 1126–1135.Google Scholar
- Pasquale Foggia, Nicolai Petkov, Alessia Saggese, Nicola Strisciuglio, and Mario Vento. 2015. Reliable detection of audio events in highly noisy environments. Pattern Recognition Letters 65: 22–28.Google ScholarDigital Library
- Eduardo Fonseca, Jordi Pons Puig, Xavier Favory, Frederic Font Corbera, Dmitry Bogdanov, Andres Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra. 2017. Freesound datasets: a platform for the creation of open audio datasets. In Hu X, Cunningham SJ, Turnbull D, Duan Z, editors. Proceedings of the 18th ISMIR Conference; 2017 oct 23-27; Suzhou, China.[Canada]: International Society for Music Information Retrieval; 2017. p. 486-493.Google Scholar
- Jort F Gemmeke, Daniel P W Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 776–780.Google ScholarDigital Library
- Shayan Gharib, Konstantinos Drossos, Emre Cakir, Dmitriy Serdyuk, and Tuomas Virtanen. 2018. Unsupervised adversarial domain adaptation for acoustic scene classification. arXiv preprint arXiv:1808.05777.Google Scholar
- Steven Goodman, Susanne Kirchner, Rose Guttman, Dhruv Jain, Jon Froehlich, and Leah Findlater. 2020. Evaluating Smartwatch-based Sound Feedback for Deaf and Hard-of-hearing Users Across Contexts. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 1–13.Google ScholarDigital Library
- Steven M Goodman, Ping Liu, Dhruv Jain, Emma J McDonnell, Jon E Froehlich, and Leah Findlater. 2021. Toward User-Driven Sound Recognizer Personalization with People Who Are d/Deaf or Hard of Hearing. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 5, 2: 1–23.Google ScholarDigital Library
- Shawn Hershey, Sourish Chaudhuri, Daniel P W Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, and others. 2017. CNN architectures for large-scale audio classification. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), 131–135.Google ScholarDigital Library
- Dhruv Jain, Brendon Chiu, Steven Goodman, Chris Schmandt, Leah Findlater, and Jon E Froehlich. 2020. Field study of a tactile sound awareness device for deaf users. In Proceedings of the 2020 International Symposium on Wearable Computers, 55–57.Google ScholarDigital Library
- Dhruv Jain, Angela Carey Lin, Marcus Amalachandran, Aileen Zeng, Rose Guttman, Leah Findlater, and Jon Froehlich. 2019. Exploring Sound Awareness in the Home for People who are Deaf or Hard of Hearing. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, 94:1-94:13.Google ScholarDigital Library
- Dhruv Jain, Kelly Mack, Akli Amrous, Matt Wright, Steven Goodman, Leah Findlater, and Jon E Froehlich. 2020. HomeSound: An Iterative Field Deployment of an In-Home Sound Awareness System for Deaf or Hard of Hearing Users. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (CHI ’20), 1–12.Google ScholarDigital Library
- Dhruv Jain, Hung Ngo, Pratyush Patel, Steven Goodman, Leah Findlater, and Jon Froehlich. 2020. SoundWatch: Exploring Smartwatch-based Deep Learning Approaches to Support Sound Awareness for Deaf and Hard of Hearing Users. In ACM SIGACCESS conference on Computers and accessibility, 1–13.Google ScholarDigital Library
- Pedro R Mendes Júnior, Roberto M De Souza, Rafael de O Werneck, Bernardo V Stein, Daniel V Pazinato, Waldir R de Almeida, Otávio A B Penatti, Ricardo da S Torres, and Anderson Rocha. 2017. Nearest neighbors distance ratio open-set classifier. Machine Learning 106, 3: 359–386.Google ScholarDigital Library
- Mahdie Karbasi, Seyed Mohammad Ahadi, and M Bahmanian. 2011. Environmental sound classification using spectral dynamic features. In 2011 8th International Conference on Information, Communications & Signal Processing, 1–5.Google ScholarCross Ref
- Peerapol Khunarsal, Chidchanok Lursinsap, and Thanapant Raicharoen. 2013. Very short time environmental sound classification based on spectrogram pattern matching. Information Sciences 243: 57–74.Google ScholarCross Ref
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.Google Scholar
- W Bradley Knox and Peter Stone. 2015. Framing reinforcement learning from human reward: Reward positivity, temporal discounting, episodicity, and performance. Artificial Intelligence 225: 24–50.Google ScholarDigital Library
- Paddy Ladd and Harlan Lane. 2013. Deaf ethnicity, deafhood, and their relationship. Sign Language Studies 13, 4: 565–579.Google ScholarCross Ref
- Gierad Laput, Karan Ahuja, Mayank Goel, and Chris Harrison. 2018. Ubicoustics: Plug-and-play acoustic activity recognition. In The 31st Annual ACM Symposium on User Interface Software and Technology, 213–224.Google ScholarDigital Library
- Juncheng Li, Wei Dai, Florian Metze, Shuhui Qu, and Samarjit Das. 2017. A comparison of deep learning methods for environmental sound detection. In 2017 IEEE International conference on acoustics, speech and signal processing (ICASSP), 126–130.Google ScholarDigital Library
- Hong Lu, Wei Pan, Nicholas D Lane, Tanzeem Choudhury, and Andrew T Campbell. 2009. Soundsense: scalable sound sensing for people-centric applications on mobile phones. In Proceedings of the 7th international conference on Mobile systems, applications, and services, 165–178.Google ScholarDigital Library
- Lie Lu, Hong-Jiang Zhang, and Hao Jiang. 2002. Content analysis for audio classification and segmentation. IEEE Transactions on speech and audio processing 10, 7: 504–516.Google ScholarCross Ref
- Manuela M Marin and Helmut Leder. 2013. Examining complexity across domains: relating subjective and objective measures of affective environmental scenes, paintings and music. PloS one 8, 8: e72412.Google ScholarCross Ref
- John D Markel and A H Jr Gray. 2013. Linear prediction of speech. Springer Science & Business Media.Google Scholar
- Tara Matthews, Scott Carter, Carol Pai, Janette Fong, and Jennifer Mankoff. 2006. Scribe4Me: Evaluating a Mobile Sound Transcription Tool for the Deaf. In Proceedings of Ubiquitous Computing (UbiComp), Paul Dourish and Adrian Friday (eds.). Springer Berlin Heidelberg, 159–176.Google ScholarDigital Library
- Tara Matthews, Janette Fong, F. Wai-Ling Ho-Ching, and Jennifer Mankoff. 2006. Evaluating non-speech sound visualizations for the deaf. Behaviour & Information Technology 25, 4: 333–351.Google ScholarCross Ref
- Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. 2016. TUT Sound events 2016. https://doi.org/10.5281/zenodo.45759Google ScholarCross Ref
- Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. 2019. Acoustic Scene Classification in DCASE 2019 Challenge: Closed and Open Set Classification and Data Mismatch Setups. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019), 164–168.Google ScholarCross Ref
- Michael Potuck. Hands-on with iOS 14’s Sound Recognition feature that listens for doorbells, smoke alarms, more. Retrieved March 9, 2021 from https://9to5mac.com/2020/10/28/how-to-use-iphone-sound-recognition-ios-14/Google Scholar
- Dalibor Mitrović, Matthias Zeppelzauer, and Christian Breiteneder. 2010. Features for content-based audio retrieval. In Advances in computers. Elsevier, 71–150.Google Scholar
- Matthew S Moore. 1992. For Hearing people only: Answers to some of the most commonly asked questions about the Deaf community, its culture, and the" Deaf Reality". Deaf Life Press.Google Scholar
- Alex Nichol, Joshua Achiam, and John Schulman. 2018. On first-order meta-learning algorithms. arXiv preprint arXiv:1803.02999.Google Scholar
- Kamal Nigam and Rayid Ghani. 2000. Analyzing the effectiveness and applicability of co-training. In Proceedings of the ninth international conference on Information and knowledge management, 86–93.Google ScholarDigital Library
- Boris N Oreshkin, Pau Rodr\’iguez López, and Alexandre Lacoste. 2018. {TADAM:} Task dependent adaptive metric for improved few-shot learning. CoRR abs/1805.1. Retrieved from http://arxiv.org/abs/1805.10123Google Scholar
- Vesa Peltonen, Juha Tuomi, Anssi Klapuri, Jyri Huopaniemi, and Timo Sorsa. 2002. Computational auditory scene recognition. In 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, II–1941.Google ScholarCross Ref
- Baolin Peng, Chenguang Zhu, Chunyuan Li, Xiujun Li, Jinchao Li, Michael Zeng, and Jianfeng Gao. 2020. Few-shot natural language generation for task-oriented dialog. arXiv preprint arXiv:2002.12328.Google Scholar
- Karol J Piczak. 2015. ESC: Dataset for environmental sound classification. In Proceedings of the 23rd ACM international conference on Multimedia, 1015–1018.Google ScholarDigital Library
- Ilyas Potamitis and Todor Ganchev. 2008. Generalized recognition of sound events: Approaches and applications. Multimedia services in intelligent environments: 41–79.Google Scholar
- Siyuan Qiao, Chenxi Liu, Wei Shen, and Alan L Yuille. 2018. Few-shot image recognition by predicting parameters from activations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7229–7238.Google ScholarCross Ref
- Aniruddh Raghu, Maithra Raghu, Samy Bengio, and Oriol Vinyals. 2019. Rapid learning or feature reuse? towards understanding the effectiveness of maml. arXiv preprint arXiv:1909.09157.Google Scholar
- Gonzalo Ramos, Christopher Meek, Patrice Simard, Jina Suh, and Soroush Ghorashi. 2020. Interactive machine teaching: a human-centered approach to building machine-learned models. Human–Computer Interaction 35, 5–6: 413–451.Google Scholar
- J Salamon, C Jacoby, and J P Bello. 2014. A Dataset and Taxonomy for Urban Sound Research. In 22nd ACM International Conference on Multimedia (ACM-MM’14), 1041–1044.Google ScholarDigital Library
- Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 4510–4520.Google ScholarCross Ref
- Bowen Shi, Ming Sun, Krishna C Puvvada, Chieh-Chi Kao, Spyros Matsoukas, and Chao Wang. 2020. Few-shot acoustic event detection via meta learning. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 76–80.Google ScholarCross Ref
- Liu Sicong, Zhou Zimu, Du Junzhao, Shangguan Longfei, Jun Han, and Xin Wang. 2017. UbiEar: Bringing Location-independent Sound Awareness to the Hard-of-hearing People with Smartphones. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 1, 2: 17.Google ScholarDigital Library
- Jake Snell, Kevin Swersky, and Richard S Zemel. 2017. Prototypical networks for few-shot learning. arXiv preprint arXiv:1703.05175.Google Scholar
- Ole Morten Strand and Andreas Egeberg. 2004. Cepstral mean and variance normalization in the model domain. In COST278 and ISCA Tutorial and Research Workshop (ITRW) on Robustness Issues in Conversational Interaction.Google Scholar
- Qianru Sun, Yaoyao Liu, Tat-Seng Chua, and Bernt Schiele. 2019. Meta-transfer learning for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 403–412.Google ScholarCross Ref
- Lisa Torrey and Jude Shavlik. 2010. Transfer learning. In Handbook of research on machine learning applications and trends: algorithms, methods, and techniques. IGI global, 242–264.Google Scholar
- Wei Wang and Zhi-Hua Zhou. 2010. A new analysis of co-training. In ICML.Google Scholar
- Yaqing Wang, Quanming Yao, James T Kwok, and Lionel M Ni. 2020. Generalizing from a few examples: A survey on few-shot learning. ACM Computing Surveys (CSUR) 53, 3: 1–34.Google ScholarDigital Library
- Yu Wang, Justin Salamon, Nicholas J Bryan, and Juan Pablo Bello. 2020. Few-shot sound event detection. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 81–85.Google ScholarCross Ref
- Meredith Whittaker, Meryl Alper, Cynthia L Bennett, Sara Hendren, Liz Kaziunas, Mara Mills, Meredith Ringel Morris, Joy Rankin, Emily Rogers, Marcel Salas, and others. 2019. Disability, bias, and AI. AI Now Institute.Google Scholar
- Jason Wu, Chris Harrison, Jeffrey P Bigham, and Gierad Laput. 2020. Automated Class Discovery and One-Shot Interactions for Acoustic Activity Recognition. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, 1–14.Google ScholarDigital Library
- Nobuhide Yamakawa, Tetsuro Kitahara, Toru Takahashi, Kazunori Komatani, Tetsuya Ogata, and Hiroshi G Okuno. 2010. Effects of modelling within-and between-frame temporal variations in power spectra on non-verbal sound recognition. In Eleventh Annual Conference of the International Speech Communication Association.Google ScholarCross Ref
- Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. 2019. Federated machine learning: Concept and applications. ACM Transactions on Intelligent Systems and Technology (TIST) 10, 2: 1–19.Google ScholarDigital Library
- Timothy Yang, Galen Andrew, Hubert Eichner, Haicheng Sun, Wei Li, Nicholas Kong, Daniel Ramage, and Françoise Beaufays. 2018. Applied federated learning: Improving google keyboard query suggestions. arXiv preprint arXiv:1812.02903.Google Scholar
- Wenpeng Yin. 2020. Meta-learning for few-shot natural language processing: A survey. arXiv preprint arXiv:2007.09604.Google Scholar
- Alina Zajadacz. 2015. Evolution of models of disability as a basis for further policy changes in accessible tourism. Journal of Tourism Futures 1, 3: 189–202.Google ScholarCross Ref
- Live Transcribe & Sound Notifications – Apps on Google Play. Retrieved April 5, 2021 from https://play.google.com/store/apps/details?id=com.google.audio.hearing.visualization.accessibility.scribeGoogle Scholar
- BBC Sound Effects. Retrieved September 18, 2019 from http://bbcsfx.acropolis.org.uk/Google Scholar
- Network Sound Effects Library. Retrieved September 15, 2019 from https://www.sound-ideas.com/Product/199/Network-Sound-Effects-LibraryGoogle Scholar
- UPC-TALP dataset. Retrieved September 18, 2019 from http://www.talp.upc.edu/content/upc-talp-database-isolated-meeting-room-acoustic-eventsGoogle Scholar
- Google Surveys. Retrieved April 5, 2021 from https://surveys.google.comGoogle Scholar
- Google Opinion Rewards - It Pays to Share Your Opinion. Retrieved April 5, 2021 from https://surveys.google.com/google-opinion-rewards/Google Scholar
- Google AI Blog: Federated Learning: Collaborative Machine Learning without Centralized Training Data. Retrieved April 6, 2021 from https://ai.googleblog.com/2017/04/federated-learning-collaborative.htmlGoogle Scholar
- AudioSet Label Accuracy. Retrieved April 6, 2021 from https://research.google.com/audioset/dataset/index.htmlGoogle Scholar
Index Terms
- ProtoSound: A Personalized and Scalable Sound Recognition System for Deaf and Hard-of-Hearing Users
Recommendations
SoundWatch: Exploring Smartwatch-based Deep Learning Approaches to Support Sound Awareness for Deaf and Hard of Hearing Users
ASSETS '20: Proceedings of the 22nd International ACM SIGACCESS Conference on Computers and AccessibilitySmartwatches have the potential to provide glanceable, always-available sound feedback to people who are deaf or hard of hearing. In this paper, we present a performance evaluation of four low-resource deep learning sound classification models: MobileNet,...
Deaf and Hard-of-hearing Individuals' Preferences for Wearable and Mobile Sound Awareness Technologies
CHI '19: Proceedings of the 2019 CHI Conference on Human Factors in Computing SystemsTo investigate preferences for mobile and wearable sound awareness systems, we conducted an online survey with 201 DHH participants. The survey explores how demographic factors affect perceptions of sound awareness technologies, gauges interest in ...
Head-Mounted Display Visualizations to Support Sound Awareness for the Deaf and Hard of Hearing
CHI '15: Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing SystemsPersons with hearing loss use visual signals such as gestures and lip movement to interpret speech. While hearing aids and cochlear implants can improve sound recognition, they generally do not help the wearer localize sound necessary to leverage these ...
Comments