ABSTRACT
Automatic speech recognition (ASR) systems can be fooled via targeted adversarial examples, which induce the ASR to produce arbitrary transcriptions in response to altered audio signals. However, state-of-the-art adversarial examples typically have to be fed into the ASR system directly, and are not successful when played in a room. Previously published over-the-air adversarial examples fall into one of three categories: they are either handcrafted examples, they are so conspicuous that human listeners can easily recognize the target transcription once they are alerted to its content, or they require precise information about the room where the attack takes place, and are hence not transferable to other rooms.
In this paper, we demonstrate the first algorithm that produces generic adversarial examples against hybrid ASR systems, which remain robust in an over-the-air attack that is not adapted to the specific environment. Hence, no prior knowledge of the room characteristics is required. Instead, we use room impulse responses (RIRs) to compute robust adversarial examples for arbitrary room characteristics and employ the ASR system Kaldi to demonstrate the attack. Further, our algorithm can utilize psychoacoustic methods to hide changes of the original audio signal below the human thresholds of hearing. In practical experiments, we show that the adversarial examples work for varying room setups, and that no direct line-of-sight between speaker and microphone is necessary. As a result, an attacker can create inconspicuous adversarial examples for any target transcription and apply these to arbitrary room setups without any prior knowledge.
- Hadi Abdullah, Washington Garcia, Christian Peeters, Patrick Traynor, Kevin R. B. Butler, and Joseph Wilson. 2019. Practical Hidden Voice Attacks against Speech and Speaker Recognition Systems. In Network and Distributed System Security Symposium (NDSS).Google ScholarCross Ref
- Jont B. Allen and David A. Berkley. 1979. Image method for efficiently simulating small-room acoustics. The Journal of the Acoustical Society of America 65, 4 (1979), 943–950.Google ScholarCross Ref
- Moustafa Alzantot, Bharathan Balaji, and Mani Srivastava. 2018. Did you hear that? Adversarial examples against automatic speech recognition. arXiv preprint arXiv:1801.00554(2018).Google Scholar
- Anish Athalye, Logan Engstrom, Andrew Ilyas, and Kevin Kwok. 2017. Synthesizing Robust Adversarial Examples. CoRR abs/1707.07397 (July 2017), 1–18.Google Scholar
- Mitali Bafna, Jack Murtagh, and Nikhil Vyas. 2018. Thwarting Adversarial Examples: An L1-Robust Sparse Fourier Transform. In Advances in Neural Information Processing Systems 31. 10075–10085.Google Scholar
- Nicholas Carlini, Anish Athalye, Nicolas Papernot, Wieland Brendel, Jonas Rauber, Dimitris Tsipras, Ian Goodfellow, and Aleksander Madry. 2019. On evaluating adversarial robustness. arXiv preprint arXiv:1902.06705(2019).Google Scholar
- Nicholas Carlini, Pratyush Mishra, Tavish Vaidya, Yuankai Zhang, Micah Sherr, Clay Shields, David A. Wagner, and Wenchao Zhou. 2016. Hidden Voice Commands. In USENIX Security Symposium. USENIX, 513–530.Google Scholar
- Nicholas Carlini and David Wagner. 2017. Towards Evaluating the Robustness of Neural Networks. In Symposium on Security and Privacy. IEEE, 39–57.Google Scholar
- Nicholas Carlini and David Wagner. 2018. Audio adversarial examples: Targeted attacks on speech-to-text. (2018), 1–7.Google Scholar
- Tao Chen, Longfei Shangguan, Zhenjiang Li, and Kyle Jamieson. 2020. Metamorph: Injecting Inaudible Commands into Over-the-air Voice Controlled Systems. (2020).Google Scholar
- Yuxuan Chen, Xuejing Yuan, Jiangshan Zhang, Yue Zhao, Shengzhi Zhang, Kai Chen, and XiaoFeng Wang. 2020. Devil’s Whisper: A General Approach for Physical Adversarial Attacks against Commercial Black-box Speech Recognition Devices. In USENIX Security Symposium. USENIX.Google Scholar
- Moustapha Cisse, Yossi Adi, Natalia Neverova, and Joseph Keshet. 2017. Houdini: Fooling Deep Structured Prediction Models. CoRR abs/1707.05373 (July 2017), 1–12.Google Scholar
- Sina Däubener, Lea Schönherr, Asja Fischer, and Dorothea Kolossa. 2020. Detecting Adversarial Examples for Speech Recognition via Uncertainty Quantification. arXiv preprint arXiv:2005.14611(2020).Google Scholar
- Ivan Evtimov, Kevin Eykholt, Earlence Fernandes, Tadayoshi Kohno, Bo Li, Atul Prakash, Amir Rahmati, and Dawn Song. 2017. Robust Physical-World Attacks on Machine Learning Models. CoRR abs/1707.08945 (July 2017), 1–11.Google Scholar
- Yarin Gal and Zoubin Ghahramani. 2016. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning. 1050–1059.Google ScholarDigital Library
- Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, 2014. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567(2014).Google Scholar
- Andrew Ilyas, Logan Engstrom, Anish Athalye, and Jessy Lin. 2018. Black-box Adversarial Attacks with Limited Queries and Information. CoRR abs/1804.08598 (April 2018), 1–10.Google Scholar
- ISO. 1993. Information Technology – Coding of Moving Pictures and Associated Audio for Digital Storage Media at Up to 1.5 Mbits/s – Part3: Audio. ISO 11172-3. International Organization for Standardization.Google Scholar
- Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. 2017. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems. 6402–6413.Google Scholar
- Juncheng Li, Shuhui Qu, Xinjian Li, Joseph Szurley, J Zico Kolter, and Florian Metze. 2019. Adversarial Music: Real World Audio Adversary Against Wake-word Detection System. In Advances in Neural Information Processing Systems (NeurIPS). 11908–11918.Google Scholar
- Christos Louizos and Max Welling. 2016. Structured and efficient variational deep learning with matrix gaussian posteriors. In International Conference on Machine Learning. 1708–1716.Google Scholar
- Christoph Lüscher, Eugen Beck, Kazuki Irie, Markus Kitza, Wilfried Michel, Albert Zeyer, Ralf Schlüter, and Hermann Ney. 2019. RWTH ASR systems for LibriSpeech: Hybrid vs Attention. Proceedings of Interspeech(2019), 231–235.Google ScholarCross Ref
- Gonzalo Navarro. 2001. A Guided Tour to Approximate String Matching. Comput. Surveys 33, 1 (March 2001), 31–88.Google ScholarDigital Library
- Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z. Berkay Celik, and Ananthram Swami. 2017. Practical Black-Box Attacks Against Machine Learning. In Asia Conference on Computer and Communications Security (ASIA CCS). ACM, 506–519.Google Scholar
- Nicolas Papernot, Patrick D. McDaniel, and Ian J. Goodfellow. 2016. Transferability in Machine Learning: From Phenomena to Black-Box Attacks using Adversarial Samples. CoRR abs/1605.07277 (May 2016), 1–13.Google Scholar
- Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, Jan Silovsky, Georg Stemmer, and Karel Vesely. 2011. The Kaldi Speech Recognition Toolkit. In Workshop on Automatic Speech Recognition and Understanding. IEEE.Google Scholar
- Yao Qin, Nicholas Carlini, Ian Goodfellow, Garrison Cottrell, and Colin Raffel. 2019. Imperceptible, Robust, and Targeted Adversarial Examples for Automatic Speech Recognition. In arXiv preprint arXiv:1903.10346.Google Scholar
- Nirupam Roy, Haitham Hassanieh, and Romit Roy Choudhury. 2017. BackDoor: Making Microphones Hear Inaudible Sounds. In Conference on Mobile Systems, Applications, and Services. ACM, 2–14.Google ScholarDigital Library
- Lea Schönherr, Katharina Kohls, Steffen Zeiler, Thorsten Holz, and Dorothea Kolossa. 2019. Adversarial Attacks Against Automatic Speech Recognition Systems via Psychoacoustic Hiding. In Network and Distributed System Security Symposium (NDSS).Google ScholarCross Ref
- Senthil Mani Shreya Khare, Rahul Aralikatte. 2019. Adversarial Black-Box Attacks on Automatic Speech Recognition Systems using Multi-Objective Evolutionary Optimization. Proceedings of Interspeech(2019).Google Scholar
- Liwei Song and Prateek Mittal. 2017. Inaudible Voice Commands. CoRR abs/1708.07238 (Aug. 2017), 1–3.Google Scholar
- Joseph Szurley and J Zico Kolter. 2019. Perceptual Based Adversarial Audio Attacks. arXiv preprint arXiv:1906.06355(2019).Google Scholar
- Rohan Taori, Amog Kamsetty, Brenton Chu, and Nikita Vemuri. 2018. Targeted adversarial examples for black box audio systems. arXiv preprint arXiv:1805.07820(2018).Google Scholar
- Florian Tramèr, Fan Zhang, Ari Juels, Michael K. Reiter, and Thomas Ristenpart. 2016. Stealing Machine Learning Models via Prediction APIs. In USENIX Security Symposium. USENIX, 601–618.Google Scholar
- Stephen Voran and Connie Sholl. 1995. Perception-based Objective Estimators of Speech. In IEEE Workshop on Speech Coding for Telecommunications. IEEE, 13–14.Google Scholar
- Binghui Wang and Neil Zhenqiang Gong. 2018. Stealing Hyperparameters in Machine Learning. In Symposium on Security and Privacy. IEEE.Google Scholar
- Hiromu Yakura and Jun Sakuma. 2019. Robust audio adversarial example for a physical attack. arXiv preprint arXiv:1810.11793(2019).Google Scholar
- Wonho Yang. 1999. Enhanced Modified Bark Spectral Distortion (EMBSD): an Objective Speech Quality Measrure Based on Audible Distortion and Cognition Model. Ph.D. Dissertation. Temple University Graduate Board.Google Scholar
- Xuejing Yuan, Yuxuan Chen, Yue Zhao, Yunhui Long, Xiaokang Liu, Kai Chen, Shengzhi Zhang, Heqing Huang, Xiaofeng Wang, and Carl A. Gunter. 2018. CommanderSong: A Systematic Approach for Practical Adversarial Voice Recognition. arXiv preprint arXiv:1801.08535(2018).Google Scholar
- Guoming Zhang, Chen Yan, Xiaoyu Ji, Tianchen Zhang, Taimin Zhang, and Wenyuan Xu. 2017. DolphinAttack: Inaudible Voice Commands. In Conference on Computer and Communications Security (CCS). ACM, 103–117.Google ScholarDigital Library
- Eberhard Zwicker and Hugo Fastl. 2007. Psychoacoustics: Facts and Models(third ed.). Springer.Google Scholar
Index Terms
- Imperio: Robust Over-the-Air Adversarial Examples for Automatic Speech Recognition Systems
Recommendations
Speech enhancement for robust automatic speech recognition
Evaluation of baseline CHiME3 recogniser in diverse range of acoustic conditions.Performance curves indicate relative influence of noise and reverberation.Evaluation of 6 different speech enhancement pipelines.Deverberation and beamforming dramatically ...
Psycho-acoustics inspired automatic speech recognition
AbstractUnderstanding the human spoken language recognition process is still a far scientific goal. Nowadays, commercial automatic speech recognisers (ASRs) achieve high performance at recognising clean speech, but their approaches are poorly ...
Highlights- We propose a novel Automatic Speech Recognizer inspired by psycho-acoustic studies.
Harmonicity Based Dereverberation for Improving Automatic Speech Recognition Performance and Speech Intelligibility
A speech signal captured by a distant microphone is generally smeared by reverberation, which severely degrades both the speech intelligibility and Automatic Speech Recognition (ASR) performance. Previously, we proposed a single-microphone ...
Comments