skip to main content
10.1145/3427228.3427276acmotherconferencesArticle/Chapter ViewAbstractPublication PagesacsacConference Proceedingsconference-collections
research-article

Imperio: Robust Over-the-Air Adversarial Examples for Automatic Speech Recognition Systems

Published:08 December 2020Publication History

ABSTRACT

Automatic speech recognition (ASR) systems can be fooled via targeted adversarial examples, which induce the ASR to produce arbitrary transcriptions in response to altered audio signals. However, state-of-the-art adversarial examples typically have to be fed into the ASR system directly, and are not successful when played in a room. Previously published over-the-air adversarial examples fall into one of three categories: they are either handcrafted examples, they are so conspicuous that human listeners can easily recognize the target transcription once they are alerted to its content, or they require precise information about the room where the attack takes place, and are hence not transferable to other rooms.

In this paper, we demonstrate the first algorithm that produces generic adversarial examples against hybrid ASR systems, which remain robust in an over-the-air attack that is not adapted to the specific environment. Hence, no prior knowledge of the room characteristics is required. Instead, we use room impulse responses (RIRs) to compute robust adversarial examples for arbitrary room characteristics and employ the ASR system Kaldi to demonstrate the attack. Further, our algorithm can utilize psychoacoustic methods to hide changes of the original audio signal below the human thresholds of hearing. In practical experiments, we show that the adversarial examples work for varying room setups, and that no direct line-of-sight between speaker and microphone is necessary. As a result, an attacker can create inconspicuous adversarial examples for any target transcription and apply these to arbitrary room setups without any prior knowledge.

References

  1. Hadi Abdullah, Washington Garcia, Christian Peeters, Patrick Traynor, Kevin R. B. Butler, and Joseph Wilson. 2019. Practical Hidden Voice Attacks against Speech and Speaker Recognition Systems. In Network and Distributed System Security Symposium (NDSS).Google ScholarGoogle ScholarCross RefCross Ref
  2. Jont B. Allen and David A. Berkley. 1979. Image method for efficiently simulating small-room acoustics. The Journal of the Acoustical Society of America 65, 4 (1979), 943–950.Google ScholarGoogle ScholarCross RefCross Ref
  3. Moustafa Alzantot, Bharathan Balaji, and Mani Srivastava. 2018. Did you hear that? Adversarial examples against automatic speech recognition. arXiv preprint arXiv:1801.00554(2018).Google ScholarGoogle Scholar
  4. Anish Athalye, Logan Engstrom, Andrew Ilyas, and Kevin Kwok. 2017. Synthesizing Robust Adversarial Examples. CoRR abs/1707.07397 (July 2017), 1–18.Google ScholarGoogle Scholar
  5. Mitali Bafna, Jack Murtagh, and Nikhil Vyas. 2018. Thwarting Adversarial Examples: An L1-Robust Sparse Fourier Transform. In Advances in Neural Information Processing Systems 31. 10075–10085.Google ScholarGoogle Scholar
  6. Nicholas Carlini, Anish Athalye, Nicolas Papernot, Wieland Brendel, Jonas Rauber, Dimitris Tsipras, Ian Goodfellow, and Aleksander Madry. 2019. On evaluating adversarial robustness. arXiv preprint arXiv:1902.06705(2019).Google ScholarGoogle Scholar
  7. Nicholas Carlini, Pratyush Mishra, Tavish Vaidya, Yuankai Zhang, Micah Sherr, Clay Shields, David A. Wagner, and Wenchao Zhou. 2016. Hidden Voice Commands. In USENIX Security Symposium. USENIX, 513–530.Google ScholarGoogle Scholar
  8. Nicholas Carlini and David Wagner. 2017. Towards Evaluating the Robustness of Neural Networks. In Symposium on Security and Privacy. IEEE, 39–57.Google ScholarGoogle Scholar
  9. Nicholas Carlini and David Wagner. 2018. Audio adversarial examples: Targeted attacks on speech-to-text. (2018), 1–7.Google ScholarGoogle Scholar
  10. Tao Chen, Longfei Shangguan, Zhenjiang Li, and Kyle Jamieson. 2020. Metamorph: Injecting Inaudible Commands into Over-the-air Voice Controlled Systems. (2020).Google ScholarGoogle Scholar
  11. Yuxuan Chen, Xuejing Yuan, Jiangshan Zhang, Yue Zhao, Shengzhi Zhang, Kai Chen, and XiaoFeng Wang. 2020. Devil’s Whisper: A General Approach for Physical Adversarial Attacks against Commercial Black-box Speech Recognition Devices. In USENIX Security Symposium. USENIX.Google ScholarGoogle Scholar
  12. Moustapha Cisse, Yossi Adi, Natalia Neverova, and Joseph Keshet. 2017. Houdini: Fooling Deep Structured Prediction Models. CoRR abs/1707.05373 (July 2017), 1–12.Google ScholarGoogle Scholar
  13. Sina Däubener, Lea Schönherr, Asja Fischer, and Dorothea Kolossa. 2020. Detecting Adversarial Examples for Speech Recognition via Uncertainty Quantification. arXiv preprint arXiv:2005.14611(2020).Google ScholarGoogle Scholar
  14. Ivan Evtimov, Kevin Eykholt, Earlence Fernandes, Tadayoshi Kohno, Bo Li, Atul Prakash, Amir Rahmati, and Dawn Song. 2017. Robust Physical-World Attacks on Machine Learning Models. CoRR abs/1707.08945 (July 2017), 1–11.Google ScholarGoogle Scholar
  15. Yarin Gal and Zoubin Ghahramani. 2016. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning. 1050–1059.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, 2014. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567(2014).Google ScholarGoogle Scholar
  17. Andrew Ilyas, Logan Engstrom, Anish Athalye, and Jessy Lin. 2018. Black-box Adversarial Attacks with Limited Queries and Information. CoRR abs/1804.08598 (April 2018), 1–10.Google ScholarGoogle Scholar
  18. ISO. 1993. Information Technology – Coding of Moving Pictures and Associated Audio for Digital Storage Media at Up to 1.5 Mbits/s – Part3: Audio. ISO 11172-3. International Organization for Standardization.Google ScholarGoogle Scholar
  19. Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. 2017. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems. 6402–6413.Google ScholarGoogle Scholar
  20. Juncheng Li, Shuhui Qu, Xinjian Li, Joseph Szurley, J Zico Kolter, and Florian Metze. 2019. Adversarial Music: Real World Audio Adversary Against Wake-word Detection System. In Advances in Neural Information Processing Systems (NeurIPS). 11908–11918.Google ScholarGoogle Scholar
  21. Christos Louizos and Max Welling. 2016. Structured and efficient variational deep learning with matrix gaussian posteriors. In International Conference on Machine Learning. 1708–1716.Google ScholarGoogle Scholar
  22. Christoph Lüscher, Eugen Beck, Kazuki Irie, Markus Kitza, Wilfried Michel, Albert Zeyer, Ralf Schlüter, and Hermann Ney. 2019. RWTH ASR systems for LibriSpeech: Hybrid vs Attention. Proceedings of Interspeech(2019), 231–235.Google ScholarGoogle ScholarCross RefCross Ref
  23. Gonzalo Navarro. 2001. A Guided Tour to Approximate String Matching. Comput. Surveys 33, 1 (March 2001), 31–88.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z. Berkay Celik, and Ananthram Swami. 2017. Practical Black-Box Attacks Against Machine Learning. In Asia Conference on Computer and Communications Security (ASIA CCS). ACM, 506–519.Google ScholarGoogle Scholar
  25. Nicolas Papernot, Patrick D. McDaniel, and Ian J. Goodfellow. 2016. Transferability in Machine Learning: From Phenomena to Black-Box Attacks using Adversarial Samples. CoRR abs/1605.07277 (May 2016), 1–13.Google ScholarGoogle Scholar
  26. Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, Jan Silovsky, Georg Stemmer, and Karel Vesely. 2011. The Kaldi Speech Recognition Toolkit. In Workshop on Automatic Speech Recognition and Understanding. IEEE.Google ScholarGoogle Scholar
  27. Yao Qin, Nicholas Carlini, Ian Goodfellow, Garrison Cottrell, and Colin Raffel. 2019. Imperceptible, Robust, and Targeted Adversarial Examples for Automatic Speech Recognition. In arXiv preprint arXiv:1903.10346.Google ScholarGoogle Scholar
  28. Nirupam Roy, Haitham Hassanieh, and Romit Roy Choudhury. 2017. BackDoor: Making Microphones Hear Inaudible Sounds. In Conference on Mobile Systems, Applications, and Services. ACM, 2–14.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Lea Schönherr, Katharina Kohls, Steffen Zeiler, Thorsten Holz, and Dorothea Kolossa. 2019. Adversarial Attacks Against Automatic Speech Recognition Systems via Psychoacoustic Hiding. In Network and Distributed System Security Symposium (NDSS).Google ScholarGoogle ScholarCross RefCross Ref
  30. Senthil Mani Shreya Khare, Rahul Aralikatte. 2019. Adversarial Black-Box Attacks on Automatic Speech Recognition Systems using Multi-Objective Evolutionary Optimization. Proceedings of Interspeech(2019).Google ScholarGoogle Scholar
  31. Liwei Song and Prateek Mittal. 2017. Inaudible Voice Commands. CoRR abs/1708.07238 (Aug. 2017), 1–3.Google ScholarGoogle Scholar
  32. Joseph Szurley and J Zico Kolter. 2019. Perceptual Based Adversarial Audio Attacks. arXiv preprint arXiv:1906.06355(2019).Google ScholarGoogle Scholar
  33. Rohan Taori, Amog Kamsetty, Brenton Chu, and Nikita Vemuri. 2018. Targeted adversarial examples for black box audio systems. arXiv preprint arXiv:1805.07820(2018).Google ScholarGoogle Scholar
  34. Florian Tramèr, Fan Zhang, Ari Juels, Michael K. Reiter, and Thomas Ristenpart. 2016. Stealing Machine Learning Models via Prediction APIs. In USENIX Security Symposium. USENIX, 601–618.Google ScholarGoogle Scholar
  35. Stephen Voran and Connie Sholl. 1995. Perception-based Objective Estimators of Speech. In IEEE Workshop on Speech Coding for Telecommunications. IEEE, 13–14.Google ScholarGoogle Scholar
  36. Binghui Wang and Neil Zhenqiang Gong. 2018. Stealing Hyperparameters in Machine Learning. In Symposium on Security and Privacy. IEEE.Google ScholarGoogle Scholar
  37. Hiromu Yakura and Jun Sakuma. 2019. Robust audio adversarial example for a physical attack. arXiv preprint arXiv:1810.11793(2019).Google ScholarGoogle Scholar
  38. Wonho Yang. 1999. Enhanced Modified Bark Spectral Distortion (EMBSD): an Objective Speech Quality Measrure Based on Audible Distortion and Cognition Model. Ph.D. Dissertation. Temple University Graduate Board.Google ScholarGoogle Scholar
  39. Xuejing Yuan, Yuxuan Chen, Yue Zhao, Yunhui Long, Xiaokang Liu, Kai Chen, Shengzhi Zhang, Heqing Huang, Xiaofeng Wang, and Carl A. Gunter. 2018. CommanderSong: A Systematic Approach for Practical Adversarial Voice Recognition. arXiv preprint arXiv:1801.08535(2018).Google ScholarGoogle Scholar
  40. Guoming Zhang, Chen Yan, Xiaoyu Ji, Tianchen Zhang, Taimin Zhang, and Wenyuan Xu. 2017. DolphinAttack: Inaudible Voice Commands. In Conference on Computer and Communications Security (CCS). ACM, 103–117.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Eberhard Zwicker and Hugo Fastl. 2007. Psychoacoustics: Facts and Models(third ed.). Springer.Google ScholarGoogle Scholar

Index Terms

  1. Imperio: Robust Over-the-Air Adversarial Examples for Automatic Speech Recognition Systems
            Index terms have been assigned to the content through auto-classification.

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Other conferences
              ACSAC '20: Proceedings of the 36th Annual Computer Security Applications Conference
              December 2020
              962 pages
              ISBN:9781450388580
              DOI:10.1145/3427228

              Copyright © 2020 ACM

              Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 8 December 2020

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article
              • Research
              • Refereed limited

              Acceptance Rates

              Overall Acceptance Rate104of497submissions,21%

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader

            HTML Format

            View this article in HTML Format .

            View HTML Format