skip to main content
10.1145/3372297.3423348acmconferencesArticle/Chapter ViewAbstractPublication PagesccsConference Proceedingsconference-collections
research-article
Public Access

AdvPulse: Universal, Synchronization-free, and Targeted Audio Adversarial Attacks via Subsecond Perturbations

Authors Info & Claims
Published:02 November 2020Publication History

ABSTRACT

Existing efforts in audio adversarial attacks only focus on the scenarios where an adversary has prior knowledge of the entire speech input so as to generate an adversarial example by aligning and mixing the audio input with corresponding adversarial perturbation. In this work we consider a more practical and challenging attack scenario where the intelligent audio system takes streaming audio inputs (e.g., live human speech) and the adversary can deceive the system by playing adversarial perturbations simultaneously. This change in attack behavior brings great challenges, preventing existing adversarial perturbation generation methods from being applied directly. In practice, (1) the adversary cannot anticipate what the victim will say: the adversary cannot rely on their prior knowledge of the speech signal to guide how to generate adversarial perturbations; and (2) the adversary cannot control when the victim will speak: the synchronization between the adversarial perturbation and the speech cannot be guaranteed. To address these challenges, in this paper we propose AdvPulse, a systematic approach to generate subsecond audio adversarial perturbations, that achieves the capability to alter the recognition results of streaming audio inputs in a targeted and synchronization-free manner. To circumvent the constraints on speech content and time, we exploit penalty-based universal adversarial perturbation generation algorithm and incorporate the varying time delay into the optimization process. We further tailor the adversarial perturbation according to environmental sounds to make it inconspicuous to humans. Additionally, by considering the sources of distortions occurred during the physical playback, we are able to generate more robust audio adversarial perturbations that can remain effective even under over-the-air propagation. Extensive experiments on two representative types of intelligent audio systems (i.e., speaker recognition and speech command recognition) are conducted in various realistic environments. The results show that our attack can achieve an average attack success rate of over 89.6% in indoor environments and 76.0% in inside-vehicle scenarios even with loud engine and road noises.

Skip Supplemental Material Section

Supplemental Material

Copy of CCS2020_fpe202_Zhuohang Li - Pat Weeden.mov

mov

285.1 MB

References

  1. Mart'in Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 265--283.Google ScholarGoogle Scholar
  2. Hadi Abdullah, Washington Garcia, Christian Peeters, Patrick Traynor, Kevin RB Butler, and Joseph Wilson. 2019. Practical hidden voice attacks against speech and speaker recognition systems. arXiv preprint arXiv:1904.05734 (2019).Google ScholarGoogle Scholar
  3. Moustafa Alzantot, Bharathan Balaji, and Mani Srivastava. 2017. Did you hear that? adversarial examples against automatic speech recognition. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS).Google ScholarGoogle Scholar
  4. Amazon. 2020. Amazon Echo. https://www.amazon.com/all-new-Echo/dp/B07R1CXKN7Google ScholarGoogle Scholar
  5. Anish Athalye, Nicholas Carlini, and David Wagner. 2018. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420 (2018).Google ScholarGoogle Scholar
  6. Anish Athalye, Logan Engstrom, Andrew Ilyas, and Kevin Kwok. 2017. Synthesizing robust adversarial examples. arXiv preprint arXiv:1707.07397 (2017).Google ScholarGoogle Scholar
  7. Chase Bank. 2019. Security as unique as your voice. https://www.chase.com/personal/voice-biometrics.Google ScholarGoogle Scholar
  8. Karissa Bell. 2015. A smarter Siri learns to recognize the sound of your voice in iOS 9. https://mashable.com/2015/09/11/hey-siri-voice-recognition/Google ScholarGoogle Scholar
  9. Nicholas Carlini and David Wagner. 2017. Towards evaluating the robustness of neural networks. In Proceedings of the IEEE Symposium on Security and Privacy (SP). 39--57.Google ScholarGoogle ScholarCross RefCross Ref
  10. Nicholas Carlini and David Wagner. 2018. Audio adversarial examples: Targeted attacks on speech-to-text. In Proceedings of the IEEE Security and Privacy Workshops (SPW). 1--7.Google ScholarGoogle ScholarCross RefCross Ref
  11. Guangke Chen, Sen Chen, Lingling Fan, Xiaoning Du, Zhe Zhao, Fu Song, and Yang Liu. 2019. Who is Real Bob? Adversarial Attacks on Speaker Recognition Systems. arXiv preprint arXiv:1911.01840 (2019).Google ScholarGoogle Scholar
  12. Tao Chen, Longfei Shangguan, Zhenjiang Li, and Kyle Jamieson. 2020. Metamorph: Injecting Inaudible Commands into Over-the-air Voice Controlled Systems. In Proceedings of the Network and Distributed System Security Symposium (NDSS).Google ScholarGoogle ScholarCross RefCross Ref
  13. Moustapha M Cisse, Yossi Adi, Natalia Neverova, and Joseph Keshet. 2017. Houdini: Fooling deep structured visual and speech recognition models with adversarial examples. In Proceedings of Advances in neural information processing systems (NeurIPS). 6977--6987.Google ScholarGoogle Scholar
  14. Nilaksh Das, Madhuri Shanbhogue, Shang-Tse Chen, Li Chen, Michael E Kounavis, and Duen Horng Chau. 2018. Adagio: Interactive experimentation with adversarial attack and defense for audio. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 677--681.Google ScholarGoogle Scholar
  15. Timothy Dozat. 2016. Incorporating nesterov momentum into adam. In International Conference on Learning Representations (ICLR).Google ScholarGoogle Scholar
  16. John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, Vol. 12, Jul (2011), 2121--2159.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Yuan Gong, Boyang Li, Christian Poellabauer, and Yiyu Shi. 2019. Real-time adversarial attacks. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI).Google ScholarGoogle ScholarCross RefCross Ref
  18. Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep learning .MIT press.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. 2014. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014).Google ScholarGoogle Scholar
  20. Google. 2020 a. Google Home. https://store.google.com/us/product/google_homeGoogle ScholarGoogle Scholar
  21. Google. 2020 b. Speech-to-text Conversion Powered by Machine Learning. https://cloud.google.com/speech-to-textGoogle ScholarGoogle Scholar
  22. Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning. 369--376.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Chris Hall. 2019. Hey BMW: Your intelligent voice assistant is actually pretty good. https://www.pocket-lint.com/cars/news/bmw/148690-hey-bmw-your-intelligent-voice-assistant-is-actually-pretty-good.Google ScholarGoogle Scholar
  24. Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, et al. 2014. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567 (2014).Google ScholarGoogle Scholar
  25. Xuedong Huang. 2017. Microsoft researchers achieve new conversational speech recognition milestone. https://www.microsoft.com/en-us/research/blog/microsoft-researchers-achieve-new-conversational-speech-recognition-milestone/Google ScholarGoogle Scholar
  26. Marco Jeub, Magnus Schafer, and Peter Vary. 2009. A binaural room impulse response database for the evaluation of dereverberation algorithms. In International Conference on Digital Signal Processing. IEEE, 1--5.Google ScholarGoogle ScholarCross RefCross Ref
  27. Jack Kiefer, Jacob Wolfowitz, et al. 1952. Stochastic estimation of the maximum of a regression function. The Annals of Mathematical Statistics, Vol. 23, 3 (1952), 462--466.Google ScholarGoogle ScholarCross RefCross Ref
  28. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google ScholarGoogle Scholar
  29. Keisuke Kinoshita, Marc Delcroix, Takuya Yoshioka, Tomohiro Nakatani, Emanuel Habets, Reinhold Haeb-Umbach, Volker Leutnant, Armin Sehr, Walter Kellermann, Roland Maas, et al. 2013. The REVERB challenge: A common evaluation framework for dereverberation and recognition of reverberant speech. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. 1--4.Google ScholarGoogle ScholarCross RefCross Ref
  30. Felix Kreuk, Yossi Adi, Moustapha Cisse, and Joseph Keshet. 2018. Fooling end-to-end speaker verification with adversarial examples. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). 1962--1966.Google ScholarGoogle ScholarCross RefCross Ref
  31. Zhuohang Li, Cong Shi, Yi Xie, Jian Liu, Bo Yuan, and Yingying Chen. 2020. Practical Adversarial Attacks Against Speaker Recognition Systems. In Proceedings of the 21st International Workshop on Mobile Computing Systems and Applications (HotMobile). 9--14.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. 2017. Universal adversarial perturbations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1765--1773.Google ScholarGoogle ScholarCross RefCross Ref
  33. Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. 2016. Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2574--2582.Google ScholarGoogle ScholarCross RefCross Ref
  34. Satoshi Nakamura, Kazuo Hiyane, Futoshi Asano, Takanobu Nishiura, and Takeshi Yamada. 2000. Acoustical sound database in real environments for sound scene understanding and hands-free speech recognition. In Language Resources and Evaluation Conference. 965--968.Google ScholarGoogle Scholar
  35. Paarth Neekhara, Shehzeen Hussain, Prakhar Pandey, Shlomo Dubnov, Julian McAuley, and Farinaz Koushanfar. 2019. Universal adversarial perturbations for speech recognition systems. arXiv preprint arXiv:1905.03828 (2019).Google ScholarGoogle Scholar
  36. Vijayaditya Peddinti, Daniel Povey, and Sanjeev Khudanpur. 2015. A time delay neural network architecture for efficient modeling of long temporal contexts. In Annual Conference of the International Speech Communication Association (INTERSPEECH).Google ScholarGoogle Scholar
  37. Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, Jan Silovsky, Georg Stemmer, and Karel Vesely. 2011. The Kaldi Speech Recognition Toolkit. In IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). IEEE Signal Processing Society. IEEE Catalog No.: CFP11SRW-USB.Google ScholarGoogle Scholar
  38. Yao Qin, Nicholas Carlini, Garrison Cottrell, Ian Goodfellow, and Colin Raffel. 2019. Imperceptible, Robust, and Targeted Adversarial Examples for Automatic Speech Recognition. In Proceedings of the International Conference on Machine Learning (ICLR). 5231--5240.Google ScholarGoogle Scholar
  39. Tara N Sainath and Carolina Parada. 2015. Convolutional neural networks for small-footprint keyword spotting. In Annual Conference of the International Speech Communication Association (INTERSPEECH).Google ScholarGoogle Scholar
  40. Samsung. 2020. Unlocks your phone with Bixby Voice. https://www.samsung.com/us/support/answer/ANS00082783/Google ScholarGoogle Scholar
  41. Lea Schönherr, Katharina Kohls, Steffen Zeiler, Thorsten Holz, and Dorothea Kolossa. 2018. Adversarial attacks against automatic speech recognition systems via psychoacoustic hiding. arXiv preprint arXiv:1808.05665 (2018).Google ScholarGoogle Scholar
  42. Google Assistant SDK. 2020. Best Practices for Audio. https://developers.google.com/assistant/sdk/guides/service/python/best-practices/audio.Google ScholarGoogle Scholar
  43. Suwon Shon, Hao Tang, and James Glass. 2018. Frame-level speaker embeddings for text-independent speaker recognition and analysis of end-to-end model. In IEEE Spoken Language Technology Workshop (SLT). IEEE, 1007--1013.Google ScholarGoogle ScholarCross RefCross Ref
  44. David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur. 2018. X-vectors: Robust dnn embeddings for speaker recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). 5329--5333.Google ScholarGoogle ScholarCross RefCross Ref
  45. Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2013. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 (2013).Google ScholarGoogle Scholar
  46. Tesla. 2020. Voice Commands. https://www.tesla.com/support/voice-commands.Google ScholarGoogle Scholar
  47. Jon Vadillo and Roberto Santana. 2019. Universal adversarial examples in speech command classification. arXiv preprint arXiv:1911.10182 (2019).Google ScholarGoogle Scholar
  48. Christophe Veaux, Junichi Yamagishi, Kirsten MacDonald, et al. 2017. CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit. University of Edinburgh. The Centre for Speech Technology Research (CSTR) (2017).Google ScholarGoogle Scholar
  49. Pete Warden. 2018. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209 (2018).Google ScholarGoogle Scholar
  50. Weidi Xie, Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. 2019. Utterance-level aggregation for speaker recognition in the wild. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5791--5795.Google ScholarGoogle ScholarCross RefCross Ref
  51. Yi Xie, Cong Shi, Zhuohang Li, Jian Liu, Yingying Chen, and Bo Yuan. 2020. Real-time, Universal, and Robust Adversarial Attacks Against Speaker Recognition Systems. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).Google ScholarGoogle ScholarCross RefCross Ref
  52. Hiromu Yakura and Jun Sakuma. 2018. Robust audio adversarial example for a physical attack. arXiv preprint arXiv:1810.11793 (2018).Google ScholarGoogle Scholar
  53. Zhuolin Yang, Bo Li, Pin-Yu Chen, and Dawn Song. 2018. Characterizing audio adversarial examples using temporal dependency. arXiv preprint arXiv:1809.10875 (2018).Google ScholarGoogle Scholar
  54. Xuejing Yuan, Yuxuan Chen, Yue Zhao, Yunhui Long, Xiaokang Liu, Kai Chen, Shengzhi Zhang, Heqing Huang, XiaoFeng Wang, and Carl A Gunter. 2018. Commandersong: A systematic approach for practical adversarial voice recognition. In 27th USENIX Security Symposium (USENIX Security 18). 49--64.Google ScholarGoogle Scholar
  55. Lei Zhang, Yan Meng, Jiahao Yu, Chong Xiang, Brandon Falk, and Haojin Zhu. 2020. Voiceprint Mimicry Attack Towards Speaker Verification System in Smart Home. In Proceedings of the IEEE International Conference on Computer Communications (INFOCOM).Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Yingke Zhu, Tom Ko, David Snyder, Brian Mak, and Daniel Povey. 2018. Self-Attentive Speaker Embeddings for Text-Independent Speaker Verification. In Interspeech. 3573--3577.Google ScholarGoogle Scholar

Index Terms

  1. AdvPulse: Universal, Synchronization-free, and Targeted Audio Adversarial Attacks via Subsecond Perturbations

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        CCS '20: Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security
        October 2020
        2180 pages
        ISBN:9781450370899
        DOI:10.1145/3372297

        Copyright © 2020 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 2 November 2020

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate1,261of6,999submissions,18%

        Upcoming Conference

        CCS '24
        ACM SIGSAC Conference on Computer and Communications Security
        October 14 - 18, 2024
        Salt Lake City , UT , USA

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader