ABSTRACT
Existing efforts in audio adversarial attacks only focus on the scenarios where an adversary has prior knowledge of the entire speech input so as to generate an adversarial example by aligning and mixing the audio input with corresponding adversarial perturbation. In this work we consider a more practical and challenging attack scenario where the intelligent audio system takes streaming audio inputs (e.g., live human speech) and the adversary can deceive the system by playing adversarial perturbations simultaneously. This change in attack behavior brings great challenges, preventing existing adversarial perturbation generation methods from being applied directly. In practice, (1) the adversary cannot anticipate what the victim will say: the adversary cannot rely on their prior knowledge of the speech signal to guide how to generate adversarial perturbations; and (2) the adversary cannot control when the victim will speak: the synchronization between the adversarial perturbation and the speech cannot be guaranteed. To address these challenges, in this paper we propose AdvPulse, a systematic approach to generate subsecond audio adversarial perturbations, that achieves the capability to alter the recognition results of streaming audio inputs in a targeted and synchronization-free manner. To circumvent the constraints on speech content and time, we exploit penalty-based universal adversarial perturbation generation algorithm and incorporate the varying time delay into the optimization process. We further tailor the adversarial perturbation according to environmental sounds to make it inconspicuous to humans. Additionally, by considering the sources of distortions occurred during the physical playback, we are able to generate more robust audio adversarial perturbations that can remain effective even under over-the-air propagation. Extensive experiments on two representative types of intelligent audio systems (i.e., speaker recognition and speech command recognition) are conducted in various realistic environments. The results show that our attack can achieve an average attack success rate of over 89.6% in indoor environments and 76.0% in inside-vehicle scenarios even with loud engine and road noises.
Supplemental Material
- Mart'in Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 265--283.Google Scholar
- Hadi Abdullah, Washington Garcia, Christian Peeters, Patrick Traynor, Kevin RB Butler, and Joseph Wilson. 2019. Practical hidden voice attacks against speech and speaker recognition systems. arXiv preprint arXiv:1904.05734 (2019).Google Scholar
- Moustafa Alzantot, Bharathan Balaji, and Mani Srivastava. 2017. Did you hear that? adversarial examples against automatic speech recognition. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS).Google Scholar
- Amazon. 2020. Amazon Echo. https://www.amazon.com/all-new-Echo/dp/B07R1CXKN7Google Scholar
- Anish Athalye, Nicholas Carlini, and David Wagner. 2018. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420 (2018).Google Scholar
- Anish Athalye, Logan Engstrom, Andrew Ilyas, and Kevin Kwok. 2017. Synthesizing robust adversarial examples. arXiv preprint arXiv:1707.07397 (2017).Google Scholar
- Chase Bank. 2019. Security as unique as your voice. https://www.chase.com/personal/voice-biometrics.Google Scholar
- Karissa Bell. 2015. A smarter Siri learns to recognize the sound of your voice in iOS 9. https://mashable.com/2015/09/11/hey-siri-voice-recognition/Google Scholar
- Nicholas Carlini and David Wagner. 2017. Towards evaluating the robustness of neural networks. In Proceedings of the IEEE Symposium on Security and Privacy (SP). 39--57.Google ScholarCross Ref
- Nicholas Carlini and David Wagner. 2018. Audio adversarial examples: Targeted attacks on speech-to-text. In Proceedings of the IEEE Security and Privacy Workshops (SPW). 1--7.Google ScholarCross Ref
- Guangke Chen, Sen Chen, Lingling Fan, Xiaoning Du, Zhe Zhao, Fu Song, and Yang Liu. 2019. Who is Real Bob? Adversarial Attacks on Speaker Recognition Systems. arXiv preprint arXiv:1911.01840 (2019).Google Scholar
- Tao Chen, Longfei Shangguan, Zhenjiang Li, and Kyle Jamieson. 2020. Metamorph: Injecting Inaudible Commands into Over-the-air Voice Controlled Systems. In Proceedings of the Network and Distributed System Security Symposium (NDSS).Google ScholarCross Ref
- Moustapha M Cisse, Yossi Adi, Natalia Neverova, and Joseph Keshet. 2017. Houdini: Fooling deep structured visual and speech recognition models with adversarial examples. In Proceedings of Advances in neural information processing systems (NeurIPS). 6977--6987.Google Scholar
- Nilaksh Das, Madhuri Shanbhogue, Shang-Tse Chen, Li Chen, Michael E Kounavis, and Duen Horng Chau. 2018. Adagio: Interactive experimentation with adversarial attack and defense for audio. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 677--681.Google Scholar
- Timothy Dozat. 2016. Incorporating nesterov momentum into adam. In International Conference on Learning Representations (ICLR).Google Scholar
- John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, Vol. 12, Jul (2011), 2121--2159.Google ScholarDigital Library
- Yuan Gong, Boyang Li, Christian Poellabauer, and Yiyu Shi. 2019. Real-time adversarial attacks. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI).Google ScholarCross Ref
- Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep learning .MIT press.Google ScholarDigital Library
- Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. 2014. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014).Google Scholar
- Google. 2020 a. Google Home. https://store.google.com/us/product/google_homeGoogle Scholar
- Google. 2020 b. Speech-to-text Conversion Powered by Machine Learning. https://cloud.google.com/speech-to-textGoogle Scholar
- Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning. 369--376.Google ScholarDigital Library
- Chris Hall. 2019. Hey BMW: Your intelligent voice assistant is actually pretty good. https://www.pocket-lint.com/cars/news/bmw/148690-hey-bmw-your-intelligent-voice-assistant-is-actually-pretty-good.Google Scholar
- Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, et al. 2014. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567 (2014).Google Scholar
- Xuedong Huang. 2017. Microsoft researchers achieve new conversational speech recognition milestone. https://www.microsoft.com/en-us/research/blog/microsoft-researchers-achieve-new-conversational-speech-recognition-milestone/Google Scholar
- Marco Jeub, Magnus Schafer, and Peter Vary. 2009. A binaural room impulse response database for the evaluation of dereverberation algorithms. In International Conference on Digital Signal Processing. IEEE, 1--5.Google ScholarCross Ref
- Jack Kiefer, Jacob Wolfowitz, et al. 1952. Stochastic estimation of the maximum of a regression function. The Annals of Mathematical Statistics, Vol. 23, 3 (1952), 462--466.Google ScholarCross Ref
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google Scholar
- Keisuke Kinoshita, Marc Delcroix, Takuya Yoshioka, Tomohiro Nakatani, Emanuel Habets, Reinhold Haeb-Umbach, Volker Leutnant, Armin Sehr, Walter Kellermann, Roland Maas, et al. 2013. The REVERB challenge: A common evaluation framework for dereverberation and recognition of reverberant speech. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. 1--4.Google ScholarCross Ref
- Felix Kreuk, Yossi Adi, Moustapha Cisse, and Joseph Keshet. 2018. Fooling end-to-end speaker verification with adversarial examples. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). 1962--1966.Google ScholarCross Ref
- Zhuohang Li, Cong Shi, Yi Xie, Jian Liu, Bo Yuan, and Yingying Chen. 2020. Practical Adversarial Attacks Against Speaker Recognition Systems. In Proceedings of the 21st International Workshop on Mobile Computing Systems and Applications (HotMobile). 9--14.Google ScholarDigital Library
- Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. 2017. Universal adversarial perturbations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1765--1773.Google ScholarCross Ref
- Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. 2016. Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2574--2582.Google ScholarCross Ref
- Satoshi Nakamura, Kazuo Hiyane, Futoshi Asano, Takanobu Nishiura, and Takeshi Yamada. 2000. Acoustical sound database in real environments for sound scene understanding and hands-free speech recognition. In Language Resources and Evaluation Conference. 965--968.Google Scholar
- Paarth Neekhara, Shehzeen Hussain, Prakhar Pandey, Shlomo Dubnov, Julian McAuley, and Farinaz Koushanfar. 2019. Universal adversarial perturbations for speech recognition systems. arXiv preprint arXiv:1905.03828 (2019).Google Scholar
- Vijayaditya Peddinti, Daniel Povey, and Sanjeev Khudanpur. 2015. A time delay neural network architecture for efficient modeling of long temporal contexts. In Annual Conference of the International Speech Communication Association (INTERSPEECH).Google Scholar
- Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, Jan Silovsky, Georg Stemmer, and Karel Vesely. 2011. The Kaldi Speech Recognition Toolkit. In IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). IEEE Signal Processing Society. IEEE Catalog No.: CFP11SRW-USB.Google Scholar
- Yao Qin, Nicholas Carlini, Garrison Cottrell, Ian Goodfellow, and Colin Raffel. 2019. Imperceptible, Robust, and Targeted Adversarial Examples for Automatic Speech Recognition. In Proceedings of the International Conference on Machine Learning (ICLR). 5231--5240.Google Scholar
- Tara N Sainath and Carolina Parada. 2015. Convolutional neural networks for small-footprint keyword spotting. In Annual Conference of the International Speech Communication Association (INTERSPEECH).Google Scholar
- Samsung. 2020. Unlocks your phone with Bixby Voice. https://www.samsung.com/us/support/answer/ANS00082783/Google Scholar
- Lea Schönherr, Katharina Kohls, Steffen Zeiler, Thorsten Holz, and Dorothea Kolossa. 2018. Adversarial attacks against automatic speech recognition systems via psychoacoustic hiding. arXiv preprint arXiv:1808.05665 (2018).Google Scholar
- Google Assistant SDK. 2020. Best Practices for Audio. https://developers.google.com/assistant/sdk/guides/service/python/best-practices/audio.Google Scholar
- Suwon Shon, Hao Tang, and James Glass. 2018. Frame-level speaker embeddings for text-independent speaker recognition and analysis of end-to-end model. In IEEE Spoken Language Technology Workshop (SLT). IEEE, 1007--1013.Google ScholarCross Ref
- David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur. 2018. X-vectors: Robust dnn embeddings for speaker recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). 5329--5333.Google ScholarCross Ref
- Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2013. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 (2013).Google Scholar
- Tesla. 2020. Voice Commands. https://www.tesla.com/support/voice-commands.Google Scholar
- Jon Vadillo and Roberto Santana. 2019. Universal adversarial examples in speech command classification. arXiv preprint arXiv:1911.10182 (2019).Google Scholar
- Christophe Veaux, Junichi Yamagishi, Kirsten MacDonald, et al. 2017. CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit. University of Edinburgh. The Centre for Speech Technology Research (CSTR) (2017).Google Scholar
- Pete Warden. 2018. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209 (2018).Google Scholar
- Weidi Xie, Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. 2019. Utterance-level aggregation for speaker recognition in the wild. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5791--5795.Google ScholarCross Ref
- Yi Xie, Cong Shi, Zhuohang Li, Jian Liu, Yingying Chen, and Bo Yuan. 2020. Real-time, Universal, and Robust Adversarial Attacks Against Speaker Recognition Systems. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).Google ScholarCross Ref
- Hiromu Yakura and Jun Sakuma. 2018. Robust audio adversarial example for a physical attack. arXiv preprint arXiv:1810.11793 (2018).Google Scholar
- Zhuolin Yang, Bo Li, Pin-Yu Chen, and Dawn Song. 2018. Characterizing audio adversarial examples using temporal dependency. arXiv preprint arXiv:1809.10875 (2018).Google Scholar
- Xuejing Yuan, Yuxuan Chen, Yue Zhao, Yunhui Long, Xiaokang Liu, Kai Chen, Shengzhi Zhang, Heqing Huang, XiaoFeng Wang, and Carl A Gunter. 2018. Commandersong: A systematic approach for practical adversarial voice recognition. In 27th USENIX Security Symposium (USENIX Security 18). 49--64.Google Scholar
- Lei Zhang, Yan Meng, Jiahao Yu, Chong Xiang, Brandon Falk, and Haojin Zhu. 2020. Voiceprint Mimicry Attack Towards Speaker Verification System in Smart Home. In Proceedings of the IEEE International Conference on Computer Communications (INFOCOM).Google ScholarDigital Library
- Yingke Zhu, Tom Ko, David Snyder, Brian Mak, and Daniel Povey. 2018. Self-Attentive Speaker Embeddings for Text-Independent Speaker Verification. In Interspeech. 3573--3577.Google Scholar
Index Terms
- AdvPulse: Universal, Synchronization-free, and Targeted Audio Adversarial Attacks via Subsecond Perturbations
Recommendations
Robust Detection of Machine-induced Audio Attacks in Intelligent Audio Systems with Microphone Array
CCS '21: Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications SecurityWith the popularity of intelligent audio systems in recent years, their vulnerabilities have become an increasing public concern. Existing studies have designed a set of machine-induced audio attacks, such as replay attacks, synthesis attacks, hidden ...
Double Targeted Universal Adversarial Perturbations
Computer Vision – ACCV 2020AbstractDespite their impressive performance, deep neural networks (DNNs) are widely known to be vulnerable to adversarial attacks, which makes it challenging for them to be deployed in security-sensitive applications, such as autonomous driving. Image-...
Smooth Perturbations for Time Series Adversarial Attacks
Advances in Knowledge Discovery and Data MiningAbstractAdversarial attacks represent a threat to every deep neural network. They are particularly effective if they can perturb a given model while remaining undetectable. They have been initially introduced for image classifiers, and are well studied ...
Comments