research-article

Imperio: Robust Over-the-Air Adversarial Examples for Automatic Speech Recognition Systems

Authors:
Lea Schönherr

Ruhr University Bochum, Germany

Ruhr University Bochum, Germany
View Profile

,
Thorsten Eisenhofer

Ruhr University Bochum

Ruhr University Bochum
View Profile

,
Steffen Zeiler

Ruhr University Bochum

Ruhr University Bochum
View Profile

,
Thorsten Holz

Ruhr University Bochum, Germany

Ruhr University Bochum, Germany
View Profile

,
Dorothea Kolossa

Ruhr University Bochum

Ruhr University Bochum
View Profile

ACSAC '20: Proceedings of the 36th Annual Computer Security Applications ConferenceDecember 2020Pages 843–855https://doi.org/10.1145/3427228.3427276

Published:08 December 2020Publication History

ACSAC '20: Proceedings of the 36th Annual Computer Security Applications Conference

Pages 843–855

ABSTRACT

Automatic speech recognition (ASR) systems can be fooled via targeted adversarial examples, which induce the ASR to produce arbitrary transcriptions in response to altered audio signals. However, state-of-the-art adversarial examples typically have to be fed into the ASR system directly, and are not successful when played in a room. Previously published over-the-air adversarial examples fall into one of three categories: they are either handcrafted examples, they are so conspicuous that human listeners can easily recognize the target transcription once they are alerted to its content, or they require precise information about the room where the attack takes place, and are hence not transferable to other rooms.

In this paper, we demonstrate the first algorithm that produces generic adversarial examples against hybrid ASR systems, which remain robust in an over-the-air attack that is not adapted to the specific environment. Hence, no prior knowledge of the room characteristics is required. Instead, we use room impulse responses (RIRs) to compute robust adversarial examples for arbitrary room characteristics and employ the ASR system Kaldi to demonstrate the attack. Further, our algorithm can utilize psychoacoustic methods to hide changes of the original audio signal below the human thresholds of hearing. In practical experiments, we show that the adversarial examples work for varying room setups, and that no direct line-of-sight between speaker and microphone is necessary. As a result, an attacker can create inconspicuous adversarial examples for any target transcription and apply these to arbitrary room setups without any prior knowledge.

References

Hadi Abdullah, Washington Garcia, Christian Peeters, Patrick Traynor, Kevin R. B. Butler, and Joseph Wilson. 2019. Practical Hidden Voice Attacks against Speech and Speaker Recognition Systems. In Network and Distributed System Security Symposium (NDSS).Google ScholarCross Ref
Jont B. Allen and David A. Berkley. 1979. Image method for efficiently simulating small-room acoustics. The Journal of the Acoustical Society of America 65, 4 (1979), 943–950.Google ScholarCross Ref
Moustafa Alzantot, Bharathan Balaji, and Mani Srivastava. 2018. Did you hear that? Adversarial examples against automatic speech recognition. arXiv preprint arXiv:1801.00554(2018).Google Scholar
Anish Athalye, Logan Engstrom, Andrew Ilyas, and Kevin Kwok. 2017. Synthesizing Robust Adversarial Examples. CoRR abs/1707.07397 (July 2017), 1–18.Google Scholar
Mitali Bafna, Jack Murtagh, and Nikhil Vyas. 2018. Thwarting Adversarial Examples: An L1-Robust Sparse Fourier Transform. In Advances in Neural Information Processing Systems 31. 10075–10085.Google Scholar
Nicholas Carlini, Anish Athalye, Nicolas Papernot, Wieland Brendel, Jonas Rauber, Dimitris Tsipras, Ian Goodfellow, and Aleksander Madry. 2019. On evaluating adversarial robustness. arXiv preprint arXiv:1902.06705(2019).Google Scholar
Nicholas Carlini, Pratyush Mishra, Tavish Vaidya, Yuankai Zhang, Micah Sherr, Clay Shields, David A. Wagner, and Wenchao Zhou. 2016. Hidden Voice Commands. In USENIX Security Symposium. USENIX, 513–530.Google Scholar
Nicholas Carlini and David Wagner. 2017. Towards Evaluating the Robustness of Neural Networks. In Symposium on Security and Privacy. IEEE, 39–57.Google Scholar
Nicholas Carlini and David Wagner. 2018. Audio adversarial examples: Targeted attacks on speech-to-text. (2018), 1–7.Google Scholar
Tao Chen, Longfei Shangguan, Zhenjiang Li, and Kyle Jamieson. 2020. Metamorph: Injecting Inaudible Commands into Over-the-air Voice Controlled Systems. (2020).Google Scholar
Yuxuan Chen, Xuejing Yuan, Jiangshan Zhang, Yue Zhao, Shengzhi Zhang, Kai Chen, and XiaoFeng Wang. 2020. Devil’s Whisper: A General Approach for Physical Adversarial Attacks against Commercial Black-box Speech Recognition Devices. In USENIX Security Symposium. USENIX.Google Scholar
Moustapha Cisse, Yossi Adi, Natalia Neverova, and Joseph Keshet. 2017. Houdini: Fooling Deep Structured Prediction Models. CoRR abs/1707.05373 (July 2017), 1–12.Google Scholar
Sina Däubener, Lea Schönherr, Asja Fischer, and Dorothea Kolossa. 2020. Detecting Adversarial Examples for Speech Recognition via Uncertainty Quantification. arXiv preprint arXiv:2005.14611(2020).Google Scholar
Ivan Evtimov, Kevin Eykholt, Earlence Fernandes, Tadayoshi Kohno, Bo Li, Atul Prakash, Amir Rahmati, and Dawn Song. 2017. Robust Physical-World Attacks on Machine Learning Models. CoRR abs/1707.08945 (July 2017), 1–11.Google Scholar
Yarin Gal and Zoubin Ghahramani. 2016. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning. 1050–1059.Google ScholarDigital Library
Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, 2014. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567(2014).Google Scholar
Andrew Ilyas, Logan Engstrom, Anish Athalye, and Jessy Lin. 2018. Black-box Adversarial Attacks with Limited Queries and Information. CoRR abs/1804.08598 (April 2018), 1–10.Google Scholar
ISO. 1993. Information Technology – Coding of Moving Pictures and Associated Audio for Digital Storage Media at Up to 1.5 Mbits/s – Part3: Audio. ISO 11172-3. International Organization for Standardization.Google Scholar
Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. 2017. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems. 6402–6413.Google Scholar
Juncheng Li, Shuhui Qu, Xinjian Li, Joseph Szurley, J Zico Kolter, and Florian Metze. 2019. Adversarial Music: Real World Audio Adversary Against Wake-word Detection System. In Advances in Neural Information Processing Systems (NeurIPS). 11908–11918.Google Scholar
Christos Louizos and Max Welling. 2016. Structured and efficient variational deep learning with matrix gaussian posteriors. In International Conference on Machine Learning. 1708–1716.Google Scholar
Christoph Lüscher, Eugen Beck, Kazuki Irie, Markus Kitza, Wilfried Michel, Albert Zeyer, Ralf Schlüter, and Hermann Ney. 2019. RWTH ASR systems for LibriSpeech: Hybrid vs Attention. Proceedings of Interspeech(2019), 231–235.Google ScholarCross Ref
Gonzalo Navarro. 2001. A Guided Tour to Approximate String Matching. Comput. Surveys 33, 1 (March 2001), 31–88.Google ScholarDigital Library
Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z. Berkay Celik, and Ananthram Swami. 2017. Practical Black-Box Attacks Against Machine Learning. In Asia Conference on Computer and Communications Security (ASIA CCS). ACM, 506–519.Google Scholar
Nicolas Papernot, Patrick D. McDaniel, and Ian J. Goodfellow. 2016. Transferability in Machine Learning: From Phenomena to Black-Box Attacks using Adversarial Samples. CoRR abs/1605.07277 (May 2016), 1–13.Google Scholar
Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, Jan Silovsky, Georg Stemmer, and Karel Vesely. 2011. The Kaldi Speech Recognition Toolkit. In Workshop on Automatic Speech Recognition and Understanding. IEEE.Google Scholar
Yao Qin, Nicholas Carlini, Ian Goodfellow, Garrison Cottrell, and Colin Raffel. 2019. Imperceptible, Robust, and Targeted Adversarial Examples for Automatic Speech Recognition. In arXiv preprint arXiv:1903.10346.Google Scholar
Nirupam Roy, Haitham Hassanieh, and Romit Roy Choudhury. 2017. BackDoor: Making Microphones Hear Inaudible Sounds. In Conference on Mobile Systems, Applications, and Services. ACM, 2–14.Google ScholarDigital Library
Lea Schönherr, Katharina Kohls, Steffen Zeiler, Thorsten Holz, and Dorothea Kolossa. 2019. Adversarial Attacks Against Automatic Speech Recognition Systems via Psychoacoustic Hiding. In Network and Distributed System Security Symposium (NDSS).Google ScholarCross Ref
Senthil Mani Shreya Khare, Rahul Aralikatte. 2019. Adversarial Black-Box Attacks on Automatic Speech Recognition Systems using Multi-Objective Evolutionary Optimization. Proceedings of Interspeech(2019).Google Scholar
Liwei Song and Prateek Mittal. 2017. Inaudible Voice Commands. CoRR abs/1708.07238 (Aug. 2017), 1–3.Google Scholar
Joseph Szurley and J Zico Kolter. 2019. Perceptual Based Adversarial Audio Attacks. arXiv preprint arXiv:1906.06355(2019).Google Scholar
Rohan Taori, Amog Kamsetty, Brenton Chu, and Nikita Vemuri. 2018. Targeted adversarial examples for black box audio systems. arXiv preprint arXiv:1805.07820(2018).Google Scholar
Florian Tramèr, Fan Zhang, Ari Juels, Michael K. Reiter, and Thomas Ristenpart. 2016. Stealing Machine Learning Models via Prediction APIs. In USENIX Security Symposium. USENIX, 601–618.Google Scholar
Stephen Voran and Connie Sholl. 1995. Perception-based Objective Estimators of Speech. In IEEE Workshop on Speech Coding for Telecommunications. IEEE, 13–14.Google Scholar
Binghui Wang and Neil Zhenqiang Gong. 2018. Stealing Hyperparameters in Machine Learning. In Symposium on Security and Privacy. IEEE.Google Scholar
Hiromu Yakura and Jun Sakuma. 2019. Robust audio adversarial example for a physical attack. arXiv preprint arXiv:1810.11793(2019).Google Scholar
Wonho Yang. 1999. Enhanced Modified Bark Spectral Distortion (EMBSD): an Objective Speech Quality Measrure Based on Audible Distortion and Cognition Model. Ph.D. Dissertation. Temple University Graduate Board.Google Scholar
Xuejing Yuan, Yuxuan Chen, Yue Zhao, Yunhui Long, Xiaokang Liu, Kai Chen, Shengzhi Zhang, Heqing Huang, Xiaofeng Wang, and Carl A. Gunter. 2018. CommanderSong: A Systematic Approach for Practical Adversarial Voice Recognition. arXiv preprint arXiv:1801.08535(2018).Google Scholar
Guoming Zhang, Chen Yan, Xiaoyu Ji, Tianchen Zhang, Taimin Zhang, and Wenyuan Xu. 2017. DolphinAttack: Inaudible Voice Commands. In Conference on Computer and Communications Security (CCS). ACM, 103–117.Google ScholarDigital Library
Eberhard Zwicker and Hugo Fastl. 2007. Psychoacoustics: Facts and Models(third ed.). Springer.Google Scholar

Index Terms

Imperio: Robust Over-the-Air Adversarial Examples for Automatic Speech Recognition Systems

Index terms have been assigned to the content through auto-classification.

Recommendations

Speech enhancement for robust automatic speech recognition

Evaluation of baseline CHiME3 recogniser in diverse range of acoustic conditions.Performance curves indicate relative influence of noise and reverberation.Evaluation of 6 different speech enhancement pipelines.Deverberation and beamforming dramatically ...
Read More
Psycho-acoustics inspired automatic speech recognition
Abstract
Understanding the human spoken language recognition process is still a far scientific goal. Nowadays, commercial automatic speech recognisers (ASRs) achieve high performance at recognising clean speech, but their approaches are poorly ...
Highlights
- We propose a novel Automatic Speech Recognizer inspired by psycho-acoustic studies.
Read More
Harmonicity Based Dereverberation for Improving Automatic Speech Recognition Performance and Speech Intelligibility

A speech signal captured by a distant microphone is generally smeared by reverberation, which severely degrades both the speech intelligibility and Automatic Speech Recognition (ASR) performance. Previously, we proposed a single-microphone ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

ACSAC '20: Proceedings of the 36th Annual Computer Security Applications Conference
December 2020
962 pages
ISBN:9781450388580
DOI:10.1145/3427228

Copyright © 2020 ACM
Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 8 December 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
adversarial examples
automatic speech recognition
over-the-air attack
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate104of497submissions,21%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 26
  Total Citations
  View Citations
- 368
  Total Downloads
- Downloads (Last 12 months)66
- Downloads (Last 6 weeks)8
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Imperio: Robust Over-the-Air Adversarial Examples for Automatic Speech Recognition Systems

ACSAC '20: Proceedings of the 36th Annual Computer Security Applications Conference

ABSTRACT

References

Cited By

Index Terms

Recommendations

Speech enhancement for robust automatic speech recognition

Psycho-acoustics inspired automatic speech recognition

Harmonicity Based Dereverberation for Improving Automatic Speech Recognition Performance and Speech Intelligibility

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Imperio: Robust Over-the-Air Adversarial Examples for Automatic Speech Recognition Systems

ACSAC '20: Proceedings of the 36th Annual Computer Security Applications Conference

ABSTRACT

References

Cited By

Index Terms

Recommendations

Speech enhancement for robust automatic speech recognition

Psycho-acoustics inspired automatic speech recognition

Harmonicity Based Dereverberation for Improving Automatic Speech Recognition Performance and Speech Intelligibility

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media