research-article

Push the Limit of Adversarial Example Attack on Speaker Recognition in Physical Domain

Authors:
Qianniu Chen

Zhejiang University and ZJU-HIC

Zhejiang University and ZJU-HIC
View Profile

,
Meng Chen

Zhejiang University

Zhejiang University
View Profile

,
Li Lu

Zhejiang University

Zhejiang University
View Profile

,
Jiadi Yu

Shanghai Jiao Tong University

Shanghai Jiao Tong University
View Profile

,
Yingying Chen

Rutgers University

Rutgers University
View Profile

,
Zhibo Wang

Zhejiang University

Zhejiang University
View Profile

,
Zhongjie Ba

Zhejiang University

Zhejiang University
View Profile

,
Feng Lin

Zhejiang University

Zhejiang University
View Profile

,
Kui Ren

Zhejiang University

Zhejiang University
View Profile

SenSys '22: Proceedings of the 20th ACM Conference on Embedded Networked Sensor SystemsNovember 2022Pages 710–724https://doi.org/10.1145/3560905.3568518

Published:24 January 2023Publication History

SenSys '22: Proceedings of the 20th ACM Conference on Embedded Networked Sensor Systems

Pages 710–724

ABSTRACT

The integration of deep learning on Speaker Recognition (SR) advances its development and wide deployment, but also introduces the emerging threat of adversarial examples. However, only a few existing studies investigate its practical threat in physical domain, which either evaluate its feasibility only by directly replaying generated adversarial examples, or explore the partial channel interference for robustness improvement. In this paper, we propose a physical adversarial example attack, PhyTalker, which could generate and inject perturbations on voices in a live-streaming manner on attacking various SR models in different physical channels. Compared with the typical adversarial example for digital attacks, PhyTalker generates a subphoneme-level perturbation dictionary to decouple the perturbation optimization and injection. Moreover, we introduce the channel augmentation to compensate both device and environmental distortions, as well as model ensemble to improve the perturbation transferability. Finally, PhyTalker recognizes and localizes the latest recorded phoneme to determine the corresponding perturbations for real-time broadcasting. Extensive experiments are conducted with a large-scale corpus in real physical scenarios, and results show that PhyTalker achieves an overall Attack Success Rate (ASR) of 85.5% in attacking mainstream SR systems and Mel Cepstral Distortion (MCD) of 2.45dB in human audibility.

References

FAKEBOB adversarial attack, Tom Dorr, Golfer Chen, and Pengfei Gao. 2019. FAKEBOB. https://github.com/FAKEBOB-adversarial-attack/FAKEBOB.Google Scholar
Amazon Help & Customer Service. 2022. What Is Alexa Voice ID? https://www.amazon.com/gp/help/customer/display.html?nodeId=202199440.Google Scholar
Apple. 2022. Apple Siri. https://www.apple.com/sg/siri/.Google Scholar
Mathieu Bernard and Hadrien Titeux. 2021. Phonemizer: Text to Phones Transcription for Multiple Languages in Python. Journal of Open Source Software 6, 68 (2021), 3958. Google ScholarCross Ref
Raghav Bharadwaj. 2019. Voice and Speech Recognition in Banking - What's Possible Today. https://emerj.com/ai-sector-overviews/voice-speech-recognition-banking/.Google Scholar
Frédéric Bimbot, Jean-François Bonastre, Corinne Fredouille, Guillaume Gravier, Ivan Magrin-Chagnolleau, Sylvain Meignier, Téva Merlin, Javier Ortega-Garcia, Dijana Petrovska-Delacrétaz, and Douglas A. Reynolds. 2004. A Tutorial on Text-Independent Speaker Verification. EURASIP J. Adv. Signal Process. 2004, 4 (2004), 430--451.Google ScholarDigital Library
Nicholas Carlini and David A. Wagner. 2018. Audio Adversarial Examples: Targeted Attacks on Speech-to-Text. In Proceedings of SP Workshops. IEEE Computer Society, San Francisco, CA, USA, 1--7.Google Scholar
Guangke Chen, Sen Chen, Lingling Fan, Xiaoning Du, Zhe Zhao, Fu Song, and Yang Liu. 2021. Who is Real Bob? Adversarial Attacks on Speaker Recognition Systems. In Proceedings of SP. IEEE, Los Alamitos, CA, USA, 55--72.Google ScholarCross Ref
Meng Chen, Li Lu, Zhongjie Ba, and Kui Ren. 2022. PhoneyTalker: An Out-of-the-Box Toolkit for Adversarial Example Attack on Speaker Recognition. In Proceedings of INFOCOM. IEEE, Virtual Event, 1419--1428.Google ScholarDigital Library
Tao Chen, Longfei Shangguan, Zhenjiang Li, and Kyle Jamieson. 2020. Meta-morph: Injecting Inaudible Commands into Over-the-air Voice Controlled Systems. In Proceedings of NDSS. The Internet Society, San Diego, California, USA.Google Scholar
Yuxuan Chen, Xuejing Yuan, Jiangshan Zhang, Yue Zhao, Shengzhi Zhang, Kai Chen, and XiaoFeng Wang. 2020. Devil's Whisper: A General Approach for Physical Adversarial Attacks against Commercial Black-box Speech Recognition Devices. In Proceedings of USENIX Security Symposium. USENIX Association, 2667--2684.Google Scholar
Mia Chiquier, Chengzhi Mao, and Carl Vondrick. 2022. Real-Time Neural Voice Camouflage. In Proceedings of ICLR. OpenReview.net, Virtual Event.Google Scholar
F. A. Rezaur Rahman Chowdhury, Quan Wang, Ignacio Lopez-Moreno, and Li Wan. 2018. Attention-Based Models for Text-Dependent Speaker Verification. In Proceedings of ICASSP. IEEE, Calgary, AB, Canada, 5359--5363.Google ScholarDigital Library
Mohammad Esmaeilpour, Patrick Cardinal, and Alessandro Lameiras Koerich. 2021. Class-Conditional Defense GAN Against End-To-End Speech Attacks. In Proceedings of ICASSP. IEEE, Toronto, ON, Canada, 2565--2569.Google ScholarCross Ref
Chao Gao, Guruprasad Saikumar, Amit Srivastava, and Premkumar Natarajan. 2011. Open-set speaker identification in broadcast news. In Proceedings of ICASSP. IEEE, Prague, Czech Republic, 5280--5283.Google ScholarCross Ref
Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and Harnessing Adversarial Examples. In Proceedings of ICLR. OpenReview.net, San Diego, CA, USA.Google Scholar
Google Assistant Help. 2022. Teach Google Assistant to recognize your voice with Voice Match. https://support.google.com/assistant/answer/9071681.Google Scholar
Keita Goto and Nakamasa Inoue. 2020. Quasi-Newton Adversarial Attacks on Speaker Verification Systems. In Proceedings of APSIPA ASC. IEEE, Auckland, New Zealand, 527--531.Google Scholar
Alex Graves and Jürgen Schmidhuber. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks 18, 5--6 (2005), 602--610.Google ScholarDigital Library
Andrew Ilyas, Logan Engstrom, Anish Athalye, and Jessy Lin. 2018. Black-box Adversarial Attacks with Limited Queries and Information. In Proceedings of ICML, Vol. 80. IEEE, Stockholmsmässan, Stockholm, Sweden, 2142--2151.Google Scholar
Md Tamzeed Islam and Shahriar Nirjon. 2021. Sound-Adapter: Multi-Source Domain Adaptation for Acoustic Classification Through Domain Discovery. In Proceedings of IPSN. ACM, Nashville, TN, USA, 176--190.Google ScholarDigital Library
ISO. 2009. Measurement of room acoustic parameters-part 1: Performance spaces. Standard. International Organization for Standardization.Google Scholar
Arindam Jati, Chin-Cheng Hsu, Monisankha Pal, Raghuveer Peri, Wael AbdAlmageed, and Shrikanth Narayanan. 2021. Adversarial attack and defense strategies for deep speaker recognition systems. Comput. Speech Lang. 68 (2021), 101199.Google ScholarCross Ref
Shreya Khare, Rahul Aralikatte, and Senthil Mani. 2019. Adversarial Black-Box Attacks on Automatic Speech Recognition Systems Using Multi-Objective Evolutionary Optimization. In Proceedings of Interspeech. ISCA, Graz, Austria, 3208--3212.Google ScholarCross Ref
Aldebaro Klautau. 2001. ARPABET and the TIMIT alphabet. (2001).Google Scholar
Felix Kreuk, Yossi Adi, Moustapha Cissé, and Joseph Keshet. 2018. Fooling End-To-End Speaker Verification With Adversarial Examples. In Proceedings of ICASSP. IEEE, Calgary, AB, Canada, 1962--1966.Google ScholarDigital Library
R. Kubichek. 1993. Mel-cepstral distance measure for objective speech quality assessment. In Proceedings of PACRIM, Vol. 1. IEEE, 125--128.Google ScholarCross Ref
Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. 2017. Adversarial examples in the physical world. In Proceedings of ICLR. OpenReview.net, Toulon, France.Google Scholar
Anthony Larcher, Kong-Aik Lee, Bin Ma, and Haizhou Li. 2014. Text-dependent speaker verification: Classifiers, databases and RSR2015. Speech Commun. 60 (2014), 56--77.Google ScholarCross Ref
Vladimir I. Levenshtein et al. 1966. Binary codes capable of correcting deletions, insertions and reversals. Dokl. Akad. Nauk SSSR (1966).Google Scholar
Chao Li, Xiaokong Ma, Bing Jiang, Xiangang Li, Xuewei Zhang, Xiao Liu, Ying Cao, Ajay Kannan, and Zhenyao Zhu. 2017. Deep Speaker: an End-to-End Neural Speaker Embedding System. CoRR abs/1705.02304 (2017).Google Scholar
Jiguo Li, Xinfeng Zhang, Chuanmin Jia, Jizheng Xu, Li Zhang, Yue Wang, Siwei Ma, and Wen Gao. 2020. Universal Adversarial Perturbations Generative Network For Speaker Recognition. In Proceedings of ICME. IEEE, London, UK, 1--6.Google ScholarCross Ref
Xu Li, Jinghua Zhong, Xixin Wu, Jianwei Yu, Xunying Liu, and Helen Meng. 2020. Adversarial Attacks on GMM I-Vector Based Speaker Verification Systems. In Proceedings of ICASSP. IEEE, Barcelona, Spain, 6579--6583.Google ScholarCross Ref
Zhuohang Li, Cong Shi, Yi Xie, Jian Liu, Bo Yuan, and Yingying Chen. 2020. Practical Adversarial Attacks Against Speaker Recognition Systems. In Proceedings of HotMobile. ACM, Austin, TX, USA, 9--14.Google ScholarDigital Library
Zhuohang Li, Yi Wu, Jian Liu, Yingying Chen, and Bo Yuan. 2020. AdvPulse: Universal, Synchronization-free, and Targeted Audio Adversarial Attacks via Subsecond Perturbations. In Proceedings of CCS. ACM, Virtual Event, USA, 1121--1134.Google ScholarDigital Library
Tingting Liu and Shengxiao Guan. 2014. Factor analysis method for text-independent speaker identification. Journal of Software 9, 11 (2014), 2851--2860.Google ScholarCross Ref
Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. 2017. Delving into Transferable Adversarial Examples and Black-box Attacks. In Proceedings of ICLR. OpenReview.net, Toulon, France.Google Scholar
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2018. Towards Deep Learning Models Resistant to Adversarial Attacks. In Proceedings of ICLR. OpenReview.net, Vancouver, BC, Canada.Google Scholar
Akhil Mathur, Tianlin Zhang, Sourav Bhattacharya, Petar Velickovic, Leonid Joffe, Nicholas D. Lane, Fahim Kawsar, and Pietro Liò. 2018. Using deep data augmentation training to address software and hardware heterogeneities in wearable and smartphone sensing devices. In Proceedings of IPSN, Luca Mottola, Jie Gao, and Pei Zhang (Eds.). IEEE / ACM, Porto, Portugal, 200--211.Google ScholarDigital Library
Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. 2016. TUT database for acoustic scene classification and sound event detection. In Proceedings of EUSIPCO. IEEE, Budapest, Hungary, 1128--1132.Google ScholarCross Ref
Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. 2017. VoxCeleb: A Large-Scale Speaker Identification Dataset. In Processings of Interspeech, Francisco Lacerda (Ed.). ISCA, Stockholm, Sweden, 2616--2620.Google Scholar
Paarth Neekhara, Shehzeen Hussain, Prakhar Pandey, Shlomo Dubnov, Julian J. McAuley, and Farinaz Koushanfar. 2019. Universal Adversarial Perturbations for Speech Recognition Systems. In Proceedings of Interspeech. ISCA, Graz, Austria, 481--485.Google ScholarCross Ref
Institute of Telecommunication Sciences. 1996. voice frequency. https:/www.its.bldrdoc.gov/fs-1037/dir-039/_5829.htm.Google Scholar
Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. Librispeech: An ASR corpus based on public domain audio books. In Processings of ICASSP. IEEE, South Brisbane, Queensland, Australia, 5206--5210.Google Scholar
Krishan Rajaratnam, Kunal Shah, and Jugal Kalita. 2018. Isolated and Ensemble Audio Preprocessing Methods for Detecting Adversarial Examples against Automatic Speech Recognition. In Proceedings of ROCLING. Hsinchu, Taiwan, 16--30.Google Scholar
Douglas D. Rife and John Vanderkooy. 1989. Transfer-function measurement with maximum-length sequences. Journal of the Audio Engineering Society 37, 6 (june 1989), 419--444.Google Scholar
Lea Schönherr, Thorsten Eisenhofer, Steffen Zeiler, Thorsten Holz, and Dorothea Kolossa. 2020. Imperio: Robust Over-the-Air Adversarial Examples for Automatic Speech Recognition Systems. In Proceedings of ACSAC. ACM, Austin, TX, USA, 843--855.Google ScholarDigital Library
Seeed. 2018. ReSpeaker Core v2.0. https://wiki.seeedstudio.com/ReSpeaker_Core_v2.0/.Google Scholar
David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur. 2018. X-Vectors: Robust DNN Embeddings for Speaker Recognition. In Proceedings of ICASSP. IEEE, Calgary, AB, Canada, 5329--5333.Google ScholarDigital Library
Guy-Bart Stan, Jean-Jacques Embrechts, and Dominique Archambeau. 2002. Comparison of different impulse response measurement techniques. Journal of the Audio Engineering Society 50, 4 (2002), 249--262.Google Scholar
Rohan Taori, Amog Kamsetty, Brenton Chu, and Nikita Vemuri. 2019. Targeted Adversarial Examples for Black Box Audio Systems. In Proceedings of SP Workshops. IEEE, San Francisco, CA, USA, 15--20.Google ScholarCross Ref
Henry Turner, Giulio Lovisotto, and Ivan Martinovic. 2019. Attacking Speaker Recognition Systems with Phoneme Morphing. In Proceedings of ESORICS, Kazue Sako, Steve A. Schneider, and Peter Y. A. Ryan (Eds.), Vol. 11735. Springer, 471--492.Google ScholarDigital Library
Ehsan Variani, Xin Lei, Erik McDermott, Ignacio Lopez-Moreno, and Javier Gonzalez-Dominguez. 2014. Deep neural networks for small footprint text-dependent speaker verification. In Proceedings of ICASSP. IEEE, Florence, Italy, 4052--4056.Google ScholarCross Ref
Jesús Villalba, Yuekai Zhang, and Najim Dehak. 2020. x-Vectors Meet Adversarial Attacks: Benchmarking Adversarial Robustness in Speaker VeEication. In Proceedings of Interspeech. ISCA, Shanghai, China, 4233--4237.Google ScholarCross Ref
Qing Wang, Pengcheng Guo, and Lei Xie. 2020. Inaudible Adversarial Perturbations for Targeted Attack in Speaker Recognition. In Proceedings of Interspeech. ISCA, Shanghai, China, 4228--4232.Google ScholarCross Ref
WeChat. 2015. Voiceprint: The New WeChat Password. https://blog.wechat.com/2015/05/21/voiceprint-the-new-wechat-password/.Google Scholar
WHO. 2019. Advice for the public: Coronavirus disease (COVID-19). https://www.who.int/emergencies/diseases/novel-coronavirus-2019/advice-for-public.Google Scholar
Yi Xie, Zhuohang Li, Cong Shi, Jian Liu, Yingying Chen, and Bo Yuan. 2021. Enabling Fast and Universal Audio Adversarial Attack Using Generative Model. In Proceedings of AAAI. AAAI Press, Virtual Event, 14129--14137.Google Scholar
Yi Xie, Cong Shi, Zhuohang Li, Jian Liu, Yingying Chen, and Bo Yuan. 2020. Real-Time, Universal, and Robust Adversarial Attacks Against Speaker Recognition Systems. In Proceedings of ICASSP. IEEE, Barcelona, Spain, 1738--1742.Google ScholarCross Ref
Xuejing Yuan, Yuxuan Chen, Yue Zhao, Yunhui Long, Xiaokang Liu, Kai Chen, Shengzhi Zhang, Heqing Huang, Xiaofeng Wang, and Carl A. Gunter. 2018. CommanderSong: A Systematic Approach for Practical Adversarial Voice Recognition. In Proceedings of USENIX Security Symposium. USENIX Association, Baltimore, MD, USA, 49--64.Google Scholar
Weiyi Zhang, Shuning Zhao, Le Liu, Jianmin Li, Xingliang Cheng, Thomas Fang Zheng, and Xiaolin Hu. 2021. Attack on Practical Speaker Verification System Using Universal Adversarial Perturbations. In Proceedings of ICASSP. IEEE, Toronto, ON, Canada, 2575--2579.Google ScholarCross Ref
Yuekai Zhang, Ziyan Jiang, Jesús Villalba, and Najim Dehak. 2020. Black-Box Attacks on Spoofing Countermeasures Using Transferability of Adversarial Examples. In Proceedings of Interspeech. ISCA, Shanghai, China, 4238--4242.Google ScholarCross Ref

Index Terms

Push the Limit of Adversarial Example Attack on Speaker Recognition in Physical Domain
1. Computing methodologies
  1. Artificial intelligence
2. Security and privacy
  1. Network security
    1. Mobile and wireless security

Recommendations

Decision-based adversarial attack for speaker recognition models
CSAI '22: Proceedings of the 2022 6th International Conference on Computer Science and Artificial Intelligence

As a biometric technology, speaker recognition is widely used in finance, criminal investigation, and other fields due to its convenience and high accuracy. Speaker recognition models are vulnerable to spoofing attacks and adversarial attacks. Thus, the ...
Read More
Practical Backdoor Attack Against Speaker Recognition System
Information Security Practice and Experience
Abstract
Deep learning-based models have achieved state-of-the-art performance in a wide variety of classification and recognition tasks. Although such models have been demonstrated to suffer from backdoor attacks in multiple domains, little is known ...
Read More
Speaker and channel factors in text-dependent speaker recognition

We reformulate joint factor analysis so that it can serve as a feature extractor for text-dependent speaker recognition. The new formulation is based on left-to-right modeling with tied mixture HMMs and it is designed to deal with problems such as the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SenSys '22: Proceedings of the 20th ACM Conference on Embedded Networked Sensor Systems
November 2022
1280 pages
ISBN:9781450398862
DOI:10.1145/3560905
General Chairs:
Jeremy Gummeson
University of Massachusetts Amherst
,
Sunghoon Ivan Lee
University of Massachusetts Amherst
,
Program Chairs:
Jie Gao
Rutgers University
,
Guoliang Xing
The Chinese University of Hong Kong
Copyright © 2022 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 January 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
adversarial example attack
live-streaming
physical domain
speaker recognition
Qualifiers
- research-article
Conference

Acceptance Rates
SenSys '22 Paper Acceptance Rate52of187submissions,28%Overall Acceptance Rate174of867submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 186
  Total Downloads
- Downloads (Last 12 months)110
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Push the Limit of Adversarial Example Attack on Speaker Recognition in Physical Domain

SenSys '22: Proceedings of the 20th ACM Conference on Embedded Networked Sensor Systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Decision-based adversarial attack for speaker recognition models

Practical Backdoor Attack Against Speaker Recognition System

Speaker and channel factors in text-dependent speaker recognition

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Push the Limit of Adversarial Example Attack on Speaker Recognition in Physical Domain

SenSys '22: Proceedings of the 20th ACM Conference on Embedded Networked Sensor Systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Decision-based adversarial attack for speaker recognition models

Practical Backdoor Attack Against Speaker Recognition System

Speaker and channel factors in text-dependent speaker recognition

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media