ABSTRACT
The integration of deep learning on Speaker Recognition (SR) advances its development and wide deployment, but also introduces the emerging threat of adversarial examples. However, only a few existing studies investigate its practical threat in physical domain, which either evaluate its feasibility only by directly replaying generated adversarial examples, or explore the partial channel interference for robustness improvement. In this paper, we propose a physical adversarial example attack, PhyTalker, which could generate and inject perturbations on voices in a live-streaming manner on attacking various SR models in different physical channels. Compared with the typical adversarial example for digital attacks, PhyTalker generates a subphoneme-level perturbation dictionary to decouple the perturbation optimization and injection. Moreover, we introduce the channel augmentation to compensate both device and environmental distortions, as well as model ensemble to improve the perturbation transferability. Finally, PhyTalker recognizes and localizes the latest recorded phoneme to determine the corresponding perturbations for real-time broadcasting. Extensive experiments are conducted with a large-scale corpus in real physical scenarios, and results show that PhyTalker achieves an overall Attack Success Rate (ASR) of 85.5% in attacking mainstream SR systems and Mel Cepstral Distortion (MCD) of 2.45dB in human audibility.
- FAKEBOB adversarial attack, Tom Dorr, Golfer Chen, and Pengfei Gao. 2019. FAKEBOB. https://github.com/FAKEBOB-adversarial-attack/FAKEBOB.Google Scholar
- Amazon Help & Customer Service. 2022. What Is Alexa Voice ID? https://www.amazon.com/gp/help/customer/display.html?nodeId=202199440.Google Scholar
- Apple. 2022. Apple Siri. https://www.apple.com/sg/siri/.Google Scholar
- Mathieu Bernard and Hadrien Titeux. 2021. Phonemizer: Text to Phones Transcription for Multiple Languages in Python. Journal of Open Source Software 6, 68 (2021), 3958. Google ScholarCross Ref
- Raghav Bharadwaj. 2019. Voice and Speech Recognition in Banking - What's Possible Today. https://emerj.com/ai-sector-overviews/voice-speech-recognition-banking/.Google Scholar
- Frédéric Bimbot, Jean-François Bonastre, Corinne Fredouille, Guillaume Gravier, Ivan Magrin-Chagnolleau, Sylvain Meignier, Téva Merlin, Javier Ortega-Garcia, Dijana Petrovska-Delacrétaz, and Douglas A. Reynolds. 2004. A Tutorial on Text-Independent Speaker Verification. EURASIP J. Adv. Signal Process. 2004, 4 (2004), 430--451.Google ScholarDigital Library
- Nicholas Carlini and David A. Wagner. 2018. Audio Adversarial Examples: Targeted Attacks on Speech-to-Text. In Proceedings of SP Workshops. IEEE Computer Society, San Francisco, CA, USA, 1--7.Google Scholar
- Guangke Chen, Sen Chen, Lingling Fan, Xiaoning Du, Zhe Zhao, Fu Song, and Yang Liu. 2021. Who is Real Bob? Adversarial Attacks on Speaker Recognition Systems. In Proceedings of SP. IEEE, Los Alamitos, CA, USA, 55--72.Google ScholarCross Ref
- Meng Chen, Li Lu, Zhongjie Ba, and Kui Ren. 2022. PhoneyTalker: An Out-of-the-Box Toolkit for Adversarial Example Attack on Speaker Recognition. In Proceedings of INFOCOM. IEEE, Virtual Event, 1419--1428.Google ScholarDigital Library
- Tao Chen, Longfei Shangguan, Zhenjiang Li, and Kyle Jamieson. 2020. Meta-morph: Injecting Inaudible Commands into Over-the-air Voice Controlled Systems. In Proceedings of NDSS. The Internet Society, San Diego, California, USA.Google Scholar
- Yuxuan Chen, Xuejing Yuan, Jiangshan Zhang, Yue Zhao, Shengzhi Zhang, Kai Chen, and XiaoFeng Wang. 2020. Devil's Whisper: A General Approach for Physical Adversarial Attacks against Commercial Black-box Speech Recognition Devices. In Proceedings of USENIX Security Symposium. USENIX Association, 2667--2684.Google Scholar
- Mia Chiquier, Chengzhi Mao, and Carl Vondrick. 2022. Real-Time Neural Voice Camouflage. In Proceedings of ICLR. OpenReview.net, Virtual Event.Google Scholar
- F. A. Rezaur Rahman Chowdhury, Quan Wang, Ignacio Lopez-Moreno, and Li Wan. 2018. Attention-Based Models for Text-Dependent Speaker Verification. In Proceedings of ICASSP. IEEE, Calgary, AB, Canada, 5359--5363.Google ScholarDigital Library
- Mohammad Esmaeilpour, Patrick Cardinal, and Alessandro Lameiras Koerich. 2021. Class-Conditional Defense GAN Against End-To-End Speech Attacks. In Proceedings of ICASSP. IEEE, Toronto, ON, Canada, 2565--2569.Google ScholarCross Ref
- Chao Gao, Guruprasad Saikumar, Amit Srivastava, and Premkumar Natarajan. 2011. Open-set speaker identification in broadcast news. In Proceedings of ICASSP. IEEE, Prague, Czech Republic, 5280--5283.Google ScholarCross Ref
- Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and Harnessing Adversarial Examples. In Proceedings of ICLR. OpenReview.net, San Diego, CA, USA.Google Scholar
- Google Assistant Help. 2022. Teach Google Assistant to recognize your voice with Voice Match. https://support.google.com/assistant/answer/9071681.Google Scholar
- Keita Goto and Nakamasa Inoue. 2020. Quasi-Newton Adversarial Attacks on Speaker Verification Systems. In Proceedings of APSIPA ASC. IEEE, Auckland, New Zealand, 527--531.Google Scholar
- Alex Graves and Jürgen Schmidhuber. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks 18, 5--6 (2005), 602--610.Google ScholarDigital Library
- Andrew Ilyas, Logan Engstrom, Anish Athalye, and Jessy Lin. 2018. Black-box Adversarial Attacks with Limited Queries and Information. In Proceedings of ICML, Vol. 80. IEEE, Stockholmsmässan, Stockholm, Sweden, 2142--2151.Google Scholar
- Md Tamzeed Islam and Shahriar Nirjon. 2021. Sound-Adapter: Multi-Source Domain Adaptation for Acoustic Classification Through Domain Discovery. In Proceedings of IPSN. ACM, Nashville, TN, USA, 176--190.Google ScholarDigital Library
- ISO. 2009. Measurement of room acoustic parameters-part 1: Performance spaces. Standard. International Organization for Standardization.Google Scholar
- Arindam Jati, Chin-Cheng Hsu, Monisankha Pal, Raghuveer Peri, Wael AbdAlmageed, and Shrikanth Narayanan. 2021. Adversarial attack and defense strategies for deep speaker recognition systems. Comput. Speech Lang. 68 (2021), 101199.Google ScholarCross Ref
- Shreya Khare, Rahul Aralikatte, and Senthil Mani. 2019. Adversarial Black-Box Attacks on Automatic Speech Recognition Systems Using Multi-Objective Evolutionary Optimization. In Proceedings of Interspeech. ISCA, Graz, Austria, 3208--3212.Google ScholarCross Ref
- Aldebaro Klautau. 2001. ARPABET and the TIMIT alphabet. (2001).Google Scholar
- Felix Kreuk, Yossi Adi, Moustapha Cissé, and Joseph Keshet. 2018. Fooling End-To-End Speaker Verification With Adversarial Examples. In Proceedings of ICASSP. IEEE, Calgary, AB, Canada, 1962--1966.Google ScholarDigital Library
- R. Kubichek. 1993. Mel-cepstral distance measure for objective speech quality assessment. In Proceedings of PACRIM, Vol. 1. IEEE, 125--128.Google ScholarCross Ref
- Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. 2017. Adversarial examples in the physical world. In Proceedings of ICLR. OpenReview.net, Toulon, France.Google Scholar
- Anthony Larcher, Kong-Aik Lee, Bin Ma, and Haizhou Li. 2014. Text-dependent speaker verification: Classifiers, databases and RSR2015. Speech Commun. 60 (2014), 56--77.Google ScholarCross Ref
- Vladimir I. Levenshtein et al. 1966. Binary codes capable of correcting deletions, insertions and reversals. Dokl. Akad. Nauk SSSR (1966).Google Scholar
- Chao Li, Xiaokong Ma, Bing Jiang, Xiangang Li, Xuewei Zhang, Xiao Liu, Ying Cao, Ajay Kannan, and Zhenyao Zhu. 2017. Deep Speaker: an End-to-End Neural Speaker Embedding System. CoRR abs/1705.02304 (2017).Google Scholar
- Jiguo Li, Xinfeng Zhang, Chuanmin Jia, Jizheng Xu, Li Zhang, Yue Wang, Siwei Ma, and Wen Gao. 2020. Universal Adversarial Perturbations Generative Network For Speaker Recognition. In Proceedings of ICME. IEEE, London, UK, 1--6.Google ScholarCross Ref
- Xu Li, Jinghua Zhong, Xixin Wu, Jianwei Yu, Xunying Liu, and Helen Meng. 2020. Adversarial Attacks on GMM I-Vector Based Speaker Verification Systems. In Proceedings of ICASSP. IEEE, Barcelona, Spain, 6579--6583.Google ScholarCross Ref
- Zhuohang Li, Cong Shi, Yi Xie, Jian Liu, Bo Yuan, and Yingying Chen. 2020. Practical Adversarial Attacks Against Speaker Recognition Systems. In Proceedings of HotMobile. ACM, Austin, TX, USA, 9--14.Google ScholarDigital Library
- Zhuohang Li, Yi Wu, Jian Liu, Yingying Chen, and Bo Yuan. 2020. AdvPulse: Universal, Synchronization-free, and Targeted Audio Adversarial Attacks via Subsecond Perturbations. In Proceedings of CCS. ACM, Virtual Event, USA, 1121--1134.Google ScholarDigital Library
- Tingting Liu and Shengxiao Guan. 2014. Factor analysis method for text-independent speaker identification. Journal of Software 9, 11 (2014), 2851--2860.Google ScholarCross Ref
- Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. 2017. Delving into Transferable Adversarial Examples and Black-box Attacks. In Proceedings of ICLR. OpenReview.net, Toulon, France.Google Scholar
- Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2018. Towards Deep Learning Models Resistant to Adversarial Attacks. In Proceedings of ICLR. OpenReview.net, Vancouver, BC, Canada.Google Scholar
- Akhil Mathur, Tianlin Zhang, Sourav Bhattacharya, Petar Velickovic, Leonid Joffe, Nicholas D. Lane, Fahim Kawsar, and Pietro Liò. 2018. Using deep data augmentation training to address software and hardware heterogeneities in wearable and smartphone sensing devices. In Proceedings of IPSN, Luca Mottola, Jie Gao, and Pei Zhang (Eds.). IEEE / ACM, Porto, Portugal, 200--211.Google ScholarDigital Library
- Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. 2016. TUT database for acoustic scene classification and sound event detection. In Proceedings of EUSIPCO. IEEE, Budapest, Hungary, 1128--1132.Google ScholarCross Ref
- Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. 2017. VoxCeleb: A Large-Scale Speaker Identification Dataset. In Processings of Interspeech, Francisco Lacerda (Ed.). ISCA, Stockholm, Sweden, 2616--2620.Google Scholar
- Paarth Neekhara, Shehzeen Hussain, Prakhar Pandey, Shlomo Dubnov, Julian J. McAuley, and Farinaz Koushanfar. 2019. Universal Adversarial Perturbations for Speech Recognition Systems. In Proceedings of Interspeech. ISCA, Graz, Austria, 481--485.Google ScholarCross Ref
- Institute of Telecommunication Sciences. 1996. voice frequency. https:/www.its.bldrdoc.gov/fs-1037/dir-039/_5829.htm.Google Scholar
- Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. Librispeech: An ASR corpus based on public domain audio books. In Processings of ICASSP. IEEE, South Brisbane, Queensland, Australia, 5206--5210.Google Scholar
- Krishan Rajaratnam, Kunal Shah, and Jugal Kalita. 2018. Isolated and Ensemble Audio Preprocessing Methods for Detecting Adversarial Examples against Automatic Speech Recognition. In Proceedings of ROCLING. Hsinchu, Taiwan, 16--30.Google Scholar
- Douglas D. Rife and John Vanderkooy. 1989. Transfer-function measurement with maximum-length sequences. Journal of the Audio Engineering Society 37, 6 (june 1989), 419--444.Google Scholar
- Lea Schönherr, Thorsten Eisenhofer, Steffen Zeiler, Thorsten Holz, and Dorothea Kolossa. 2020. Imperio: Robust Over-the-Air Adversarial Examples for Automatic Speech Recognition Systems. In Proceedings of ACSAC. ACM, Austin, TX, USA, 843--855.Google ScholarDigital Library
- Seeed. 2018. ReSpeaker Core v2.0. https://wiki.seeedstudio.com/ReSpeaker_Core_v2.0/.Google Scholar
- David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur. 2018. X-Vectors: Robust DNN Embeddings for Speaker Recognition. In Proceedings of ICASSP. IEEE, Calgary, AB, Canada, 5329--5333.Google ScholarDigital Library
- Guy-Bart Stan, Jean-Jacques Embrechts, and Dominique Archambeau. 2002. Comparison of different impulse response measurement techniques. Journal of the Audio Engineering Society 50, 4 (2002), 249--262.Google Scholar
- Rohan Taori, Amog Kamsetty, Brenton Chu, and Nikita Vemuri. 2019. Targeted Adversarial Examples for Black Box Audio Systems. In Proceedings of SP Workshops. IEEE, San Francisco, CA, USA, 15--20.Google ScholarCross Ref
- Henry Turner, Giulio Lovisotto, and Ivan Martinovic. 2019. Attacking Speaker Recognition Systems with Phoneme Morphing. In Proceedings of ESORICS, Kazue Sako, Steve A. Schneider, and Peter Y. A. Ryan (Eds.), Vol. 11735. Springer, 471--492.Google ScholarDigital Library
- Ehsan Variani, Xin Lei, Erik McDermott, Ignacio Lopez-Moreno, and Javier Gonzalez-Dominguez. 2014. Deep neural networks for small footprint text-dependent speaker verification. In Proceedings of ICASSP. IEEE, Florence, Italy, 4052--4056.Google ScholarCross Ref
- Jesús Villalba, Yuekai Zhang, and Najim Dehak. 2020. x-Vectors Meet Adversarial Attacks: Benchmarking Adversarial Robustness in Speaker VeEication. In Proceedings of Interspeech. ISCA, Shanghai, China, 4233--4237.Google ScholarCross Ref
- Qing Wang, Pengcheng Guo, and Lei Xie. 2020. Inaudible Adversarial Perturbations for Targeted Attack in Speaker Recognition. In Proceedings of Interspeech. ISCA, Shanghai, China, 4228--4232.Google ScholarCross Ref
- WeChat. 2015. Voiceprint: The New WeChat Password. https://blog.wechat.com/2015/05/21/voiceprint-the-new-wechat-password/.Google Scholar
- WHO. 2019. Advice for the public: Coronavirus disease (COVID-19). https://www.who.int/emergencies/diseases/novel-coronavirus-2019/advice-for-public.Google Scholar
- Yi Xie, Zhuohang Li, Cong Shi, Jian Liu, Yingying Chen, and Bo Yuan. 2021. Enabling Fast and Universal Audio Adversarial Attack Using Generative Model. In Proceedings of AAAI. AAAI Press, Virtual Event, 14129--14137.Google Scholar
- Yi Xie, Cong Shi, Zhuohang Li, Jian Liu, Yingying Chen, and Bo Yuan. 2020. Real-Time, Universal, and Robust Adversarial Attacks Against Speaker Recognition Systems. In Proceedings of ICASSP. IEEE, Barcelona, Spain, 1738--1742.Google ScholarCross Ref
- Xuejing Yuan, Yuxuan Chen, Yue Zhao, Yunhui Long, Xiaokang Liu, Kai Chen, Shengzhi Zhang, Heqing Huang, Xiaofeng Wang, and Carl A. Gunter. 2018. CommanderSong: A Systematic Approach for Practical Adversarial Voice Recognition. In Proceedings of USENIX Security Symposium. USENIX Association, Baltimore, MD, USA, 49--64.Google Scholar
- Weiyi Zhang, Shuning Zhao, Le Liu, Jianmin Li, Xingliang Cheng, Thomas Fang Zheng, and Xiaolin Hu. 2021. Attack on Practical Speaker Verification System Using Universal Adversarial Perturbations. In Proceedings of ICASSP. IEEE, Toronto, ON, Canada, 2575--2579.Google ScholarCross Ref
- Yuekai Zhang, Ziyan Jiang, Jesús Villalba, and Najim Dehak. 2020. Black-Box Attacks on Spoofing Countermeasures Using Transferability of Adversarial Examples. In Proceedings of Interspeech. ISCA, Shanghai, China, 4238--4242.Google ScholarCross Ref
Index Terms
- Push the Limit of Adversarial Example Attack on Speaker Recognition in Physical Domain
Recommendations
Decision-based adversarial attack for speaker recognition models
CSAI '22: Proceedings of the 2022 6th International Conference on Computer Science and Artificial IntelligenceAs a biometric technology, speaker recognition is widely used in finance, criminal investigation, and other fields due to its convenience and high accuracy. Speaker recognition models are vulnerable to spoofing attacks and adversarial attacks. Thus, the ...
Practical Backdoor Attack Against Speaker Recognition System
Information Security Practice and ExperienceAbstractDeep learning-based models have achieved state-of-the-art performance in a wide variety of classification and recognition tasks. Although such models have been demonstrated to suffer from backdoor attacks in multiple domains, little is known ...
Speaker and channel factors in text-dependent speaker recognition
We reformulate joint factor analysis so that it can serve as a feature extractor for text-dependent speaker recognition. The new formulation is based on left-to-right modeling with tied mixture HMMs and it is designed to deal with problems such as the ...
Comments