research-article

Transferring Audio Deepfake Detection Capability across Languages

Authors:
Zhongjie Ba

Zhejiang University, China and ZJU-Hangzhou Global Scientific and Technological Innovation Center, China

Zhejiang University, China and ZJU-Hangzhou Global Scientific and Technological Innovation Center, China

0000-0003-0921-8869
View Profile

,
Qing Wen

Zhejiang University, China and ZJU-Hangzhou Global Scientific and Technological Innovation Center, China

Zhejiang University, China and ZJU-Hangzhou Global Scientific and Technological Innovation Center, China

0000-0001-8967-3460
View Profile

,
Peng Cheng

Zhejiang University, China and ZJU-Hangzhou Global Scientific and Technological Innovation Center, China

Zhejiang University, China and ZJU-Hangzhou Global Scientific and Technological Innovation Center, China

0000-0002-4453-2274
View Profile

,
Yuwei Wang

Zhejiang University, China and ZJU-Hangzhou Global Scientific and Technological Innovation Center, China

Zhejiang University, China and ZJU-Hangzhou Global Scientific and Technological Innovation Center, China

0000-0003-3665-6311
View Profile

,
Feng Lin

Zhejiang University, China and ZJU-Hangzhou Global Scientific and Technological Innovation Center, China

Zhejiang University, China and ZJU-Hangzhou Global Scientific and Technological Innovation Center, China

0000-0001-5240-5200
View Profile

,
Li Lu

Zhejiang University, China and ZJU-Hangzhou Global Scientific and Technological Innovation Center, China

Zhejiang University, China and ZJU-Hangzhou Global Scientific and Technological Innovation Center, China

0000-0001-5230-3749
View Profile

,
Zhenguang Liu

Zhejiang University, China and ZJU-Hangzhou Global Scientific and Technological Innovation Center, China

Zhejiang University, China and ZJU-Hangzhou Global Scientific and Technological Innovation Center, China

0000-0002-7981-9873
View Profile

Authors Info & Claims

WWW '23: Proceedings of the ACM Web Conference 2023April 2023Pages 2033–2044https://doi.org/10.1145/3543507.3583222

Published:30 April 2023Publication History

WWW '23: Proceedings of the ACM Web Conference 2023

Pages 2033–2044

ABSTRACT

The proliferation of deepfake content has motivated a surge of detection studies. However, existing detection methods in the audio area exclusively work in English, and there is a lack of data resources in other languages. Cross-lingual deepfake detection, a critical but rarely explored area, urges more study. This paper conducts the first comprehensive study on the cross-lingual perspective of deepfake detection. We observe that English data enriched in deepfake algorithms can teach a detector the knowledge of various spoofing artifacts, contributing to performing detection across language domains. Based on the observation, we first construct a first-of-its-kind cross-lingual evaluation dataset including heterogeneous spoofed speech uttered in the two most widely spoken languages, then explored domain adaptation (DA) techniques to transfer the artifacts detection capability and propose effective and practical DA strategies fitting the cross-lingual scenario. Our adversarial-based DA paradigm teaches the model to learn real/fake knowledge while losing language dependency. Extensive experiments over 137-hour audio clips validate the adapted models can detect fake audio generated by unseen algorithms in the new domain.

References

Mauro Barni, Kassem Kallas, Ehsan Nowroozi, and Benedetta Tondi. 2020. CNN Detection of GAN-Generated Face Images based on Cross-Band Co-occurrences Analysis. CoRR abs/2007.12909 (2020). arXiv:2007.12909https://arxiv.org/abs/2007.12909Google Scholar
Berlitz. 2021. The most spoken languages in the world. https://www.berlitz.com/blog/most-spoken-languages-world.Google Scholar
Gautam Bhattacharya, Joao Monteiro, Jahangir Alam, and Patrick Kenny. 2019. Generative adversarial speaker embedding networks for domain robust end-to-end speaker verification. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6226–6230.Google Scholar
Rohan Kumar Das. 2021. Known-unknown Data Augmentation Strategies for Detection of Logical Access, Physical Access and Speech Deepfake Attacks: ASVspoof 2021. In Proc. 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge. 29–36. https://doi.org/10.21437/ASVSPOOF.2021-5Google Scholar
Steven Davis and Paul Mermelstein. 1980. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE transactions on acoustics, speech, and signal processing 28, 4 (1980), 357–366.Google Scholar
Brian Dolhansky, Joanna Bitton, Ben Pflaum, Jikuo Lu, Russ Howes, Menglin Wang, and Cristian Canton-Ferrer. 2020. The DeepFake Detection Challenge Dataset. CoRR abs/2006.07397 (2020). arXiv:2006.07397https://arxiv.org/abs/2006.07397Google Scholar
Joel Frank, Thorsten Eisenhofer, Lea Schönherr, Asja Fischer, Dorothea Kolossa, and Thorsten Holz. 2020. Leveraging Frequency Analysis for Deep Fake Image Recognition. In Proceedings of the 37th International Conference on Machine Learning(ICML’20). JMLR.org, Article 304, 12 pages.Google Scholar
Joel Frank and Lea Schönherr. 2021. WaveFake: A Data Set to Facilitate Audio Deepfake Detection. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). https://openreview.net/forum¿id=74TZg9gsO8WGoogle Scholar
Yaroslav Ganin and Victor Lempitsky. 2015. Unsupervised domain adaptation by backpropagation. In International conference on machine learning. PMLR, 1180–1189.Google Scholar
Guang Hua, Andrew Beng Jin Teoh, and Haijian Zhang. 2021. Towards End-to-End Synthetic Speech Detection. IEEE Signal Processing Letters 28 (2021), 1265–1269. https://doi.org/10.1109/LSP.2021.3089437Google Scholar
Yihao Huang, Felix Juefei-Xu, Qing Guo, Yang Liu, and Geguang Pu. 2022. FakeLocator: Robust Localization of GAN-Based Face Manipulations. IEEE Transactions on Information Forensics and Security 17 (2022), 2657–2672. https://doi.org/10.1109/TIFS.2022.3141262Google Scholar
Junguang Jiang, Yang Shu, Jianmin Wang, and Mingsheng Long. 2022. Transferability in Deep Learning: A Survey. arxiv:2201.05867 [cs.LG]Google Scholar
Liming Jiang, Zhengkui Guo, Wayne Wu, Zhaoyang Liu, Ziwei Liu, Chen Change Loy, Shuo Yang, Yuanjun Xiong, Wei Xia, Baoying Chen, Peiyu Zhuang, Sili Li, Shen Chen, Taiping Yao, Shouhong Ding, Jilin Li, Feiyue Huang, Liujuan Cao, Rongrong Ji, Changlei Lu, and Ganchao Tan. 2021. DeeperForensics Challenge 2020 on Real-World Face Forgery Detection: Methods and Results. CoRR abs/2102.09471 (2021). arXiv:2102.09471https://arxiv.org/abs/2102.09471Google Scholar
Felix Juefei-Xu, Run Wang, Yihao Huang, Qing Guo, Lei Ma, and Yang Liu. 2022. Countering malicious deepfakes: Survey, battleground, and horizon. International Journal of Computer Vision (2022), 1–57.Google Scholar
Felix Juefei-Xu, Run Wang, Yihao Huang, Qing Guo, Lei Ma, and Yang Liu. 2022. Countering Malicious DeepFakes: Survey, Battleground, and Horizon. Int. J. Comput. Vision 130, 7 (jul 2022), 1678–1734. https://doi.org/10.1007/s11263-022-01606-8Google Scholar
J. W. Jung, H. S. Heo, H. Tak, H. J. Shim, J Son Chung, B. J. Lee, H. J. Yu, and N. Evans. 2021. AASIST: Audio Anti-Spoofing using Integrated Spectro-Temporal Graph Attention Networks. In arXiv e-prints.Google Scholar
Jee-weon Jung, Seung-bin Kim, Hye-jin Shim, Ju-ho Kim, and Ha-Jin Yu. 2020. Improved rawnet with feature map scaling for text-independent speaker verification using raw waveforms. arXiv preprint arXiv:2004.00526 (2020).Google Scholar
Woo Hyun Kang, Jahangir Alam, and Abderrahim Fathan. 2021. CRIM’s System Description for the ASVSpoof2021 Challenge. In Proc. 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge. 100–106. https://doi.org/10.21437/ASVSPOOF.2021-16Google Scholar
Piotr Kawa, Marcin Plata, and Piotr Syga. 2022. Attack Agnostic Dataset: Towards Generalization and Stabilization of Audio DeepFake Detection. arXiv preprint arXiv:2206.13979 (2022).Google Scholar
Tomi Kinnunen, Md Sahidullah, Héctor Delgado, Massimiliano Todisco, Nicholas Evans, Junichi Yamagishi, and Kong Aik Lee. 2017. The ASVspoof 2017 challenge: Assessing the limits of replay spoofing attack detection. In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH ’17). Stockholm, Sweden. http://www.eurecom.fr/publication/5235Google Scholar
Xu Li, Xixin Wu, Hui Lu, Xunying Liu, and Helen Meng. 2021. Channel-wise gated res2net: Towards robust detection of synthetic speech attacks. arXiv preprint arXiv:2107.08803 (2021).Google Scholar
Yanghao Li, Naiyan Wang, Jianping Shi, Xiaodi Hou, and Jiaying Liu. 2018. Adaptive batch normalization for practical domain adaptation. Pattern Recognition 80 (2018), 109–117.Google Scholar
Weiwei Lin, Man-Mai Mak, Na Li, Dan Su, and Dong Yu. 2020. Multi-Level Deep Neural Network Adaptation for Speaker Verification Using MMD and Consistency Regularization. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 6839–6843. https://doi.org/10.1109/ICASSP40776.2020.9054134Google Scholar
Weiwei Lin, Man-Wai Mak, Na Li, Dan Su, and Dong Yu. 2020. A Framework for Adapting DNN Speaker Embedding Across Languages. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28 (2020), 2810–2822. https://doi.org/10.1109/TASLP.2020.3030499Google Scholar
Zhenguang Liu, Sifan Wu, Chejian Xu, Xiang Wang, Lei Zhu, Shuang Wu, and Fuli Feng. 2022. Copy Motion From One to Another: Fake Motion Video Generation. In IJCAI. 1223–1231. https://doi.org/10.24963/ijcai.2022/171Google Scholar
Mingsheng Long, Yue Cao, Jianmin Wang, and Michael I. Jordan. 2015. Learning Transferable Features with Deep Adaptation Networks. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37 (Lille, France) (ICML’15). JMLR.org, 97–105.Google Scholar
Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I. Jordan. 2017. Deep Transfer Learning with Joint Adaptation Networks(ICML’17). JMLR.org, 2208–2217.Google Scholar
Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan. 2017. Deep transfer learning with joint adaptation networks. In International conference on machine learning. PMLR, 2208–2217.Google Scholar
Nicolas Müller, Pavel Czempin, Franziska Diekmann, Adam Froghyar, and Konstantin Böttinger. 2022. Does Audio Deepfake Detection Generalize¿. In Proc. Interspeech 2022. 2783–2787. https://doi.org/10.21437/Interspeech.2022-108Google Scholar
Jiahui Pan, Shuai Nie, Hui Zhang, Shulin He, Kanghao Zhang, Shan Liang, Xueliang Zhang, and Jianhua Tao. 2022. Speaker recognition-assisted robust audio deepfake detection. In Proc. Interspeech 2022. 4202–4206. https://doi.org/10.21437/Interspeech.2022-72Google Scholar
Tanvina B Patel and Hemant A Patil. 2015. Combining evidences from mel cepstral, cochlear filter cepstral and instantaneous frequency features for detection of natural vs. spoofed speech. In Sixteenth annual conference of the international speech communication association.Google Scholar
Kuniaki Saito, Yoshitaka Ushiku, and Tatsuya Harada. 2017. Asymmetric tri-training for unsupervised domain adaptation. In International Conference on Machine Learning. PMLR, 2988–2997.Google Scholar
Berrak Sisman, Junichi Yamagishi, Simon King, and Haizhou Li. 2021. An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning. 29 (jan 2021), 132–157. https://doi.org/10.1109/TASLP.2020.3038524Google Scholar
Baochen Sun, Jiashi Feng, and Kate Saenko. 2016. Return of frustratingly easy domain adaptation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 30.Google Scholar
Hemlata Tak, Jee-weon Jung, Jose Patino, Madhu Kamble, Massimiliano Todisco, and Nicholas Evans. 2021. End-to-end spectro-temporal graph attention networks for speaker verification anti-spoofing and speech deepfake detection. arXiv preprint arXiv:2107.12710 (2021).Google Scholar
Hemlata Tak, Jose Patino, Massimiliano Todisco, Andreas Nautsch, Nicholas Evans, and Anthony Larcher. 2021. End-to-End anti-spoofing with RawNet2. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 6369–6373. https://doi.org/10.1109/ICASSP39728.2021.9414234Google Scholar
Xu Tan, Tao Qin, Frank Soong, and Tie-Yan Liu. 2021. A survey on neural speech synthesis. arXiv preprint arXiv:2106.15561 (2021).Google Scholar
Shahroz Tariq, Sowon Jeon, and Simon S. Woo. 2022. Am I a Real or Fake Celebrity¿ Evaluating Face Recognition and Verification APIs under Deepfake Impersonation Attack. In Proceedings of the ACM Web Conference 2022 (Virtual Event, Lyon, France) (WWW ’22). Association for Computing Machinery, New York, NY, USA, 512–523. https://doi.org/10.1145/3485447.3512212Google Scholar
Massimiliano Todisco, Héctor Delgado, and Nicholas Evans. 2017. Constant Q cepstral coefficients: A spoofing countermeasure for automatic speaker verification. Computer Speech & Language 45 (2017), 516–535.Google Scholar
Massimiliano Todisco, Héctor Delgado, and Nicholas WD Evans. 2016. A New Feature for Automatic Speaker Verification Anti-Spoofing: Constant Q Cepstral Coefficients.. In Odyssey, Vol. 2016. 283–290.Google Scholar
Massimiliano Todisco, Xin Wang, Ville Vestman, Md Sahidullah, Hector Delgado, Andreas Nautsch, Junichi Yamagishi, Nicholas Evans, Tomi Kinnunen, and Kong Aik Lee. 2019. ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection. arXiv preprint arXiv:1904.05441 (2019).Google Scholar
Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. 2017. Adversarial discriminative domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7167–7176.Google Scholar
Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE.Journal of machine learning research 9, 11 (2008).Google Scholar
Hongji Wang, Heinrich Dinkel, Shuai Wang, Yanmin Qian, and Kai Yu. 2020. Dual-Adversarial Domain Adaptation for Generalized Replay Attack Detection. In Proc. Interspeech 2020. 1086–1090. https://doi.org/10.21437/Interspeech.2020-1255Google Scholar
Xin Wang and Junich Yamagishi. 2021. A comparative study on recent neural spoofing countermeasures for synthetic speech detection. arXiv preprint arXiv:2103.11326 (2021).Google Scholar
Xin Wang and Junichi Yamagishi. 2021. Investigating self-supervised front ends for speech spoofing countermeasures. arXiv preprint arXiv:2111.07725 (2021).Google Scholar
Zhenyu Wang and John H. L. Hansen. 2022. Multi-Source Domain Adaptation for Text-Independent Forensic Speaker Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2022), 60–75. https://doi.org/10.1109/TASLP.2021.3130975Google Scholar
Yan Wen, Zhenchun Lei, Yingen Yang, Changhong Liu, and Minglei Ma. 2022. Multi-Path GMM-MobileNet Based on Attack Algorithms and Codecs for Synthetic Speech and Deepfake Detection. In Proc. Interspeech 2022. 4795–4799. https://doi.org/10.21437/Interspeech.2022-10312Google Scholar
Garrett Wilson and Diane J. Cook. 2020. A Survey of Unsupervised Deep Domain Adaptation. ACM Trans. Intell. Syst. Technol. 11, 5, Article 51 (jul 2020), 46 pages. https://doi.org/10.1145/3400066Google Scholar
Zhizheng Wu, Tomi Kinnunen, Nicholas Evans, Junichi Yamagishi, Cemal Hanilçi, Md Sahidullah, and Aleksandr Sizov. 2015. ASVspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge. In Sixteenth Annual Conference of the International Speech Communication Association.Google Scholar
Wei Xia, Jing Huang, and John HL Hansen. 2019. Cross-lingual text-independent speaker verification using unsupervised adversarial discriminative domain adaptation. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5816–5820.Google Scholar
Yang Xie, Zhenchuan Zhang, and Yingchun Yang. 2021. Siamese Network with wav2vec Feature for Spoofing Speech Detection.. In Interspeech. 4269–4273.Google Scholar
Junichi Yamagishi, Xin Wang, Massimiliano Todisco, Md Sahidullah, Jose Patino, Andreas Nautsch, Xuechen Liu, Kong Aik Lee, Tomi Kinnunen, Nicholas Evans, 2021. ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection. arXiv preprint arXiv:2109.00537 (2021).Google Scholar
Yifang Yin, Harsh Shrivastava, Ying Zhang, Zhenguang Liu, Rajiv Ratn Shah, and Roger Zimmermann. 2021. Enhanced audio tagging via multi-to single-modal teacher-student mutual learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 10709–10717.Google Scholar
You Zhang, Fei Jiang, and Zhiyao Duan. 2021. One-Class Learning Towards Synthetic Voice Spoofing Detection. IEEE Signal Processing Letters 28 (2021), 937–941. https://doi.org/10.1109/lsp.2021.3076358Google Scholar
You Zhang, Ge Zhu, Fei Jiang, and Zhiyao Duan. 2021. An Empirical Study on Channel Effects for Synthetic Voice Spoofing Countermeasure Systems. In Proc. Interspeech 2021. 4309–4313. https://doi.org/10.21437/Interspeech.2021-1820Google Scholar
Zhenyu Zhang, Yewei Gu, Xiaowei Yi, and Xianfeng Zhao. 2021. FMFCC-A: A Challenging Mandarin Dataset for Synthetic Speech Detection. CoRR abs/2110.09441 (2021). arXiv:2110.09441https://arxiv.org/abs/2110.09441Google Scholar

Index Terms

Transferring Audio Deepfake Detection Capability across Languages

Recommendations

Human Perception of Audio Deepfakes
DDAM '22: Proceedings of the 1st International Workshop on Deepfake Detection for Audio Multimedia

The recent emergence of deepfakes has brought manipulated and generated content to the forefront of machine learning research. Automatic detection of deepfakes has seen many new machine learning techniques. Human detection capabilities, however, are far ...
Read More
Unsupervised Cross-system Log Anomaly Detection via Domain Adaptation
CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management

Log anomaly detection, which focuses on detecting anomalous log records, becomes an active research problem because of its importance in developing stable and sustainable systems. Currently, many unsupervised log anomaly detection approaches are ...
Read More
Transferable visual pattern memory network for domain adaptation in anomaly detection
Abstract
Anomaly detection transfer aims to utilize knowledge learned from source anomaly detection task to improve the performance of target anomaly detection task. Conventional methods typically assume that labeled normal or abnormal data are available ...
Graphical abstract

Display Omitted
Highlights
- This study focuses on a rarely studied anomaly detection transfer scenario with little supervised information.
- An adversarial domain adaptation method is proposed to extract transferable visual patterns to transfer knowledge.
- A ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '23: Proceedings of the ACM Web Conference 2023
April 2023
4293 pages
ISBN:9781450394161
DOI:10.1145/3543507
Editors:
Ying Ding,
Jie Tang,
Juan Sequeda,
Lora Aroyo,
Carlos Castillo,
Geert-Jan Houben
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 30 April 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Badges
- Artifacts Available / v1.1
Author Tags
Deepfake
audio spoofing
domain adaptation
transfer learning
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 489
  Total Downloads
- Downloads (Last 12 months)487
- Downloads (Last 6 weeks)61
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Transferring Audio Deepfake Detection Capability across Languages

WWW '23: Proceedings of the ACM Web Conference 2023

ABSTRACT

References

Cited By

Index Terms

Recommendations

Human Perception of Audio Deepfakes

Unsupervised Cross-system Log Anomaly Detection via Domain Adaptation

Transferable visual pattern memory network for domain adaptation in anomaly detection