ABSTRACT
The proliferation of deepfake content has motivated a surge of detection studies. However, existing detection methods in the audio area exclusively work in English, and there is a lack of data resources in other languages. Cross-lingual deepfake detection, a critical but rarely explored area, urges more study. This paper conducts the first comprehensive study on the cross-lingual perspective of deepfake detection. We observe that English data enriched in deepfake algorithms can teach a detector the knowledge of various spoofing artifacts, contributing to performing detection across language domains. Based on the observation, we first construct a first-of-its-kind cross-lingual evaluation dataset including heterogeneous spoofed speech uttered in the two most widely spoken languages, then explored domain adaptation (DA) techniques to transfer the artifacts detection capability and propose effective and practical DA strategies fitting the cross-lingual scenario. Our adversarial-based DA paradigm teaches the model to learn real/fake knowledge while losing language dependency. Extensive experiments over 137-hour audio clips validate the adapted models can detect fake audio generated by unseen algorithms in the new domain.
- Mauro Barni, Kassem Kallas, Ehsan Nowroozi, and Benedetta Tondi. 2020. CNN Detection of GAN-Generated Face Images based on Cross-Band Co-occurrences Analysis. CoRR abs/2007.12909 (2020). arXiv:2007.12909https://arxiv.org/abs/2007.12909Google Scholar
- Berlitz. 2021. The most spoken languages in the world. https://www.berlitz.com/blog/most-spoken-languages-world.Google Scholar
- Gautam Bhattacharya, Joao Monteiro, Jahangir Alam, and Patrick Kenny. 2019. Generative adversarial speaker embedding networks for domain robust end-to-end speaker verification. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6226–6230.Google Scholar
- Rohan Kumar Das. 2021. Known-unknown Data Augmentation Strategies for Detection of Logical Access, Physical Access and Speech Deepfake Attacks: ASVspoof 2021. In Proc. 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge. 29–36. https://doi.org/10.21437/ASVSPOOF.2021-5Google Scholar
- Steven Davis and Paul Mermelstein. 1980. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE transactions on acoustics, speech, and signal processing 28, 4 (1980), 357–366.Google Scholar
- Brian Dolhansky, Joanna Bitton, Ben Pflaum, Jikuo Lu, Russ Howes, Menglin Wang, and Cristian Canton-Ferrer. 2020. The DeepFake Detection Challenge Dataset. CoRR abs/2006.07397 (2020). arXiv:2006.07397https://arxiv.org/abs/2006.07397Google Scholar
- Joel Frank, Thorsten Eisenhofer, Lea Schönherr, Asja Fischer, Dorothea Kolossa, and Thorsten Holz. 2020. Leveraging Frequency Analysis for Deep Fake Image Recognition. In Proceedings of the 37th International Conference on Machine Learning(ICML’20). JMLR.org, Article 304, 12 pages.Google Scholar
- Joel Frank and Lea Schönherr. 2021. WaveFake: A Data Set to Facilitate Audio Deepfake Detection. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). https://openreview.net/forum¿id=74TZg9gsO8WGoogle Scholar
- Yaroslav Ganin and Victor Lempitsky. 2015. Unsupervised domain adaptation by backpropagation. In International conference on machine learning. PMLR, 1180–1189.Google Scholar
- Guang Hua, Andrew Beng Jin Teoh, and Haijian Zhang. 2021. Towards End-to-End Synthetic Speech Detection. IEEE Signal Processing Letters 28 (2021), 1265–1269. https://doi.org/10.1109/LSP.2021.3089437Google Scholar
- Yihao Huang, Felix Juefei-Xu, Qing Guo, Yang Liu, and Geguang Pu. 2022. FakeLocator: Robust Localization of GAN-Based Face Manipulations. IEEE Transactions on Information Forensics and Security 17 (2022), 2657–2672. https://doi.org/10.1109/TIFS.2022.3141262Google Scholar
- Junguang Jiang, Yang Shu, Jianmin Wang, and Mingsheng Long. 2022. Transferability in Deep Learning: A Survey. arxiv:2201.05867 [cs.LG]Google Scholar
- Liming Jiang, Zhengkui Guo, Wayne Wu, Zhaoyang Liu, Ziwei Liu, Chen Change Loy, Shuo Yang, Yuanjun Xiong, Wei Xia, Baoying Chen, Peiyu Zhuang, Sili Li, Shen Chen, Taiping Yao, Shouhong Ding, Jilin Li, Feiyue Huang, Liujuan Cao, Rongrong Ji, Changlei Lu, and Ganchao Tan. 2021. DeeperForensics Challenge 2020 on Real-World Face Forgery Detection: Methods and Results. CoRR abs/2102.09471 (2021). arXiv:2102.09471https://arxiv.org/abs/2102.09471Google Scholar
- Felix Juefei-Xu, Run Wang, Yihao Huang, Qing Guo, Lei Ma, and Yang Liu. 2022. Countering malicious deepfakes: Survey, battleground, and horizon. International Journal of Computer Vision (2022), 1–57.Google Scholar
- Felix Juefei-Xu, Run Wang, Yihao Huang, Qing Guo, Lei Ma, and Yang Liu. 2022. Countering Malicious DeepFakes: Survey, Battleground, and Horizon. Int. J. Comput. Vision 130, 7 (jul 2022), 1678–1734. https://doi.org/10.1007/s11263-022-01606-8Google Scholar
- J. W. Jung, H. S. Heo, H. Tak, H. J. Shim, J Son Chung, B. J. Lee, H. J. Yu, and N. Evans. 2021. AASIST: Audio Anti-Spoofing using Integrated Spectro-Temporal Graph Attention Networks. In arXiv e-prints.Google Scholar
- Jee-weon Jung, Seung-bin Kim, Hye-jin Shim, Ju-ho Kim, and Ha-Jin Yu. 2020. Improved rawnet with feature map scaling for text-independent speaker verification using raw waveforms. arXiv preprint arXiv:2004.00526 (2020).Google Scholar
- Woo Hyun Kang, Jahangir Alam, and Abderrahim Fathan. 2021. CRIM’s System Description for the ASVSpoof2021 Challenge. In Proc. 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge. 100–106. https://doi.org/10.21437/ASVSPOOF.2021-16Google Scholar
- Piotr Kawa, Marcin Plata, and Piotr Syga. 2022. Attack Agnostic Dataset: Towards Generalization and Stabilization of Audio DeepFake Detection. arXiv preprint arXiv:2206.13979 (2022).Google Scholar
- Tomi Kinnunen, Md Sahidullah, Héctor Delgado, Massimiliano Todisco, Nicholas Evans, Junichi Yamagishi, and Kong Aik Lee. 2017. The ASVspoof 2017 challenge: Assessing the limits of replay spoofing attack detection. In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH ’17). Stockholm, Sweden. http://www.eurecom.fr/publication/5235Google Scholar
- Xu Li, Xixin Wu, Hui Lu, Xunying Liu, and Helen Meng. 2021. Channel-wise gated res2net: Towards robust detection of synthetic speech attacks. arXiv preprint arXiv:2107.08803 (2021).Google Scholar
- Yanghao Li, Naiyan Wang, Jianping Shi, Xiaodi Hou, and Jiaying Liu. 2018. Adaptive batch normalization for practical domain adaptation. Pattern Recognition 80 (2018), 109–117.Google Scholar
- Weiwei Lin, Man-Mai Mak, Na Li, Dan Su, and Dong Yu. 2020. Multi-Level Deep Neural Network Adaptation for Speaker Verification Using MMD and Consistency Regularization. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 6839–6843. https://doi.org/10.1109/ICASSP40776.2020.9054134Google Scholar
- Weiwei Lin, Man-Wai Mak, Na Li, Dan Su, and Dong Yu. 2020. A Framework for Adapting DNN Speaker Embedding Across Languages. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28 (2020), 2810–2822. https://doi.org/10.1109/TASLP.2020.3030499Google Scholar
- Zhenguang Liu, Sifan Wu, Chejian Xu, Xiang Wang, Lei Zhu, Shuang Wu, and Fuli Feng. 2022. Copy Motion From One to Another: Fake Motion Video Generation. In IJCAI. 1223–1231. https://doi.org/10.24963/ijcai.2022/171Google Scholar
- Mingsheng Long, Yue Cao, Jianmin Wang, and Michael I. Jordan. 2015. Learning Transferable Features with Deep Adaptation Networks. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37 (Lille, France) (ICML’15). JMLR.org, 97–105.Google Scholar
- Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I. Jordan. 2017. Deep Transfer Learning with Joint Adaptation Networks(ICML’17). JMLR.org, 2208–2217.Google Scholar
- Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan. 2017. Deep transfer learning with joint adaptation networks. In International conference on machine learning. PMLR, 2208–2217.Google Scholar
- Nicolas Müller, Pavel Czempin, Franziska Diekmann, Adam Froghyar, and Konstantin Böttinger. 2022. Does Audio Deepfake Detection Generalize¿. In Proc. Interspeech 2022. 2783–2787. https://doi.org/10.21437/Interspeech.2022-108Google Scholar
- Jiahui Pan, Shuai Nie, Hui Zhang, Shulin He, Kanghao Zhang, Shan Liang, Xueliang Zhang, and Jianhua Tao. 2022. Speaker recognition-assisted robust audio deepfake detection. In Proc. Interspeech 2022. 4202–4206. https://doi.org/10.21437/Interspeech.2022-72Google Scholar
- Tanvina B Patel and Hemant A Patil. 2015. Combining evidences from mel cepstral, cochlear filter cepstral and instantaneous frequency features for detection of natural vs. spoofed speech. In Sixteenth annual conference of the international speech communication association.Google Scholar
- Kuniaki Saito, Yoshitaka Ushiku, and Tatsuya Harada. 2017. Asymmetric tri-training for unsupervised domain adaptation. In International Conference on Machine Learning. PMLR, 2988–2997.Google Scholar
- Berrak Sisman, Junichi Yamagishi, Simon King, and Haizhou Li. 2021. An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning. 29 (jan 2021), 132–157. https://doi.org/10.1109/TASLP.2020.3038524Google Scholar
- Baochen Sun, Jiashi Feng, and Kate Saenko. 2016. Return of frustratingly easy domain adaptation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 30.Google Scholar
- Hemlata Tak, Jee-weon Jung, Jose Patino, Madhu Kamble, Massimiliano Todisco, and Nicholas Evans. 2021. End-to-end spectro-temporal graph attention networks for speaker verification anti-spoofing and speech deepfake detection. arXiv preprint arXiv:2107.12710 (2021).Google Scholar
- Hemlata Tak, Jose Patino, Massimiliano Todisco, Andreas Nautsch, Nicholas Evans, and Anthony Larcher. 2021. End-to-End anti-spoofing with RawNet2. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 6369–6373. https://doi.org/10.1109/ICASSP39728.2021.9414234Google Scholar
- Xu Tan, Tao Qin, Frank Soong, and Tie-Yan Liu. 2021. A survey on neural speech synthesis. arXiv preprint arXiv:2106.15561 (2021).Google Scholar
- Shahroz Tariq, Sowon Jeon, and Simon S. Woo. 2022. Am I a Real or Fake Celebrity¿ Evaluating Face Recognition and Verification APIs under Deepfake Impersonation Attack. In Proceedings of the ACM Web Conference 2022 (Virtual Event, Lyon, France) (WWW ’22). Association for Computing Machinery, New York, NY, USA, 512–523. https://doi.org/10.1145/3485447.3512212Google Scholar
- Massimiliano Todisco, Héctor Delgado, and Nicholas Evans. 2017. Constant Q cepstral coefficients: A spoofing countermeasure for automatic speaker verification. Computer Speech & Language 45 (2017), 516–535.Google Scholar
- Massimiliano Todisco, Héctor Delgado, and Nicholas WD Evans. 2016. A New Feature for Automatic Speaker Verification Anti-Spoofing: Constant Q Cepstral Coefficients.. In Odyssey, Vol. 2016. 283–290.Google Scholar
- Massimiliano Todisco, Xin Wang, Ville Vestman, Md Sahidullah, Hector Delgado, Andreas Nautsch, Junichi Yamagishi, Nicholas Evans, Tomi Kinnunen, and Kong Aik Lee. 2019. ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection. arXiv preprint arXiv:1904.05441 (2019).Google Scholar
- Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. 2017. Adversarial discriminative domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7167–7176.Google Scholar
- Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE.Journal of machine learning research 9, 11 (2008).Google Scholar
- Hongji Wang, Heinrich Dinkel, Shuai Wang, Yanmin Qian, and Kai Yu. 2020. Dual-Adversarial Domain Adaptation for Generalized Replay Attack Detection. In Proc. Interspeech 2020. 1086–1090. https://doi.org/10.21437/Interspeech.2020-1255Google Scholar
- Xin Wang and Junich Yamagishi. 2021. A comparative study on recent neural spoofing countermeasures for synthetic speech detection. arXiv preprint arXiv:2103.11326 (2021).Google Scholar
- Xin Wang and Junichi Yamagishi. 2021. Investigating self-supervised front ends for speech spoofing countermeasures. arXiv preprint arXiv:2111.07725 (2021).Google Scholar
- Zhenyu Wang and John H. L. Hansen. 2022. Multi-Source Domain Adaptation for Text-Independent Forensic Speaker Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2022), 60–75. https://doi.org/10.1109/TASLP.2021.3130975Google Scholar
- Yan Wen, Zhenchun Lei, Yingen Yang, Changhong Liu, and Minglei Ma. 2022. Multi-Path GMM-MobileNet Based on Attack Algorithms and Codecs for Synthetic Speech and Deepfake Detection. In Proc. Interspeech 2022. 4795–4799. https://doi.org/10.21437/Interspeech.2022-10312Google Scholar
- Garrett Wilson and Diane J. Cook. 2020. A Survey of Unsupervised Deep Domain Adaptation. ACM Trans. Intell. Syst. Technol. 11, 5, Article 51 (jul 2020), 46 pages. https://doi.org/10.1145/3400066Google Scholar
- Zhizheng Wu, Tomi Kinnunen, Nicholas Evans, Junichi Yamagishi, Cemal Hanilçi, Md Sahidullah, and Aleksandr Sizov. 2015. ASVspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge. In Sixteenth Annual Conference of the International Speech Communication Association.Google Scholar
- Wei Xia, Jing Huang, and John HL Hansen. 2019. Cross-lingual text-independent speaker verification using unsupervised adversarial discriminative domain adaptation. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5816–5820.Google Scholar
- Yang Xie, Zhenchuan Zhang, and Yingchun Yang. 2021. Siamese Network with wav2vec Feature for Spoofing Speech Detection.. In Interspeech. 4269–4273.Google Scholar
- Junichi Yamagishi, Xin Wang, Massimiliano Todisco, Md Sahidullah, Jose Patino, Andreas Nautsch, Xuechen Liu, Kong Aik Lee, Tomi Kinnunen, Nicholas Evans, 2021. ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection. arXiv preprint arXiv:2109.00537 (2021).Google Scholar
- Yifang Yin, Harsh Shrivastava, Ying Zhang, Zhenguang Liu, Rajiv Ratn Shah, and Roger Zimmermann. 2021. Enhanced audio tagging via multi-to single-modal teacher-student mutual learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 10709–10717.Google Scholar
- You Zhang, Fei Jiang, and Zhiyao Duan. 2021. One-Class Learning Towards Synthetic Voice Spoofing Detection. IEEE Signal Processing Letters 28 (2021), 937–941. https://doi.org/10.1109/lsp.2021.3076358Google Scholar
- You Zhang, Ge Zhu, Fei Jiang, and Zhiyao Duan. 2021. An Empirical Study on Channel Effects for Synthetic Voice Spoofing Countermeasure Systems. In Proc. Interspeech 2021. 4309–4313. https://doi.org/10.21437/Interspeech.2021-1820Google Scholar
- Zhenyu Zhang, Yewei Gu, Xiaowei Yi, and Xianfeng Zhao. 2021. FMFCC-A: A Challenging Mandarin Dataset for Synthetic Speech Detection. CoRR abs/2110.09441 (2021). arXiv:2110.09441https://arxiv.org/abs/2110.09441Google Scholar
Index Terms
- Transferring Audio Deepfake Detection Capability across Languages
Recommendations
Human Perception of Audio Deepfakes
DDAM '22: Proceedings of the 1st International Workshop on Deepfake Detection for Audio MultimediaThe recent emergence of deepfakes has brought manipulated and generated content to the forefront of machine learning research. Automatic detection of deepfakes has seen many new machine learning techniques. Human detection capabilities, however, are far ...
Unsupervised Cross-system Log Anomaly Detection via Domain Adaptation
CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge ManagementLog anomaly detection, which focuses on detecting anomalous log records, becomes an active research problem because of its importance in developing stable and sustainable systems. Currently, many unsupervised log anomaly detection approaches are ...
Transferable visual pattern memory network for domain adaptation in anomaly detection
AbstractAnomaly detection transfer aims to utilize knowledge learned from source anomaly detection task to improve the performance of target anomaly detection task. Conventional methods typically assume that labeled normal or abnormal data are available ...
Graphical abstractDisplay Omitted
Highlights- This study focuses on a rarely studied anomaly detection transfer scenario with little supervised information.
- An adversarial domain adaptation method is proposed to extract transferable visual patterns to transfer knowledge.
- A ...
Comments