ABSTRACT
This paper presents a technique to scan neural network based AI models to determine if they are trojaned. Pre-trained AI models may contain back-doors that are injected through training or by transforming inner neuron weights. These trojaned models operate normally when regular inputs are provided, and mis-classify to a specific output label when the input is stamped with some special pattern called trojan trigger. We develop a novel technique that analyzes inner neuron behaviors by determining how output activations change when we introduce different levels of stimulation to a neuron. The neurons that substantially elevate the activation of a particular output label regardless of the provided input is considered potentially compromised. Trojan trigger is then reverse-engineered through an optimization procedure using the stimulation analysis results, to confirm that a neuron is truly compromised. We evaluate our system ABS on 177 trojaned models that are trojaned with various attack methods that target both the input space and the feature space, and have various trojan trigger sizes and shapes, together with 144 benign models that are trained with different data and initial weight values. These models belong to 7 different model structures and 6 different datasets, including some complex ones such as ImageNet, VGG-Face and ResNet110. Our results show that ABS is highly effective, can achieve over 90% detection rate for most cases (and many 100%), when only one input sample is provided for each output label. It substantially out-performs the state-of-the-art technique Neural Cleanse that requires a lot of input samples and small trojan triggers to achieve good performance.
Supplemental Material
- acoomans. 2013. . https://github.com/acoomans/instagram-filtersGoogle Scholar
- Anish Athalye, Nicholas Carlini, and David Wagner. 2018. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420 (2018).Google Scholar
- Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et almbox. 2016. End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316 (2016).Google Scholar
- Tom B Brown, Dandelion Mané, Aurko Roy, Mart'in Abadi, and Justin Gilmer. 2017. Adversarial patch. arXiv preprint arXiv:1712.09665 (2017).Google Scholar
- Yinzhi Cao, Alexander Fangxiao Yu, Andrew Aday, Eric Stahl, Jon Merwine, and Junfeng Yang. 2018. Efficient repair of polluted machine learning systems via causal unlearning. In Proceedings of the 2018 on Asia Conference on Computer and Communications Security. ACM, 735--747.Google ScholarDigital Library
- Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. 2017. Targeted backdoor attacks on deep learning systems using data poisoning. arXiv preprint arXiv:1712.05526 (2017).Google Scholar
- Edward Chou, Florian Tramèr, Giancarlo Pellegrino, and Dan Boneh. 2018. SentiNet: Detecting Physical Attacks Against Deep Learning Systems. arXiv preprint arXiv:1812.00292 (2018).Google Scholar
- Joseph Clements and Yingjie Lao. 2018. Hardware trojan attacks on neural networks. arXiv preprint arXiv:1806.05768 (2018).Google Scholar
- J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09 .Google Scholar
- Minghong Fang, Guolei Yang, Neil Zhenqiang Gong, and Jia Liu. 2018. Poisoning attacks to graph-based recommender systems. In Proceedings of the 34th Annual Computer Security Applications Conference. ACM, 381--392.Google ScholarDigital Library
- Yarin Gal. 2016. Uncertainty in deep learning . Ph.D. Dissertation. PhD thesis, University of Cambridge.Google Scholar
- Yansong Gao, Chang Xu, Derui Wang, Shiping Chen, Damith C Ranasinghe, and Surya Nepal. 2019. STRIP: A Defence Against Trojan Attacks on Deep Neural Networks. arXiv preprint arXiv:1902.06531 (2019).Google Scholar
- Ross Girshick. 2015. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision. 1440--1448.Google ScholarDigital Library
- Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg. 2017. Badnets: Identifying vulnerabilities in the machine learning model supply chain. arXiv preprint arXiv:1708.06733 (2017).Google Scholar
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.Google ScholarCross Ref
- Matthew Jagielski, Alina Oprea, Battista Biggio, Chang Liu, Cristina Nita-Rotaru, and Bo Li. 2018. Manipulating machine learning: Poisoning attacks and countermeasures for regression learning. In 2018 IEEE Symposium on Security and Privacy (SP) . IEEE, 19--35.Google ScholarCross Ref
- Yujie Ji, Xinyang Zhang, Shouling Ji, Xiapu Luo, and Ting Wang. 2018. Model-reuse attacks on deep learning systems. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security. ACM, 349--363.Google ScholarDigital Library
- Yujie Ji, Xinyang Zhang, and Ting Wang. 2017. Backdoor attacks against learning systems. In 2017 IEEE Conference on Communications and Network Security (CNS) .Google ScholarCross Ref
- Melissa King. 2019. TrojAI . https://www.iarpa.gov/index.php?option=com_content&view=article&id=1142&Itemid=443Google Scholar
- Alex Krizhevsky and Geoffrey Hinton. 2009. Learning multiple layers of features from tiny images . Technical Report. Citeseer.Google Scholar
- Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner, et almbox. 1998. Gradient-based learning applied to document recognition. Proc. IEEE (1998).Google Scholar
- Gil Levi and Tal Hassner. 2015. Age and gender classification using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops . 34--42.Google ScholarCross Ref
- Wenshuo Li, Jincheng Yu, Xuefei Ning, Pengjun Wang, Qi Wei, Yu Wang, and Huazhong Yang. 2018. Hu-fu: Hardware and software collaborative attack framework against neural networks. In ISVLSI .Google Scholar
- Cong Liao, Haoti Zhong, Anna Squicciarini, Sencun Zhu, and David Miller. 2018. Backdoor embedding in convolutional neural network models via invisible perturbation. arXiv preprint arXiv:1808.10307 (2018).Google Scholar
- Min Lin, Qiang Chen, and Shuicheng Yan. 2013. Network in network. arXiv preprint arXiv:1312.4400 (2013).Google Scholar
- Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. 2018a. Fine-pruning: Defending against backdooring attacks on deep neural networks. In International Symposium on Research in Attacks, Intrusions, and Defenses. Springer, 273--294.Google ScholarCross Ref
- Yingqi Liu, Shiqing Ma, Yousra Aafer, Wen-Chuan Lee, Juan Zhai, Weihang Wang, and Xiangyu Zhang. 2018b. Trojaning Attack on Neural Networks. In 25nd Annual Network and Distributed System Security Symposium, NDSS 2018, San Diego, California, USA, February 18--221, 2018 . The Internet Society.Google Scholar
- Yuntao Liu, Yang Xie, and Ankur Srivastava. 2017. Neural trojans. In 2017 IEEE International Conference on Computer Design (ICCD). IEEE, 45--48.Google ScholarCross Ref
- Shiqing Ma, Yingqi Liu, Guanhong Tao, Wen-Chuan Lee, and Xiangyu Zhang. 2019. NIC: Detecting Adversarial Samples with Neural Network Invariant Checking. In 26th Annual Network and Distributed System Security Symposium, NDSS .Google Scholar
- Wei Ma and Jun Lu. 2017. An Equivalence of Fully Connected Layer and Convolutional Layer. arXiv preprint arXiv:1712.01252 (2017).Google Scholar
- Andreas Møgelmose, Dongran Liu, and Mohan M Trivedi. 2014. Traffic sign detection for us roads: Remaining challenges and a case for tracking. In 17th International IEEE Conference on Intelligent Transportation Systems (ITSC) .Google ScholarCross Ref
- Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. 2017. Universal adversarial perturbations. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1765--1773.Google ScholarCross Ref
- Mehran Mozaffari-Kermani, Susmita Sur-Kolay, Anand Raghunathan, and Niraj K Jha. 2015. Systematic poisoning attacks on and defenses for machine learning in healthcare. IEEE journal of biomedical and health informatics (2015).Google ScholarCross Ref
- Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Ananthram Swami. 2017. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia conference on computer and communications security. ACM, 506--519.Google ScholarDigital Library
- Nicolas Papernot, Patrick McDaniel, Somesh Jha, Matt Fredrikson, Z Berkay Celik, and Ananthram Swami. 2016. The limitations of deep learning in adversarial settings. In 2016 IEEE European Symposium on Security and Privacy (EuroS&P) .Google ScholarCross Ref
- Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, et almbox. 2015. Deep face recognition.. In bmvc, Vol. 1. 6.Google Scholar
- Andrea Paudice, Luis Mu noz-González, and Emil C Lupu. 2018. Label sanitization against label flipping poisoning attacks. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 5--15.Google Scholar
- Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 779--788.Google ScholarCross Ref
- Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et almbox. 2015. Imagenet large scale visual recognition challenge. International journal of computer vision , Vol. 115, 3 (2015), 211--252.Google Scholar
- Ali Shafahi, W Ronny Huang, Mahyar Najibi, Octavian Suciu, Christoph Studer, Tudor Dumitras, and Tom Goldstein. 2018. Poison frogs! targeted clean-label poisoning attacks on neural networks. In NeuralIPS .Google Scholar
- Mahmood Sharif, Lujo Bauer, and Michael K Reiter. 2018. On the suitability of lp-norms for creating and preventing adversarial examples. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops .Google ScholarCross Ref
- Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google Scholar
- Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and Christian Igel. 2011. The German Traffic Sign Recognition Benchmark: A multi-class classification competition.. In IJCNN, Vol. 6. 7.Google ScholarCross Ref
- Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and Christian Igel. 2012. Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition. Neural networks , Vol. 32 (2012), 323--332.Google Scholar
- Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2013. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 (2013).Google Scholar
- Guanhong Tao, Shiqing Ma, Yingqi Liu, and Xiangyu Zhang. 2018. Attacks meet interpretability: Attribute-steered detection of adversarial samples. In Advances in Neural Information Processing Systems. 7717--7728.Google Scholar
- Alexander Turner, Dimitris Tsipras, and Aleksander Madry. 2018. Clean-Label Backdoor Attacks. (2018).Google Scholar
- Bolun Wang, Yuanshun Yao, Shawn Shan, Huiying Li, Bimal Viswanath, Haitao Zheng, and Ben Y Zhao. 2019. Neural cleanse: Identifying and mitigating backdoor attacks in neural networks. In Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks. IEEE, 0.Google Scholar
- Zhou Wang, Alan C Bovik, Hamid R Sheikh, Eero P Simoncelli, et almbox. 2004. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing , Vol. 13, 4 (2004), 600--612.Google Scholar
- wikipedia. 2019. Electrical brain stimulation - Wikipedia . https://en.wikipedia.org/wiki/Electrical_brain_stimulationGoogle Scholar
- Xi Wu, Uyeong Jang, Jiefeng Chen, Lingjiao Chen, and Somesh Jha. 2017. Reinforcing adversarial robustness using model confidence induced by adversarial training. arXiv preprint arXiv:1711.08001 (2017).Google Scholar
- Weilin Xu, David Evans, and Yanjun Qi. 2017. Feature squeezing: Detecting adversarial examples in deep neural networks. arXiv preprint arXiv:1704.01155 (2017).Google Scholar
- Wei Yang, Deguang Kong, Tao Xie, and Carl A Gunter. 2017. Malware detection in adversarial settings: Exploiting feature evolutions and confusions in android apps. In Proceedings of the 33rd Annual Computer Security Applications Conference .Google ScholarDigital Library
- Minhui Zou, Yang Shi, Chengliang Wang, Fangyu Li, WenZhan Song, and Yu Wang. 2018. Potrojan: powerful neural-level trojan designs in deep learning models. arXiv preprint arXiv:1802.03043 (2018).Google Scholar
Index Terms
- ABS: Scanning Neural Networks for Back-doors by Artificial Brain Stimulation
Recommendations
Simulating non-scanning worms on peer-to-peer networks
InfoScale '06: Proceedings of the 1st international conference on Scalable information systemsMillions of Internet users are using large-scale peer-to-peer (P2P) networks to share content files today. Many other mission-critical applications, such as Internet telephony and Domain Name System (DNS), have also found P2P networks appealing due to ...
On the development of an internetwork-centric defense for scanning worms
Studies of worm outbreaks have found that the speed of worm propagation makes manual intervention ineffective. Consequently, many automated containment mechanisms have been proposed to contain worm outbreaks before they grow out of control. These ...
Comments