research-article

Audee: automated testing for deep learning frameworks

Authors:
Qianyu Guo

Tianjin University, China

Tianjin University, China
View Profile

,
Xiaofei Xie

Nanyang Technological University, Singapore

Nanyang Technological University, Singapore
View Profile

,
Yi Li

Nanyang Technological University, Singapore

Nanyang Technological University, Singapore
View Profile

,
Xiaoyu Zhang

Xi'an Jiaotong University, China

Xi'an Jiaotong University, China
View Profile

,
Yang Liu

Nanyang Technological University, Singapore

Nanyang Technological University, Singapore
View Profile

,
Xiaohong Li

Tianjin University, China

Tianjin University, China
View Profile

,
Chao Shen

Xi'an Jiaotong University, China

Xi'an Jiaotong University, China
View Profile

ASE '20: Proceedings of the 35th IEEE/ACM International Conference on Automated Software EngineeringDecember 2020Pages 486–498https://doi.org/10.1145/3324884.3416571

Published:27 January 2021Publication History

ASE '20: Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering

Pages 486–498

ABSTRACT

Deep learning (DL) has been applied widely, and the quality of DL system becomes crucial, especially for safety-critical applications. Existing work mainly focuses on the quality analysis of DL models, but lacks attention to the underlying frameworks on which all DL models depend. In this work, we propose Audee, a novel approach for testing DL frameworks and localizing bugs. Audee adopts a search-based approach and implements three different mutation strategies to generate diverse test cases by exploring combinations of model structures, parameters, weights and inputs. Audee is able to detect three types of bugs: logical bugs, crashes and Not-a-Number (NaN) errors. In particular, for logical bugs, Audee adopts a cross-reference check to detect behavioural inconsistencies across multiple frameworks (e.g., TensorFlow and PyTorch), which may indicate potential bugs in their implementations. For NaN errors, Audee adopts a heuristic-based approach to generate DNNs that tend to output outliers (i.e., too large or small values), and these values are likely to produce NaN. Furthermore, Audee leverages a causal-testing based technique to localize layers as well as parameters that cause inconsistencies or bugs. To evaluate the effectiveness of our approach, we applied Audee on testing four DL frameworks, i.e., TensorFlow, PyTorch, CNTK, and Theano. We generate a large number of DNNs which cover 25 widely-used APIs in the four frameworks. The results demonstrate that Audee is effective in detecting inconsistencies, crashes and NaN errors. In total, 26 unique unknown bugs were discovered, and 7 of them have already been confirmed or fixed by the developers.

References

2018. Uber is giving up on self-driving cars in California after deadly crash. https://www.vice.com/en_us/article/9kga85/uber-is-giving-up-on-self-driving-cars-in-california-after-deadly-crashGoogle Scholar
2019. IMDb Dataset. https://www.imdb.com/interfaces/Google Scholar
2020. AUDEE. https://sites.google.com/view/audeeGoogle Scholar
2020. Keras: The Python Deep Learning library. https://keras.ioGoogle Scholar
2020. List of self-driving car fatalities. https://en.wikipedia.org/wiki/List_of_self-driving_car_fatalities#cite_note-15Google Scholar
Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). 265--283.Google ScholarDigital Library
Nicholas Carlini and David Wagner. 2017. Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp). IEEE, 39--57.Google Scholar
Chenyi Chen, Ari Seff, Alain Kornhauser, and Jianxiong Xiao. 2015. Deepdriving: Learning affordance for direct perception in autonomous driving. In Proceedings of the IEEE International Conference on Computer Vision. 2722--2730.Google ScholarDigital Library
CNTK. 2020. CNTK has supporting issues with GRU(unroll=true). https://github.com/microsoft/CNTK/issues/3800Google Scholar
Xiaoning Du, Xiaofei Xie, Yi Li, Lei Ma, Yang Liu, and Jianjun Zhao. 2019. Deep-stellar: model-based quantitative analysis of stateful deep learning systems. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 477--487.Google ScholarDigital Library
Saikat Dutta, Owolabi Legunsen, Zixin Huang, and Sasa Misailovic. 2018. Testing probabilistic programming systems. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 574--586.Google ScholarDigital Library
Anurag Dwarakanath, Manish Ahuja, Samarth Sikand, Raghotham M Rao, RP Jagadeesh Chandra Bose, Neville Dubash, and Sanjay Podder. 2018. Identifying implementation bugs in machine learning based image classifiers using metamorphic testing. In Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis. 118--128.Google ScholarDigital Library
Facebook. 2020. ONNX. https://github.com/onnx/onnxGoogle Scholar
Gordon Fraser and Andrea Arcuri. 2011. Evosuite: automatic test suite generation for object-oriented software. In Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering. 416--419.Google ScholarDigital Library
Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. 2014. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014).Google Scholar
Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. 2013. Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, 6645--6649.Google ScholarCross Ref
Qianyu Guo, Sen Chen, Xiaofei Xie, Lei Ma, Qiang Hu, Hongtao Liu, Yang Liu, Jianjun Zhao, and Xiaohong Li. 2019. An empirical study towards characterizing deep learning development and deployment across different frameworks and platforms. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 810--822.Google ScholarDigital Library
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.Google ScholarCross Ref
Nargiz Humbatova, Gunel Jahangirova, Gabriele Bavota, Vincenzo Riccio, Andrea Stocco, and Paolo Tonella. 2019. Taxonomy of Real Faults in Deep Learning Systems. arXiv (2019), arXiv-1910.Google Scholar
Guy Katz, Clark Barrett, David L Dill, Kyle Julian, and Mykel J Kochenderfer. 2017. Reluplex: An efficient SMT solver for verifying deep neural networks. In International Conference on Computer Aided Verification. Springer, 97--117.Google ScholarCross Ref
Keras. 2020. Keras has supporting issues with GRU (unroll=true) on the CNTK backend. https://github.com/keras-team/keras/issues/13852Google Scholar
Nair Krizhevsky, Hinton Vinod, Christopher Geoffrey, Mike Papadakis, and Anthony Ventresque. 2014. CIFAR-10 dataset. http://www.cs.toronto.edu/kriz/cifar.html.Google Scholar
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proc. of the IEEE 86, 11 (1998), 2278--2324.Google ScholarCross Ref
Yann LeCun and Corrina Cortes. 1998. The MNIST database of handwritten digits.Google Scholar
Ling Liu, Yanzhao Wu, Wenqi Wei, Wenqi Cao, Semih Sahin, and Qi Zhang. 2018. Benchmarking deep learning frameworks: Design considerations, metrics and beyond. In 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS). IEEE, 1258--1269.Google ScholarCross Ref
Siqi Liu, Sidong Liu, Weidong Cai, Sonia Pujol, Ron Kikinis, and Dagan Feng. 2014. Early diagnosis of Alzheimer's disease with deep learning. In 2014 IEEE 11th international symposium on biomedical imaging (ISBI). IEEE, 1015--1018.Google ScholarCross Ref
Microsoft. 2020. MMdnn. https://github.com/Microsoft/MMdnnGoogle Scholar
M. Zalewski. [n.d.]. american fuzzy lop. http://lcamtuf.coredump.cx/afl/.Google Scholar
Mahdi Nejadgholi and Jinqiu Yang. 2019. A Study of Oracle Approximations in Testing Deep Learning Libraries. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 785--796.Google ScholarDigital Library
Augustus Odena and Ian Goodfellow. 2018. Tensorfuzz: Debugging neural networks with coverage-guided fuzzing. arXiv preprint arXiv:1807.10875 (2018).Google Scholar
Tianyu Pang, Kun Xu, Chao Du, Ning Chen, and Jun Zhu. 2019. Improving Adversarial Robustness via Promoting Ensemble Diversity. CoRR abs/1901.08846 (2019). http://arxiv.org/abs/1901.08846Google Scholar
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. openreview (2017).Google Scholar
Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. 2017. Deepxplore: Automated whitebox testing of deep learning systems. In proceedings of the 26th Symposium on Operating Systems Principles. 1--18.Google ScholarDigital Library
Hung Viet Pham, Thibaud Lutellier, Weizhen Qi, and Lin Tan. 2019. CRADLE: cross-backend validation to detect and localize bugs in deep learning libraries. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 1027--1038.Google ScholarDigital Library
Pytorch. 2020. AvgPool: Ensure all cells are valid in ceil mode. https://github.com/pytorch/pytorch/pull/41368Google Scholar
Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4510--4520.Google ScholarCross Ref
Frank Seide and Amit Agarwal. 2016. CNTK: Microsoft's open-source deep-learning toolkit. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2135--2135.Google ScholarDigital Library
Arnab Sharma and Heike Wehrheim. 2019. Testing machine learning algorithms for balanced data usage. In 2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST). IEEE, 125--135.Google ScholarCross Ref
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google Scholar
Siwakorn Srisakaokul, Zhengkai Wu, Angello Astorga, Oreoluwa Alebiosu, and Tao Xie. 2018. Multiple-implementation testing of supervised learning software. In Workshops at the Thirty-Second AAAI Conference on Artificial Intelligence.Google Scholar
Yi Sun, Yuheng Chen, Xiaogang Wang, and Xiaoou Tang. 2014. Deep learning face representation by joint identification-verification. In Advances in neural information processing systems. 1988--1996.Google Scholar
The Theano Development Team, Rami Al-Rfou, Guillaume Alain, Amjad Almahairi, Christof Angermueller, Dzmitry Bahdanau, Nicolas Ballas, Frédéric Bastien, Justin Bayer, Anatoly Belikov, et al. 2016. Theano: A Python framework for fast computation of mathematical expressions. arXiv preprint arXiv:1605.02688 (2016).Google Scholar
TensorFlow. 2019. SoftmaxOp leads to overflow. https://github.com/uTensor/uTensor/issues/175Google Scholar
TensorFlow. 2020. Checking if Kernel_size=0 in conv2d and reports error accordingly. https://github.com/tensorflow/tensorflow/pull/37395Google Scholar
TensorFlow. 2020. The fix of corner cases for the value None processing. https://github.com/tensorflow/tensorflow/commit/3db8df8ffafe5bcd83a12b92bc4c8287cd80237fGoogle Scholar
TensorFlow. 2020. The fix of missing check for the unreasonable parameter input_dim=0 in the layer Embedding. https://github.com/tensorflow/tensorflow/commit/f61175812426009a4c96e51befb2951612990903Google Scholar
TensorFlow. 2020. The output of Batch Normalization may contain Nan under certain parameters. https://github.com/tensorflow/tensorflow/issues/38644Google Scholar
TensorFlow. 2020. Tensorflow can build and even run a model with Conv2D kerne_size=0. https://github.com/tensorflow/tensorflow/issues/37334Google Scholar
Theano. 2020. Theano lacks a check for unreasonable parameters like dilation_rate=0 in Conv2D or DepthwiseConv2D. https://github.com/Theano/Theano/issues/6745Google Scholar
Yuchi Tian, Kexin Pei, Suman Jana, and Baishakhi Ray. 2018. Deeptest: Automated testing of deep-neural-network-driven autonomous cars. In Proceedings of the 40th international conference on software engineering. 303--314.Google ScholarDigital Library
Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Ian Goodfellow, Dan Boneh, and Patrick McDaniel. 2017. Ensemble Adversarial Training: Attacks and Defenses. arXiv:stat.ML/1705.07204Google Scholar
Petra Vidnerová and Roman Neruda. 2016. Evolutionary generation of adversarial examples for deep and shallow machine learning models. In Proceedings of the The 3rd Multidisciplinary International Social Networks Conference on Social Informatics 2016, Data Science 2016. 1--7.Google ScholarDigital Library
M Wu, Y Ouyang, H Zhou, L Zhang, C Liu, and Y Zhang. 2020. Simulee: Detecting cuda synchronization bugs via memory-access modeling. In Proceedings of the 42nd International Conference on Software Engineering, ICSE. 23--29.Google ScholarDigital Library
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016).Google Scholar
Xiaofei Xie, Lei Ma, Felix Juefei-Xu, Minhui Xue, Hongxu Chen, Yang Liu, Jianjun Zhao, Bo Li, Jianxiong Yin, and Simon See. 2019. DeepHunter: a coverage-guided fuzz testing framework for deep neural networks. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis. 146--157.Google ScholarDigital Library
Xiaofei Xie, Lei Ma, Haijun Wang, Yuekang Li, Yang Liu, and Xiaohong Li. 2019. Diffchaser: Detecting disagreements for deep neural networks. In Proceedings of the 28th International Joint Conference on Artificial Intelligence. AAAI Press, 5772--5778.Google ScholarCross Ref
Xiyue Zhang, Xiaofei Xie, Lei Ma, Xiaoning Du, Qiang Hu, Yang Liu, Jianjun Zhao, and Meng Sun. 2020. Towards Characterizing Adversarial Defects of Deep Learning Software from the Lens of Uncertainty.Google Scholar
Yuhao Zhang, Yifan Chen, Shing-Chi Cheung, Yingfei Xiong, and Lu Zhang. 2018. An empirical study on TensorFlow program bugs. In Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis. 129--140.Google ScholarDigital Library
Dan Zuras, Mike Cowlishaw, Alex Aiken, Matthew Applegate, David Bailey, Steve Bass, Dileep Bhandarkar, Mahesh Bhat, David Bindel, Sylvie Boldo, et al. 2008. IEEE standard for floating-point arithmetic. IEEE Std 754, 2008 (2008), 1--70.Google Scholar

Index Terms

Audee: automated testing for deep learning frameworks
1. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Software defect analysis
        Software testing and debugging

Recommendations

Toward Understanding Deep Learning Framework Bugs
DL frameworks are the basis of constructing all DL programs and models, and thus their bugs could lead to the unexpected behaviors of any DL program or model relying on them. Such a wide effect demonstrates the necessity and importance of guaranteeing DL ...
Read More
Detect Related Bugs from Source Code Using Bug Information
COMPSAC '10: Proceedings of the 2010 IEEE 34th Annual Computer Software and Applications Conference

Open source projects often maintain open bug repositories during development and maintenance, and the reporters often point out straightly or implicitly the reasons why bugs occur when they submit them. The comments about a bug are very valuable for ...
Read More
Are Neural Bug Detectors Comparable to Software Developers on Variable Misuse Bugs?
ASE '22: Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering

Debugging, that is, identifying and fixing bugs in software, is a central part of software development. Developers are therefore often confronted with the task of deciding whether a given code snippet contains a bug, and if yes, where. Recently, data-...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ASE '20: Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering
December 2020
1449 pages
ISBN:9781450367684
DOI:10.1145/3324884
General Chair:
John Grundy,
Program Chairs:
Claire Le Goues,
David Lo
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 January 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
bug detection
deep learning frameworks
deep learning testing
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate82of337submissions,24%
Upcoming Conference
ASE '24

Sponsor:

sigsoft online

sigsoft online

ASE '24: 39th IEEE/ACM International Conference on Automated Software Engineering

October 27 - November 1, 2024

Sacramento , CA , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 42
  Total Citations
  View Citations
- 388
  Total Downloads
- Downloads (Last 12 months)163
- Downloads (Last 6 weeks)24
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Audee: automated testing for deep learning frameworks

ASE '20: Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

Toward Understanding Deep Learning Framework Bugs

Detect Related Bugs from Source Code Using Bug Information

Are Neural Bug Detectors Comparable to Software Developers on Variable Misuse Bugs?