skip to main content
10.1145/3324884.3416571acmconferencesArticle/Chapter ViewAbstractPublication PagesaseConference Proceedingsconference-collections
research-article

Audee: automated testing for deep learning frameworks

Published:27 January 2021Publication History

ABSTRACT

Deep learning (DL) has been applied widely, and the quality of DL system becomes crucial, especially for safety-critical applications. Existing work mainly focuses on the quality analysis of DL models, but lacks attention to the underlying frameworks on which all DL models depend. In this work, we propose Audee, a novel approach for testing DL frameworks and localizing bugs. Audee adopts a search-based approach and implements three different mutation strategies to generate diverse test cases by exploring combinations of model structures, parameters, weights and inputs. Audee is able to detect three types of bugs: logical bugs, crashes and Not-a-Number (NaN) errors. In particular, for logical bugs, Audee adopts a cross-reference check to detect behavioural inconsistencies across multiple frameworks (e.g., TensorFlow and PyTorch), which may indicate potential bugs in their implementations. For NaN errors, Audee adopts a heuristic-based approach to generate DNNs that tend to output outliers (i.e., too large or small values), and these values are likely to produce NaN. Furthermore, Audee leverages a causal-testing based technique to localize layers as well as parameters that cause inconsistencies or bugs. To evaluate the effectiveness of our approach, we applied Audee on testing four DL frameworks, i.e., TensorFlow, PyTorch, CNTK, and Theano. We generate a large number of DNNs which cover 25 widely-used APIs in the four frameworks. The results demonstrate that Audee is effective in detecting inconsistencies, crashes and NaN errors. In total, 26 unique unknown bugs were discovered, and 7 of them have already been confirmed or fixed by the developers.

References

  1. 2018. Uber is giving up on self-driving cars in California after deadly crash. https://www.vice.com/en_us/article/9kga85/uber-is-giving-up-on-self-driving-cars-in-california-after-deadly-crashGoogle ScholarGoogle Scholar
  2. 2019. IMDb Dataset. https://www.imdb.com/interfaces/Google ScholarGoogle Scholar
  3. 2020. AUDEE. https://sites.google.com/view/audeeGoogle ScholarGoogle Scholar
  4. 2020. Keras: The Python Deep Learning library. https://keras.ioGoogle ScholarGoogle Scholar
  5. 2020. List of self-driving car fatalities. https://en.wikipedia.org/wiki/List_of_self-driving_car_fatalities#cite_note-15Google ScholarGoogle Scholar
  6. Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). 265--283.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Nicholas Carlini and David Wagner. 2017. Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp). IEEE, 39--57.Google ScholarGoogle Scholar
  8. Chenyi Chen, Ari Seff, Alain Kornhauser, and Jianxiong Xiao. 2015. Deepdriving: Learning affordance for direct perception in autonomous driving. In Proceedings of the IEEE International Conference on Computer Vision. 2722--2730.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. CNTK. 2020. CNTK has supporting issues with GRU(unroll=true). https://github.com/microsoft/CNTK/issues/3800Google ScholarGoogle Scholar
  10. Xiaoning Du, Xiaofei Xie, Yi Li, Lei Ma, Yang Liu, and Jianjun Zhao. 2019. Deep-stellar: model-based quantitative analysis of stateful deep learning systems. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 477--487.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Saikat Dutta, Owolabi Legunsen, Zixin Huang, and Sasa Misailovic. 2018. Testing probabilistic programming systems. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 574--586.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Anurag Dwarakanath, Manish Ahuja, Samarth Sikand, Raghotham M Rao, RP Jagadeesh Chandra Bose, Neville Dubash, and Sanjay Podder. 2018. Identifying implementation bugs in machine learning based image classifiers using metamorphic testing. In Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis. 118--128.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Facebook. 2020. ONNX. https://github.com/onnx/onnxGoogle ScholarGoogle Scholar
  14. Gordon Fraser and Andrea Arcuri. 2011. Evosuite: automatic test suite generation for object-oriented software. In Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering. 416--419.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. 2014. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014).Google ScholarGoogle Scholar
  16. Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. 2013. Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, 6645--6649.Google ScholarGoogle ScholarCross RefCross Ref
  17. Qianyu Guo, Sen Chen, Xiaofei Xie, Lei Ma, Qiang Hu, Hongtao Liu, Yang Liu, Jianjun Zhao, and Xiaohong Li. 2019. An empirical study towards characterizing deep learning development and deployment across different frameworks and platforms. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 810--822.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.Google ScholarGoogle ScholarCross RefCross Ref
  19. Nargiz Humbatova, Gunel Jahangirova, Gabriele Bavota, Vincenzo Riccio, Andrea Stocco, and Paolo Tonella. 2019. Taxonomy of Real Faults in Deep Learning Systems. arXiv (2019), arXiv-1910.Google ScholarGoogle Scholar
  20. Guy Katz, Clark Barrett, David L Dill, Kyle Julian, and Mykel J Kochenderfer. 2017. Reluplex: An efficient SMT solver for verifying deep neural networks. In International Conference on Computer Aided Verification. Springer, 97--117.Google ScholarGoogle ScholarCross RefCross Ref
  21. Keras. 2020. Keras has supporting issues with GRU (unroll=true) on the CNTK backend. https://github.com/keras-team/keras/issues/13852Google ScholarGoogle Scholar
  22. Nair Krizhevsky, Hinton Vinod, Christopher Geoffrey, Mike Papadakis, and Anthony Ventresque. 2014. CIFAR-10 dataset. http://www.cs.toronto.edu/kriz/cifar.html.Google ScholarGoogle Scholar
  23. Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proc. of the IEEE 86, 11 (1998), 2278--2324.Google ScholarGoogle ScholarCross RefCross Ref
  24. Yann LeCun and Corrina Cortes. 1998. The MNIST database of handwritten digits.Google ScholarGoogle Scholar
  25. Ling Liu, Yanzhao Wu, Wenqi Wei, Wenqi Cao, Semih Sahin, and Qi Zhang. 2018. Benchmarking deep learning frameworks: Design considerations, metrics and beyond. In 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS). IEEE, 1258--1269.Google ScholarGoogle ScholarCross RefCross Ref
  26. Siqi Liu, Sidong Liu, Weidong Cai, Sonia Pujol, Ron Kikinis, and Dagan Feng. 2014. Early diagnosis of Alzheimer's disease with deep learning. In 2014 IEEE 11th international symposium on biomedical imaging (ISBI). IEEE, 1015--1018.Google ScholarGoogle ScholarCross RefCross Ref
  27. Microsoft. 2020. MMdnn. https://github.com/Microsoft/MMdnnGoogle ScholarGoogle Scholar
  28. M. Zalewski. [n.d.]. american fuzzy lop. http://lcamtuf.coredump.cx/afl/.Google ScholarGoogle Scholar
  29. Mahdi Nejadgholi and Jinqiu Yang. 2019. A Study of Oracle Approximations in Testing Deep Learning Libraries. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 785--796.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Augustus Odena and Ian Goodfellow. 2018. Tensorfuzz: Debugging neural networks with coverage-guided fuzzing. arXiv preprint arXiv:1807.10875 (2018).Google ScholarGoogle Scholar
  31. Tianyu Pang, Kun Xu, Chao Du, Ning Chen, and Jun Zhu. 2019. Improving Adversarial Robustness via Promoting Ensemble Diversity. CoRR abs/1901.08846 (2019). http://arxiv.org/abs/1901.08846Google ScholarGoogle Scholar
  32. Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. openreview (2017).Google ScholarGoogle Scholar
  33. Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. 2017. Deepxplore: Automated whitebox testing of deep learning systems. In proceedings of the 26th Symposium on Operating Systems Principles. 1--18.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Hung Viet Pham, Thibaud Lutellier, Weizhen Qi, and Lin Tan. 2019. CRADLE: cross-backend validation to detect and localize bugs in deep learning libraries. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 1027--1038.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Pytorch. 2020. AvgPool: Ensure all cells are valid in ceil mode. https://github.com/pytorch/pytorch/pull/41368Google ScholarGoogle Scholar
  36. Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4510--4520.Google ScholarGoogle ScholarCross RefCross Ref
  37. Frank Seide and Amit Agarwal. 2016. CNTK: Microsoft's open-source deep-learning toolkit. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2135--2135.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Arnab Sharma and Heike Wehrheim. 2019. Testing machine learning algorithms for balanced data usage. In 2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST). IEEE, 125--135.Google ScholarGoogle ScholarCross RefCross Ref
  39. Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google ScholarGoogle Scholar
  40. Siwakorn Srisakaokul, Zhengkai Wu, Angello Astorga, Oreoluwa Alebiosu, and Tao Xie. 2018. Multiple-implementation testing of supervised learning software. In Workshops at the Thirty-Second AAAI Conference on Artificial Intelligence.Google ScholarGoogle Scholar
  41. Yi Sun, Yuheng Chen, Xiaogang Wang, and Xiaoou Tang. 2014. Deep learning face representation by joint identification-verification. In Advances in neural information processing systems. 1988--1996.Google ScholarGoogle Scholar
  42. The Theano Development Team, Rami Al-Rfou, Guillaume Alain, Amjad Almahairi, Christof Angermueller, Dzmitry Bahdanau, Nicolas Ballas, Frédéric Bastien, Justin Bayer, Anatoly Belikov, et al. 2016. Theano: A Python framework for fast computation of mathematical expressions. arXiv preprint arXiv:1605.02688 (2016).Google ScholarGoogle Scholar
  43. TensorFlow. 2019. SoftmaxOp leads to overflow. https://github.com/uTensor/uTensor/issues/175Google ScholarGoogle Scholar
  44. TensorFlow. 2020. Checking if Kernel_size=0 in conv2d and reports error accordingly. https://github.com/tensorflow/tensorflow/pull/37395Google ScholarGoogle Scholar
  45. TensorFlow. 2020. The fix of corner cases for the value None processing. https://github.com/tensorflow/tensorflow/commit/3db8df8ffafe5bcd83a12b92bc4c8287cd80237fGoogle ScholarGoogle Scholar
  46. TensorFlow. 2020. The fix of missing check for the unreasonable parameter input_dim=0 in the layer Embedding. https://github.com/tensorflow/tensorflow/commit/f61175812426009a4c96e51befb2951612990903Google ScholarGoogle Scholar
  47. TensorFlow. 2020. The output of Batch Normalization may contain Nan under certain parameters. https://github.com/tensorflow/tensorflow/issues/38644Google ScholarGoogle Scholar
  48. TensorFlow. 2020. Tensorflow can build and even run a model with Conv2D kerne_size=0. https://github.com/tensorflow/tensorflow/issues/37334Google ScholarGoogle Scholar
  49. Theano. 2020. Theano lacks a check for unreasonable parameters like dilation_rate=0 in Conv2D or DepthwiseConv2D. https://github.com/Theano/Theano/issues/6745Google ScholarGoogle Scholar
  50. Yuchi Tian, Kexin Pei, Suman Jana, and Baishakhi Ray. 2018. Deeptest: Automated testing of deep-neural-network-driven autonomous cars. In Proceedings of the 40th international conference on software engineering. 303--314.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Ian Goodfellow, Dan Boneh, and Patrick McDaniel. 2017. Ensemble Adversarial Training: Attacks and Defenses. arXiv:stat.ML/1705.07204Google ScholarGoogle Scholar
  52. Petra Vidnerová and Roman Neruda. 2016. Evolutionary generation of adversarial examples for deep and shallow machine learning models. In Proceedings of the The 3rd Multidisciplinary International Social Networks Conference on Social Informatics 2016, Data Science 2016. 1--7.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. M Wu, Y Ouyang, H Zhou, L Zhang, C Liu, and Y Zhang. 2020. Simulee: Detecting cuda synchronization bugs via memory-access modeling. In Proceedings of the 42nd International Conference on Software Engineering, ICSE. 23--29.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016).Google ScholarGoogle Scholar
  55. Xiaofei Xie, Lei Ma, Felix Juefei-Xu, Minhui Xue, Hongxu Chen, Yang Liu, Jianjun Zhao, Bo Li, Jianxiong Yin, and Simon See. 2019. DeepHunter: a coverage-guided fuzz testing framework for deep neural networks. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis. 146--157.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Xiaofei Xie, Lei Ma, Haijun Wang, Yuekang Li, Yang Liu, and Xiaohong Li. 2019. Diffchaser: Detecting disagreements for deep neural networks. In Proceedings of the 28th International Joint Conference on Artificial Intelligence. AAAI Press, 5772--5778.Google ScholarGoogle ScholarCross RefCross Ref
  57. Xiyue Zhang, Xiaofei Xie, Lei Ma, Xiaoning Du, Qiang Hu, Yang Liu, Jianjun Zhao, and Meng Sun. 2020. Towards Characterizing Adversarial Defects of Deep Learning Software from the Lens of Uncertainty.Google ScholarGoogle Scholar
  58. Yuhao Zhang, Yifan Chen, Shing-Chi Cheung, Yingfei Xiong, and Lu Zhang. 2018. An empirical study on TensorFlow program bugs. In Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis. 129--140.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Dan Zuras, Mike Cowlishaw, Alex Aiken, Matthew Applegate, David Bailey, Steve Bass, Dileep Bhandarkar, Mahesh Bhat, David Bindel, Sylvie Boldo, et al. 2008. IEEE standard for floating-point arithmetic. IEEE Std 754, 2008 (2008), 1--70.Google ScholarGoogle Scholar

Index Terms

  1. Audee: automated testing for deep learning frameworks

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      ASE '20: Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering
      December 2020
      1449 pages
      ISBN:9781450367684
      DOI:10.1145/3324884

      Copyright © 2020 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 27 January 2021

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate82of337submissions,24%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader