skip to main content
10.1145/3317550.3321443acmconferencesArticle/Chapter ViewAbstractPublication PageshotosConference Proceedingsconference-collections
research-article

A Case for Managed and Model-less Inference Serving

Published:13 May 2019Publication History

ABSTRACT

The number of applications relying on inference from machine learning models, especially neural networks, is already large and expected to keep growing. For instance, Facebook applications issue tens-of-trillions of inference queries per day with varying performance, accuracy, and cost constraints. Unfortunately, today's inference serving systems are neither easy to use nor cost effective. Developers must manually match the performance, accuracy, and cost constraints of their applications to a large design space that includes decisions such as selecting the right model and model optimizations, selecting the right hardware architecture, selecting the right scale-out factor, and avoiding cold-start effects. These interacting decisions are difficult to make, especially when the application load varies over time, applications evolve over time, and the available resources vary over time.

If we want an increasing number of applications to use machine learning, we must automate issues that affect ease-of-use, performance, and cost efficiency for both users and providers. Hence, we define and make the case for managed and model-less inference serving. In this paper, we identify and discuss open research directions to realize this vision.

References

  1. M. AbdelBaky, M. Zou, A. R. Zamani, E. Renart, J. Diaz-Montes, and M. Parashar. 2017. Computing in the Continuum: Combining Pervasive Devices and Services to Support Data-Driven Applications. In 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS). 1815--1824.Google ScholarGoogle Scholar
  2. Istemi Ekin Akkus, Ruichuan Chen, Ivica Rimac, Manuel Stein, Klaus Satzke, Andre Beck, Paarijaat Aditya, and Volker Hilt. 2018. SAND: Towards High-Performance Serverless Computing. In 2018 USENIX Annual Technical Conference (USENIX ATC 18). USENIX Association, Boston, MA, 923--935. https://www.usenix.org/conference/atc18/presentation/akkus Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Amazon 2018. Amazon Elastic Inference. https://aws.amazon.com/machine-learning/elastic-inference/.Google ScholarGoogle Scholar
  4. Amazon 2018. Amazon SageMaker. https://aws.amazon.com/sagemaker/.Google ScholarGoogle Scholar
  5. Amazon 2018. Amazon SageMaker Neo. https://aws.amazon.com/sagemaker/neo/.Google ScholarGoogle Scholar
  6. Dan Ardelean, Amer Diwan, and Chandra Erdman. 2018. Performance Analysis of Cloud Applications. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18). USENIX Association, Renton, WA, 405--417. https://www.usenix.org/conference/nsdi18/presentation/ardelean Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Mohammed Attia, Younes Samih, Ali Elkahky, and Laura Kallmeyer. 2018. Multilingual Multi-class Sentiment Classification Using Convolutional Neural Networks. Miyazaki, Japan, 635--640. http://www.lrec-conf.org/proceedings/lrec2018/pdf/149.pdfGoogle ScholarGoogle Scholar
  8. AWS 2018. AWS Lambda. https://aws.amazon.com/lambda/.Google ScholarGoogle Scholar
  9. Alfredo Canziani, Adam Paszke, and Eugenio Culurciello. 2016. An Analysis of Deep Neural Network Models for Practical Applications. CoRR abs/1605.07678 (2016). arXiv:1605.07678 http://arxiv.org/abs/1605.07678Google ScholarGoogle Scholar
  10. Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, Carlsbad, CA, 578--594. https://www.usenix.org/conference/osdi18/presentation/chen Google ScholarGoogle Scholar
  11. Daniel Crankshaw, Xin Wang, Giulio Zhou, Michael J. Franklin, Joseph E. Gonzalez, and Ion Stoica. 2017. Clipper: A Low-Latency Online Prediction Serving System. In 14th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2017, Boston, MA, USA, March 27-29, 2017. 613--627. https://www.usenix.org/conference/nsdi17/technical-sessions/presentation/crankshaw Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Abdul Dakkak, Cheng Li, Simon Garcia De Gonzalo, Jinjun Xiong, and Wen-Mei W. Hwu. 2018. TrIMS: Transparent and Isolated Model Sharing for Low Latency Deep LearningInference in Function as a Service Environments. CoRR abs/1811.09732 (2018). arXiv:1811.09732 http://arxiv.org/abs/1811.09732Google ScholarGoogle Scholar
  13. Facebook 2017. PyTorch. https://pytorch.org/.Google ScholarGoogle Scholar
  14. Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Todd Massengill, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, Logan Adams, Mahdi Ghandi, Stephen Heil, Prerak Patel, Adam Sapek, Gabriel Weisz, Lisa Woods, Sitaram Lanka, Steven K. Reinhardt, Adrian M. Caulfield, Eric S. Chung, and Doug Burger. 2018. A Configurable Cloud-scale DNN Processor for Real-time AI. In Proceedings of the 45th Annual International Symposium on Computer Architecture (ISCA '18). IEEE Press, Piscataway, NJ, USA, 1--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Anshul Gandhi, Mor Harchol-Balter, Ram Raghunathan, and Michael A Kozuch. 2012. Autoscale: Dynamic, robust capacity management for multi-tier data centers. ACM Transactions on Computer Systems (TOCS) 30, 4(2012), 14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Google 2018. Google Cloud Machine Learning Engine. https://cloud.google.com/ml-engine/.Google ScholarGoogle Scholar
  17. Google 2018. Google Compute Engine Pricing. https://cloud.google.com/compute/pricing.Google ScholarGoogle Scholar
  18. Google 2018. TensorFlow - An open source machine learning framework for everyone. https://www.tensorflow.org.Google ScholarGoogle Scholar
  19. Google 2018. TensorFlow Serving for model deployment in production. https://www.tensorflow.org/serving/.Google ScholarGoogle Scholar
  20. Juncheng Gu, Mosharaf Chowdhury, Kang G. Shin, Yibo Zhu, Myeongjae Jeon, Junjie Qian, Hongqiang Liu, and Chuanxiong Guo. 2019. Tiresias: A GPU Cluster Manager for Distributed Deep Learning. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19). USENIX Association, Boston, MA, 485--500. https://www.usenix.org/conference/nsdi19/presentation/gu Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Arpan Gujarati, Sameh Elnikety, Yuxiong He, Kathryn S McKinley, and Björn B Brandenburg. 2017. Swayam: distributed autoscaling to meet SLAs of machine learning inference services with resource efficiency. In Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference. ACM, 109--120. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Arpan Gujarati, Sameh Elnikety, Yuxiong He, Kathryn S. McKinley, and Björn B. Brandenburg. 2017. Swayam: distributed autoscaling to meet SLAs of machine learning inference services with resource efficiency. In Middleware. ACM, 109--120. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Song Han, Jeff Pool, John Tran, and William Dally. 2015. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems. 1135--1143. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Kim Hazelwood, Sarah Bird, David Brooks, Soumith Chintala, Utku Diril, Dmytro Dzhulgakov, Mohamed Fawzy, Bill Jia, Yangqing Jia, Aditya Kalro, James Law, Kevin Lee, Jason Lu, Pieter Noordhuis, Misha Smelyanskiy, Liang Xiong, and Xiaodong Wang. 2018. Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective. In Proceedings of the 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA) (HPCA '18). IEEE.Google ScholarGoogle ScholarCross RefCross Ref
  25. Gao Huang, Danlu Chen, Tianhong Li, Felix Wu, Laurens van der Maaten, and Kilian Weinberger. 2018. Multi-Scale Dense Networks for Resource Efficient Image Classification. In International Conference on Learning Representations. https://openreview.net/forum?id=Hk2aImxAbGoogle ScholarGoogle Scholar
  26. Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. 2016. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size. arXiv preprint arXiv:1602.07360 (2016).Google ScholarGoogle Scholar
  27. Intel 2018. Intel Nervana Neural Network Processor. https://ai.intel.com/nervana-nnp/.Google ScholarGoogle Scholar
  28. Paras Jain, Xiangxi Mo, Ajay Jain, Harikaran Subbaraj, Rehan Durrani, Alexey Tumanov, Joseph Gonzalez, and Ion Stoica. 2018. Dynamic Space-Time Scheduling for GPU Inference. In LearningSys Workshop at Neural Information Processing Systems 2018.Google ScholarGoogle Scholar
  29. Junchen Jiang, Ganesh Ananthanarayanan, Peter Bodik, Siddhartha Sen, and Ion Stoica. 2018. Chameleon: Scalable Adaptation of Video Analytics. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication (SIGCOMM '18). ACM, New York, NY, USA, 253--266. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Eric Jonas, Qifan Pu, Shivaram Venkataraman, Ion Stoica, and Benjamin Recht. 2017. Occupy the Cloud: Distributed Computing for the 99%. In Proceedings of the 2017 Symposium on Cloud Computing (SoCC '17). ACM, New York, NY, USA, 445--451. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. 2017. In-Datacenter Performance Analysis of a Tensor Processing Unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA '17). ACM, New York, NY, USA, 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Yiping Kang, Johann Hauswald, Cao Gao, Austin Rovinski, Trevor Mudge, Jason Mars, and Lingjia Tang. 2017. Neurosurgeon: Collaborative Intelligence Between the Cloud and Mobile Edge. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (2017-01-01) (ASPLOS '17). ACM, Xi'an, China, 615--629. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Animesh Koratana, Daniel Kang, Peter Bailis, and Matei Zaharia. 2018. LIT: Block-wise Intermediate Representation Training for Model Compression. CoRR abs/1810.01937 (2018). arXiv:1810.01937 http://arxiv.org/abs/1810.01937Google ScholarGoogle Scholar
  34. Urs Köster, Tristan Webb, Xin Wang, Marcel Nassar, Arjun K. Bansal, William Constable, Oguz Elibol, Stewart Hall, Luke Hornof, Amir Khosrowshahi, Carey Kloss, Ruby J. Pai, and Naveen Rao. 2017. Flex-point: An Adaptive Numerical Format for Efficient Training of Deep Neural Networks. CoRR abs/1711.02213 (2017). arXiv:1711.02213 http://arxiv.org/abs/1711.02213Google ScholarGoogle Scholar
  35. Microsoft 2018. Azure Machine Learning. https://docs.microsoft.com/en-us/azure/machine-learning/.Google ScholarGoogle Scholar
  36. Philipp Moritz, Robert Nishihara, Ion Stoica, and Michael I. Jordan. 2015. SparkNet: Training Deep Networks in Spark. CoRR abs/1511.06051 (2015). arXiv:1511.06051 http://arxiv.org/abs/1511.06051Google ScholarGoogle Scholar
  37. MXNet 2017. Apache MXNet (Incubating) - A flexible and efficient library for deep learning. https://mxnet.apache.org/.Google ScholarGoogle Scholar
  38. Feng Niu, Benjamin Recht, Christopher Re, and Stephen J. Wright. 2011. HOGWILD!: A Lock-free Approach to Parallelizing Stochastic Gradient Descent. In Proceedings of the 24th International Conference on Neural Information Processing Systems (NIPS'11). Curran Associates Inc., USA, 693--701. http://dl.acm.org/citation.cfm?id=2986459.2986537 Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. NVIDIA 2017. NVIDIA Tesla V100 Tensor Core GPU. https://www.nvidia.com/en-us/data-center/tesla-v100/.Google ScholarGoogle Scholar
  40. NVIDIA 2018. NVIDIA TensorRT Inference Server. https://github.com/NVIDIA/tensorrt-inference-server.Google ScholarGoogle Scholar
  41. NVIDIA 2018. NVIDIA TensorRT: Programmable Inference Accelerator. https://developer.nvidia.com/tensorrt.Google ScholarGoogle Scholar
  42. Edward Oakes, Leon Yang, Dennis Zhou, Kevin Houck, Tyler Harter, Andrea Arpaci-Dusseau, and Remzi Arpaci-Dusseau. 2018. SOCK: Rapid Task Provisioning with Serverless-Optimized Containers. In 2018 USENIX Annual Technical Conference (USENIX ATC 18). USENIX Association, Boston, MA, 57--70. https://www.usenix.org/conference/atc18/presentation/oakes Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Young H. Oh, Quan Quan, Daeyeon Kim, Seonghak Kim, Jun Heo, Sungjun Jung, Jaeyoung Jang, and Jae W. Lee. 2018. A Portable, Automatic Data Quantizer for Deep Neural Networks. In Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques (PACT '18). ACM, New York, NY, USA, Article 17, 14 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Mahadev Satyanarayanan. 2017. The Emergence of Edge Computing. Computer 50, 1 (Jan. 2017), 30--39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Virginia Smith, Simone Forte, Chenxin Ma, Martin Takác, Michael I. Jordan, and Martin Jaggi. 2017. CoCoA: A General Framework for Communication-Efficient Distributed Optimization. Journal of Machine Learning Research 18 (2017), 230:1--230:49. http://jmlr.org/papers/v18/papers/v18/16-512.html Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Leonard Truong, Rajkishore Barik, Ehsan Totoni, Hai Liu, Chick Markley, Armando Fox, and Tatiana Shpeisman. 2016. Latte: A Language, Compiler, and Runtime for Elegant and Efficient Deep Neural Networks. SIGPLAN Not. 51, 6 (June 2016), 209--223. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Leonid Velikovich, Ian Williams, Justin Scheiner, Petar S. Aleksic, Pedro J. Moreno, and Michael Riley. 2018. Semantic Lattice Processing in Contextual Automatic Speech Recognition for Google Assistant. In Interspeech 2018, 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, 2-6 September 2018. 2222--2226.Google ScholarGoogle Scholar
  48. Liang Wang, Mengyuan Li, Yinqian Zhang, Thomas Ristenpart, and Michael Swift. 2018. Peeking Behind the Curtains of Serverless Platforms. In 2018 USENIX Annual Technical Conference (USENIX ATC 18). USENIX Association, Boston, MA, 133--146. https://www.usenix.org/conference/atc18/presentation/wang-liang Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Wei Wang, Jinyang Gao, Meihui Zhang, Sheng Wang, Gang Chen, Teck Khim Ng, Beng Chin Ooi, Jie Shao, and Moaz Reyad. 2018. Rafiki: machine learning as an analytics service system. Proceedings of the VLDB Endowment 12, 2 (2018), 128--140. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu Zhao, Quanlu Zhang, Fan Yang, and Lidong Zhou. 2018. Gandiva: Introspective Cluster Scheduling for Deep Learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, Carlsbad, CA, 595--610. https://www. usenix.org/conference/osdi18/presentation/xiao Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Xilinx 2018. Accelerating DNNs with Xilinx Alveo Accelerator Cards. https://www.xilinx.com/support/documentation/white_papers/wp504-accel-dnns.pdf.Google ScholarGoogle Scholar
  52. Mengjia Yan, Christopher W. Fletcher, and Josep Torrellas. 2018. Cache Telepathy: Leveraging Shared Resource Attacks to Learn DNN Architectures. CoRR abs/1808.04761 (2018). arXiv:1808.04761 http://arxiv.org/abs/1808.04761Google ScholarGoogle Scholar

Index Terms

  1. A Case for Managed and Model-less Inference Serving

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        HotOS '19: Proceedings of the Workshop on Hot Topics in Operating Systems
        May 2019
        227 pages
        ISBN:9781450367271
        DOI:10.1145/3317550

        Copyright © 2019 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 13 May 2019

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed limited

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader