ABSTRACT
Today's computer architectures are fundamentally different than a decade ago: IO devices and interfaces can sustain much higher data rates than the compute capacity of a single threaded CPU. To meet the computational requirements of modern applications, the operating system (OS) requires lean and optimized software running on CPUs for applications to fully exploit the IO resources. Despite the changes in hardware, today's traditional system software unfortunately uses the same assumptions of a decade ago---the IO is slow, and the CPU is fast.
This paper makes a case for the data-centric extensible OS, which enables full exploitation of emerging high-performance IO hardware. Based on the idea of minimizing data movements in software, a top-to-bottom lean and optimized architecture is proposed, which allows applications to customize the OS kernel's IO subsystems with application-provided code. This enables sharing and high-performance IO among applications---initial microbenchmarks on a Linux prototype where we used eBPF to specialize the Linux kernel show performance improvements of up to 1.8× for database primitives and 4.8× for UNIX utility tools.
- Extension framework for file systems in user-space. In 2019 USENIX Annual Technical Conference (USENIX ATC 19), Renton, WA, 2019. USENIX Association. Google ScholarDigital Library
- Mike Accetta, Robert Baron, William Bolosky, David Golub, Richard Rashid, Avadis Tevanian, and Michael Young. Mach: A new kernel foundation for unix development. pages 93--112, 1986.Google Scholar
- Ethernet Alliance. Ethernet's terabit future seen in new ethernet alliance roadmap, 2018. ttps://ethernetalliance.org/wp-content/uploads/2018/03/EA_Roadmap18_FINAL_12Mar18.pdf.Google Scholar
- Antonio Barbalace, Martin Decky, and Javier Picorel. Smart software caches. In The 8th Workshop on Systems for Multi-core and Heterogeneous Architectures, 2018.Google Scholar
- Antonio Barbalace, Anthony Iliopoulos, Holm Rauchfuss, and Goetz Brasche. It's time to think about an operating system for near data processing architectures. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems, HotOS '17, pages 56--61, New York, NY, USA, 2017. ACM. Google ScholarDigital Library
- Antonio Barbalace, Rob Lyerly, Christopher Jelesnianski, Anthony Carno, Ho-ren Chuang, and Binoy Ravindran. Breaking the boundaries in heterogeneous-isa datacenters. In Proceedings of the 22th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '17, 2017. Google ScholarDigital Library
- Antonio Barbalace, Marina Sadini, Saif Ansary, Christopher Jelesnianski, Akshay Ravichandran, Cagil Kendir, Alastair Murray, and Binoy Ravindran. Popcorn: Bridging the programmability gap in heterogeneous-isa platforms. In Proceedings of the Tenth European Conference on Computer Systems, EuroSys '15, pages 29:1--29:16, New York, NY, USA, 2015. ACM. Google ScholarDigital Library
- Luiz Barroso, Mike Marty, David Patterson, and Parthasarathy Ranganathan. Attack of the killer microseconds. Commun. ACM, 60(4):48--54, March 2017. Google ScholarDigital Library
- Andrew Baumann, Paul Barham, Pierre-Evariste Dagand, Tim Harris, Rebecca Isaacs, Simon Peter, Timothy Roscoe, Adrian Schüpbach, and Akhilesh Singhania. The multikernel: A new os architecture for scalable multicore systems. In Proceedings of the ACM SIGOPS 22Nd Symposium on Operating Systems Principles, SOSP '09, 2009. Google ScholarDigital Library
- Andrew Baumann, Dongyoon Lee, Pedro Fonseca, Lisa Glendenning, Jacob R. Lorch, Barry Bond, Reuben Olinsky, and Galen C. Hunt. Composing os extensions safely and efficiently with bascule. In Proceedings of the 8th ACM European Conference on Computer Systems, EuroSys '13, pages 239--252, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
- Adam Belay, Andrea Bittau, Ali Mashtizadeh, David Terei, David Mazieres, and Christos Kozyrakis. Dune: Safe User-level Access to Privileged CPU Features. page 14.Google Scholar
- Adam Belay, George Prekas, Mia Primorac, Ana Klimovic, Samuel Grossman, Christos Kozyrakis, and Edouard Bugnion. The ix operating system: Combining low latency, high throughput, and efficiency in a protected dataplane. ACM Trans. Comput. Syst., 34(4):11:1--11:39, December 2016. Google ScholarDigital Library
- B. N. Bershad, S. Savage, P. Pardyak, E. G. Sirer, M. E. Fiuczynski, D. Becker, C. Chambers, and S. Eggers. Extensibility safety and performance in the spin operating system. In Proceedings of the Fifteenth ACM Symposium on Operating Systems Principles, SOSP '95, pages 267--283, New York, NY, USA, 1995. ACM. Google ScholarDigital Library
- H. Bos and B. Samwel. Safe kernel programming in the oke. In 2002 IEEE Open Architectures and Network Programming Proceedings. OPENARCH 2002 (Cat. No.02EX571), pages 141--152, June 2002.Google ScholarCross Ref
- Bryan M. Cantrill, Michael W. Shapiro, and Adam H. Leventhal. Dynamic instrumentation of production systems. In Proceedings of the Annual Conference on USENIX Annual Technical Conference, ATEC '04, pages 2--2, Berkeley, CA, USA, 2004. USENIX Association. Google ScholarDigital Library
- CCIX Consortium. Cache Coherent Interconnect for Accelerators (CCIX), 2017. http://www.ccixconsortium.com/.Google Scholar
- Shenghsun Cho, Amoghavarsha Suresh, Tapti Palit, Michael Ferdman, and Nima Honarmand. Taming the killer microsecond. In 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2018. Google ScholarDigital Library
- OpenCAPI Consortium. Welcome to OpenCAPI consortium, 2017. http://opencapi.org/.Google Scholar
- Byron Cook, Andreas Podelski, and Andrey Rybalchenko. Termination proofs for systems code. In Proceedings of the 27th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '06, pages 415--426, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
- Nathan Dautenhahn, Theodoros Kasampalis, Will Dietz, John Criswell, and Vikram Adve. Nested kernel: An operating system architecture for intra-kernel privilege separation. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '15, pages 191--206, New York, NY, USA, 2015. ACM. Google ScholarDigital Library
- Willem de Bruijn and Herbert Bos. Pipesfs: Fast linux i/o in the unix tradition. SIGOPS Oper. Syst. Rev., 42(5):55--63, July 2008. Google ScholarDigital Library
- Willem de Bruijn, Herbert Bos, and Henri Bal. Application-tailored i/o with streamline. ACM Trans. Comput. Syst., 29(2):6:1--6:33, May 2011. Google ScholarDigital Library
- Ulrich Drepper. The cost of virtualization. Acm Queue, 6(1):28--35, 2008. Google ScholarDigital Library
- Weimin Du, Ravi Krishnamurthy, and Ming-Chien Shan. Query optimization in a heterogeneous dbms. In Proceedings of the 18th International Conference on Very Large Data Bases, VLDB '92, pages 277--291, San Francisco, CA, USA, 1992. Morgan Kaufmann Publishers Inc. Google ScholarDigital Library
- Dawson R Engler, M Frans Kaashoek, and James O'Toole Jr. Exokernel: An Operating System Architecture for Application-Level Resource Management. page 16. Google ScholarDigital Library
- Matt Fleming. A thorough introduction to eBPF, 2017. https://lwn.net/Articles/740157/.Google Scholar
- Gen-Z Consortium. Gen-Z A New Approach to Data Access, 2017. http://genzconsortium.org/.Google Scholar
- Elazar Gershuni, Nadav Amit, Arie Gurfinkel, Nina Narodytska, Jorge A. Navas, Noam Rinetzky, Leonid Ryzhyk, and Mooly Sagiv. Simple and precise static analysis of untrusted linux kernel extensions. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2019, pages 1069--1084, New York, NY, USA, 2019. ACM. Google ScholarDigital Library
- Brendan Gregg. Kpti/kaiser meltdown initial performance regressions, 2018. http://www.brendangregg.com/blog/2018-02-09/kpti-kaiser-meltdown-performance.html.Google Scholar
- Brendan Gregg. Linux Extended BPF (eBPF) Tracing Tools, 2018. http://www.brendangregg.com/ebpf.html.Google Scholar
- Boncheol Gu, Andre S. Yoon, Duck-Ho Bae, Insoon Jo, Jinyoung Lee, Jonghyun Yoon, Jeong-Uk Kang, Moonsang Kwon, Chanho Yoon, Sangyeun Cho, Jaeheon Jeong, and Duckhyun Chang. Biscuit: A framework for near-data processing of big data workloads. In Proceedings of the 43rd International Symposium on Computer Architecture, ISCA '16, pages 153--165, Piscataway, NJ, USA, 2016. IEEE Press. Google ScholarDigital Library
- Laura M. Haas, Donald Kossmann, Edward L. Wimmers, and Jun Yang. Optimizing queries across diverse data sources. In In Proc. of VLDB, pages 276--285, 1997. Google ScholarDigital Library
- F. T. Hady, A. Foong, B. Veal, and D. Williams. Platform storage performance with 3d xpoint technology. Proceedings of the IEEE, 105(9):1822--1833, Sept 2017.Google ScholarCross Ref
- Matthias Hille, Nils Asmussen, Pramod Bhatotia, and Hermann Härtig. Semperos: A distributed capability system. In 2019 USENIX Annual Technical Conference (USENIX ATC 19), Renton, WA, 2019. USENIX Association. Google ScholarDigital Library
- Galen C. Hunt and James R. Larus. Singularity: Rethinking the software stack. SIGOPS Oper. Syst. Rev., 41(2):37--49, April 2007. Google ScholarDigital Library
- Intel. Storage Performance Development Kit (SPDK), 2018. http://www.spdk.org.Google Scholar
- Intel. BlobFS (Blobstore Filesystem), 2019. https://spdk.io/doc/blobfs.html.Google Scholar
- Jonathan Corbet. Page-based direct i/o, 2009. https://lwn.net/Articles/348719/, Online, accessed 01/05/2019.Google Scholar
- The Linux Kernel. Seccomp BPF (SECure COMPuting with filters), 2018. https://www.kernel.org/doc/html/v4.13/userspace-api/seccomp_filter.html.Google Scholar
- Dmitrii Kuvaiskii, Oleksii Oleksenko, Sergei Arnautov, Bohdan Trach, Pramod Bhatotia, Pascal Felber, and Christof Fetzer. Sgxbounds: Memory safety for shielded execution. In Proceedings of the Twelfth European Conference on Computer Systems, EuroSys '17, pages 205--221, New York, NY, USA, 2017. ACM. Google ScholarDigital Library
- C. A. Lai, Q. Wang, J. Kimball, J. Li, J. Park, and C. Pu. Io performance interference among consolidated n-tier applications: Sharing is better than isolation for disks. In 2014 IEEE 7th International Conference on Cloud Computing, pages 24--31, June 2014. Google ScholarDigital Library
- W. S. Liao, See-Mong Tan, and R. H. Campbell. Fine-grained, dynamic user customization of operating systems. In Proceedings of the Fifth International Workshop on Object-Orientation in Operation Systems, pages 62--66, Oct 1996. Google ScholarDigital Library
- J. Liedtke. On micro-kernel construction. In Proceedings of the Fifteenth ACM Symposium on Operating Systems Principles, SOSP '95, pages 237--250, New York, NY, USA, 1995. ACM. Google ScholarDigital Library
- Felix Xiaozhu Lin, Zhen Wang, and Lin Zhong. K2: a mobile operating system for heterogeneous coherence domains. ACM SIGARCH Computer Architecture News, 42(1):285--300, 2014. Google ScholarDigital Library
- LWN. Linux >=4.9: eBPF memory corruption bugs, 2017. https://lwn.net/Articles/742169/.Google Scholar
- Steven McCanne and Van Jacobson. The bsd packet filter: A new architecture for user-level packet capture. In Proceedings of the USENIX Winter 1993 Conference Proceedings on USENIX Winter 1993 Conference Proceedings, USENIX'93, pages 2--2, Berkeley, CA, USA, 1993. USENIX Association. Google ScholarDigital Library
- Larry McVoy. The splice i/o model, 1998.Google Scholar
- Mellanox. ConnectX-6 200Gb/s Ethernet Adapter IC, 2018. http://www.mellanox.com/related-docs/prod_silicon/PB_ConnectX-6_EN_IC.pdf.Google Scholar
- Netronome. About agilio smartnics, 2019. https://www.netronome.com/products/smartnic/overview/, Online, accessed 01/05/2019.Google Scholar
- NVIDIA. Nsight Eclipse Edition, 2018. https://developer.nvidia.com/nsight-eclipse-edition.Google Scholar
- Oleksii Oleksenko, Dmitrii Kuvaiskii, Pramod Bhatotia, Pascal Felber, and Christof Fetzer. Intel MPX Explained: A Cross-layer Analysis of the Intel MPX System Stack. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 2018. Google ScholarDigital Library
- Simon Peter, Jialin Li, Irene Zhang, Dan R. K. Ports, Doug Woos, Arvind Krishnamurthy, Thomas Anderson, and Timothy Roscoe. Arrakis: The operating system is the control plane. ACM Trans. Comput. Syst., 33(4):11:1--11:30, November 2015. Google ScholarDigital Library
- Ian Pratt and Keir Fraser. Arsenic: A user-accessible gigabit ethernet interface. In IN PROCEEDINGS OF IEEE INFOCOM, pages 67--76, 2001.Google ScholarCross Ref
- Mia Primorac, Edouard Bugnion, and Katerina Argyraki. How to measure the killer microsecond. SIGCOMM Comput. Commun. Rev., 47(5):61--66, October 2017. Google ScholarDigital Library
- IO Visor Project. Bcc bpf compiler collection, 2018. https://www.iovisor.org/technology/bcc.Google Scholar
- The Linux Foundation Projects. Data Plane Development Kit (DPDK), 2018. http://www.dpdk.org.Google Scholar
- Amit Purohit, Charles P Wright, Joseph Spadavecchia, Erez Zadok, et al. Cosy: Develop in user-land, run in kernel-mode. In HotOS, pages 109--114, 2003. Google ScholarDigital Library
- Andrew Putnam. Large-scale reconfigurable computing in a microsoft datacenter. In Hot Chips 26 Symposium (HCS), 2014 IEEE, pages 1--38. IEEE, 2014.Google Scholar
- Matthew J Renzelmann, Asim Kadav, and Michael M Swift. Symdrive: Testing drivers without devices. In Osdi, volume 1, page 6, 2012. Google ScholarDigital Library
- Samsung. Samsung pm1725a nvme ssd, 2018. https://www.samsung.com/semiconductor/global.semi.static/Samsung_PM1725a_NVMe_SSD-0.pdf.Google Scholar
- Margo I. Seltzer, Yasuhiro Endo, Christopher Small, and Keith A. Smith. Dealing with disaster: Surviving misbehaved kernel extensions. In Proceedings of the Second USENIX Symposium on Operating Systems Design and Implementation, OSDI '96, pages 213--227, New York, NY, USA, 1996. ACM. Google ScholarDigital Library
- Yizhou Shan, Yutong Huang, Yilun Chen, and Yiying Zhang. Legoos: A disseminated, distributed {OS} for hardware resource disaggregation. In 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18), pages 69--87, 2018. Google ScholarDigital Library
- Christopher Small and Margo Seltzer. A comparison of os extension technologies. In Proceedings of the 1996 Annual Conference on USENIX Annual Technical Conference, ATEC '96, pages 4--4, Berkeley, CA, USA, 1996. USENIX Association. Google ScholarDigital Library
- Andrew S Tanenbaum. A unix clone with source code for operating systems courses. SIGOPS Oper. Syst. Rev., 21(1):20--29, January 1987. Google ScholarDigital Library
- Chandramohan A. Thekkath, Thu D. Nguyen, Evelyn Moy, and Edward D. Lazowska. Implementing network protocols at user level. IEEE/ACM Trans. Netw., 1(5):554--565, October 1993. Google ScholarDigital Library
- Shivakumar Venkataraman and Tian Zhang. Heterogeneous database query optimization in db2 universal datajoiner. In Proceedings of the 24rd International Conference on Very Large Data Bases, VLDB '98, pages 685--689, San Francisco, CA, USA, 1998. Morgan Kaufmann Publishers Inc. Google ScholarDigital Library
- Common Vulnerabilities and Exposures. CVE-2017-16995, 2017. https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2017-16995.Google Scholar
- Daniel Waddington and Jim Harris. Software challenges for the changing storage landscape. Commun. ACM, 61(11):136--145, October 2018. Google ScholarDigital Library
- XiWang, David Lazar, Nickolai Zeldovich, Adam Chlipala, and Zachary Tatlock. Jitk: A trustworthy in-kernel interpreter infrastructure. In OSDI, pages 33--47, 2014. Google ScholarDigital Library
- N. Zilberman, Y. Audzevich,G. A. Covington, and A. W. Moore. Netfpga sume: Toward 100 gbps as research commodity. IEEE Micro, 34(5):32--41, Sept 2014.Google ScholarCross Ref
Comments