ABSTRACT
We present PacketShader, a high-performance software router framework for general packet processing with Graphics Processing Unit (GPU) acceleration. PacketShader exploits the massively-parallel processing power of GPU to address the CPU bottleneck in current software routers. Combined with our high-performance packet I/O engine, PacketShader outperforms existing software routers by more than a factor of four, forwarding 64B IPv4 packets at 39 Gbps on a single commodity PC. We have implemented IPv4 and IPv6 forwarding, OpenFlow switching, and IPsec tunneling to demonstrate the flexibility and performance advantage of PacketShader. The evaluation results show that GPU brings significantly higher throughput over the CPU-only implementation, confirming the effectiveness of GPU for computation and memory-intensive operations in packet processing.
- AMD Fusion. http://fusion.amd.com.Google Scholar
- Cavium Networks OCTEON II processors. http://www.caviumnetworks.com/OCTEON_II_MIPS64.html.Google Scholar
- Check Point IP Security Appliances. http://www.checkpoint.com/products/ip-appliances/index.html.Google Scholar
- Cisco QuantumFlow Processors. http://www.cisco.com/en/US/prod/collateral/routers/ps9343/solution_over%view_c22--448936.html.Google Scholar
- General Purpose computation on GPUs. http://www.gpgpu.org.Google Scholar
- GNU Zebra project. http://www.zebra.org.Google Scholar
- NVIDIA CUDA GPU Computing Discussion Forum. http://forums.nvidia.com/index.php?showtopic=104243.Google Scholar
- NVIDIA Fermi Architecture. http://www.nvidia.com/object/fermi_architecture.html.Google Scholar
- OpenFlow Reference System. http://www.openflowswitch.org/wp/downloads/.Google Scholar
- OpenFlow Switch Specification, Version 0.8.9. http://www.openflowswitch.org/documents/openflow-spec-v0.8.9.pdf.Google Scholar
- Quagga project. http://www.quagga.net.Google Scholar
- Receive-Side Scaling Enhancements in Windows Server 2008. http://www.microsoft.com/whdc/device/network/ndis_rss.mspx.Google Scholar
- The OpenFlow Switch Consortium. http://www.openflowswitch.org.Google Scholar
- University of Oregon RouteViews project. http://www.routeviews.org/.Google Scholar
- R. Bolla and R. Bruschi. PC-based software routers: High performance and application service support. In ACM PRESTO, 2008. Google ScholarDigital Library
- J. Bonwick. The slab allocator: an object-caching kernel memory allocator. In USENIX Summer Technical Conference, 1994. Google ScholarDigital Library
- S. Boyd-Wickizer, H. Chen, R. Chen, Y. Mao, F. Kaashoek, R. Morris, A. Pesterev, L. Stein, M. Wu, Y. Dai, Y. Zhang, and Z. Zhang. Corey: An operating system for many cores. In OSDI, 2008. Google ScholarDigital Library
- T. Brecht, G. J. Janakiraman, B. Lynn, V. Saletore, and Y. Turner. Evaluating network processing efficiency with processor partitioning and asynchronous i/o. SIGOPS Oper. Syst. Rev., 40(4):265--278, 2006. Google ScholarDigital Library
- M. Dobrescu, N. Egi, K. Argyraki, B.-G. Chun, K. Fall, G. Iannaccone, A. Knies, M. Manesh, and S. Ratnasamy. RouteBricks: exploiting parallelism to scale software routers. In SOSP, 2009. Google ScholarDigital Library
- K. Fatahalian and M. Houston. A closer look at GPUs. Communications of the ACM, 51:50--57, 2008. Google ScholarDigital Library
- A. Foong, J. Fung, and D. Newell. An in-depth analysis of the impact of processor affinity on network performance. In IEEE ICON, 2004.Google ScholarCross Ref
- P. Gupta, S. Lin, and N. McKeown. Routing lookups in hardware at memory access speeds. In IEEE INFOCOM, 1998.Google ScholarCross Ref
- S. Han, K. Jang, K. Park, and S. Moon. Building a single-box 100 gbps software router. In IEEE Workshop on Local and Metropolitan Area Networks, 2010.Google ScholarCross Ref
- O. Harrison and J. Waldron. Practical Symmetric Key Cryptography on Modern Graphics Hardware. In USENIX Security, 2008. Google ScholarDigital Library
- S. Hong and H. Kim. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In ISCA, 2009. Google ScholarDigital Library
- V. Jacobson, C. Leres, and S. McCanne. libpcap, Lawrence Berkeley Laboratory, Berkeley, CA. http://www.tcpdump.org.Google Scholar
- K. Jang, S. Han, S. Moon, and K. Park. Converting your graphics card into high-performance SSL accelerator. submitted for publication.Google Scholar
- G. Jin and B. L. Tierney. System capability effects on algorithms for network bandwidth measurement. In IMC, 2003. Google ScholarDigital Library
- D. Kim, J. Heo, J. Huh, J. Kim, and S. Yoon. HPCCD: Hybrid Parallel Continuous Collision Detection using CPUs and GPUs. In Computer Graphics Forum, volume 28, pages 1791--1800. John Wiley & Sons, 2009.Google Scholar
- E. Kohler, R. Morris, B. Chen, J. Jannotti, and M. F. Kaashoek. The Click modular router. ACM TOCS, 18(3):263--297, 2000. Google ScholarDigital Library
- Y. Liao, D. Yin, and L. Gao. PdP: parallelizing data plane in virtual network substrate. In ACM VISA, 2009. Google ScholarDigital Library
- S. Manavski. CUDA compatible GPU as an efficient hardware accelerator for AES cryptography. In IEEE Signal Processing and Communications, 2007.Google ScholarCross Ref
- N. McKeown, T. Anderson, H. Balakrishnan, G. Parulkar, L. Peterson, J. Rexford, S. Shenker, and J. Turner. OpenFlow: enabling innovation in campus networks. SIGCOMM CCR, 38(2):69--74, 2008. Google ScholarDigital Library
- J. Mogul and K. Ramarkishnan. Eliminating Receive Livelock in an Interrupt-Driven Kernel. ACM TOCS, 15(3):217--252, 1997. Google ScholarDigital Library
- S. Mu, X. Zhang, N. Zhang, J. Lu, Y. S. Deng, and S. Zhang. Ip routing processing with graphic processors. In Design, Automation & Test in Europe Conference & Exhibition (DATE), 2010. Google ScholarDigital Library
- J. Naous, D. Erickson, G. A. Covington, G. Appenzeller, and N. McKeown. Implementing an OpenFlow switch on the NetFPGA platform. In ANCS, 2008. Google ScholarDigital Library
- J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable parallel programming with CUDA. Queue, 6(2):40--53, 2008. Google ScholarDigital Library
- NVIDIA Corporation. NVIDIA CUDA Best Practices Guide, Version 3.0.Google Scholar
- NVIDIA Corporation. NVIDIA CUDA Architecture Introduction and Overview, 2009.Google Scholar
- NVIDIA Corporation. NVIDIA CUDA Programming Guide, Version 3.0, 2009.Google Scholar
- J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Kruger, A. E. Lefohn, and T. J. Purcell. A Survey of General-Purpose Computation on Graphics Hardware. In Eurographics 2005, State of the Art Reports, pages 21--51, Aug. 2005.Google Scholar
- K. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Kruger, A. E. Lefohn, and T. J. Purcell. A survey of general-purpose computation on graphics hardware. Computer Graphics Forum, 26:80--113, 2007.Google ScholarCross Ref
- S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W.-m. W. Hwu. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In ACM PPoPP, 2008. Google ScholarDigital Library
- J. H. Salim, R. Olsson, and A. Kuznetsov. Beyond softnet. In Annual Linux Showcase & Conference, 2001. Google ScholarDigital Library
- L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, et al. Larrabee: a many-core x86 architecture for visual computing. In ACM SIGGRAPH, 2008. Google ScholarDigital Library
- N. Shah, W. Plishker, K. Ravindran, and K. Keutzer. Np-click: A productive software development approach for network processors. IEEE Micro, 24(5):45--54, 2004. Google ScholarDigital Library
- H. Shojania, B. Li, and X. Wang. Nuclei: GPU-accelerated many-core network coding. In IEEE INFOCOM, 2009.Google ScholarCross Ref
- R. Smith, N. Goyal, J. Ormont, C. Estan, and K. Sankaralingam. Evaluating GPUs for network packet signature matching. In IEEE ISPASS, 2009.Google ScholarCross Ref
- R. Szerwinski and T. Güneysu. Exploiting the power of GPUs for asymmetric cryptography. Cryptographic Hardware and Embedded Systems, pages 79--99, 2008. Google ScholarDigital Library
- J. Torrellas, H. S. Lam, and J. L. Hennessy. False Sharing and Spatial Locality in Multiprocessor Caches. IEEE Trans. on Computers, 43(6):651--663, 1994. Google ScholarDigital Library
- J. S. Turner, P. Crowley, J. DeHart, A. Freestone, B. Heller, F. Kuhns, S. Kumar, J. Lockwood, J. Lu, M. Wilson, C. Wiseman, and D. Zar. Supercharging planetlab: a high performance, multi-application, overlay network platform. SIGCOMM CCR, 37(4):85--96, 2007. Google ScholarDigital Library
- L. G. Valiant and G. J. Brebner. Universal schemes for parallel communication. In Proceedings of the ACM symposium on Theory of computing, 1981. Google ScholarDigital Library
- G. Vasiliadis, S. Antonatos, M. Polychronakis, E. P. Markatos, and S. Ioannidis. Gnort: High performance network intrusion detection using graphics processors. In Proc. of Recent Advances in Intrusion Detection (RAID), 2008. Google ScholarDigital Library
- B. Veal and A. Foong. Performance Scalability of a Multi-Core Web Server. In ANCS, 2007. Google ScholarDigital Library
- M. Waldvogel, G. Varghese, J. Turner, and B. Plattner. Scalable high speed IP routing lookups. In SIGCOMM, 1997. Google ScholarDigital Library
Index Terms
- PacketShader: a GPU-accelerated software router
Recommendations
PacketShader: a GPU-accelerated software router
SIGCOMM '10We present PacketShader, a high-performance software router framework for general packet processing with Graphics Processing Unit (GPU) acceleration. PacketShader exploits the massively-parallel processing power of GPU to address the CPU bottleneck in ...
Out-of-core implementation for accelerator kernels on heterogeneous clouds
Cloud environments today are increasingly featuring hybrid nodes containing multicore CPU processors and a diverse mix of accelerators such as Graphics Processing Units (GPUs), Intel Xeon Phi co-processors, and Field-Programmable Gate Arrays (FPGAs) to ...
A performance study of general-purpose applications on graphics processors using CUDA
Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of ...
Comments