ABSTRACT
Incoming and outgoing processing for a given TCP connection often execute on different cores: an incoming packet is typically processed on the core that receives the interrupt, while outgoing data processing occurs on the core running the relevant user code. As a result, accesses to read/write connection state (such as TCP control blocks) often involve cache invalidations and data movement between cores' caches. These can take hundreds of processor cycles, enough to significantly reduce performance.
We present a new design, called Affinity-Accept, that causes all processing for a given TCP connection to occur on the same core. Affinity-Accept arranges for the network interface to determine the core on which application processing for each new connection occurs, in a lightweight way; it adjusts the card's choices only in response to imbalances in CPU scheduling. Measurements show that for the Apache web server serving static files on a 48-core AMD system, Affinity-Accept reduces time spent in the TCP stack by 30% and improves overall throughput by 24%.
- Chelsio Terminator 4 ASIC. White paper, Chelsio Communications, January 2010. http://chelsio.com/assetlibrary/whitepapers/ChelsioT4 Architecture White Paper.pdf.Google Scholar
- Apache HTTP Server, October 2011. http://httpd.apache.org/.Google Scholar
- Httperf, October 2011. http://www.hpl.hp.com/research/linux/httperf/.Google Scholar
- Lighttpd Server, October 2011. http://www.lighttpd.net/.Google Scholar
- Receive Side Scaling, October 2011. http://technet.microsoft.com/en-us/network/dd277646.Google Scholar
- SMP and Lighttpd, October 2011. http://redmine.lighttpd.net/wiki/1/Docs:MultiProcessor.Google Scholar
- SpecWeb2009, October 2011. http://www.spec.org/web2009/.Google Scholar
- AMD, Inc. Six-core AMD opteron processor features. http://www.amd.com/us/products/server/processors/six-core-opteron/Pages/six-core-opteron-key-architectural-features.aspx.Google Scholar
- S. Boyd-Wickizer, A. T. Clements, Y. Mao, A. Pesterev, M. F. Kaashoek, R. Morris, and N. Zeldovich. An Analysis of Linux Scalability to Many Cores. In Proc. OSDI, 2010. Google ScholarDigital Library
- M. Dobrescu, N. Egi, K. Argyraki, B.-G. Chun, K. Fall, G. Iannaccone, A. Knies, M. Manesh, and S. Ratnasamy. Route-Bricks: Exploiting Parallelism To Scale Software Routers. In Proc. SOSP, 2009. Google ScholarDigital Library
- T. Herbert. RFS: Receive Flow Steering, October 2011. http://lwn.net/Articles/381955/.Google Scholar
- T. Herbert. RPS: Receive Packet Steering, October 2011. http://lwn.net/Articles/361440/.Google Scholar
- T. Herbert. aRFS: Accelerated Receive Flow Steering, January 2012. http://lwn.net/Articles/406489/.Google Scholar
- Intel. 82599 10 GbE Controller Datasheet, October 2011. http://download.intel.com/design/network/datashts/82599 datasheet.pdf.Google Scholar
- Linux 3.2.2 Myricom driver source code, January 2012. drivers/net/ethernet/myricom/myri10ge/myri10ge.c.Google Scholar
- Linux 3.2.2 Solarflare driver source code, January 2012. drivers/net/ethernet/sfc/regs.h.Google Scholar
- G. Lu, C. Guo, Y. Li, Z. Zhou, T. Yuan, H. Wu, Y. Xiong, R. Gao, and Y. Zhang. ServerSwitch: A Programmable and High Performance Platform for Data Center Networks. In Proc. NSDI, 2011. Google ScholarDigital Library
- E. M. Nahum, D. J. Yates, J. F. Kurose, and D. Towsley. Performance issues in parallelized network protocols. In Proc. OSDI, 1994. Google ScholarDigital Library
- A. Pesterev, N. Zeldovich, and R. T. Morris. Locating cache performance bottlenecks using data profiling. In Proc. EuroSys, 2010. Google ScholarDigital Library
- Robert Watson. Packet Steering in FreeBSD, January 2012. http://freebsd.1045724.n5.nabble.com/Packet-steering-SMP-td4250398.html.Google Scholar
- L. Soares and M. Stumm. FlexSC: flexible system call scheduling with exception-less system calls. In Proc. OSDI, 2010. Google ScholarDigital Library
- Sunay Tripathi. FireEngine: A new networking architecture for the Solaris operating system. White paper, Sun Microsystems, June 2004.Google Scholar
- P. Willmann, S. Rixner, and A. L. Cox. An evaluation of network stack parallelization strategies in modern operating systems. In Proc. USENIX ATC, June 2006. Google ScholarDigital Library
- D. J. Yates, E. M. Nahum, J. F. Kurose, and D. Towsley. Networking support for large scale multiprocessor servers. In Proc. SIGMETRICS, 1996. Google ScholarDigital Library
Index Terms
- Improving network connection locality on multicore systems
Recommendations
Counter-Based Cache Replacement and Bypassing Algorithms
Recent studies have shown that in highly associative caches, the performance gap between the Least Recently Used (LRU) and the theoretical optimal replacement algorithms is large, motivating the design of alternative replacement algorithms to improve ...
A cost-effective load-balancing policy for tile-based, massive multi-core packet processors
Massive multi-core architectures provide a computation platform with high processing throughput, enabling the efficient processing of workloads with a significant degree of thread-level parallelism found in networking environments.
Communication-centric ...
Reshaping cache misses to improve row-buffer locality in multicore systems
PACT '13: Proceedings of the 22nd international conference on Parallel architectures and compilation techniquesOptimizing cache locality has always been important since the emergence of caches, and numerous cache locality optimization schemes have been published in compiler literature. However, in modern architectures, cache locality is not the only factor that ...
Comments