ABSTRACT
Today's large multi-core Internet servers support thousands of concurrent connections or ows. The computation ability of future server platforms will depend on increasing numbers of cores. The key to ensure that performance scales with cores is to ensure that systems software and hardware are designed to fully exploit the parallelism that is inherent in independent network ows. This paper identifies the major bottlenecks to scalability for a reference server workload on a commercial server platform. However, performance scaling on commercial web servers has proven elusive. We determined that on web server running a modified SPEC-web2005 Support workload, throughput scales only 4.8 x on eight cores. Our results show that the operating system, TCP/IP stack, and application exploited ow-level parallelism well with few exceptions, and that load imbalance and shared cache affected performance little. Having eliminated these potential bottlenecks, we determined that performance scaling was limited by the capacity of the address bus, which became saturated on all eight cores. If this key obstacle is addressed, commercial web server and systems software are well-positioned to scale to a large number of cores.
- CAIDA. Workload Characterization: Application Cross-Section, 2006. http://www.caida.org/analysis/workload/.Google Scholar
- J. Chase, G. Gallatin, and K. Yocum. End-system optimizations for high-speed TCP. 39(4), 2001. Google ScholarDigital Library
- S. Chinthamani and R. Iyer. Design and evaluation of snoop filters for web servers. In SPECTS, 2004.Google Scholar
- D. Clark, V. Jacobson, J. Romkey, and H. Salwen. An analysis of TCP processing overhead. IEEE Communications Magazine, 27(6), 1989.Google ScholarDigital Library
- A. Foong, J. Fung, D. Newell, A. Lopez-Estrada, S. Abraham, and P. Irelan. Architectural characterization of processor affinity in network processing. In ISPASS. IEEE, 2005. Google ScholarDigital Library
- R. Hariharan and N. Sun. Workload characterization of SPECweb2005. In SPEC Benchmark Workshop. SPEC, 2006.Google Scholar
- Intel Corporation. Receive Side Scaling on Intel Network Adapters. http://support.intel.com/support/network/adapter/pro100/sb/CS-027574.htm.Google Scholar
- R. Iyer. Characterization and evaluation of cache hierarchies for web servers. World Wide Web, 7(3):259--280, Sept. 2004. Google ScholarDigital Library
- L. Kencl and J.-Y. L. Boudec. Adaptive load sharing for network processors. In INFOCOM, volume 2, pages 545--554. IEEE, 2002.Google ScholarCross Ref
- É. Lemoine, C. Pham, and L. Lefèvre. Packet classification in the NIC for improved SMP-based Internet servers. In ICN. IEEE, 2004.Google Scholar
- Linux Kernel. Linux IP Sysctl Documentation. Documentation/networking/ip-sysctl.txt.Google Scholar
- C. MacCárthaigh. Scaling Apache 2.x beyond 20,000 concurrent downloads. In ApacheCon EU, July 2005.Google Scholar
- M. Martin, P. Harper, D. Sorin, M. Hill, and D. Wood. Using destination-set prediction to improve the latency/bandwidth tradeoff in shared-memory multiprocessors. In ISCA, pages 206--217, 2003. Google ScholarDigital Library
- Microsoft Corporation. Scalable Networking with RSS, Apr. 2005.Google Scholar
- D. Miller. How the Linux TCP Output Engine Works. http://vger.kernel.org/¿davem/tcp output.html.Google Scholar
- I. Molnar. Goals, Design and Implementation of the New Ultra-Scalable O(1) Scheduler. Linux Kernel, Apr. 2002. Documentation/sched-design.txt.Google Scholar
- J. Salehi, J. Kurose, and D. Towsley. The effectiveness of affinity-based scheduling in multiprocessor network protocol processing (extended version). Transactions on Networking, 4(4):516--530, Aug. 1996. Google ScholarDigital Library
- W. Shi and L. Kencl. Sequence-preserving adaptive load balancers. In ANCS, pages 143--152. IEEE/ACM, Dec. 2006. Google ScholarDigital Library
- W. Shi, M. MacGregor, and P. Gburzynski. Load balancing for parallel forwarding. Transactions on Networking, 13(4):790--801, Aug. 2005. Google ScholarDigital Library
- SPEC. SPECweb2005 Release 1.10 Benchmark Design Document, Apr. 2006.Google Scholar
- SPEC. SPECweb2005 Result for the HP ProLiant DL380 G5, 2007. http://www.spec.org/web2005/results/res2007q2/web2005-20070507-00066.html.Google Scholar
- SPEC. SPECweb2005 Result for the HP ProLiant DL385 G2, 2007. http://www.spec.org/web2005/results/res2007q3/web2005-20070828-00079.html.Google Scholar
- SPEC. SPECweb2005 Result for the HP ProLiant DL580 G5, 2007. http://www.spec.org/web2005/results/res2007q3/web2005-20070828-00077.html.Google Scholar
- SPEC. SPECweb2005 Result for the HP ProLiant DL585 G2, 2007. http://www.spec.org/web2005/results/res2007q2/web2005-20070507-00067.html.Google Scholar
- SPEC. SPECweb2005 Result for the HP ProLiant ML360 G5, 2007. http://www.spec.org/web2005/results/res2007q2/web2005-20070507-00068.html.Google Scholar
- S. Tripathi. FireEngine--a new networking architecture for the Solaris operating system. Whitepaper, Sun Microsystems, Nov. 2004.Google Scholar
- V. Viswanathan. Intel front side bus architecture. Intel Software College course, 2006.Google Scholar
- J. Walker. Pseudorandom Number Sequence Test Program. Fourmilab, Oct. 1998.Google Scholar
- P. Willman, S. Rixner, and A. Cox. An evaluation of network stack parallelization strategies in modern operating systems. In USENIX, pages 91--96, 2006. Google ScholarDigital Library
- W. Zhang and W. Zhang. Linux virtual server clusters. Linux Magazine, 5(11), 2003.Google Scholar
Index Terms
- Performance scalability of a multi-core web server
Recommendations
Improving the scalability of a multi-core web server
ICPE '13: Proceedings of the 4th ACM/SPEC International Conference on Performance EngineeringImproving the performance and scalability of Web servers enhances user experiences and reduces the costs of providing Web-based services. The advent of Multi-core technology motivates new studies to understand how efficiently Web servers utilize such ...
Comparing high-performance multi-core web-server architectures
SYSTOR '12: Proceedings of the 5th Annual International Systems and Storage ConferenceIn this paper, we study how web-server architecture and implementation affect performance when trying to obtain high throughput on a 4-core system servicing static content. We focus on static content as a growing numbers of servers are dedicated to ...
Workshop on relaxing synchronization for multicore and manycore scalability (RACES 2012)
SPLASH '12: Proceedings of the 3rd annual conference on Systems, programming, and applications: software for humanityMassively-parallel systems are coming: core counts keep rising whether conventional cores as in multicore and manycore systems, or specialized cores as in GPUs. Conventional wisdom has been to utilize this parallelism by reducing synchronization to the ...
Comments