skip to main content
10.1145/1323548.1323562acmconferencesArticle/Chapter ViewAbstractPublication PagesancsConference Proceedingsconference-collections
research-article

Performance scalability of a multi-core web server

Published:03 December 2007Publication History

ABSTRACT

Today's large multi-core Internet servers support thousands of concurrent connections or ows. The computation ability of future server platforms will depend on increasing numbers of cores. The key to ensure that performance scales with cores is to ensure that systems software and hardware are designed to fully exploit the parallelism that is inherent in independent network ows. This paper identifies the major bottlenecks to scalability for a reference server workload on a commercial server platform. However, performance scaling on commercial web servers has proven elusive. We determined that on web server running a modified SPEC-web2005 Support workload, throughput scales only 4.8 x on eight cores. Our results show that the operating system, TCP/IP stack, and application exploited ow-level parallelism well with few exceptions, and that load imbalance and shared cache affected performance little. Having eliminated these potential bottlenecks, we determined that performance scaling was limited by the capacity of the address bus, which became saturated on all eight cores. If this key obstacle is addressed, commercial web server and systems software are well-positioned to scale to a large number of cores.

References

  1. CAIDA. Workload Characterization: Application Cross-Section, 2006. http://www.caida.org/analysis/workload/.Google ScholarGoogle Scholar
  2. J. Chase, G. Gallatin, and K. Yocum. End-system optimizations for high-speed TCP. 39(4), 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Chinthamani and R. Iyer. Design and evaluation of snoop filters for web servers. In SPECTS, 2004.Google ScholarGoogle Scholar
  4. D. Clark, V. Jacobson, J. Romkey, and H. Salwen. An analysis of TCP processing overhead. IEEE Communications Magazine, 27(6), 1989.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Foong, J. Fung, D. Newell, A. Lopez-Estrada, S. Abraham, and P. Irelan. Architectural characterization of processor affinity in network processing. In ISPASS. IEEE, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. R. Hariharan and N. Sun. Workload characterization of SPECweb2005. In SPEC Benchmark Workshop. SPEC, 2006.Google ScholarGoogle Scholar
  7. Intel Corporation. Receive Side Scaling on Intel Network Adapters. http://support.intel.com/support/network/adapter/pro100/sb/CS-027574.htm.Google ScholarGoogle Scholar
  8. R. Iyer. Characterization and evaluation of cache hierarchies for web servers. World Wide Web, 7(3):259--280, Sept. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. L. Kencl and J.-Y. L. Boudec. Adaptive load sharing for network processors. In INFOCOM, volume 2, pages 545--554. IEEE, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  10. É. Lemoine, C. Pham, and L. Lefèvre. Packet classification in the NIC for improved SMP-based Internet servers. In ICN. IEEE, 2004.Google ScholarGoogle Scholar
  11. Linux Kernel. Linux IP Sysctl Documentation. Documentation/networking/ip-sysctl.txt.Google ScholarGoogle Scholar
  12. C. MacCárthaigh. Scaling Apache 2.x beyond 20,000 concurrent downloads. In ApacheCon EU, July 2005.Google ScholarGoogle Scholar
  13. M. Martin, P. Harper, D. Sorin, M. Hill, and D. Wood. Using destination-set prediction to improve the latency/bandwidth tradeoff in shared-memory multiprocessors. In ISCA, pages 206--217, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Microsoft Corporation. Scalable Networking with RSS, Apr. 2005.Google ScholarGoogle Scholar
  15. D. Miller. How the Linux TCP Output Engine Works. http://vger.kernel.org/¿davem/tcp output.html.Google ScholarGoogle Scholar
  16. I. Molnar. Goals, Design and Implementation of the New Ultra-Scalable O(1) Scheduler. Linux Kernel, Apr. 2002. Documentation/sched-design.txt.Google ScholarGoogle Scholar
  17. J. Salehi, J. Kurose, and D. Towsley. The effectiveness of affinity-based scheduling in multiprocessor network protocol processing (extended version). Transactions on Networking, 4(4):516--530, Aug. 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. W. Shi and L. Kencl. Sequence-preserving adaptive load balancers. In ANCS, pages 143--152. IEEE/ACM, Dec. 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. W. Shi, M. MacGregor, and P. Gburzynski. Load balancing for parallel forwarding. Transactions on Networking, 13(4):790--801, Aug. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. SPEC. SPECweb2005 Release 1.10 Benchmark Design Document, Apr. 2006.Google ScholarGoogle Scholar
  21. SPEC. SPECweb2005 Result for the HP ProLiant DL380 G5, 2007. http://www.spec.org/web2005/results/res2007q2/web2005-20070507-00066.html.Google ScholarGoogle Scholar
  22. SPEC. SPECweb2005 Result for the HP ProLiant DL385 G2, 2007. http://www.spec.org/web2005/results/res2007q3/web2005-20070828-00079.html.Google ScholarGoogle Scholar
  23. SPEC. SPECweb2005 Result for the HP ProLiant DL580 G5, 2007. http://www.spec.org/web2005/results/res2007q3/web2005-20070828-00077.html.Google ScholarGoogle Scholar
  24. SPEC. SPECweb2005 Result for the HP ProLiant DL585 G2, 2007. http://www.spec.org/web2005/results/res2007q2/web2005-20070507-00067.html.Google ScholarGoogle Scholar
  25. SPEC. SPECweb2005 Result for the HP ProLiant ML360 G5, 2007. http://www.spec.org/web2005/results/res2007q2/web2005-20070507-00068.html.Google ScholarGoogle Scholar
  26. S. Tripathi. FireEngine--a new networking architecture for the Solaris operating system. Whitepaper, Sun Microsystems, Nov. 2004.Google ScholarGoogle Scholar
  27. V. Viswanathan. Intel front side bus architecture. Intel Software College course, 2006.Google ScholarGoogle Scholar
  28. J. Walker. Pseudorandom Number Sequence Test Program. Fourmilab, Oct. 1998.Google ScholarGoogle Scholar
  29. P. Willman, S. Rixner, and A. Cox. An evaluation of network stack parallelization strategies in modern operating systems. In USENIX, pages 91--96, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. W. Zhang and W. Zhang. Linux virtual server clusters. Linux Magazine, 5(11), 2003.Google ScholarGoogle Scholar

Index Terms

  1. Performance scalability of a multi-core web server

          Recommendations

          Reviews

          Carlos Juiz

          The Internet provides a computing scenario where clients communicate with Web servers through mutually independent connections. If Internet server application processing and the associated protocol processing of a connection (flow) are done exclusively on a single central processing unit (CPU) core, minimal data sharing and synchronization between flows is expected. The computation ability of future servers will depend on increasing the number of cores. This interesting paper "identifies the major bottlenecks to scalability for a reference server workload on a commercial server platform." To test their hypothesis, the authors "set up [a] test server running a well-tuned Apache HTTP server and [the] Linux operating system. The server had eight cores with pairs of cores sharing L2 cache." The experiments show that the test server, running a modified SPECweb2005 Support workload, achieved only a 4.8 times speedup in throughput, compared to the ideal eight times-official SPECweb2005 results show similar scaling problems. This work provides "insights on the key causes of poor scalability of a Web server," and also provides "the analysis methodology leading to these insights." This latter feature makes the paper more interesting than the findings themselves, since the main bottleneck of the multicore server is the bus, and the snoopy protocol for sharing it. The authors determined that the main cause of poor scaling is the capacity of the bus. They confirmed that the address bus reached 77 percent utilization on eight cores, which is considered fully saturated. Other results showed that the number of cache misses remained nearly constant per byte as the number of cores increased, and that shared cache between cores on the same bus had little effect on performance. However, profiling revealed some scalability obstacles in software. "Increasing hash table capacities and reducing dependence on linked lists," as workload increases, should fix these scalability problems. "In the kernel, flow-level parallelism broke down in the file-system directory cache," which was widely shared. The authors propose that "a possible workaround would be to maintain alternate directory trees for each core." In conclusion, the remaining problem in scaling performance with the number of cores is address bus capacity. As stated, "directories (and directory caches) can be used to replace snoopy cache coherence," with paying the price of additional cost and latency. Further studies should be addressed to verify this last hypothesis for real workloads Online Computing Reviews Service

          Access critical reviews of Computing literature here

          Become a reviewer for Computing Reviews.

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            ANCS '07: Proceedings of the 3rd ACM/IEEE Symposium on Architecture for networking and communications systems
            December 2007
            212 pages
            ISBN:9781595939456
            DOI:10.1145/1323548

            Copyright © 2007 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 3 December 2007

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            ANCS '07 Paper Acceptance Rate20of70submissions,29%Overall Acceptance Rate88of314submissions,28%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader