ABSTRACT
To improve the performance of a single application on Chip Multiprocessors (CMPs), the application must be split into threads which execute concurrently on multiple cores. In multi-threaded applications, critical sections are used to ensure that only one thread accesses shared data at any given time. Critical sections can serialize the execution of threads, which significantly reduces performance and scalability.
This paper proposes Accelerated Critical Sections (ACS), a technique that leverages the high-performance core(s) of an Asymmetric Chip Multiprocessor (ACMP) to accelerate the execution of critical sections. In ACS, selected critical sections are executed by a high-performance core, which can execute the critical section faster than the other, smaller cores. As a result, ACS reduces serialization: it lowers the likelihood of threads waiting for a critical section to finish. Our evaluation on a set of 12 critical-section-intensive workloads shows that ACS reduces the average execution time by 34% compared to an equal-area 32T-core symmetric CMP and by 23% compared to an equal-area ACMP. Moreover, for 7 out of the 12 workloads, ACS improves scalability by increasing the number of threads at which performance saturates.
- MySQL database engine 5.0.1. http://www.mysql.com, 2008.Google Scholar
- Opening Tables scalability in MySQL. MySQL Performance Blog. http://www.mysqlperformanceblog.com/2006/11/21/opening--tablesscalability, 2006.Google Scholar
- SQLite database engine version 3.5.8. http:/www.sqlite.org, 2008.Google Scholar
- SysBench: a system performance benchmark version 0.4.8. http://sysbench.sourceforge.net, 2008.Google Scholar
- S. Adve et al. Replacing locks by higher-level primitives. Technical Report TR94-237, Rice University, 1994.Google Scholar
- G. M. Amdahl. Validity of the single processor approach to achieving large scale computing capabilities. In AFIPS, 1967. Google ScholarDigital Library
- D. H. Bailey et al. NAS parallel benchmarks. Technical Report Tech. Rep. RNR-94-007, NASA Ames Research Center, 1994.Google Scholar
- A. D. Birrell and B. J. Nelson. Implementing remote procedure calls. ACM Trans. Comput. Syst., 2(1):39--59, 1984. Google ScholarDigital Library
- C. Brunschen et al. OdinMP/CCp -- a portable implementation of OpenMP for C. Concurrency: Prac. and Exp., 12(12), 2000.Google Scholar
- D. Culler, J. Singh, and A. Gupta. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann, 1998. Google ScholarDigital Library
- A. J. Dorta et al. The OpenMP source code repository. In Euromicro, 2005. Google ScholarDigital Library
- S. Gochman et al. The Intel Pentium M processor: Microarchitecture and performance. 7(2):21--36, May 2003.Google Scholar
- G. Grohoski. Distinguished Engineer, Sun Microsystems. Personal communication, November 2007.Google Scholar
- M. Herlihy and J. E. B. Moss. Transactional memory: architectural support for lock-free data structures. In ISCA-20, 1993. Google ScholarDigital Library
- M. Hill and M. Marty. Amdahl's law in the multicore era. IEEE Computer, 41(7), 2008. Google ScholarDigital Library
- R. Hoffmann et al. Using hardware operations to reduce the synchronization overhead of task pools. ICPP, 2004 Google ScholarDigital Library
- Intel. Prescott New Instructions Software Dev. Guide. http://cachewww.intel.com/cd/00/00/06/67/66753 66753.pdf, 2004.Google Scholar
- Intel. Source code for Intel threading building blocks.Google Scholar
- Intel. Pentium Processor User's Manual Volume 1: Pentium Processor Data Book, 1993.Google Scholar
- Intel. IA-32 Intel Architecture Software Dev. Guide, 2008.Google Scholar
- E. Ipek et al. Core fusion: accommodating software diversity in chip multiprocessors. In ISCA-34, 2007. Google ScholarDigital Library
- P. Kongetira et al. Niagara: A 32-Way Multithreaded SPARC Processor. IEEE Micro, 25(2):21--29, 2005. Google ScholarDigital Library
- H. Kredel. Source code for traveling salesman problem (tsp). http://krum.rz.uni-mannheim.de/ba-pp-2007/java/index.html.Google Scholar
- R. Kumar, D. M. Tullsen, N. P. Jouppi, and P. Ranganathan. Heterogeneous chip multiprocessors. IEEE Computer, 38(11), 2005. Google ScholarDigital Library
- L. Lamport. A new solution of Dijkstra's concurrent programming problem. CACM, 17(8):453--455, August 1974. Google ScholarDigital Library
- J. Laudon and D. Lenoski. The SGI Origin: A ccNUMA Highly Scalable Server. In ISCA, pages 241--251, 1997. Google ScholarDigital Library
- E. L. Lawler and D. E. Wood. Branch-and-bound methods: A survey. Operations Research, 14(4):699--719, 1966.Google ScholarDigital Library
- C. Liao et al. OpenUH: an optimizing, portable OpenMP compiler. Concurr. Comput. : Pract. Exper., 19(18):2317--2332, 2007. Google ScholarDigital Library
- J. F. Martínez and J. Torrellas. Speculative synchronization: applying thread-level speculation to explicitly parallel applications. In ASPLOS-X, 2002.Google ScholarDigital Library
- T. Morad et al. Performance, power efficiency and scalability of asymmetric cluster chip multiprocessors. Comp Arch Lttrs, 2006. Google ScholarDigital Library
- R. Narayanan et al. MineBench: A Benchmark Suite for Data Mining Workloads. In IISWC, 2006.Google ScholarCross Ref
- Y. Nishitani et al. Implementation and evaluation of OpenMP for Hitachi SR8000. In ISHPC-3, 2000. Google ScholarDigital Library
- R. Rajwar and J. Goodman. Speculative lock elision: Enabling highly concurrent multithreaded execution. In MICRO-34, 2001. Google ScholarDigital Library
- R. Rajwar and J. R. Goodman. Transactional lock-free execution of lock-based programs. In ASPLOS-X, 2002. Google ScholarDigital Library
- P. Ranganathan et al. The interaction of software prefetching with ILP processors in shared-memory systems. In ISCA-24, 1997. Google ScholarDigital Library
- C. Rossbach et al. TxLinux: using and managing hardware transactional memory in an operating system. In SOSP'07, 2007. Google ScholarDigital Library
- M. Sato et al. Design of OpenMP compiler for an SMP cluster. In EWOMP, Sept. 1999.Google Scholar
- L. Seiler et al. Larrabee: a many-core x86 architecture for visual computing. ACM Trans. Graph., 2008. Google ScholarDigital Library
- S. Sridharan et al. Thread migration to improve synchronization performance. In Workshop on OSIHPA, 2006.Google Scholar
- The Standard Performance Evaluation Corporation. Welcome to SPEC. http://www.specbench.org/.Google Scholar
- M. Suleman et al. ACMP: Balancing Hardware Efficiency and Programmer Efficiency. Technical report, HPS, February 2007.Google Scholar
- M. Suleman et al. An Asymmetric Multi-core Architecture for Accelerating Critical Sections. Technical Report TR-HPS-2008-003, 2008.Google Scholar
- M. Suleman et al. Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs. In ASPLOS XIII, 2008. Google ScholarDigital Library
- J. M. Tendler et al. POWER4 system microarchitecture. IBM Journal of Research and Development, 46(1):5--26, 2002. Google ScholarDigital Library
- Tornado Web Server. Source code. http://tornado.sourceforge.net/.Google Scholar
- P. Trancoso and J. Torrellas. The impact of speeding up critical sections with data prefetching and forwarding. In ICPP, 1996.Google ScholarCross Ref
- M. Tremblay et al. A Third-Generation 65nm 16-Core 32-Thread Plus 32-Scout-Thread CMT SPARC Processor. In ISSCC, 2008.Google ScholarCross Ref
- D. M. Tullsen et al. Simultaneous multithreading: Maximizing onchip parallelism. In ISCA-22, 1995. Google ScholarDigital Library
- M. Waldvogel, G. Varghese, J. Turner, and B. Plattner. Scalable high speed ip routing lookups. In SIGCOMM, 1997. Google ScholarDigital Library
- Wikipedia. Fifteen puzzle. http://en.wikipedia.org/wiki/Fifteen puzzle.Google Scholar
- S. C. Woo et al. The SPLASH-2 programs: Characterization and methodological considerations. In ISCA-22, 1995. Google ScholarDigital Library
- P. Zhao and J. N. Amaral. Ablego: a function outlining and partial inlining framework. Softw. Pract. Exper., 37(5):465--491, 2007. Google ScholarDigital Library
Index Terms
- Accelerating critical section execution with asymmetric multi-core architectures
Recommendations
Accelerating critical section execution with asymmetric multi-core architectures
ASPLOS 2009To improve the performance of a single application on Chip Multiprocessors (CMPs), the application must be split into threads which execute concurrently on multiple cores. In multi-threaded applications, critical sections are used to ensure that only ...
Accelerating critical section execution with asymmetric multi-core architectures
ASPLOS 2009To improve the performance of a single application on Chip Multiprocessors (CMPs), the application must be split into threads which execute concurrently on multiple cores. In multi-threaded applications, critical sections are used to ensure that only ...
Accelerating Critical Section Execution with Asymmetric Multicore Architectures
Contention for critical sections can reduce performance and scalability by causing thread serialization. The proposed accelerated critical sections mechanism reduces this limitation. ACS executes critical sections on the high-performance core of an ...
Comments