research-article

Free Access

Federation: Boosting per-thread performance of throughput-oriented manycore architectures

Authors:
Michael Boyer

University of Virginia, Charlottesville, VA

University of Virginia, Charlottesville, VA
View Profile

,
David Tarjan

University of Virginia, Charlottesville, VA

University of Virginia, Charlottesville, VA
View Profile

,
Kevin Skadron

University of Virginia, Charlottesville, VA

University of Virginia, Charlottesville, VA
View Profile

ACM Transactions on Architecture and Code Optimization Volume 7 Issue 4Article No.: 19pp 1–38https://doi.org/10.1145/1880043.1880046

Published:30 December 2010Publication History

ACM Transactions on Architecture and Code Optimization

Abstract

Manycore architectures designed for parallel workloads are likely to use simple, highly multithreaded, in-order cores. This maximizes throughput, but only with enough threads to keep hardware utilized. For applications or phases with more limited parallelism, we describe creating an out-of-order processor on-the-fly, by federating two neighboring in-order cores. We reuse the large register file in the multithreaded cores to implement some out-of-order structures and reengineer other large, associative structures into simpler lookup tables. The resulting federated core provides twice the single-thread performance of the underlying in-order core, allowing the architecture to efficiently support a wider range of parallelism.

References

Almog, Y., Rosner, R., Schwartz, N., and Schmorak, A. 2004. Specialized dynamic optimizations for high-performance energy-efficient microarchitecture. In Proceedings of the 2nd International Symposium on Code Generation and Optimization. Google ScholarDigital Library
AMD. 2007. ATI Radeon HD 2900 technology: GPU specifications. http://www.amd.com/us/products/desktop/graphics/atiradeon-hd-2000/hd-2900/Pages/atiradeon-hd-2900-specifications. aspxGoogle Scholar
AMD. 2008. The industry-changing impact of accelerated computing. http://sites.amd.com/us/Documents/AMD_fusion_Whitepaper.pdf.Google Scholar
Asanovic, K., Bodik, R., Catanzaro, B. C., Gebis, J. J., Husbands, P., Keutzer, K., Patterson, D. A., Plishker, W. L., Shalf, J., Williams, S. W., and Yelick, K. A. 2006. The landscape of parallel computing research: A view from Berkeley. Tech. rep. UCB/EECS-2006-183, EECS Department, University of California, Berkeley.Google Scholar
Brekelbaum, E., Rupley, J., I., Wilkerson, C., and Black, B. 2002. Hierarchical scheduling windows. In Proceedings of the 35th International Symposium on Microarchitecture. Google ScholarDigital Library
Brooks, D., Tiwari, V., and Martonosi, M. 2000. Wattch: A framework for architectural-level power analysis and optimizations. In Proceedings of the 27th International Symposium on Computer Architecture. Google ScholarDigital Library
Burger, D., Austin, T. M., and Bennett, S. 1996. Evaluating future microprocessors: The Simple- Scalar tool set. Tech. rep. CS-TR-1996-1308, University of Wisconsin-Madison.Google Scholar
Burger, D., Keckler, S. W., McKinley, K. S., Dahlin, M., John, L. K., Lin, C., Moore, C. R., Burrill, J., McDonald, R. G., Yoder, W., and the TRIPS Team. 2004. Scaling to the end of silicon with EDGE architectures. IEEE Comput. 37, 7. Google ScholarDigital Library
Butts, J. A. and Sohi, G. S. 2002. Characterizing and predicting value degree of use. In Proceedings of the 35th International Symposium on Microarchitecture. Google ScholarDigital Library
Calder, B. and Grunwald, D. 1995. Next cache line and set prediction. In Proceedings of the 22nd International Symposium on Computer Architecture. Google ScholarDigital Library
Carmean, D. 2007. Future CPU architectures: The shift from traditional models. Intel Higher Education Lecture Series.Google Scholar
Chou, Y., Fahs, B., and Abraham, S. 2004. Microarchitecture optimizations for exploiting memory-level parallelism. In Proceedings of the 31st International Symposium on Computer Architecture. Google ScholarDigital Library
Davis, J. D., Laudon, J., and Olukotun, K. 2005. Maximizing CMP throughput with mediocre cores. In Proceedings of the 15th Conference on Parallel Architectures and Compilation Techniques. Google ScholarDigital Library
Dolbeau, R. and Seznec, A. 2004. CASH: Revisiting hardware sharing in single-chip parallel processors. J. Instruction-Level Paral. 6.Google Scholar
Ganusov, I. and Burtscher, M. 2006. Efficient emulation of hardware prefetchers via event-driven helper threading. In Proceedings of the 15th Conference on Parallel Architectures and Compilation Techniques. Google ScholarDigital Library
Garg, A., Castro, F., Huang, M., Chaver, D., Pinuel, L., and Prieto, M. 2006. Substituting associative load queue with simple hash tables in out-of-order microprocessors. In Proceedings of the 12th International Symposium on Low Power Electronics and Design. Google ScholarDigital Library
Glew, A. 1998. MLP yes&excl; ILP no&excl; In ASPLOS Wild and Crazy Ideas.Google Scholar
Grochowski, E., Ronen, R., Shen, J., and Wang, H. 2004. Best of both latency and throughput. In Proceedings of the 22nd International Conference on Computer Design. Google ScholarDigital Library
Hofstee, H. P. 2005. Power efficient processor architecture and the Cell processor. In Proceedings of the 11th International Conference on High Performance Computer Architecture. Google ScholarDigital Library
Huang, M., Renau, J., and Torrellas, J. 2002. Energy-Efficient hybrid wakeup logic. In Proceedings of the 8th International Symposium on Low Power Electronics and Design. Google ScholarDigital Library
İpek, E., Kírman, M., Kírman, N., and Mart'ínez, J. 2007. Core fusion: Accommodating software diversity in chip multiprocessors. In Proceedings of the 34th International Symposium on Computer Architecture. Google ScholarDigital Library
Johnson, T. and Nawathe, U. 2007. An 8-core, 64-thread, 64-bit power efficient SPARC SOC. In Proceedings of the 54th International Solid-State Circuits Conference.Google Scholar
Kessler, R., McLellan, E., and Webb, D. 1998. The Alpha 21264 microprocessor architecture. In Proceedings of the 16th International Conference on Computer Design. Google ScholarDigital Library
Kim, C., Sethumadhavan, S., Govindan, M. S., Ranganathan, N., Gulati, D., Burger, D., and Keckler, S. W. 2007. Composable lightweight processors. In Proceedings of the 40th International Symposium on Microarchitecture. Google ScholarDigital Library
Kongetira, P., Aingaran, K., and Olukotun, K. 2005. Niagara: A 32-way multithreaded SPARC processor. IEEE Micro 25, 2. Google ScholarDigital Library
Kucuk, G., Ergin, O., Ponomarev, D., and Ghose, K. 2003. Distributed reorder buffer schemes for low power. In Proceedings of the 21st International Conference on Computer Design. Google ScholarDigital Library
Kumar, R., Jouppi, N., and Tullsen, D. 2004a. Conjoined-Core chip multiprocessing. In Proceedings of the 37th International Symposium on Microarchitecture. Google ScholarDigital Library
Kumar, R., Tullsen, D. M., Ranganathan, P., Jouppi, N. P., and Farkas, K. I. 2004b. Single-ISA heterogeneous multi-core architectures for multithreaded workload performance. In Proceedings of the 31st International Symposium on Computer Architecture. Google ScholarDigital Library
Mesa-Martinez, F. J., Nayfach-Battilan, J., and Renau, J. 2007. Power model validation through thermal measurements. In Proceedings of the 34th International Symposium on Computer Architecture. Google ScholarDigital Library
Mutlu, O., Stark, J., Wilkerson, C., and Patt, Y. N. 2003. Runahead execution: An alternative to very large instruction windows for out-of-order processors. In Proceedings of the 9th International Conference on High Performance Computer Architecture. Google ScholarDigital Library
NVIDIA. 2009. NVIDIA CUDA programming guides, version 2.2.1. http://developer.download. nvidia.com/compute/cuda/2-2/toolkit/docs/NVIDIA-CUDA_Programming-Guide-2.2.1.pdfGoogle Scholar
Onder, S. and Gupta, R. 1999. Dynamic memory disambiguation in the presence of out-of-order store issuing. In Proceedings of the 32nd International Symposium on Microarchitecture. Google ScholarDigital Library
Owens, J. D., Luebke, D., Govindaraju, N., Harris, M., Krüger, J., Lefohn, A. E., and Purcell, T. J. 2007. A survey of general-purpose computation on graphics hardware. Comput. Graph. Forum 26, 1.Google ScholarCross Ref
Raasch, S. E., Binkert, N. L., and Reinhardt, S. K. 2002. A scalable instruction queue design using dependence chains. In Proceedings of the 29th International Symposium on Computer Architecture. Google ScholarDigital Library
Ramírez, M. A., Cristal, A., Veidenbaum, A. V., Villa, L., and Valero, M. 2004. Direct instruction wakeup for out-of-order processors. In Proceedings of the International Workshop on Innovative Architecture for Future Generation High-Performance Processors and Systems. Google ScholarDigital Library
Roth, A. 2005. Store vulnerability window (SVW): Re-Execution filtering for enhanced load optimization. In Proceedings of the 32nd International Symposium on Computer Architecture. Google ScholarDigital Library
Salverda, P. and Zilles, C. 2008. Fundamental performance challenges in horizontal fusion of in-order cores. In Proceedings of the 14th International Conference on High Performance Computer Architecture.Google Scholar
Sankaralingam, K., Nagarajan, R., McDonald, R., Desikan, R., Drolia, S., Govindan, M. S., Gratz, P., Gulati, D., Hanson, H., Kim, C., Liu, H., Ranganathan, N., Sethumadhavan, S., Sharif, S., Shivakumar, P., Keckler, S. W., and Burger, D. 2006. Distributed microarchitectural protocols in the TRIPS prototype processor. In Proceedings of the 39th International Symposium on Microarchitecture. 480--491. Google ScholarDigital Library
Sassone, P. G., II, J. R., Brekelbaum, E., Loh, G. H., and Black, B. 2007. Matrix scheduler reloaded. In Proceedings of the 34th International Symposium on Computer Architecture. Google ScholarDigital Library
Sato, T., Nakamura, Y., and Arita, I. 2001. Revisiting direct tag search algorithm on superscalar processors. In Proceedings of the Workshop on Complexity-Effective Design.Google Scholar
Sethumadhavan, S., Desikan, R., Burger, D., Moore, C. R., and Keckler, S. W. 2003. Scalable hardware memory disambiguation for high ILP processors. In Proceedings of the 36th International Symposium on Microarchitecture. Google ScholarDigital Library
Seznec, A., Felix, S., Krishnan, V., and Sazeides, Y. 2002. Design tradeoffs for the Alpha EV8 conditional branch predictor. In Proceedings of the 29th International Symposium on Computer Architecture. Google ScholarDigital Library
Sha, T., Martin, M. M. K., and Roth, A. 2005. Scalable store-load forwarding via store queue index prediction. In Proceedings of the 38th International Symposium on Microarchitecture. Google ScholarDigital Library
Sha, T., Martin, M. M. K., and Roth, A. 2006. NoSQ: Store-Load communication without a store queue. In Proceedings of the 39th International Symposium on Microarchitecture. Google ScholarDigital Library
Sherwood, T., Perelman, E., Hamerly, G., and Calder, B. 2002. Automatically characterizing large scale program behavior. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems. Google ScholarDigital Library
Skadron, K., Stan, M. R., Huang, W., Velusamy, S., Sankaranarayanan, K., and Tarjan, D. 2003. Temperature-Aware microarchitecture. In Proceedings of the 30th International Symposium on Computer Architecture. Google ScholarDigital Library
Smith, A., Burrill, J., Gibson, J., Maher, B., Nethercote, N., Yoder, B., Burger, D., and McKinley, K. 2006. Compiling for EDGE architectures. In Proceedings of the 4th International Symposium on Code Generation and Optimization. Google ScholarDigital Library
Subramaniam, S. and Loh, G. H. 2006. Fire-and-Forget: Load/store scheduling with no store queue at all. In Proceedings of the 39th International Symposium on Microarchitecture. Google ScholarDigital Library
Tarjan, D., Boyer, M., and Skadron, K. 2008. Federation: Repurposing scalar cores for out-of-order instruction issue. In Proceedings of the 45th Design Automation Conference. Google ScholarDigital Library
Tremblay, M. and O'Connor, J. M. 1996. UltraSparc I: A four-issue processor supporting multimedia. IEEE Micro 16, 2. Google ScholarDigital Library
Tseng, J. H. and Asanovic, K. 2006. RingScalar: A complexity-effective out-of-order superscalar microarchitecture. Tech. rep. MIT-CSAIL-TR-2006-066, MIT CSAIL.Google Scholar
Zhong, H., Lieberman, S. A., and Mahlke, S. A. 2007. Extending multicore architectures to exploit hybrid parallelism in single-thread applications. In Proceedings of the 13th International Conference on High Performance Computer Architecture. Google ScholarDigital Library

Index Terms

Federation: Boosting per-thread performance of throughput-oriented manycore architectures
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multiple instruction, multiple data

Recommendations

Federation: repurposing scalar cores for out-of-order instruction issue
DAC '08: Proceedings of the 45th annual Design Automation Conference

Future SoCs will contain multiple cores. For workloads with significant parallelism, prior work has shown the benefit of many small, multi-threaded, scalar cores. For workloads that require better single-thread performance, a dedicated, larger core can ...
Read More
autopin: automated optimization of thread-to-core pinning on multicore systems
Transactions on high-performance embedded architectures and compilers III

In this paper we present a framework for automatic detection and application of the best binding between threads of a running parallel application and processor cores in a shared memory system, by making use of hardware performance counters. This is ...
Read More
Accelerating Critical Section Execution with Asymmetric Multicore Architectures

Contention for critical sections can reduce performance and scalability by causing thread serialization. The proposed accelerated critical sections mechanism reduces this limitation. ACS executes critical sections on the high-performance core of an ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Architecture and Code Optimization Volume 7, Issue 4
December 2010
167 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/1880043
Issue’s Table of Contents

Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 30 December 2010
- Accepted: 1 August 2010
- Revised: 1 May 2010
- Received: 1 April 2008
Published in taco Volume 7, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
CMP
Federation
multicore
out-of-orderS
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 13
  Total Citations
  View Citations
- 545
  Total Downloads
- Downloads (Last 12 months)31
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Federation: Boosting per-thread performance of throughput-oriented manycore architectures

ACM Transactions on Architecture and Code Optimization

Abstract

References

Cited By

Index Terms

Recommendations

Federation: repurposing scalar cores for out-of-order instruction issue

autopin: automated optimization of thread-to-core pinning on multicore systems

Accelerating Critical Section Execution with Asymmetric Multicore Architectures

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Federation: Boosting per-thread performance of throughput-oriented manycore architectures

ACM Transactions on Architecture and Code Optimization

Abstract

References

Cited By

Index Terms

Recommendations

Federation: repurposing scalar cores for out-of-order instruction issue

autopin: automated optimization of thread-to-core pinning on multicore systems

Accelerating Critical Section Execution with Asymmetric Multicore Architectures

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media