Abstract
Manycore architectures designed for parallel workloads are likely to use simple, highly multithreaded, in-order cores. This maximizes throughput, but only with enough threads to keep hardware utilized. For applications or phases with more limited parallelism, we describe creating an out-of-order processor on-the-fly, by federating two neighboring in-order cores. We reuse the large register file in the multithreaded cores to implement some out-of-order structures and reengineer other large, associative structures into simpler lookup tables. The resulting federated core provides twice the single-thread performance of the underlying in-order core, allowing the architecture to efficiently support a wider range of parallelism.
- Almog, Y., Rosner, R., Schwartz, N., and Schmorak, A. 2004. Specialized dynamic optimizations for high-performance energy-efficient microarchitecture. In Proceedings of the 2nd International Symposium on Code Generation and Optimization. Google ScholarDigital Library
- AMD. 2007. ATI Radeon HD 2900 technology: GPU specifications. http://www.amd.com/us/products/desktop/graphics/atiradeon-hd-2000/hd-2900/Pages/atiradeon-hd-2900-specifications. aspxGoogle Scholar
- AMD. 2008. The industry-changing impact of accelerated computing. http://sites.amd.com/us/Documents/AMD_fusion_Whitepaper.pdf.Google Scholar
- Asanovic, K., Bodik, R., Catanzaro, B. C., Gebis, J. J., Husbands, P., Keutzer, K., Patterson, D. A., Plishker, W. L., Shalf, J., Williams, S. W., and Yelick, K. A. 2006. The landscape of parallel computing research: A view from Berkeley. Tech. rep. UCB/EECS-2006-183, EECS Department, University of California, Berkeley.Google Scholar
- Brekelbaum, E., Rupley, J., I., Wilkerson, C., and Black, B. 2002. Hierarchical scheduling windows. In Proceedings of the 35th International Symposium on Microarchitecture. Google ScholarDigital Library
- Brooks, D., Tiwari, V., and Martonosi, M. 2000. Wattch: A framework for architectural-level power analysis and optimizations. In Proceedings of the 27th International Symposium on Computer Architecture. Google ScholarDigital Library
- Burger, D., Austin, T. M., and Bennett, S. 1996. Evaluating future microprocessors: The Simple- Scalar tool set. Tech. rep. CS-TR-1996-1308, University of Wisconsin-Madison.Google Scholar
- Burger, D., Keckler, S. W., McKinley, K. S., Dahlin, M., John, L. K., Lin, C., Moore, C. R., Burrill, J., McDonald, R. G., Yoder, W., and the TRIPS Team. 2004. Scaling to the end of silicon with EDGE architectures. IEEE Comput. 37, 7. Google ScholarDigital Library
- Butts, J. A. and Sohi, G. S. 2002. Characterizing and predicting value degree of use. In Proceedings of the 35th International Symposium on Microarchitecture. Google ScholarDigital Library
- Calder, B. and Grunwald, D. 1995. Next cache line and set prediction. In Proceedings of the 22nd International Symposium on Computer Architecture. Google ScholarDigital Library
- Carmean, D. 2007. Future CPU architectures: The shift from traditional models. Intel Higher Education Lecture Series.Google Scholar
- Chou, Y., Fahs, B., and Abraham, S. 2004. Microarchitecture optimizations for exploiting memory-level parallelism. In Proceedings of the 31st International Symposium on Computer Architecture. Google ScholarDigital Library
- Davis, J. D., Laudon, J., and Olukotun, K. 2005. Maximizing CMP throughput with mediocre cores. In Proceedings of the 15th Conference on Parallel Architectures and Compilation Techniques. Google ScholarDigital Library
- Dolbeau, R. and Seznec, A. 2004. CASH: Revisiting hardware sharing in single-chip parallel processors. J. Instruction-Level Paral. 6.Google Scholar
- Ganusov, I. and Burtscher, M. 2006. Efficient emulation of hardware prefetchers via event-driven helper threading. In Proceedings of the 15th Conference on Parallel Architectures and Compilation Techniques. Google ScholarDigital Library
- Garg, A., Castro, F., Huang, M., Chaver, D., Pinuel, L., and Prieto, M. 2006. Substituting associative load queue with simple hash tables in out-of-order microprocessors. In Proceedings of the 12th International Symposium on Low Power Electronics and Design. Google ScholarDigital Library
- Glew, A. 1998. MLP yes! ILP no! In ASPLOS Wild and Crazy Ideas.Google Scholar
- Grochowski, E., Ronen, R., Shen, J., and Wang, H. 2004. Best of both latency and throughput. In Proceedings of the 22nd International Conference on Computer Design. Google ScholarDigital Library
- Hofstee, H. P. 2005. Power efficient processor architecture and the Cell processor. In Proceedings of the 11th International Conference on High Performance Computer Architecture. Google ScholarDigital Library
- Huang, M., Renau, J., and Torrellas, J. 2002. Energy-Efficient hybrid wakeup logic. In Proceedings of the 8th International Symposium on Low Power Electronics and Design. Google ScholarDigital Library
- İpek, E., Kírman, M., Kírman, N., and Mart'ínez, J. 2007. Core fusion: Accommodating software diversity in chip multiprocessors. In Proceedings of the 34th International Symposium on Computer Architecture. Google ScholarDigital Library
- Johnson, T. and Nawathe, U. 2007. An 8-core, 64-thread, 64-bit power efficient SPARC SOC. In Proceedings of the 54th International Solid-State Circuits Conference.Google Scholar
- Kessler, R., McLellan, E., and Webb, D. 1998. The Alpha 21264 microprocessor architecture. In Proceedings of the 16th International Conference on Computer Design. Google ScholarDigital Library
- Kim, C., Sethumadhavan, S., Govindan, M. S., Ranganathan, N., Gulati, D., Burger, D., and Keckler, S. W. 2007. Composable lightweight processors. In Proceedings of the 40th International Symposium on Microarchitecture. Google ScholarDigital Library
- Kongetira, P., Aingaran, K., and Olukotun, K. 2005. Niagara: A 32-way multithreaded SPARC processor. IEEE Micro 25, 2. Google ScholarDigital Library
- Kucuk, G., Ergin, O., Ponomarev, D., and Ghose, K. 2003. Distributed reorder buffer schemes for low power. In Proceedings of the 21st International Conference on Computer Design. Google ScholarDigital Library
- Kumar, R., Jouppi, N., and Tullsen, D. 2004a. Conjoined-Core chip multiprocessing. In Proceedings of the 37th International Symposium on Microarchitecture. Google ScholarDigital Library
- Kumar, R., Tullsen, D. M., Ranganathan, P., Jouppi, N. P., and Farkas, K. I. 2004b. Single-ISA heterogeneous multi-core architectures for multithreaded workload performance. In Proceedings of the 31st International Symposium on Computer Architecture. Google ScholarDigital Library
- Mesa-Martinez, F. J., Nayfach-Battilan, J., and Renau, J. 2007. Power model validation through thermal measurements. In Proceedings of the 34th International Symposium on Computer Architecture. Google ScholarDigital Library
- Mutlu, O., Stark, J., Wilkerson, C., and Patt, Y. N. 2003. Runahead execution: An alternative to very large instruction windows for out-of-order processors. In Proceedings of the 9th International Conference on High Performance Computer Architecture. Google ScholarDigital Library
- NVIDIA. 2009. NVIDIA CUDA programming guides, version 2.2.1. http://developer.download. nvidia.com/compute/cuda/2-2/toolkit/docs/NVIDIA-CUDA_Programming-Guide-2.2.1.pdfGoogle Scholar
- Onder, S. and Gupta, R. 1999. Dynamic memory disambiguation in the presence of out-of-order store issuing. In Proceedings of the 32nd International Symposium on Microarchitecture. Google ScholarDigital Library
- Owens, J. D., Luebke, D., Govindaraju, N., Harris, M., Krüger, J., Lefohn, A. E., and Purcell, T. J. 2007. A survey of general-purpose computation on graphics hardware. Comput. Graph. Forum 26, 1.Google ScholarCross Ref
- Raasch, S. E., Binkert, N. L., and Reinhardt, S. K. 2002. A scalable instruction queue design using dependence chains. In Proceedings of the 29th International Symposium on Computer Architecture. Google ScholarDigital Library
- Ramírez, M. A., Cristal, A., Veidenbaum, A. V., Villa, L., and Valero, M. 2004. Direct instruction wakeup for out-of-order processors. In Proceedings of the International Workshop on Innovative Architecture for Future Generation High-Performance Processors and Systems. Google ScholarDigital Library
- Roth, A. 2005. Store vulnerability window (SVW): Re-Execution filtering for enhanced load optimization. In Proceedings of the 32nd International Symposium on Computer Architecture. Google ScholarDigital Library
- Salverda, P. and Zilles, C. 2008. Fundamental performance challenges in horizontal fusion of in-order cores. In Proceedings of the 14th International Conference on High Performance Computer Architecture.Google Scholar
- Sankaralingam, K., Nagarajan, R., McDonald, R., Desikan, R., Drolia, S., Govindan, M. S., Gratz, P., Gulati, D., Hanson, H., Kim, C., Liu, H., Ranganathan, N., Sethumadhavan, S., Sharif, S., Shivakumar, P., Keckler, S. W., and Burger, D. 2006. Distributed microarchitectural protocols in the TRIPS prototype processor. In Proceedings of the 39th International Symposium on Microarchitecture. 480--491. Google ScholarDigital Library
- Sassone, P. G., II, J. R., Brekelbaum, E., Loh, G. H., and Black, B. 2007. Matrix scheduler reloaded. In Proceedings of the 34th International Symposium on Computer Architecture. Google ScholarDigital Library
- Sato, T., Nakamura, Y., and Arita, I. 2001. Revisiting direct tag search algorithm on superscalar processors. In Proceedings of the Workshop on Complexity-Effective Design.Google Scholar
- Sethumadhavan, S., Desikan, R., Burger, D., Moore, C. R., and Keckler, S. W. 2003. Scalable hardware memory disambiguation for high ILP processors. In Proceedings of the 36th International Symposium on Microarchitecture. Google ScholarDigital Library
- Seznec, A., Felix, S., Krishnan, V., and Sazeides, Y. 2002. Design tradeoffs for the Alpha EV8 conditional branch predictor. In Proceedings of the 29th International Symposium on Computer Architecture. Google ScholarDigital Library
- Sha, T., Martin, M. M. K., and Roth, A. 2005. Scalable store-load forwarding via store queue index prediction. In Proceedings of the 38th International Symposium on Microarchitecture. Google ScholarDigital Library
- Sha, T., Martin, M. M. K., and Roth, A. 2006. NoSQ: Store-Load communication without a store queue. In Proceedings of the 39th International Symposium on Microarchitecture. Google ScholarDigital Library
- Sherwood, T., Perelman, E., Hamerly, G., and Calder, B. 2002. Automatically characterizing large scale program behavior. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems. Google ScholarDigital Library
- Skadron, K., Stan, M. R., Huang, W., Velusamy, S., Sankaranarayanan, K., and Tarjan, D. 2003. Temperature-Aware microarchitecture. In Proceedings of the 30th International Symposium on Computer Architecture. Google ScholarDigital Library
- Smith, A., Burrill, J., Gibson, J., Maher, B., Nethercote, N., Yoder, B., Burger, D., and McKinley, K. 2006. Compiling for EDGE architectures. In Proceedings of the 4th International Symposium on Code Generation and Optimization. Google ScholarDigital Library
- Subramaniam, S. and Loh, G. H. 2006. Fire-and-Forget: Load/store scheduling with no store queue at all. In Proceedings of the 39th International Symposium on Microarchitecture. Google ScholarDigital Library
- Tarjan, D., Boyer, M., and Skadron, K. 2008. Federation: Repurposing scalar cores for out-of-order instruction issue. In Proceedings of the 45th Design Automation Conference. Google ScholarDigital Library
- Tremblay, M. and O'Connor, J. M. 1996. UltraSparc I: A four-issue processor supporting multimedia. IEEE Micro 16, 2. Google ScholarDigital Library
- Tseng, J. H. and Asanovic, K. 2006. RingScalar: A complexity-effective out-of-order superscalar microarchitecture. Tech. rep. MIT-CSAIL-TR-2006-066, MIT CSAIL.Google Scholar
- Zhong, H., Lieberman, S. A., and Mahlke, S. A. 2007. Extending multicore architectures to exploit hybrid parallelism in single-thread applications. In Proceedings of the 13th International Conference on High Performance Computer Architecture. Google ScholarDigital Library
Index Terms
- Federation: Boosting per-thread performance of throughput-oriented manycore architectures
Recommendations
Federation: repurposing scalar cores for out-of-order instruction issue
DAC '08: Proceedings of the 45th annual Design Automation ConferenceFuture SoCs will contain multiple cores. For workloads with significant parallelism, prior work has shown the benefit of many small, multi-threaded, scalar cores. For workloads that require better single-thread performance, a dedicated, larger core can ...
autopin: automated optimization of thread-to-core pinning on multicore systems
Transactions on high-performance embedded architectures and compilers IIIIn this paper we present a framework for automatic detection and application of the best binding between threads of a running parallel application and processor cores in a shared memory system, by making use of hardware performance counters. This is ...
Accelerating Critical Section Execution with Asymmetric Multicore Architectures
Contention for critical sections can reduce performance and scalability by causing thread serialization. The proposed accelerated critical sections mechanism reduces this limitation. ACS executes critical sections on the high-performance core of an ...
Comments