skip to main content
research-article
Free Access

Federation: Boosting per-thread performance of throughput-oriented manycore architectures

Published:30 December 2010Publication History
Skip Abstract Section

Abstract

Manycore architectures designed for parallel workloads are likely to use simple, highly multithreaded, in-order cores. This maximizes throughput, but only with enough threads to keep hardware utilized. For applications or phases with more limited parallelism, we describe creating an out-of-order processor on-the-fly, by federating two neighboring in-order cores. We reuse the large register file in the multithreaded cores to implement some out-of-order structures and reengineer other large, associative structures into simpler lookup tables. The resulting federated core provides twice the single-thread performance of the underlying in-order core, allowing the architecture to efficiently support a wider range of parallelism.

References

  1. Almog, Y., Rosner, R., Schwartz, N., and Schmorak, A. 2004. Specialized dynamic optimizations for high-performance energy-efficient microarchitecture. In Proceedings of the 2nd International Symposium on Code Generation and Optimization. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. AMD. 2007. ATI Radeon HD 2900 technology: GPU specifications. http://www.amd.com/us/products/desktop/graphics/atiradeon-hd-2000/hd-2900/Pages/atiradeon-hd-2900-specifications. aspxGoogle ScholarGoogle Scholar
  3. AMD. 2008. The industry-changing impact of accelerated computing. http://sites.amd.com/us/Documents/AMD_fusion_Whitepaper.pdf.Google ScholarGoogle Scholar
  4. Asanovic, K., Bodik, R., Catanzaro, B. C., Gebis, J. J., Husbands, P., Keutzer, K., Patterson, D. A., Plishker, W. L., Shalf, J., Williams, S. W., and Yelick, K. A. 2006. The landscape of parallel computing research: A view from Berkeley. Tech. rep. UCB/EECS-2006-183, EECS Department, University of California, Berkeley.Google ScholarGoogle Scholar
  5. Brekelbaum, E., Rupley, J., I., Wilkerson, C., and Black, B. 2002. Hierarchical scheduling windows. In Proceedings of the 35th International Symposium on Microarchitecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Brooks, D., Tiwari, V., and Martonosi, M. 2000. Wattch: A framework for architectural-level power analysis and optimizations. In Proceedings of the 27th International Symposium on Computer Architecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Burger, D., Austin, T. M., and Bennett, S. 1996. Evaluating future microprocessors: The Simple- Scalar tool set. Tech. rep. CS-TR-1996-1308, University of Wisconsin-Madison.Google ScholarGoogle Scholar
  8. Burger, D., Keckler, S. W., McKinley, K. S., Dahlin, M., John, L. K., Lin, C., Moore, C. R., Burrill, J., McDonald, R. G., Yoder, W., and the TRIPS Team. 2004. Scaling to the end of silicon with EDGE architectures. IEEE Comput. 37, 7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Butts, J. A. and Sohi, G. S. 2002. Characterizing and predicting value degree of use. In Proceedings of the 35th International Symposium on Microarchitecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Calder, B. and Grunwald, D. 1995. Next cache line and set prediction. In Proceedings of the 22nd International Symposium on Computer Architecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Carmean, D. 2007. Future CPU architectures: The shift from traditional models. Intel Higher Education Lecture Series.Google ScholarGoogle Scholar
  12. Chou, Y., Fahs, B., and Abraham, S. 2004. Microarchitecture optimizations for exploiting memory-level parallelism. In Proceedings of the 31st International Symposium on Computer Architecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Davis, J. D., Laudon, J., and Olukotun, K. 2005. Maximizing CMP throughput with mediocre cores. In Proceedings of the 15th Conference on Parallel Architectures and Compilation Techniques. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Dolbeau, R. and Seznec, A. 2004. CASH: Revisiting hardware sharing in single-chip parallel processors. J. Instruction-Level Paral. 6.Google ScholarGoogle Scholar
  15. Ganusov, I. and Burtscher, M. 2006. Efficient emulation of hardware prefetchers via event-driven helper threading. In Proceedings of the 15th Conference on Parallel Architectures and Compilation Techniques. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Garg, A., Castro, F., Huang, M., Chaver, D., Pinuel, L., and Prieto, M. 2006. Substituting associative load queue with simple hash tables in out-of-order microprocessors. In Proceedings of the 12th International Symposium on Low Power Electronics and Design. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Glew, A. 1998. MLP yes! ILP no! In ASPLOS Wild and Crazy Ideas.Google ScholarGoogle Scholar
  18. Grochowski, E., Ronen, R., Shen, J., and Wang, H. 2004. Best of both latency and throughput. In Proceedings of the 22nd International Conference on Computer Design. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Hofstee, H. P. 2005. Power efficient processor architecture and the Cell processor. In Proceedings of the 11th International Conference on High Performance Computer Architecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Huang, M., Renau, J., and Torrellas, J. 2002. Energy-Efficient hybrid wakeup logic. In Proceedings of the 8th International Symposium on Low Power Electronics and Design. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. İpek, E., Kírman, M., Kírman, N., and Mart'ínez, J. 2007. Core fusion: Accommodating software diversity in chip multiprocessors. In Proceedings of the 34th International Symposium on Computer Architecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Johnson, T. and Nawathe, U. 2007. An 8-core, 64-thread, 64-bit power efficient SPARC SOC. In Proceedings of the 54th International Solid-State Circuits Conference.Google ScholarGoogle Scholar
  23. Kessler, R., McLellan, E., and Webb, D. 1998. The Alpha 21264 microprocessor architecture. In Proceedings of the 16th International Conference on Computer Design. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Kim, C., Sethumadhavan, S., Govindan, M. S., Ranganathan, N., Gulati, D., Burger, D., and Keckler, S. W. 2007. Composable lightweight processors. In Proceedings of the 40th International Symposium on Microarchitecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Kongetira, P., Aingaran, K., and Olukotun, K. 2005. Niagara: A 32-way multithreaded SPARC processor. IEEE Micro 25, 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Kucuk, G., Ergin, O., Ponomarev, D., and Ghose, K. 2003. Distributed reorder buffer schemes for low power. In Proceedings of the 21st International Conference on Computer Design. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Kumar, R., Jouppi, N., and Tullsen, D. 2004a. Conjoined-Core chip multiprocessing. In Proceedings of the 37th International Symposium on Microarchitecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Kumar, R., Tullsen, D. M., Ranganathan, P., Jouppi, N. P., and Farkas, K. I. 2004b. Single-ISA heterogeneous multi-core architectures for multithreaded workload performance. In Proceedings of the 31st International Symposium on Computer Architecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Mesa-Martinez, F. J., Nayfach-Battilan, J., and Renau, J. 2007. Power model validation through thermal measurements. In Proceedings of the 34th International Symposium on Computer Architecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Mutlu, O., Stark, J., Wilkerson, C., and Patt, Y. N. 2003. Runahead execution: An alternative to very large instruction windows for out-of-order processors. In Proceedings of the 9th International Conference on High Performance Computer Architecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. NVIDIA. 2009. NVIDIA CUDA programming guides, version 2.2.1. http://developer.download. nvidia.com/compute/cuda/2-2/toolkit/docs/NVIDIA-CUDA_Programming-Guide-2.2.1.pdfGoogle ScholarGoogle Scholar
  32. Onder, S. and Gupta, R. 1999. Dynamic memory disambiguation in the presence of out-of-order store issuing. In Proceedings of the 32nd International Symposium on Microarchitecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Owens, J. D., Luebke, D., Govindaraju, N., Harris, M., Krüger, J., Lefohn, A. E., and Purcell, T. J. 2007. A survey of general-purpose computation on graphics hardware. Comput. Graph. Forum 26, 1.Google ScholarGoogle ScholarCross RefCross Ref
  34. Raasch, S. E., Binkert, N. L., and Reinhardt, S. K. 2002. A scalable instruction queue design using dependence chains. In Proceedings of the 29th International Symposium on Computer Architecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Ramírez, M. A., Cristal, A., Veidenbaum, A. V., Villa, L., and Valero, M. 2004. Direct instruction wakeup for out-of-order processors. In Proceedings of the International Workshop on Innovative Architecture for Future Generation High-Performance Processors and Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Roth, A. 2005. Store vulnerability window (SVW): Re-Execution filtering for enhanced load optimization. In Proceedings of the 32nd International Symposium on Computer Architecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Salverda, P. and Zilles, C. 2008. Fundamental performance challenges in horizontal fusion of in-order cores. In Proceedings of the 14th International Conference on High Performance Computer Architecture.Google ScholarGoogle Scholar
  38. Sankaralingam, K., Nagarajan, R., McDonald, R., Desikan, R., Drolia, S., Govindan, M. S., Gratz, P., Gulati, D., Hanson, H., Kim, C., Liu, H., Ranganathan, N., Sethumadhavan, S., Sharif, S., Shivakumar, P., Keckler, S. W., and Burger, D. 2006. Distributed microarchitectural protocols in the TRIPS prototype processor. In Proceedings of the 39th International Symposium on Microarchitecture. 480--491. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Sassone, P. G., II, J. R., Brekelbaum, E., Loh, G. H., and Black, B. 2007. Matrix scheduler reloaded. In Proceedings of the 34th International Symposium on Computer Architecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Sato, T., Nakamura, Y., and Arita, I. 2001. Revisiting direct tag search algorithm on superscalar processors. In Proceedings of the Workshop on Complexity-Effective Design.Google ScholarGoogle Scholar
  41. Sethumadhavan, S., Desikan, R., Burger, D., Moore, C. R., and Keckler, S. W. 2003. Scalable hardware memory disambiguation for high ILP processors. In Proceedings of the 36th International Symposium on Microarchitecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Seznec, A., Felix, S., Krishnan, V., and Sazeides, Y. 2002. Design tradeoffs for the Alpha EV8 conditional branch predictor. In Proceedings of the 29th International Symposium on Computer Architecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Sha, T., Martin, M. M. K., and Roth, A. 2005. Scalable store-load forwarding via store queue index prediction. In Proceedings of the 38th International Symposium on Microarchitecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Sha, T., Martin, M. M. K., and Roth, A. 2006. NoSQ: Store-Load communication without a store queue. In Proceedings of the 39th International Symposium on Microarchitecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Sherwood, T., Perelman, E., Hamerly, G., and Calder, B. 2002. Automatically characterizing large scale program behavior. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Skadron, K., Stan, M. R., Huang, W., Velusamy, S., Sankaranarayanan, K., and Tarjan, D. 2003. Temperature-Aware microarchitecture. In Proceedings of the 30th International Symposium on Computer Architecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Smith, A., Burrill, J., Gibson, J., Maher, B., Nethercote, N., Yoder, B., Burger, D., and McKinley, K. 2006. Compiling for EDGE architectures. In Proceedings of the 4th International Symposium on Code Generation and Optimization. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Subramaniam, S. and Loh, G. H. 2006. Fire-and-Forget: Load/store scheduling with no store queue at all. In Proceedings of the 39th International Symposium on Microarchitecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Tarjan, D., Boyer, M., and Skadron, K. 2008. Federation: Repurposing scalar cores for out-of-order instruction issue. In Proceedings of the 45th Design Automation Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Tremblay, M. and O'Connor, J. M. 1996. UltraSparc I: A four-issue processor supporting multimedia. IEEE Micro 16, 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Tseng, J. H. and Asanovic, K. 2006. RingScalar: A complexity-effective out-of-order superscalar microarchitecture. Tech. rep. MIT-CSAIL-TR-2006-066, MIT CSAIL.Google ScholarGoogle Scholar
  52. Zhong, H., Lieberman, S. A., and Mahlke, S. A. 2007. Extending multicore architectures to exploit hybrid parallelism in single-thread applications. In Proceedings of the 13th International Conference on High Performance Computer Architecture. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Federation: Boosting per-thread performance of throughput-oriented manycore architectures

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Architecture and Code Optimization
      ACM Transactions on Architecture and Code Optimization  Volume 7, Issue 4
      December 2010
      167 pages
      ISSN:1544-3566
      EISSN:1544-3973
      DOI:10.1145/1880043
      Issue’s Table of Contents

      Copyright © 2010 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 30 December 2010
      • Accepted: 1 August 2010
      • Revised: 1 May 2010
      • Received: 1 April 2008
      Published in taco Volume 7, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader