skip to main content
research-article

Reducing the Performance Gap between Soft Scalar CPUs and Custom Hardware with TILT

Published:27 June 2017Publication History
Skip Abstract Section

Abstract

By using resource sharing field-programmable gate array (FPGA) compute engines, we can reduce the performance gap between soft scalar CPUs and resource-intensive custom datapath designs. This article demonstrates that Thread- and Instruction-Level parallel Template architecture (TILT), a programmable FPGA-based horizontally microcoded compute engine designed to highly utilize floating point (FP) functional units (FUs), can improve significantly the average throughput of eight FP-intensive applications compared to a soft scalar CPU (similar to a FP-extended Nios). For eight benchmark applications, we show that: (i) a base TILT configuration having a single instance for each FU type can improve the performance over a soft scalar CPU by 15.8 × , while requiring on average 26% of the custom datapaths’ area; (ii) selectively increasing the number of FUs can more than double TILT’s average throughput, reducing the custom-datapath-throughput-gap from 576 × to 14 × ; and (iii) replicated instances of the most computationally dense TILT configuration that fit within the area of each custom datapath design can reduce the gap to 8.27 × , while replicated instances of application-tuned configurations of TILT can reduce the custom-datapath-throughput-gap to an average of 5.22 × , and up to 3.41 × for the Matrix Multiply benchmark. Last, we present methods for design space reduction, and we correctly predict the computationally densest design for seven out of eight benchmarks.

References

  1. F. Anjam, M. Nadeem, and S. Wong. 2010. A VLIW softcore processor with dynamically adjustable issue-slots. In Proceedings of the International Conference on Field Programmable Technology (FPT’10). 393--398.Google ScholarGoogle Scholar
  2. V. E. Benes. 1964. Optimal rearrangeable multistage connecting networks. Bell Syst. Tech. J. 43, 4, 1641--1656.Google ScholarGoogle ScholarCross RefCross Ref
  3. F. Black and M. Scholes. 1973. The pricing of options and corporate liabilities. J. Politic. Econ. 81, 3, pp. 637--654.Google ScholarGoogle ScholarCross RefCross Ref
  4. E. S. Davidson, L. E. Shar, A. T. Thomas, and J. H. Patel. 1975. In Proceedings of the Effective control for pipelined computers (COMPCON’90). 181--184.Google ScholarGoogle Scholar
  5. R. Dimond, O. Mencer, and W. Luk. 2005. CUSTARD—A customisable threaded FPGA soft processor and tools. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’05).Google ScholarGoogle Scholar
  6. J. A. Fisher. 1979. The Optimization of Horizontal Microcode Within and Beyond Basic Blocks: An Application of Processor Scheduling with Resources. Ph.D. Dissertation. New York University. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. A. Fisher, P. Faraboschi, and C. Young. 2005. Embedded Computing: A VLIW Approach to Architecture, Compilers and Tools. Elsevier. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. B. Fort, D. Capalija, Z. G. Vranesic, and S. D. Brown. 2006. A multithreaded soft processor for SoPC area reduction. In Proceedings of the IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM’06). Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. E. G. Haug. 2013. Black Scholes Code. Retrieved from http://www.espenhaug.com/black_scholes.html. (2013).Google ScholarGoogle Scholar
  10. A. L. Hodgkin and A. F. Huxley. 1952. A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol. 117, 4.Google ScholarGoogle ScholarCross RefCross Ref
  11. T. C. Hu. 1961. Parallel sequencing and assembly line problems. In Operat. Res. 9 (6). 841--848. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. K. Jones, R. Hoare, D. Kusic, F. Joshua, and F. John 2005. An FPGA-based VLIW processor with custom hardware execution. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA’05). Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. N. Kapre and A. DeHon. 2009. Accelerating SPICE model-evaluation using FPGAs. In Proceedings of the IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM’09). Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. N. Kapre and A. DeHon. 2011. VLIW-SCORE: Beyond C for sequential control of SPICE FPGA acceleration. In Proceedings of the International Conference on Field Programmable Technology (FPT’11).Google ScholarGoogle Scholar
  15. N. Kapre and A. DeHon. 2012. SPICE2: Spatial processors interconnected for concurrent execution for accelerating the SPICE circuit simulator using an FPGA. IEEE Trans. Comput.-Aided Des. Integr. Circ. Syst. 31, 1, 9--22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. Labrecque and J. G. Steffan. 2007. Improving pipelined soft processors with multithreading. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’07).Google ScholarGoogle Scholar
  17. M. Lam. 1988. Software pipelining: An effective scheduling technique for VLIW machines. In Proceedings of the ACM SIGPLAN 1988 Conference on Programming Language Design and Implementation (PLDI’88). Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. W. J. Lee, S. O. Woo, K. T. Kwon, S. J. Son, K. J. Min, C. H. Lee, K. J. Jang, C. M. Park, S. Y. Jung, and S. H. Lee. 2011. A scalable GPU architecture based on dynamically embedded reconfigurable processor. Proceedings of ACM High Performance Graphics 2011, Posters.Google ScholarGoogle Scholar
  19. Y. Lei, Y. Dou, J. Zhou, and S. Wang. 2011. VPFPAP: A special-purpose VLIW processor for variable-precision floating-point arithmetic. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’11). 252--257. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Compiler LLVM. 2012. The LLVM Compiler Infrastructure. Retrieved from http://llvm.org. Version 3.1.Google ScholarGoogle Scholar
  21. C. Loken, D. Gruner, L. Groer, R. Peltier, N. Bunn, M. Craig, T. Henriques, J. Dempsey, C. Yu, J. Chen, L. Jonathan Dursi, J. Chong, S. Northrup, J. Pinto, N. Knecht, and R. Van Zon. 2010. SciNet: Lessons learned from building a power-efficient top-20 system and data centre. J. Phys.: Conf. Ser. 256, 1 (2010).Google ScholarGoogle ScholarCross RefCross Ref
  22. S. Mann and R. W. Picard. 1995. On being “undigital” with digital cameras: Extending dynamic range by combining differently exposed pictures. In Proceedings of the 1995 Conference on Imaging Science and Technology (IST’95). 442--448.Google ScholarGoogle Scholar
  23. B. Mei, S. Vernalde, D. Verkest, H. De Man, and R. Lauwereins. 2002. DRESC: A retargetable compiler for coarse-grained reconfigurable architectures. In Proceedings of the 2002 IEEE International Conference on Field-Programmable Technology (FPT’02). 166--173.Google ScholarGoogle Scholar
  24. B. Mei, S. Vernalde, D. Verkest, H. De Man, and R. Lauwereins. 2003. ADRES: An architecture with tightly coupled VLIW processor and coarse-grained reconfigurable matrix. In Proceedings of the Field Programmable Logic and Application, 13th International Conference (FPL’03). 61--70.Google ScholarGoogle Scholar
  25. MESA. 2013a. Matrix Inverse Code. Retrieved from http://express.ece.ucsb.edu/benchmark/mesa/invert_matrix_general.html.Google ScholarGoogle Scholar
  26. MESA. 2013b. Matrix Multiply Code. Retrieved from http://express.ece.ucsb.edu/benchmark/mesa/matmul.html.Google ScholarGoogle Scholar
  27. G. D. Micheli. 1994. Synthesis and Optimization of Digital Circuits. McGraw-Hill. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. T. Miyamori and K. Olukotun. 1998. REMARC: Reconfigurable multimedia array coprocessor. In Proceedings of the IEICE Transactions on Information and Systems E82-D. 389--397.Google ScholarGoogle Scholar
  29. R. Moussali, N. Ghanem, and M. A. R. Saghir. 2007. Microarchitectural enhancements for configurable multi-threaded soft processors. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’07).Google ScholarGoogle ScholarCross RefCross Ref
  30. NVidia. 2013a. Gaussian Blur Benchmark Code. Retrieved from http://http.developer.nvidia.com/GPUGems3/gpugems3_ch40.html.Google ScholarGoogle Scholar
  31. NVidia. 2013b. N Body Benchmark Code. Retrieved from http://http.developer.nvidia.com/GPUGems3/gpugems3_ch31.html.Google ScholarGoogle Scholar
  32. Kalin Ovtcharov, Ilian Tili, and J. Gregory Steffan. 2013. TILT: A multithreaded VLIW soft processor family. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’13).Google ScholarGoogle Scholar
  33. M. A. R. Saghir, M. El-Majzoub, and P. Akl. 2006. Datapath and ISA customization for soft VLIW processors, In ReConFig 2006. In Proceedings of the IEEE International Conference on Reconfigurable Computing and FPGA’s (ReConFig’06).Google ScholarGoogle Scholar
  34. H. Singh, M. Lee, G. Lu, F. J. Kurdahi, N. Bagherzadeh, and E. M. Chaves Filho. 2000. MorphoSys: An integrated reconfigurable system for data-parallel and computation-intensive applications. IEEE Trans. Comput. 49, 5, 465--481. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. H. Wong, V. Betz, and J. Rose. 2011. Comparing FPGA vs. custom CMOS and the impact on processor microarchitecture. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA’11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. F. Xu, D. Li, and Y. Wang. 2011. An iterative approach for hybrid pipeline scheduling under throughput and resource constraints. In Proceedings of the IEEE International Conference on Computer Science and Automation Engineering (CSAE’11).Google ScholarGoogle Scholar
  37. Y. Yu and S. T. Acton. 2002. Speckle reducing anisotropic diffusion. Proceedings of the IEEE Transactions on Image Processing, 11 (2002), 1260--1270. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Reducing the Performance Gap between Soft Scalar CPUs and Custom Hardware with TILT

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Reconfigurable Technology and Systems
        ACM Transactions on Reconfigurable Technology and Systems  Volume 10, Issue 3
        September 2017
        187 pages
        ISSN:1936-7406
        EISSN:1936-7414
        DOI:10.1145/3102109
        • Editor:
        • Steve Wilton
        Issue’s Table of Contents

        Copyright © 2017 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 27 June 2017
        • Accepted: 1 March 2017
        • Revised: 1 December 2016
        • Received: 1 February 2016
        Published in trets Volume 10, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader