Abstract
By using resource sharing field-programmable gate array (FPGA) compute engines, we can reduce the performance gap between soft scalar CPUs and resource-intensive custom datapath designs. This article demonstrates that Thread- and Instruction-Level parallel Template architecture (TILT), a programmable FPGA-based horizontally microcoded compute engine designed to highly utilize floating point (FP) functional units (FUs), can improve significantly the average throughput of eight FP-intensive applications compared to a soft scalar CPU (similar to a FP-extended Nios). For eight benchmark applications, we show that: (i) a base TILT configuration having a single instance for each FU type can improve the performance over a soft scalar CPU by 15.8 × , while requiring on average 26% of the custom datapaths’ area; (ii) selectively increasing the number of FUs can more than double TILT’s average throughput, reducing the custom-datapath-throughput-gap from 576 × to 14 × ; and (iii) replicated instances of the most computationally dense TILT configuration that fit within the area of each custom datapath design can reduce the gap to 8.27 × , while replicated instances of application-tuned configurations of TILT can reduce the custom-datapath-throughput-gap to an average of 5.22 × , and up to 3.41 × for the Matrix Multiply benchmark. Last, we present methods for design space reduction, and we correctly predict the computationally densest design for seven out of eight benchmarks.
- F. Anjam, M. Nadeem, and S. Wong. 2010. A VLIW softcore processor with dynamically adjustable issue-slots. In Proceedings of the International Conference on Field Programmable Technology (FPT’10). 393--398.Google Scholar
- V. E. Benes. 1964. Optimal rearrangeable multistage connecting networks. Bell Syst. Tech. J. 43, 4, 1641--1656.Google ScholarCross Ref
- F. Black and M. Scholes. 1973. The pricing of options and corporate liabilities. J. Politic. Econ. 81, 3, pp. 637--654.Google ScholarCross Ref
- E. S. Davidson, L. E. Shar, A. T. Thomas, and J. H. Patel. 1975. In Proceedings of the Effective control for pipelined computers (COMPCON’90). 181--184.Google Scholar
- R. Dimond, O. Mencer, and W. Luk. 2005. CUSTARD—A customisable threaded FPGA soft processor and tools. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’05).Google Scholar
- J. A. Fisher. 1979. The Optimization of Horizontal Microcode Within and Beyond Basic Blocks: An Application of Processor Scheduling with Resources. Ph.D. Dissertation. New York University. Google ScholarDigital Library
- J. A. Fisher, P. Faraboschi, and C. Young. 2005. Embedded Computing: A VLIW Approach to Architecture, Compilers and Tools. Elsevier. Google ScholarDigital Library
- B. Fort, D. Capalija, Z. G. Vranesic, and S. D. Brown. 2006. A multithreaded soft processor for SoPC area reduction. In Proceedings of the IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM’06). Google ScholarDigital Library
- E. G. Haug. 2013. Black Scholes Code. Retrieved from http://www.espenhaug.com/black_scholes.html. (2013).Google Scholar
- A. L. Hodgkin and A. F. Huxley. 1952. A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol. 117, 4.Google ScholarCross Ref
- T. C. Hu. 1961. Parallel sequencing and assembly line problems. In Operat. Res. 9 (6). 841--848. Google ScholarDigital Library
- A. K. Jones, R. Hoare, D. Kusic, F. Joshua, and F. John 2005. An FPGA-based VLIW processor with custom hardware execution. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA’05). Google ScholarDigital Library
- N. Kapre and A. DeHon. 2009. Accelerating SPICE model-evaluation using FPGAs. In Proceedings of the IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM’09). Google ScholarDigital Library
- N. Kapre and A. DeHon. 2011. VLIW-SCORE: Beyond C for sequential control of SPICE FPGA acceleration. In Proceedings of the International Conference on Field Programmable Technology (FPT’11).Google Scholar
- N. Kapre and A. DeHon. 2012. SPICE2: Spatial processors interconnected for concurrent execution for accelerating the SPICE circuit simulator using an FPGA. IEEE Trans. Comput.-Aided Des. Integr. Circ. Syst. 31, 1, 9--22. Google ScholarDigital Library
- M. Labrecque and J. G. Steffan. 2007. Improving pipelined soft processors with multithreading. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’07).Google Scholar
- M. Lam. 1988. Software pipelining: An effective scheduling technique for VLIW machines. In Proceedings of the ACM SIGPLAN 1988 Conference on Programming Language Design and Implementation (PLDI’88). Google ScholarDigital Library
- W. J. Lee, S. O. Woo, K. T. Kwon, S. J. Son, K. J. Min, C. H. Lee, K. J. Jang, C. M. Park, S. Y. Jung, and S. H. Lee. 2011. A scalable GPU architecture based on dynamically embedded reconfigurable processor. Proceedings of ACM High Performance Graphics 2011, Posters.Google Scholar
- Y. Lei, Y. Dou, J. Zhou, and S. Wang. 2011. VPFPAP: A special-purpose VLIW processor for variable-precision floating-point arithmetic. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’11). 252--257. Google ScholarDigital Library
- Compiler LLVM. 2012. The LLVM Compiler Infrastructure. Retrieved from http://llvm.org. Version 3.1.Google Scholar
- C. Loken, D. Gruner, L. Groer, R. Peltier, N. Bunn, M. Craig, T. Henriques, J. Dempsey, C. Yu, J. Chen, L. Jonathan Dursi, J. Chong, S. Northrup, J. Pinto, N. Knecht, and R. Van Zon. 2010. SciNet: Lessons learned from building a power-efficient top-20 system and data centre. J. Phys.: Conf. Ser. 256, 1 (2010).Google ScholarCross Ref
- S. Mann and R. W. Picard. 1995. On being “undigital” with digital cameras: Extending dynamic range by combining differently exposed pictures. In Proceedings of the 1995 Conference on Imaging Science and Technology (IST’95). 442--448.Google Scholar
- B. Mei, S. Vernalde, D. Verkest, H. De Man, and R. Lauwereins. 2002. DRESC: A retargetable compiler for coarse-grained reconfigurable architectures. In Proceedings of the 2002 IEEE International Conference on Field-Programmable Technology (FPT’02). 166--173.Google Scholar
- B. Mei, S. Vernalde, D. Verkest, H. De Man, and R. Lauwereins. 2003. ADRES: An architecture with tightly coupled VLIW processor and coarse-grained reconfigurable matrix. In Proceedings of the Field Programmable Logic and Application, 13th International Conference (FPL’03). 61--70.Google Scholar
- MESA. 2013a. Matrix Inverse Code. Retrieved from http://express.ece.ucsb.edu/benchmark/mesa/invert_matrix_general.html.Google Scholar
- MESA. 2013b. Matrix Multiply Code. Retrieved from http://express.ece.ucsb.edu/benchmark/mesa/matmul.html.Google Scholar
- G. D. Micheli. 1994. Synthesis and Optimization of Digital Circuits. McGraw-Hill. Google ScholarDigital Library
- T. Miyamori and K. Olukotun. 1998. REMARC: Reconfigurable multimedia array coprocessor. In Proceedings of the IEICE Transactions on Information and Systems E82-D. 389--397.Google Scholar
- R. Moussali, N. Ghanem, and M. A. R. Saghir. 2007. Microarchitectural enhancements for configurable multi-threaded soft processors. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’07).Google ScholarCross Ref
- NVidia. 2013a. Gaussian Blur Benchmark Code. Retrieved from http://http.developer.nvidia.com/GPUGems3/gpugems3_ch40.html.Google Scholar
- NVidia. 2013b. N Body Benchmark Code. Retrieved from http://http.developer.nvidia.com/GPUGems3/gpugems3_ch31.html.Google Scholar
- Kalin Ovtcharov, Ilian Tili, and J. Gregory Steffan. 2013. TILT: A multithreaded VLIW soft processor family. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’13).Google Scholar
- M. A. R. Saghir, M. El-Majzoub, and P. Akl. 2006. Datapath and ISA customization for soft VLIW processors, In ReConFig 2006. In Proceedings of the IEEE International Conference on Reconfigurable Computing and FPGA’s (ReConFig’06).Google Scholar
- H. Singh, M. Lee, G. Lu, F. J. Kurdahi, N. Bagherzadeh, and E. M. Chaves Filho. 2000. MorphoSys: An integrated reconfigurable system for data-parallel and computation-intensive applications. IEEE Trans. Comput. 49, 5, 465--481. Google ScholarDigital Library
- H. Wong, V. Betz, and J. Rose. 2011. Comparing FPGA vs. custom CMOS and the impact on processor microarchitecture. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA’11). Google ScholarDigital Library
- F. Xu, D. Li, and Y. Wang. 2011. An iterative approach for hybrid pipeline scheduling under throughput and resource constraints. In Proceedings of the IEEE International Conference on Computer Science and Automation Engineering (CSAE’11).Google Scholar
- Y. Yu and S. T. Acton. 2002. Speckle reducing anisotropic diffusion. Proceedings of the IEEE Transactions on Image Processing, 11 (2002), 1260--1270. Google ScholarDigital Library
Index Terms
- Reducing the Performance Gap between Soft Scalar CPUs and Custom Hardware with TILT
Recommendations
Soft vector processors vs FPGA custom hardware: measuring and reducing the gap
FPGA '09: Proceedings of the ACM/SIGDA international symposium on Field programmable gate arraysSoft processors are often used in FPGA-based systems because of their ease-of-use, but for a given computation there is a significant gap in area/performance between a C code implementation executing on a soft processor and a custom FPGA hardware ...
A high performance 32-bit ALU for programmable logic
FPGA '04: Proceedings of the 2004 ACM/SIGDA 12th international symposium on Field programmable gate arraysThe Arithmetic-Logic-Unit (ALU) is at the heart of a modern microprocessor, and its size and speed are often significant contributors to the overall processor's cost and performance. This paper presents the design of the ALU used in Altera's NIOS 2.0 ...
On Data Forwarding in Deeply Pipelined Soft Processors
FPGA '15: Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysWe can design high-frequency soft-processors on FPGAs that exploit deep pipelining of DSP primitives, supported by selective data forwarding, to deliver up to 25% performance improvements across a range of benchmarks. Pipelined, in-order, scalar ...
Comments