research-article

Reducing the Performance Gap between Soft Scalar CPUs and Custom Hardware with TILT

Authors:
Ilian Tili

University of Toronto, Ontario, Canada

University of Toronto, Ontario, Canada
View Profile

,
Kalin Ovtcharov

University of Toronto, Ontario, Canada

University of Toronto, Ontario, Canada
View Profile

,
J. Gregory Steffan

University of Toronto, Ontario, Canada

University of Toronto, Ontario, Canada
View Profile

ACM Transactions on Reconfigurable Technology and Systems Volume 10 Issue 3Article No.: 22pp 1–23https://doi.org/10.1145/3079757

Published:27 June 2017Publication History

ACM Transactions on Reconfigurable Technology and Systems

Abstract

By using resource sharing field-programmable gate array (FPGA) compute engines, we can reduce the performance gap between soft scalar CPUs and resource-intensive custom datapath designs. This article demonstrates that Thread- and Instruction-Level parallel Template architecture (TILT), a programmable FPGA-based horizontally microcoded compute engine designed to highly utilize floating point (FP) functional units (FUs), can improve significantly the average throughput of eight FP-intensive applications compared to a soft scalar CPU (similar to a FP-extended Nios). For eight benchmark applications, we show that: (i) a base TILT configuration having a single instance for each FU type can improve the performance over a soft scalar CPU by 15.8 × , while requiring on average 26% of the custom datapaths’ area; (ii) selectively increasing the number of FUs can more than double TILT’s average throughput, reducing the custom-datapath-throughput-gap from 576 × to 14 × ; and (iii) replicated instances of the most computationally dense TILT configuration that fit within the area of each custom datapath design can reduce the gap to 8.27 × , while replicated instances of application-tuned configurations of TILT can reduce the custom-datapath-throughput-gap to an average of 5.22 × , and up to 3.41 × for the Matrix Multiply benchmark. Last, we present methods for design space reduction, and we correctly predict the computationally densest design for seven out of eight benchmarks.

References

F. Anjam, M. Nadeem, and S. Wong. 2010. A VLIW softcore processor with dynamically adjustable issue-slots. In Proceedings of the International Conference on Field Programmable Technology (FPT’10). 393--398.Google Scholar
V. E. Benes. 1964. Optimal rearrangeable multistage connecting networks. Bell Syst. Tech. J. 43, 4, 1641--1656.Google ScholarCross Ref
F. Black and M. Scholes. 1973. The pricing of options and corporate liabilities. J. Politic. Econ. 81, 3, pp. 637--654.Google ScholarCross Ref
E. S. Davidson, L. E. Shar, A. T. Thomas, and J. H. Patel. 1975. In Proceedings of the Effective control for pipelined computers (COMPCON’90). 181--184.Google Scholar
R. Dimond, O. Mencer, and W. Luk. 2005. CUSTARD—A customisable threaded FPGA soft processor and tools. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’05).Google Scholar
J. A. Fisher. 1979. The Optimization of Horizontal Microcode Within and Beyond Basic Blocks: An Application of Processor Scheduling with Resources. Ph.D. Dissertation. New York University. Google ScholarDigital Library
J. A. Fisher, P. Faraboschi, and C. Young. 2005. Embedded Computing: A VLIW Approach to Architecture, Compilers and Tools. Elsevier. Google ScholarDigital Library
B. Fort, D. Capalija, Z. G. Vranesic, and S. D. Brown. 2006. A multithreaded soft processor for SoPC area reduction. In Proceedings of the IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM’06). Google ScholarDigital Library
E. G. Haug. 2013. Black Scholes Code. Retrieved from http://www.espenhaug.com/black_scholes.html. (2013).Google Scholar
A. L. Hodgkin and A. F. Huxley. 1952. A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol. 117, 4.Google ScholarCross Ref
T. C. Hu. 1961. Parallel sequencing and assembly line problems. In Operat. Res. 9 (6). 841--848. Google ScholarDigital Library
A. K. Jones, R. Hoare, D. Kusic, F. Joshua, and F. John 2005. An FPGA-based VLIW processor with custom hardware execution. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA’05). Google ScholarDigital Library
N. Kapre and A. DeHon. 2009. Accelerating SPICE model-evaluation using FPGAs. In Proceedings of the IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM’09). Google ScholarDigital Library
N. Kapre and A. DeHon. 2011. VLIW-SCORE: Beyond C for sequential control of SPICE FPGA acceleration. In Proceedings of the International Conference on Field Programmable Technology (FPT’11).Google Scholar
N. Kapre and A. DeHon. 2012. SPICE2: Spatial processors interconnected for concurrent execution for accelerating the SPICE circuit simulator using an FPGA. IEEE Trans. Comput.-Aided Des. Integr. Circ. Syst. 31, 1, 9--22. Google ScholarDigital Library
M. Labrecque and J. G. Steffan. 2007. Improving pipelined soft processors with multithreading. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’07).Google Scholar
M. Lam. 1988. Software pipelining: An effective scheduling technique for VLIW machines. In Proceedings of the ACM SIGPLAN 1988 Conference on Programming Language Design and Implementation (PLDI’88). Google ScholarDigital Library
W. J. Lee, S. O. Woo, K. T. Kwon, S. J. Son, K. J. Min, C. H. Lee, K. J. Jang, C. M. Park, S. Y. Jung, and S. H. Lee. 2011. A scalable GPU architecture based on dynamically embedded reconfigurable processor. Proceedings of ACM High Performance Graphics 2011, Posters.Google Scholar
Y. Lei, Y. Dou, J. Zhou, and S. Wang. 2011. VPFPAP: A special-purpose VLIW processor for variable-precision floating-point arithmetic. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’11). 252--257. Google ScholarDigital Library
Compiler LLVM. 2012. The LLVM Compiler Infrastructure. Retrieved from http://llvm.org. Version 3.1.Google Scholar
C. Loken, D. Gruner, L. Groer, R. Peltier, N. Bunn, M. Craig, T. Henriques, J. Dempsey, C. Yu, J. Chen, L. Jonathan Dursi, J. Chong, S. Northrup, J. Pinto, N. Knecht, and R. Van Zon. 2010. SciNet: Lessons learned from building a power-efficient top-20 system and data centre. J. Phys.: Conf. Ser. 256, 1 (2010).Google ScholarCross Ref
S. Mann and R. W. Picard. 1995. On being “undigital” with digital cameras: Extending dynamic range by combining differently exposed pictures. In Proceedings of the 1995 Conference on Imaging Science and Technology (IST’95). 442--448.Google Scholar
B. Mei, S. Vernalde, D. Verkest, H. De Man, and R. Lauwereins. 2002. DRESC: A retargetable compiler for coarse-grained reconfigurable architectures. In Proceedings of the 2002 IEEE International Conference on Field-Programmable Technology (FPT’02). 166--173.Google Scholar
B. Mei, S. Vernalde, D. Verkest, H. De Man, and R. Lauwereins. 2003. ADRES: An architecture with tightly coupled VLIW processor and coarse-grained reconfigurable matrix. In Proceedings of the Field Programmable Logic and Application, 13th International Conference (FPL’03). 61--70.Google Scholar
MESA. 2013a. Matrix Inverse Code. Retrieved from http://express.ece.ucsb.edu/benchmark/mesa/invert_matrix_general.html.Google Scholar
MESA. 2013b. Matrix Multiply Code. Retrieved from http://express.ece.ucsb.edu/benchmark/mesa/matmul.html.Google Scholar
G. D. Micheli. 1994. Synthesis and Optimization of Digital Circuits. McGraw-Hill. Google ScholarDigital Library
T. Miyamori and K. Olukotun. 1998. REMARC: Reconfigurable multimedia array coprocessor. In Proceedings of the IEICE Transactions on Information and Systems E82-D. 389--397.Google Scholar
R. Moussali, N. Ghanem, and M. A. R. Saghir. 2007. Microarchitectural enhancements for configurable multi-threaded soft processors. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’07).Google ScholarCross Ref
NVidia. 2013a. Gaussian Blur Benchmark Code. Retrieved from http://http.developer.nvidia.com/GPUGems3/gpugems3_ch40.html.Google Scholar
NVidia. 2013b. N Body Benchmark Code. Retrieved from http://http.developer.nvidia.com/GPUGems3/gpugems3_ch31.html.Google Scholar
Kalin Ovtcharov, Ilian Tili, and J. Gregory Steffan. 2013. TILT: A multithreaded VLIW soft processor family. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’13).Google Scholar
M. A. R. Saghir, M. El-Majzoub, and P. Akl. 2006. Datapath and ISA customization for soft VLIW processors, In ReConFig 2006. In Proceedings of the IEEE International Conference on Reconfigurable Computing and FPGA’s (ReConFig’06).Google Scholar
H. Singh, M. Lee, G. Lu, F. J. Kurdahi, N. Bagherzadeh, and E. M. Chaves Filho. 2000. MorphoSys: An integrated reconfigurable system for data-parallel and computation-intensive applications. IEEE Trans. Comput. 49, 5, 465--481. Google ScholarDigital Library
H. Wong, V. Betz, and J. Rose. 2011. Comparing FPGA vs. custom CMOS and the impact on processor microarchitecture. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA’11). Google ScholarDigital Library
F. Xu, D. Li, and Y. Wang. 2011. An iterative approach for hybrid pipeline scheduling under throughput and resource constraints. In Proceedings of the IEEE International Conference on Computer Science and Automation Engineering (CSAE’11).Google Scholar
Y. Yu and S. T. Acton. 2002. Speckle reducing anisotropic diffusion. Proceedings of the IEEE Transactions on Image Processing, 11 (2002), 1260--1270. Google ScholarDigital Library

Index Terms

Reducing the Performance Gap between Soft Scalar CPUs and Custom Hardware with TILT
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Reconfigurable computing
    2. Parallel architectures
      1. Very long instruction word

Recommendations

Soft vector processors vs FPGA custom hardware: measuring and reducing the gap
FPGA '09: Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays

Soft processors are often used in FPGA-based systems because of their ease-of-use, but for a given computation there is a significant gap in area/performance between a C code implementation executing on a soft processor and a custom FPGA hardware ...
Read More
A high performance 32-bit ALU for programmable logic
FPGA '04: Proceedings of the 2004 ACM/SIGDA 12th international symposium on Field programmable gate arrays

The Arithmetic-Logic-Unit (ALU) is at the heart of a modern microprocessor, and its size and speed are often significant contributors to the overall processor's cost and performance. This paper presents the design of the ALU used in Altera's NIOS 2.0 ...
Read More
On Data Forwarding in Deeply Pipelined Soft Processors
FPGA '15: Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

We can design high-frequency soft-processors on FPGAs that exploit deep pipelining of DSP primitives, supported by selective data forwarding, to deliver up to 25% performance improvements across a range of benchmarks. Pipelined, in-order, scalar ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Reconfigurable Technology and Systems Volume 10, Issue 3
September 2017
187 pages
ISSN:1936-7406
EISSN:1936-7414
DOI:10.1145/3102109
Editor:
Steve Wilton
Department of Electrical and Computer Engineering/University of British Columbia/Kaiser 4112, 5500-2332 Main Mall/Vancouver, BC V6T 1Z4 Canada
Issue’s Table of Contents
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 June 2017
- Accepted: 1 March 2017
- Revised: 1 December 2016
- Received: 1 February 2016
Published in trets Volume 10, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
FPGA
Soft processors
compiling
computational density
computer architecture
design space
scheduling
throughput
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 175
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Reducing the Performance Gap between Soft Scalar CPUs and Custom Hardware with TILT

ACM Transactions on Reconfigurable Technology and Systems

Abstract

References

Cited By

Index Terms

Recommendations

Soft vector processors vs FPGA custom hardware: measuring and reducing the gap

A high performance 32-bit ALU for programmable logic

On Data Forwarding in Deeply Pipelined Soft Processors

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Reducing the Performance Gap between Soft Scalar CPUs and Custom Hardware with TILT

ACM Transactions on Reconfigurable Technology and Systems

Abstract

References

Cited By

Index Terms

Recommendations

Soft vector processors vs FPGA custom hardware: measuring and reducing the gap

A high performance 32-bit ALU for programmable logic

On Data Forwarding in Deeply Pipelined Soft Processors

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media