research-article

A performance and energy comparison of FPGAs, GPUs, and multicores for sliding-window applications

Authors:
Jeremy Fowers

University of Florida, Gainesville, FL, USA

University of Florida, Gainesville, FL, USA
View Profile

,
Greg Brown

University of Florida, Gainesville, FL, USA

University of Florida, Gainesville, FL, USA
View Profile

,
Patrick Cooke

University of Florida, Gainesville, FL, USA

University of Florida, Gainesville, FL, USA
View Profile

,
Greg Stitt

University of Florida, Gainesville, FL, USA

University of Florida, Gainesville, FL, USA
View Profile

FPGA '12: Proceedings of the ACM/SIGDA international symposium on Field Programmable Gate ArraysFebruary 2012Pages 47–56https://doi.org/10.1145/2145694.2145704

Published:22 February 2012Publication History

FPGA '12: Proceedings of the ACM/SIGDA international symposium on Field Programmable Gate Arrays

Pages 47–56

ABSTRACT

With the emergence of accelerator devices such as multicores, graphics-processing units (GPUs), and field-programmable gate arrays (FPGAs), application designers are confronted with the problem of searching a huge design space that has been shown to have widely varying performance and energy metrics for different accelerators, different application domains, and different use cases. To address this problem, numerous studies have evaluated specific applications across different accelerators. In this paper, we analyze an important domain of applications, referred to as sliding-window applications, when executing on FPGAs, GPUs, and multicores. For each device, we present optimization strategies and analyze use cases where each device is most effective. The results show that FPGAs can achieve speedup of up to 11x and 57x compared to GPUs and multicores, respectively, while also using orders of magnitude less energy.

References

Altera, Inc. 2011 Stratix III Early Power Estimator. http://www.altera.com/support/devices/estimator/st3-estimator/st3-power-estimator.html.Google Scholar
Asano, S., Maruyama, T., and Yamaguchi, Y. 2009. Performance comparison of FPGA, GPU and CPU in image processing. In Proc. of Int. Conf. on Field Prog, Logic and App. FPL '09. 126--131.Google Scholar
Baker, Z.K., Gokhale, M.B., and Tripp, J.L. 2007. Matched filter computation on FPGA, Cell and GPU. In Proc. of the IEEE Symp. on Field-Prog. Custom Computing Machines. FCCM'07. 207--218. Google ScholarDigital Library
Chase, J., Nelson, B., Bodily, J., Zhaoyi W., and Dah-Jye, L. 2008. Real-time optical flow calculations on FPGA and GPU architectures: a comparison study. In Proc. of the Int. Symp. on Field-Prog. Custom Computing Machines. FCCM '08. 173--182. Google ScholarDigital Library
Che, S., Li, J., Sheaffer, J.W., Skadron, K., and Lach, J. 2008. Accelerating compute-intensive applications with GPUs and FPGAs. In Proc. of the Symp. on Application Specific Processors. SASP'08. 101--107. Google ScholarDigital Library
Cope, B., Cheung, P.Y.K., Luk, W., and Witt, S. 2005. Have GPUs made FPGAs redundant in the field of video processing? In Proc. of the IEEE Int. Conf. on Field-Prog. Technology. 111--118.Google Scholar
Dong, Y., Dou, Y., and Zhou, J. 2007. Optimized generation of memory structure in compiling window operations onto reconfigurable hardware," in Proc. of the Int. Symp. on Applied Reconfigurable Computing, ARC '07. 110--121. Google ScholarDigital Library
Friemel, B.H., Bohs, L.N., and Trahey, G.E. 1995. Relative performance of two-dimensional speckle-tracking techniques: normalized correlation, non-normalized correlation and sum-absolute-difference. In Proc. of the IEEE Ultrasonics Symp.. 2, 1481--1484.Google Scholar
Frigo, M., and Johnson, S. 2009. FFTW Library. http://fftw.orgGoogle Scholar
Guo, Z., Najjar, W., Vahid, F., and Vissers, K. 2004. A quantitative analysis of the speedup factors of FPGAs over processors. In Proc. of the ACM/SIGDA Int. Symp. on Field Prog. gate arrays. FPGA '04. 162--170. Google ScholarDigital Library
Harris, M. 2007. "Optimizing Parallel Reduction in CUDA," NVIDIA Developer Technology.Google Scholar
Hunt, L. 2009. Fault-aware machine vision in small unmanned systems. In Proc. of the Florida Conf. on Recent Advances in Robotics. FCRAR'09.Google Scholar
Intel. 2010. Writing Optimal OpenCL Code with Intel OpenCL SDK: Performance Guide. http://software.intel.com/file/37171/.Google Scholar
Liu, W., Pokharel, P., and Principe, J. 2007. Correntropy: Properties and applications in non-Gaussian signal processing. IEEE Tranactions on. Signal Processing, 55, 11 (Nov. 2007), 5286--5298. Google ScholarDigital Library
Mehta, S., Misra, A., Singhal, A., Kumar, P., and Mittal, A. 2010. A high-performance parallel implementation of sum of absolute differences algorithm for motion estimation using CUDA. HiPC Conf. 2010.Google Scholar
Munshi, A. The OpenCL Specification. http://www.khronos.org/registry/cl/specs/opencl-1.0.29.pdf.Google Scholar
NVIDIA. 2001. CUDA. http://developer.nvidia.com/object/cuda.html.Google Scholar
NVIDIA. 2011. CUDA CUFFT Library. http://developer.nvidia.com/cuda-toolkit-40.Google Scholar
NVIDIA. 2011. NVIDIA Tegra 2. http://www.nvidia.com/object/tegra-2.html.Google Scholar
Owens, J.D., Houston, M., Luebke, D., Green, S., Stone, J.E., and Phillips, J.C. 2008. GPU computing. Proc. of the IEEE. 96, 5, 879--899.Google ScholarCross Ref
Pauwels, K., Tomasi, M., Diaz Alonso, J., Ros, E., and Van Hulle, M. 2011. A comparison of FPGA and GPU for real-time phase-based optical flow, stereo, and local image features. IEEE Transactions on Computers. 99. Google ScholarDigital Library
Podlozhnyuk, V. 2007. FFT-based 2D convolution. White Paper. NVIDIA Corporation.Google Scholar
Porter, R.B. and Bergmann, N.W. A generic implementation framework for FPGA based stereo matching. In Proc. of the IEEE Speech and Image Technologies for Computing and Telecommunications, TENCON '97. 461--464.Google Scholar
Principe, J., Fisher III, J., Xu, D. 2000. Information theoretic learning. In S. Haykin (Ed.), Unsupervised adaptive filtering. New York, NY: Wiley.Google Scholar
Sinha, S., Frahm, J.M., and Pollefeys M. 2006. GPU-based Video Feature Tracking and Matching. Technical Report TR06-012, University of North Carolina at Chapel Hill.Google Scholar
Underwood, K.D. and Hemmert, K.S. 2004. Closing the gap: CPU and FPGA trends in sustainable floating-point BLAS performance. In Proc. of the IEEE Symp. on Field-Prog. Custom Computing Machines, FCCM'04. 219--228. Google ScholarDigital Library
Xilinx. 2010. Virtex-4 Family Overview v3.1. (Aug 30, 2010). http://www.xilinx.com/support/documentation/data_sheets/ds112.pdfGoogle Scholar
Yu, H. and Leeser, M. 2006. Automatic sliding window operation optimization for FPGA-based computing boards. In Proc. of the IEEE Symp. on Field-Prog. Custom Computing Machines. FCCM '06. 76--88. Google ScholarDigital Library
Zhang, J., He, Y., Yang S., and Zhong, Y. 2003. Performance and complexity joint optimization for H.264 video coding. In Proc. of the Int. Symp. on Circuits and Systems. ISCAS '03. 2, 888--891.Google Scholar
Zhi G., Betul B., and Walid N. 2004. Input data reuse in compiling window operations onto reconfigurable hardware. In Proc. of the ACM SIGPLAN/SIGBED Conf. on Languages, compilers, and tools for embedded systems. LCTES '04. 249--256. Google ScholarDigital Library

Index Terms

A performance and energy comparison of FPGAs, GPUs, and multicores for sliding-window applications
1. Computer systems organization
  1. Embedded and cyber-physical systems
  2. Real-time systems
2. General and reference
  1. Cross-computing tools and techniques
    1. Design

Recommendations

A Tradeoff Analysis of FPGAs, GPUs, and Multicores for Sliding-Window Applications

The increasing usage of hardware accelerators such as Field-Programmable Gate Arrays (FPGAs) and Graphics Processing Units (GPUs) has significantly increased application design complexity. Such complexity results from a larger design space created by ...
Read More
Exploiting Parallelism on GPUs and FPGAs with OmpSs
ANDARE '17: Proceedings of the 1st Workshop on AutotuniNg and aDaptivity AppRoaches for Energy efficient HPC Systems

This paper presents the OmpSs approach to deal with heterogeneous programming on GPU and FPGA accelerators. The OmpSs programming model is based on the Mercurium compiler and the Nanos++ runtime. Applications are annotated with compiler directives ...
Read More
Performance study on CUDA GPUs for parallelizing the local ensemble transformed Kalman filter algorithm

Modern graphics cards provide computational capabilities that exceed current CPUs. As one of the computational intensive problems, numerical weather prediction has the opportunity to benefit from the massive number of threads and large memory throughput ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
FPGA '12: Proceedings of the ACM/SIGDA international symposium on Field Programmable Gate Arrays
February 2012
352 pages
ISBN:9781450311557
DOI:10.1145/2145694
General Chair:
Katherine Compton
University of Wisconsin-Madison
,
Program Chair:
Brad Hutchings
Brigham Young University
Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 22 February 2012
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
FPGA
GPU
multicore
parallelism
sliding window
speedup
Qualifiers
- research-article
Conference

Acceptance Rates
FPGA '12 Paper Acceptance Rate20of87submissions,23%Overall Acceptance Rate125of627submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 206
  Total Citations
  View Citations
- 2,244
  Total Downloads
- Downloads (Last 12 months)61
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A performance and energy comparison of FPGAs, GPUs, and multicores for sliding-window applications

FPGA '12: Proceedings of the ACM/SIGDA international symposium on Field Programmable Gate Arrays

ABSTRACT

References

Cited By

Index Terms

Recommendations

A Tradeoff Analysis of FPGAs, GPUs, and Multicores for Sliding-Window Applications

Exploiting Parallelism on GPUs and FPGAs with OmpSs

Performance study on CUDA GPUs for parallelizing the local ensemble transformed Kalman filter algorithm

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

A performance and energy comparison of FPGAs, GPUs, and multicores for sliding-window applications

FPGA '12: Proceedings of the ACM/SIGDA international symposium on Field Programmable Gate Arrays

ABSTRACT

References

Cited By

Index Terms

Recommendations

A Tradeoff Analysis of FPGAs, GPUs, and Multicores for Sliding-Window Applications

Exploiting Parallelism on GPUs and FPGAs with OmpSs

Performance study on CUDA GPUs for parallelizing the local ensemble transformed Kalman filter algorithm

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media