ABSTRACT
With the emergence of accelerator devices such as multicores, graphics-processing units (GPUs), and field-programmable gate arrays (FPGAs), application designers are confronted with the problem of searching a huge design space that has been shown to have widely varying performance and energy metrics for different accelerators, different application domains, and different use cases. To address this problem, numerous studies have evaluated specific applications across different accelerators. In this paper, we analyze an important domain of applications, referred to as sliding-window applications, when executing on FPGAs, GPUs, and multicores. For each device, we present optimization strategies and analyze use cases where each device is most effective. The results show that FPGAs can achieve speedup of up to 11x and 57x compared to GPUs and multicores, respectively, while also using orders of magnitude less energy.
- Altera, Inc. 2011 Stratix III Early Power Estimator. http://www.altera.com/support/devices/estimator/st3-estimator/st3-power-estimator.html.Google Scholar
- Asano, S., Maruyama, T., and Yamaguchi, Y. 2009. Performance comparison of FPGA, GPU and CPU in image processing. In Proc. of Int. Conf. on Field Prog, Logic and App. FPL '09. 126--131.Google Scholar
- Baker, Z.K., Gokhale, M.B., and Tripp, J.L. 2007. Matched filter computation on FPGA, Cell and GPU. In Proc. of the IEEE Symp. on Field-Prog. Custom Computing Machines. FCCM'07. 207--218. Google ScholarDigital Library
- Chase, J., Nelson, B., Bodily, J., Zhaoyi W., and Dah-Jye, L. 2008. Real-time optical flow calculations on FPGA and GPU architectures: a comparison study. In Proc. of the Int. Symp. on Field-Prog. Custom Computing Machines. FCCM '08. 173--182. Google ScholarDigital Library
- Che, S., Li, J., Sheaffer, J.W., Skadron, K., and Lach, J. 2008. Accelerating compute-intensive applications with GPUs and FPGAs. In Proc. of the Symp. on Application Specific Processors. SASP'08. 101--107. Google ScholarDigital Library
- Cope, B., Cheung, P.Y.K., Luk, W., and Witt, S. 2005. Have GPUs made FPGAs redundant in the field of video processing? In Proc. of the IEEE Int. Conf. on Field-Prog. Technology. 111--118.Google Scholar
- Dong, Y., Dou, Y., and Zhou, J. 2007. Optimized generation of memory structure in compiling window operations onto reconfigurable hardware," in Proc. of the Int. Symp. on Applied Reconfigurable Computing, ARC '07. 110--121. Google ScholarDigital Library
- Friemel, B.H., Bohs, L.N., and Trahey, G.E. 1995. Relative performance of two-dimensional speckle-tracking techniques: normalized correlation, non-normalized correlation and sum-absolute-difference. In Proc. of the IEEE Ultrasonics Symp.. 2, 1481--1484.Google Scholar
- Frigo, M., and Johnson, S. 2009. FFTW Library. http://fftw.orgGoogle Scholar
- Guo, Z., Najjar, W., Vahid, F., and Vissers, K. 2004. A quantitative analysis of the speedup factors of FPGAs over processors. In Proc. of the ACM/SIGDA Int. Symp. on Field Prog. gate arrays. FPGA '04. 162--170. Google ScholarDigital Library
- Harris, M. 2007. "Optimizing Parallel Reduction in CUDA," NVIDIA Developer Technology.Google Scholar
- Hunt, L. 2009. Fault-aware machine vision in small unmanned systems. In Proc. of the Florida Conf. on Recent Advances in Robotics. FCRAR'09.Google Scholar
- Intel. 2010. Writing Optimal OpenCL Code with Intel OpenCL SDK: Performance Guide. http://software.intel.com/file/37171/.Google Scholar
- Liu, W., Pokharel, P., and Principe, J. 2007. Correntropy: Properties and applications in non-Gaussian signal processing. IEEE Tranactions on. Signal Processing, 55, 11 (Nov. 2007), 5286--5298. Google ScholarDigital Library
- Mehta, S., Misra, A., Singhal, A., Kumar, P., and Mittal, A. 2010. A high-performance parallel implementation of sum of absolute differences algorithm for motion estimation using CUDA. HiPC Conf. 2010.Google Scholar
- Munshi, A. The OpenCL Specification. http://www.khronos.org/registry/cl/specs/opencl-1.0.29.pdf.Google Scholar
- NVIDIA. 2001. CUDA. http://developer.nvidia.com/object/cuda.html.Google Scholar
- NVIDIA. 2011. CUDA CUFFT Library. http://developer.nvidia.com/cuda-toolkit-40.Google Scholar
- NVIDIA. 2011. NVIDIA Tegra 2. http://www.nvidia.com/object/tegra-2.html.Google Scholar
- Owens, J.D., Houston, M., Luebke, D., Green, S., Stone, J.E., and Phillips, J.C. 2008. GPU computing. Proc. of the IEEE. 96, 5, 879--899.Google ScholarCross Ref
- Pauwels, K., Tomasi, M., Diaz Alonso, J., Ros, E., and Van Hulle, M. 2011. A comparison of FPGA and GPU for real-time phase-based optical flow, stereo, and local image features. IEEE Transactions on Computers. 99. Google ScholarDigital Library
- Podlozhnyuk, V. 2007. FFT-based 2D convolution. White Paper. NVIDIA Corporation.Google Scholar
- Porter, R.B. and Bergmann, N.W. A generic implementation framework for FPGA based stereo matching. In Proc. of the IEEE Speech and Image Technologies for Computing and Telecommunications, TENCON '97. 461--464.Google Scholar
- Principe, J., Fisher III, J., Xu, D. 2000. Information theoretic learning. In S. Haykin (Ed.), Unsupervised adaptive filtering. New York, NY: Wiley.Google Scholar
- Sinha, S., Frahm, J.M., and Pollefeys M. 2006. GPU-based Video Feature Tracking and Matching. Technical Report TR06-012, University of North Carolina at Chapel Hill.Google Scholar
- Underwood, K.D. and Hemmert, K.S. 2004. Closing the gap: CPU and FPGA trends in sustainable floating-point BLAS performance. In Proc. of the IEEE Symp. on Field-Prog. Custom Computing Machines, FCCM'04. 219--228. Google ScholarDigital Library
- Xilinx. 2010. Virtex-4 Family Overview v3.1. (Aug 30, 2010). http://www.xilinx.com/support/documentation/data_sheets/ds112.pdfGoogle Scholar
- Yu, H. and Leeser, M. 2006. Automatic sliding window operation optimization for FPGA-based computing boards. In Proc. of the IEEE Symp. on Field-Prog. Custom Computing Machines. FCCM '06. 76--88. Google ScholarDigital Library
- Zhang, J., He, Y., Yang S., and Zhong, Y. 2003. Performance and complexity joint optimization for H.264 video coding. In Proc. of the Int. Symp. on Circuits and Systems. ISCAS '03. 2, 888--891.Google Scholar
- Zhi G., Betul B., and Walid N. 2004. Input data reuse in compiling window operations onto reconfigurable hardware. In Proc. of the ACM SIGPLAN/SIGBED Conf. on Languages, compilers, and tools for embedded systems. LCTES '04. 249--256. Google ScholarDigital Library
Index Terms
- A performance and energy comparison of FPGAs, GPUs, and multicores for sliding-window applications
Recommendations
A Tradeoff Analysis of FPGAs, GPUs, and Multicores for Sliding-Window Applications
The increasing usage of hardware accelerators such as Field-Programmable Gate Arrays (FPGAs) and Graphics Processing Units (GPUs) has significantly increased application design complexity. Such complexity results from a larger design space created by ...
Exploiting Parallelism on GPUs and FPGAs with OmpSs
ANDARE '17: Proceedings of the 1st Workshop on AutotuniNg and aDaptivity AppRoaches for Energy efficient HPC SystemsThis paper presents the OmpSs approach to deal with heterogeneous programming on GPU and FPGA accelerators. The OmpSs programming model is based on the Mercurium compiler and the Nanos++ runtime. Applications are annotated with compiler directives ...
Performance study on CUDA GPUs for parallelizing the local ensemble transformed Kalman filter algorithm
Modern graphics cards provide computational capabilities that exceed current CPUs. As one of the computational intensive problems, numerical weather prediction has the opportunity to benefit from the massive number of threads and large memory throughput ...
Comments