Abstract
Recent developments in the capacity of modern Field Programmable Gate Arrays (FPGAs) have significantly expanded their applications. One such field is the acceleration of scientific computation and one type of calculation that is commonplace in scientific computation is the solution of systems of linear equations. A method that has proven in software to be very efficient and robust for finding such solutions is the Conjugate Gradient (CG) algorithm. In this article we present a widely parallel and deeply pipelined hardware CG implementation, targeted at modern FPGA architectures. This implementation is particularly suited for accelerating multiple small-to-medium-sized dense systems of linear equations and can be used as a stand-alone solver or as building block to solve higher-order systems. In this article it is shown that through parallelization it is possible to convert the computation time per iteration for an order n matrix from Θ(n2) clock cycles on a microprocessor to Θ(n) on a FPGA. Through deep pipelining it is also possible to solve several problems in parallel and maximize both performance and efficiency. I/O requirements are shown to be scalable and convergent to a constant value with the increase of matrix order. Post place-and-route results on a readily available VirtexII-6000 demonstrate sustained performance of 5 GFlops, and results on a Virtex5-330 indicate sustained performance of 35 GFlops. A comparison with an optimized software implementation running on a high-end CPU demonstrate that this FPGA implementation represents a significant speedup of at least an order of magnitude.
- <scp>Atlas</scp>. 2008. Automatically Tuned Linear Algebra Software.Google Scholar
- <scp>Bayliss, S., Bouganis, C., and Constantinides, G.</scp> 2006. An FPGA implementation of the simplex algorithm. In Proceedings of the International Conference on Field Programmable Technology. 49--56.Google Scholar
- <scp>Bhatt, A.</scp> 2007. PCI-Express - Creating a third generation I/O interconnect. In Intel Developer Network for PCI Express Architecture. 1--11.Google Scholar
- <scp>Biglieri, E., Calderbank, R., Constantinides, A., Goldsmith, A., and Paulraj, A.</scp> 2007. MIMO Wireless Communications. Cambridge University Press, UK. Google ScholarDigital Library
- <scp>Bonato, V., Peron, R., Wolf, D., Holanda, J., Marques, E., and Cardoso, J.</scp> 2007. An FPGA implementation for a Kalman filter with application to mobile robotics. In Proceedings of the Symposium on Industrial Embedded Systems. 148--155.Google Scholar
- <scp>Callanan, O., Gregg, D., Nisbet, A., and Peardon, M.</scp> 2006. High performance scientific computing using FPGAs with IEEE floating point and logarithmic arithmetic for lattice QCD. In Proceedings of the Conference on Field Programmable Logic and Applications. 29--35.Google Scholar
- <scp>Clearspeed</scp>. 2006. CSX600 Product Brief. http://support.clearspeed.com/documentation/hardware/csx600/.Google Scholar
- <scp>CoreGen</scp>. 2006. Core Generator Floating Point v3. http://www.edaboard.com/ftopic351915.html.Google Scholar
- <scp>Cray</scp>. 2005. XD1 Datasheet. Cray, Inc., Seattle, WA.Google Scholar
- <scp>Dandekar, O., Plishker, W., Bhattacharyya, S., and Shekhar, R.</scp> 2008. Multiobjective optimization of FPGA-based medical image registration. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines. To appear. Google ScholarDigital Library
- <scp>deLorimier, M. and DeHon, A.</scp> 2005. Floating-Point sparse matrix-vector multiply for FPGAs. In Proceedings of the IEEE International Symposium on Field Programmable Gate Arrays. 75--85. Google ScholarDigital Library
- <scp>Dou, Y., Vassiliadis, S., Kuzmanov, G., and Gaydadjiev, G.</scp> 2005. 64-bit floating-point FPGA matrix multiplication. In Proceedings of the IEEE International Symposium on Field Programmable Gate Arrays. 86--95. Google ScholarDigital Library
- <scp>Fatahalian, K., Sugerman, J., and Hanrahan, P.</scp> 2004. Understanding the efficiency of GPU algorithms for matrix-matrix multiplication. In Proceedings of the ACM SIGGRAPH/Euro Graphics Conference on Graphics Hardware. 133--137. Google ScholarDigital Library
- <scp>Fujimoto, N.</scp> 2008. Faster matrix-vector multiplication on GeForce 8800GTX. In Proceedings of the IEEE International Symposium on Parallel and Distributed Systems. 1--8.Google ScholarCross Ref
- <scp>Golub, G. and Van-Loan, F.</scp> 1996. Matrix Computations. The Johns Hopkins University Press, 53.Google Scholar
- <scp>Goto, K. and Geijn, R.</scp> 2008. Anatomy of high-performance matrix multiplication. ACM Trans. Math. Softw. 12:1--12:25. Google ScholarDigital Library
- <scp>Grote, M. and Simon, H.</scp> 1992. Parallel preconditioning and approximation inverses on the connection machine. In Proceedings of the Scalable High Performance Computing Conference. 76--83.Google Scholar
- <scp>Haridas, S. and Ziavras, S.</scp> 2004. FPGA implementation of a Cholesky algorithm for a shared-memory multiprocessor architecture. J. Parall. Algor. Appl. 19, 6, 411--226.Google Scholar
- <scp>He, M. and Ling, K.</scp> 2005. Model predictive control on a chip. In Proceedings of the International Conference on Control and Automation. 43--55.Google Scholar
- <scp>Hestenes, M. and Stiefel, E.</scp> 1952. Methods of conjugate gradients for solving linear systems. J. Res. Natl. Bur. Stand. 49, 6, 409--436.Google ScholarCross Ref
- <scp>IEEE</scp>. 1985. 754 standard for binary floating-point arithmetic. http://grouper.ieee.org/groups/754.Google Scholar
- <scp>Kelley, C. and Sachs, E.</scp> 1999. Truncated newton methods for optimization with inaccurate functions and gradients. SIAM J. Optimiz. 43--55.Google Scholar
- <scp>Kurzak, J., Buttari, A., and Dongarra, J.</scp> 2008. Solving systems of linear equations on the CELL processor using Cholesky factorization. IEEE Trans. Parall. Distrib. Syst., 1175--1186. Google ScholarDigital Library
- <scp>Langhammer, M.</scp> 2004. RSSI - 2008 - Foundation of FPGA acceleration. http://www.rssi2008.org/proceedings/industry/Altera.pdf.Google Scholar
- <scp>Langhammer, M.</scp> 2008. Floating point datapath synthesis for FPGAs. In Proceedings of the IEEE International Conference on Field Programmable Logic and Applications. 355--360.Google ScholarCross Ref
- <scp>Maslennikow, O., Lepekha, V., and Sergyienko, A.</scp> 2005. FPGA implementation of the conjugate gradient method. In Proceedings of the Conference on Parallel Processing and Applied Mathematics. 526--533. Google ScholarDigital Library
- <scp>Meurant, G.</scp> 2006. The Lanczos and Conjugate Gradient Algorithms From Theory to Finite Precision Computation. SIAM, 323--324. Google ScholarDigital Library
- <scp>Morris, G. and Prasanna, V.</scp> 2005. An FPGA-based floating-point Jacobi iterative solver. In Proceedings of the 8th International Symposium on Parallel Architectures, Algorithms, and Networks. 420--427. Google ScholarDigital Library
- <scp>Netlib</scp>. 2008. Basic linear algebra subprograms. http://www.netlib.org/blas/.Google Scholar
- <scp>Pournara, I., Bouganis, C., and Constantinides, G.</scp> 2005. FPGA-Accelerated reconstruction of gene regulatory networks. In Proceedings of the Conference on Field Programmable Logic. 323--328.Google Scholar
- <scp>Roldao, A. and Constantinides, G.</scp> 2008. High throughput FPGA-based floating point conjugate gradient implementation. In Proceedings of the Conference on Applied Reconfigurable Computing. 75--86. Google ScholarDigital Library
- <scp>Sgi</scp>. 2006. RASC RC100 blade. http://www.sgi.com/pdfs/3920.pdf.Google Scholar
- <scp>Shewchuk, J.</scp> 2003. An introduction to the conjugate gradient method without the agonizing pain. http://www.cs.cmu.edu/~quake-papers/painless-conjugate-gradient.pdf.Google Scholar
- <scp>Spec</scp>. 2008. Floating point component of standard performance evaluation corporation CPU2000 benchmarks. http://www.spec.org/cpu2000/.Google Scholar
- <scp>Tomov, S.</scp> 2008. GPUs for HPC - NVIDIA’s compute unified device architecture. http://www.cs.utk.edu/~dongarra/WEBPAGES/SPRING-2008/Lect09_GPU.pdf.Google Scholar
- <scp>Underwood, K.</scp> 2004. FPGAs vs. CPUs: Trends in peak floating-point performance. In Proceedings of the ACM International Symposium on Field-Programmable Gate Arrays. 171--180. Google ScholarDigital Library
- <scp>Virtex5</scp>. 2007. DS100 (v3.0) Virtex5 family overview - LX , LXT, and SXT platforms. http://www.silica.com/fileadmin/02_Products/05_Product-News/09_PLD/XLX-XCSVSXT/DS_XLX_XC5VSXT_rev3-0_Feb07.pdf.Google Scholar
- <scp>Williams, S., Shalf, J., Oliker, L., Kamil, S., Husbands, P., and Yelick, K.</scp> 2006. The potential of the cell processor for scientific computing. In Proceedings of the 3rd Conference on Computing Frontiers. 9--20. Google ScholarDigital Library
- <scp>Wright, S.</scp> 1991. Parallel algorithms for banded linear systems. SIAM J. Sci. Statist. Comput. 12, 4, 824--842.Google ScholarDigital Library
- <scp>Wright, S.</scp> 1993. Interior point methods for optimal control of discrete time systems. J. Optimiz. Theory Appl. 77, 1, 161--187. Google ScholarDigital Library
- <scp>Zhuo, L. and Prasanna, V.</scp> 2005. High performance linear algebra operations on reconfigurable systems. In Proceedings of the Conference on SuperComputing. 12--18. Google ScholarDigital Library
Index Terms
A High Throughput FPGA-Based Floating Point Conjugate Gradient Implementation for Dense Matrices
Recommendations
A High Throughput FPGA-Based Floating Point Conjugate Gradient Implementation
ARC '08: Proceedings of the 4th international workshop on Reconfigurable Computing: Architectures, Tools and ApplicationsAs Field Programmable Gate Arrays (FPGAs) have reached capacities beyond millions of equivalent gates, it becomes possible to accelerate floating-point scientific computing applications. One type of calculation that is commonplace in scientific ...
Efficient shared-memory implementation of high-performance conjugate gradient benchmark and its application to unstructured matrices
SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisA new sparse high performance conjugate gradient benchmark (HPCG) has been recently released to address challenges in the design of sparse linear solvers for the next generation extreme-scale computing systems. Key computation, data access, and ...
An implementation of block conjugate gradient algorithm on CPU-GPU processors
Co-HPC '14: Proceedings of the 1st International Workshop on Hardware-Software Co-Design for High Performance ComputingIn this paper, we investigate the implementation of the Block Conjugate Gradient (BCG) algorithm on CPU-GPU processors. By analyzing the performance of various matrix operations in BCG, we identify the main performance bottleneck in constructing new ...
Comments