skip to main content
research-article

A High Throughput FPGA-Based Floating Point Conjugate Gradient Implementation for Dense Matrices

Published:01 January 2010Publication History
Skip Abstract Section

Abstract

Recent developments in the capacity of modern Field Programmable Gate Arrays (FPGAs) have significantly expanded their applications. One such field is the acceleration of scientific computation and one type of calculation that is commonplace in scientific computation is the solution of systems of linear equations. A method that has proven in software to be very efficient and robust for finding such solutions is the Conjugate Gradient (CG) algorithm. In this article we present a widely parallel and deeply pipelined hardware CG implementation, targeted at modern FPGA architectures. This implementation is particularly suited for accelerating multiple small-to-medium-sized dense systems of linear equations and can be used as a stand-alone solver or as building block to solve higher-order systems. In this article it is shown that through parallelization it is possible to convert the computation time per iteration for an order n matrix from Θ(n2) clock cycles on a microprocessor to Θ(n) on a FPGA. Through deep pipelining it is also possible to solve several problems in parallel and maximize both performance and efficiency. I/O requirements are shown to be scalable and convergent to a constant value with the increase of matrix order. Post place-and-route results on a readily available VirtexII-6000 demonstrate sustained performance of 5 GFlops, and results on a Virtex5-330 indicate sustained performance of 35 GFlops. A comparison with an optimized software implementation running on a high-end CPU demonstrate that this FPGA implementation represents a significant speedup of at least an order of magnitude.

References

  1. <scp>Atlas</scp>. 2008. Automatically Tuned Linear Algebra Software.Google ScholarGoogle Scholar
  2. <scp>Bayliss, S., Bouganis, C., and Constantinides, G.</scp> 2006. An FPGA implementation of the simplex algorithm. In Proceedings of the International Conference on Field Programmable Technology. 49--56.Google ScholarGoogle Scholar
  3. <scp>Bhatt, A.</scp> 2007. PCI-Express - Creating a third generation I/O interconnect. In Intel Developer Network for PCI Express Architecture. 1--11.Google ScholarGoogle Scholar
  4. <scp>Biglieri, E., Calderbank, R., Constantinides, A., Goldsmith, A., and Paulraj, A.</scp> 2007. MIMO Wireless Communications. Cambridge University Press, UK. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. <scp>Bonato, V., Peron, R., Wolf, D., Holanda, J., Marques, E., and Cardoso, J.</scp> 2007. An FPGA implementation for a Kalman filter with application to mobile robotics. In Proceedings of the Symposium on Industrial Embedded Systems. 148--155.Google ScholarGoogle Scholar
  6. <scp>Callanan, O., Gregg, D., Nisbet, A., and Peardon, M.</scp> 2006. High performance scientific computing using FPGAs with IEEE floating point and logarithmic arithmetic for lattice QCD. In Proceedings of the Conference on Field Programmable Logic and Applications. 29--35.Google ScholarGoogle Scholar
  7. <scp>Clearspeed</scp>. 2006. CSX600 Product Brief. http://support.clearspeed.com/documentation/hardware/csx600/.Google ScholarGoogle Scholar
  8. <scp>CoreGen</scp>. 2006. Core Generator Floating Point v3. http://www.edaboard.com/ftopic351915.html.Google ScholarGoogle Scholar
  9. <scp>Cray</scp>. 2005. XD1 Datasheet. Cray, Inc., Seattle, WA.Google ScholarGoogle Scholar
  10. <scp>Dandekar, O., Plishker, W., Bhattacharyya, S., and Shekhar, R.</scp> 2008. Multiobjective optimization of FPGA-based medical image registration. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines. To appear. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. <scp>deLorimier, M. and DeHon, A.</scp> 2005. Floating-Point sparse matrix-vector multiply for FPGAs. In Proceedings of the IEEE International Symposium on Field Programmable Gate Arrays. 75--85. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. <scp>Dou, Y., Vassiliadis, S., Kuzmanov, G., and Gaydadjiev, G.</scp> 2005. 64-bit floating-point FPGA matrix multiplication. In Proceedings of the IEEE International Symposium on Field Programmable Gate Arrays. 86--95. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. <scp>Fatahalian, K., Sugerman, J., and Hanrahan, P.</scp> 2004. Understanding the efficiency of GPU algorithms for matrix-matrix multiplication. In Proceedings of the ACM SIGGRAPH/Euro Graphics Conference on Graphics Hardware. 133--137. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. <scp>Fujimoto, N.</scp> 2008. Faster matrix-vector multiplication on GeForce 8800GTX. In Proceedings of the IEEE International Symposium on Parallel and Distributed Systems. 1--8.Google ScholarGoogle ScholarCross RefCross Ref
  15. <scp>Golub, G. and Van-Loan, F.</scp> 1996. Matrix Computations. The Johns Hopkins University Press, 53.Google ScholarGoogle Scholar
  16. <scp>Goto, K. and Geijn, R.</scp> 2008. Anatomy of high-performance matrix multiplication. ACM Trans. Math. Softw. 12:1--12:25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. <scp>Grote, M. and Simon, H.</scp> 1992. Parallel preconditioning and approximation inverses on the connection machine. In Proceedings of the Scalable High Performance Computing Conference. 76--83.Google ScholarGoogle Scholar
  18. <scp>Haridas, S. and Ziavras, S.</scp> 2004. FPGA implementation of a Cholesky algorithm for a shared-memory multiprocessor architecture. J. Parall. Algor. Appl. 19, 6, 411--226.Google ScholarGoogle Scholar
  19. <scp>He, M. and Ling, K.</scp> 2005. Model predictive control on a chip. In Proceedings of the International Conference on Control and Automation. 43--55.Google ScholarGoogle Scholar
  20. <scp>Hestenes, M. and Stiefel, E.</scp> 1952. Methods of conjugate gradients for solving linear systems. J. Res. Natl. Bur. Stand. 49, 6, 409--436.Google ScholarGoogle ScholarCross RefCross Ref
  21. <scp>IEEE</scp>. 1985. 754 standard for binary floating-point arithmetic. http://grouper.ieee.org/groups/754.Google ScholarGoogle Scholar
  22. <scp>Kelley, C. and Sachs, E.</scp> 1999. Truncated newton methods for optimization with inaccurate functions and gradients. SIAM J. Optimiz. 43--55.Google ScholarGoogle Scholar
  23. <scp>Kurzak, J., Buttari, A., and Dongarra, J.</scp> 2008. Solving systems of linear equations on the CELL processor using Cholesky factorization. IEEE Trans. Parall. Distrib. Syst., 1175--1186. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. <scp>Langhammer, M.</scp> 2004. RSSI - 2008 - Foundation of FPGA acceleration. http://www.rssi2008.org/proceedings/industry/Altera.pdf.Google ScholarGoogle Scholar
  25. <scp>Langhammer, M.</scp> 2008. Floating point datapath synthesis for FPGAs. In Proceedings of the IEEE International Conference on Field Programmable Logic and Applications. 355--360.Google ScholarGoogle ScholarCross RefCross Ref
  26. <scp>Maslennikow, O., Lepekha, V., and Sergyienko, A.</scp> 2005. FPGA implementation of the conjugate gradient method. In Proceedings of the Conference on Parallel Processing and Applied Mathematics. 526--533. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. <scp>Meurant, G.</scp> 2006. The Lanczos and Conjugate Gradient Algorithms From Theory to Finite Precision Computation. SIAM, 323--324. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. <scp>Morris, G. and Prasanna, V.</scp> 2005. An FPGA-based floating-point Jacobi iterative solver. In Proceedings of the 8th International Symposium on Parallel Architectures, Algorithms, and Networks. 420--427. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. <scp>Netlib</scp>. 2008. Basic linear algebra subprograms. http://www.netlib.org/blas/.Google ScholarGoogle Scholar
  30. <scp>Pournara, I., Bouganis, C., and Constantinides, G.</scp> 2005. FPGA-Accelerated reconstruction of gene regulatory networks. In Proceedings of the Conference on Field Programmable Logic. 323--328.Google ScholarGoogle Scholar
  31. <scp>Roldao, A. and Constantinides, G.</scp> 2008. High throughput FPGA-based floating point conjugate gradient implementation. In Proceedings of the Conference on Applied Reconfigurable Computing. 75--86. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. <scp>Sgi</scp>. 2006. RASC RC100 blade. http://www.sgi.com/pdfs/3920.pdf.Google ScholarGoogle Scholar
  33. <scp>Shewchuk, J.</scp> 2003. An introduction to the conjugate gradient method without the agonizing pain. http://www.cs.cmu.edu/~quake-papers/painless-conjugate-gradient.pdf.Google ScholarGoogle Scholar
  34. <scp>Spec</scp>. 2008. Floating point component of standard performance evaluation corporation CPU2000 benchmarks. http://www.spec.org/cpu2000/.Google ScholarGoogle Scholar
  35. <scp>Tomov, S.</scp> 2008. GPUs for HPC - NVIDIA’s compute unified device architecture. http://www.cs.utk.edu/~dongarra/WEBPAGES/SPRING-2008/Lect09_GPU.pdf.Google ScholarGoogle Scholar
  36. <scp>Underwood, K.</scp> 2004. FPGAs vs. CPUs: Trends in peak floating-point performance. In Proceedings of the ACM International Symposium on Field-Programmable Gate Arrays. 171--180. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. <scp>Virtex5</scp>. 2007. DS100 (v3.0) Virtex5 family overview - LX , LXT, and SXT platforms. http://www.silica.com/fileadmin/02_Products/05_Product-News/09_PLD/XLX-XCSVSXT/DS_XLX_XC5VSXT_rev3-0_Feb07.pdf.Google ScholarGoogle Scholar
  38. <scp>Williams, S., Shalf, J., Oliker, L., Kamil, S., Husbands, P., and Yelick, K.</scp> 2006. The potential of the cell processor for scientific computing. In Proceedings of the 3rd Conference on Computing Frontiers. 9--20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. <scp>Wright, S.</scp> 1991. Parallel algorithms for banded linear systems. SIAM J. Sci. Statist. Comput. 12, 4, 824--842.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. <scp>Wright, S.</scp> 1993. Interior point methods for optimal control of discrete time systems. J. Optimiz. Theory Appl. 77, 1, 161--187. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. <scp>Zhuo, L. and Prasanna, V.</scp> 2005. High performance linear algebra operations on reconfigurable systems. In Proceedings of the Conference on SuperComputing. 12--18. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A High Throughput FPGA-Based Floating Point Conjugate Gradient Implementation for Dense Matrices

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Reconfigurable Technology and Systems
        ACM Transactions on Reconfigurable Technology and Systems  Volume 3, Issue 1
        January 2010
        136 pages
        ISSN:1936-7406
        EISSN:1936-7414
        DOI:10.1145/1661438
        Issue’s Table of Contents

        Copyright © 2010 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 1 January 2010
        • Accepted: 1 November 2008
        • Revised: 1 October 2008
        • Received: 1 May 2008
        Published in trets Volume 3, Issue 1

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader