skip to main content
research-article
Free Access

Improving Multibank Memory Access Parallelism with Lattice-Based Partitioning

Authors Info & Claims
Published:09 January 2015Publication History
Skip Abstract Section

Abstract

Emerging architectures, such as reconfigurable hardware platforms, provide the unprecedented opportunity of customizing the memory infrastructure based on application access patterns. This work addresses the problem of automated memory partitioning for such architectures, taking into account potentially parallel data accesses to physically independent banks. Targeted at affine static control parts (SCoPs), the technique relies on the Z-polyhedral model for program analysis and adopts a partitioning scheme based on integer lattices. The approach enables the definition of a solution space including previous works as particular cases. The problem of minimizing the total amount of memory required across the partitioned banks, referred to as storage minimization throughout the article, is tackled by an optimal approach yielding asymptotically zero memory waste or, as an alternative, an efficient approach ensuring arbitrarily small waste. The article also presents a prototype toolchain and a detailed step-by-step case study demonstrating the impact of the proposed technique along with extensive comparisons with alternative approaches in the literature.

References

  1. Christophe Alias, Alain Darte, and Alexandru Plesco. 2013. Optimizing remote accesses for offloaded kernels: Application to high-level synthesis for FPGA. In Proceedings of the Conference on Design, Automation, and Test in Europe. EDA Consortium, San Jose, CA, 575--580. http://dl.acm.org/citation.cfm?id=2485288.2485430. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Alexander Barvinok. 2002. A Course in Convexity. American Mathematical Society.Google ScholarGoogle Scholar
  3. Cédric Bastoul. 2004. Code generation in the polyhedral model is easier than you think. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques. IEEE, Los Alamitos, CA, 7--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Samuel Bayliss and George A. Constantinides. 2012. Optimizing SDRAM bandwidth for custom FPGA loop accelerators. In Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays. ACM, New York, NY, 195--204. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Uday Bondhugula, Jagannathan Ramanujam, and Ponnuswamy Sadayappan. 2007. PLuTo: A Practical and Fully Automatic Polyhedral Parallelizer and Locality Optimizer. Technical Report OSU-CISRC-10/07-TR70. Ohio State University, Columbus, OH.Google ScholarGoogle Scholar
  6. Andr R. Brodtkorb, Trond R. Hagen, and Martin L. Stra. 2013. Graphics processing unit (GPU) programming strategies and trends in GPU computing. Journal of Parallel and Distributed Computing 73, 1, 4--13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Siddhartha Chatterjee, John R. Gilbert, Fred J. E. Long, Robert Schreiber, and Shang-Hua Teng. 1995. Generating local addresses and communication sets for data-parallel programs. ACM SIGPLAN Notices 28, 7, 149--158. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Song Chen and Adam Postula. 2000. Synthesis of custom interleaved memory systems. IEEE Transactions on VLSI Systems 8, 1, 74--83. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Alessandro Cilardo, Luca Gallo, and Nicola Mazzocca. 2013. Design space exploration for high-level synthesis of multi-threaded applications. Journal of Systems Architecture 59, 10, 1171--1183. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Albert Cohen, Sylvain Girbal, and Olivier Temam. 2004. A polyhedral approach to ease the composition of program transformations. In Euro-Par. Lecture Notes in Computer Science, Vol. 3149. Springer, 292--303.Google ScholarGoogle Scholar
  11. Jason Cong, Wei Jiang, Bin Liu, and Yi Zou. 2011. Automatic memory partitioning and scheduling for throughput and power optimization. ACM Transactions on Design Automation of Electronic Systems 16, 2, Article No. 15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Ethan E. Danahy, Sos S. Agaian, and Karen A. Panetta. 2007. Algorithms for the resizing of binary and grayscale images using a logical transform. In Image Processing: Algorithms and Systems V SPIE Proceedings, Vol. 6497. SPIE, 64970.Google ScholarGoogle Scholar
  13. Alain Darte. 1991. Regular partitioning for synthesizing fixed-size systolic arrays. Integration 12, 3, 293--304. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Alain Darte, Robert Schreiber, B. Ramakrishna Rau, and Frederic Vivien. 2002. Constructing and exploiting linear schedules with prescribed parallelism. ACM Transactions on Design Automation of Electronic Systems 7, 1, 159--172. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Alain Darte, Robert Schreiber, and Gilles Villard. 2005. Lattice-based memory allocation. IEEE Transactions on Computing 54, 10, 1242--1257. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Paul Feautrier. 1992. Some efficient solutions to the affine scheduling problem. Part II. Multidimensional time. International Journal of Parallel Programming 21, 6, 389--420. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Martin Griebl and Christian Lengauer. 1996. The loop parallelizer LooPo—announcement. In Languages and Compilers for Parallel Computing. Lecture Notes in Computer Science, Vol. 1239. Springer, 603--604. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Gautam Gupta and Sanjay Rajopadhye. 2007. The Z-polyhedral model. In Proceedings of the 12th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, New York, NY, 237--248. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Manish Gupta. 1992. Automatic Data Partitioning on Distributed Memory Multicomputers. Ph.D. Dissertation. University of Illinois at Urbana-Champaign, Champaign, IL. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Guillaume Iooss and Sanjay Rajopadhye. 2012. A library to manipulate Z-polyhedra in image representation. In Proceedings of IMPACT 2012.Google ScholarGoogle Scholar
  21. Jonathan Kelner. 2009. Lecture 18, An Algorithmists Toolkit.Google ScholarGoogle Scholar
  22. Ken Kennedy and John R. Allen. 2002. Optimizing Compilers for Modern Architectures: A Dependence-Based Approach. Morgan Kaufmann, San Francisco, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. H. Le Verge. 1995. Recurrences on lattice polyhedra and their applications. Based on a manuscript written by H. Le Verge.Google ScholarGoogle Scholar
  24. Claudia Leopold. 2002. On optimal temporal locality of stencil codes. In Proceedings of the 2002 ACM Symposium on Applied Computing. ACM, New York, NY, 948--952. DOI: http://dx.doi.org/10.1145/508791.508975 Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Jia-Jhe Li, Chi-Bang Kuan, Tung-Yu Wu, and Jenq Kuen Lee. 2012a. Enabling an OpenCL compiler for embedded multicore DSP systems. In Proceedings of the 2012 41st International Conference on Parallel Processing Workshops. IEEE, Los Alamitos, CA, 545--552. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Peng Li, Yuxin Wang, Peng Zhang, Guojie Luo, Tao Wang, and Jason Cong. 2012b. Memory partitioning and scheduling co-optimization in behavioral synthesis. In Proceedings of the International Conference on Computer-Aided Design. ACM, New York, NY, 488--495. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Qiang Liu, George A. Constantinides, Konstantinos Masselos, and Peter Y. Cheung. 2007. Automatic on-chip memory minimization for data reuse. In Proceedings of the 15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’07). 251--260. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Qiang Liu, George A. Constantinides, Konstantinos Masselos, and Peter Y. K. Cheung. 2009. Combining data reuse with data-level parallelization for FPGA-targeted hardware compilation: A geometric programming framework. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 28, 3, 305--315. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Vincent Loechner. 1999. PolyLib: A library for manipulating parameterized polyhedra.Google ScholarGoogle Scholar
  30. Qingda Lu, Christophe Alias, Uday Bondhugula, Thomas Henretty, Sriram Krishnamoorthy, Jagannathan Ramanujam, Atanas Rountev, Ponnuswamy Sadayappan, Yongjian Chen, Haibo Lin, and Tin-Fook Ngai. 2009. Data layout transformation for enhancing data locality on nuca chip multiprocessors. In Proceedings of the 18th International Conference on Parallel Architectures and Compilation Techniques (PACT’09). 348--357. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Benot Meister, Nicolas Vasilache, David Wohlford, Muthu Manikandan Baskaran, Allen Leung, and Richard Lethin. 2011. R-Stream compiler. In Encyclopedia of Parallel Computing. Springer, 1756--1765.Google ScholarGoogle Scholar
  32. Morris Newman. 1972. Integral Matrices. Pure and Applied Mathematics, Vol. 45. Academic Press.Google ScholarGoogle Scholar
  33. Louis-Noel Pouchet, Peng Zhang, Ponnuswamy Sadayappan, and Jason Cong. 2013. Polyhedral-based data reuse optimization for configurable computing. In Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays. ACM, New York, NY, 29--38. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Patrice Quinton, Sanjay Rajopadhye, and Tanguy Risset. 1996. On Manipulating Z-Polyhedra. Technical Report.Google ScholarGoogle Scholar
  35. Alexander Schrijver. 1986. Theory of Linear and Integer Programming. John Wiley & Sons, New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Rachid Seghir. 2012. ZPolyTrans: A library for computing and enumerating integer transformations of Z-polyhedra. In Proceedings of the 2nd International Workshop on Polyhedral Compilation Techniques (IMPACT’12). 6.Google ScholarGoogle Scholar
  37. Jürgen Teich and Lothar Thiele. 1993. Partitioning of processor arrays: A piecewise regular approach. Integration: The VLSI Journal 14, 3, 297--332. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Sven Verdoolaege, Hristo Nikolov, and Todor Stefanov. 2007. pn: A tool for improved derivation of process networks. EURASIP Journal on Embedded Systems 2007, 1, 19. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Sven Verdoolaege and Kevin M. Woods. 2008. Counting with rational generating functions. Journal of Symbolic Computation 43, 2, 75--91. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Yuxin Wang, Peng Li, and Jason Cong. 2014. Theory and algorithm for generalized memory partitioning in high-level synthesis. In Proceedings of the 2014 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, New York, NY, 199--208. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Yuxin Wang, Peng Li, Peng Zhang, Chen Zhang, and Jason Cong. 2013. Memory partitioning for multidimensional arrays in high-level synthesis. In Proceedings of the 50th Annual Design Automation Conference. ACM, New York, NY, Article No. 12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Eric W. Weisstein. 2003. CRC Concise Encyclopedia of Mathematics. CRC Press, Boca Raton, FL.Google ScholarGoogle Scholar
  43. Xilinx Inc. 2012. Vivado Design Suite User Guide: High-Level Synthesis. Available at http://www.xilinx.com.Google ScholarGoogle Scholar

Index Terms

  1. Improving Multibank Memory Access Parallelism with Lattice-Based Partitioning

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Architecture and Code Optimization
            ACM Transactions on Architecture and Code Optimization  Volume 11, Issue 4
            January 2015
            797 pages
            ISSN:1544-3566
            EISSN:1544-3973
            DOI:10.1145/2695583
            Issue’s Table of Contents

            Copyright © 2015 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 9 January 2015
            • Accepted: 1 October 2014
            • Revised: 1 August 2014
            • Received: 1 March 2014
            Published in taco Volume 11, Issue 4

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader