Improving Multibank Memory Access Parallelism with Lattice-Based Partitioning

Authors:
Alessandro Cilardo

Department of Electrical Engineering and Information Technologies, University of Naples Federico II, Napoli, Italy

Department of Electrical Engineering and Information Technologies, University of Naples Federico II, Napoli, Italy
View Profile

,
Luca Gallo

Department of Electrical Engineering and Information Technologies, University of Naples Federico II, Napoli, Italy

Department of Electrical Engineering and Information Technologies, University of Naples Federico II, Napoli, Italy
View Profile

ACM Transactions on Architecture and Code Optimization Volume 11 Issue 4Article No.: 45pp 1–25https://doi.org/10.1145/2675359

Published:09 January 2015Publication History

ACM Transactions on Architecture and Code Optimization

Abstract

Emerging architectures, such as reconfigurable hardware platforms, provide the unprecedented opportunity of customizing the memory infrastructure based on application access patterns. This work addresses the problem of automated memory partitioning for such architectures, taking into account potentially parallel data accesses to physically independent banks. Targeted at affine static control parts (SCoPs), the technique relies on the Z-polyhedral model for program analysis and adopts a partitioning scheme based on integer lattices. The approach enables the definition of a solution space including previous works as particular cases. The problem of minimizing the total amount of memory required across the partitioned banks, referred to as storage minimization throughout the article, is tackled by an optimal approach yielding asymptotically zero memory waste or, as an alternative, an efficient approach ensuring arbitrarily small waste. The article also presents a prototype toolchain and a detailed step-by-step case study demonstrating the impact of the proposed technique along with extensive comparisons with alternative approaches in the literature.

References

Christophe Alias, Alain Darte, and Alexandru Plesco. 2013. Optimizing remote accesses for offloaded kernels: Application to high-level synthesis for FPGA. In Proceedings of the Conference on Design, Automation, and Test in Europe. EDA Consortium, San Jose, CA, 575--580. http://dl.acm.org/citation.cfm&quest;id=2485288.2485430. Google ScholarDigital Library
Alexander Barvinok. 2002. A Course in Convexity. American Mathematical Society.Google Scholar
Cédric Bastoul. 2004. Code generation in the polyhedral model is easier than you think. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques. IEEE, Los Alamitos, CA, 7--16. Google ScholarDigital Library
Samuel Bayliss and George A. Constantinides. 2012. Optimizing SDRAM bandwidth for custom FPGA loop accelerators. In Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays. ACM, New York, NY, 195--204. Google ScholarDigital Library
Uday Bondhugula, Jagannathan Ramanujam, and Ponnuswamy Sadayappan. 2007. PLuTo: A Practical and Fully Automatic Polyhedral Parallelizer and Locality Optimizer. Technical Report OSU-CISRC-10/07-TR70. Ohio State University, Columbus, OH.Google Scholar
Andr R. Brodtkorb, Trond R. Hagen, and Martin L. Stra. 2013. Graphics processing unit (GPU) programming strategies and trends in GPU computing. Journal of Parallel and Distributed Computing 73, 1, 4--13. Google ScholarDigital Library
Siddhartha Chatterjee, John R. Gilbert, Fred J. E. Long, Robert Schreiber, and Shang-Hua Teng. 1995. Generating local addresses and communication sets for data-parallel programs. ACM SIGPLAN Notices 28, 7, 149--158. Google ScholarDigital Library
Song Chen and Adam Postula. 2000. Synthesis of custom interleaved memory systems. IEEE Transactions on VLSI Systems 8, 1, 74--83. Google ScholarDigital Library
Alessandro Cilardo, Luca Gallo, and Nicola Mazzocca. 2013. Design space exploration for high-level synthesis of multi-threaded applications. Journal of Systems Architecture 59, 10, 1171--1183. Google ScholarDigital Library
Albert Cohen, Sylvain Girbal, and Olivier Temam. 2004. A polyhedral approach to ease the composition of program transformations. In Euro-Par. Lecture Notes in Computer Science, Vol. 3149. Springer, 292--303.Google Scholar
Jason Cong, Wei Jiang, Bin Liu, and Yi Zou. 2011. Automatic memory partitioning and scheduling for throughput and power optimization. ACM Transactions on Design Automation of Electronic Systems 16, 2, Article No. 15. Google ScholarDigital Library
Ethan E. Danahy, Sos S. Agaian, and Karen A. Panetta. 2007. Algorithms for the resizing of binary and grayscale images using a logical transform. In Image Processing: Algorithms and Systems V SPIE Proceedings, Vol. 6497. SPIE, 64970.Google Scholar
Alain Darte. 1991. Regular partitioning for synthesizing fixed-size systolic arrays. Integration 12, 3, 293--304. Google ScholarDigital Library
Alain Darte, Robert Schreiber, B. Ramakrishna Rau, and Frederic Vivien. 2002. Constructing and exploiting linear schedules with prescribed parallelism. ACM Transactions on Design Automation of Electronic Systems 7, 1, 159--172. Google ScholarDigital Library
Alain Darte, Robert Schreiber, and Gilles Villard. 2005. Lattice-based memory allocation. IEEE Transactions on Computing 54, 10, 1242--1257. Google ScholarDigital Library
Paul Feautrier. 1992. Some efficient solutions to the affine scheduling problem. Part II. Multidimensional time. International Journal of Parallel Programming 21, 6, 389--420. Google ScholarDigital Library
Martin Griebl and Christian Lengauer. 1996. The loop parallelizer LooPo—announcement. In Languages and Compilers for Parallel Computing. Lecture Notes in Computer Science, Vol. 1239. Springer, 603--604. Google ScholarDigital Library
Gautam Gupta and Sanjay Rajopadhye. 2007. The Z-polyhedral model. In Proceedings of the 12th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, New York, NY, 237--248. Google ScholarDigital Library
Manish Gupta. 1992. Automatic Data Partitioning on Distributed Memory Multicomputers. Ph.D. Dissertation. University of Illinois at Urbana-Champaign, Champaign, IL. Google ScholarDigital Library
Guillaume Iooss and Sanjay Rajopadhye. 2012. A library to manipulate Z-polyhedra in image representation. In Proceedings of IMPACT 2012.Google Scholar
Jonathan Kelner. 2009. Lecture 18, An Algorithmists Toolkit.Google Scholar
Ken Kennedy and John R. Allen. 2002. Optimizing Compilers for Modern Architectures: A Dependence-Based Approach. Morgan Kaufmann, San Francisco, CA. Google ScholarDigital Library
H. Le Verge. 1995. Recurrences on lattice polyhedra and their applications. Based on a manuscript written by H. Le Verge.Google Scholar
Claudia Leopold. 2002. On optimal temporal locality of stencil codes. In Proceedings of the 2002 ACM Symposium on Applied Computing. ACM, New York, NY, 948--952. DOI: http://dx.doi.org/10.1145/508791.508975 Google ScholarDigital Library
Jia-Jhe Li, Chi-Bang Kuan, Tung-Yu Wu, and Jenq Kuen Lee. 2012a. Enabling an OpenCL compiler for embedded multicore DSP systems. In Proceedings of the 2012 41st International Conference on Parallel Processing Workshops. IEEE, Los Alamitos, CA, 545--552. Google ScholarDigital Library
Peng Li, Yuxin Wang, Peng Zhang, Guojie Luo, Tao Wang, and Jason Cong. 2012b. Memory partitioning and scheduling co-optimization in behavioral synthesis. In Proceedings of the International Conference on Computer-Aided Design. ACM, New York, NY, 488--495. Google ScholarDigital Library
Qiang Liu, George A. Constantinides, Konstantinos Masselos, and Peter Y. Cheung. 2007. Automatic on-chip memory minimization for data reuse. In Proceedings of the 15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’07). 251--260. Google ScholarDigital Library
Qiang Liu, George A. Constantinides, Konstantinos Masselos, and Peter Y. K. Cheung. 2009. Combining data reuse with data-level parallelization for FPGA-targeted hardware compilation: A geometric programming framework. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 28, 3, 305--315. Google ScholarDigital Library
Vincent Loechner. 1999. PolyLib: A library for manipulating parameterized polyhedra.Google Scholar
Qingda Lu, Christophe Alias, Uday Bondhugula, Thomas Henretty, Sriram Krishnamoorthy, Jagannathan Ramanujam, Atanas Rountev, Ponnuswamy Sadayappan, Yongjian Chen, Haibo Lin, and Tin-Fook Ngai. 2009. Data layout transformation for enhancing data locality on nuca chip multiprocessors. In Proceedings of the 18th International Conference on Parallel Architectures and Compilation Techniques (PACT’09). 348--357. Google ScholarDigital Library
Benot Meister, Nicolas Vasilache, David Wohlford, Muthu Manikandan Baskaran, Allen Leung, and Richard Lethin. 2011. R-Stream compiler. In Encyclopedia of Parallel Computing. Springer, 1756--1765.Google Scholar
Morris Newman. 1972. Integral Matrices. Pure and Applied Mathematics, Vol. 45. Academic Press.Google Scholar
Louis-Noel Pouchet, Peng Zhang, Ponnuswamy Sadayappan, and Jason Cong. 2013. Polyhedral-based data reuse optimization for configurable computing. In Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays. ACM, New York, NY, 29--38. Google ScholarDigital Library
Patrice Quinton, Sanjay Rajopadhye, and Tanguy Risset. 1996. On Manipulating Z-Polyhedra. Technical Report.Google Scholar
Alexander Schrijver. 1986. Theory of Linear and Integer Programming. John Wiley & Sons, New York, NY. Google ScholarDigital Library
Rachid Seghir. 2012. ZPolyTrans: A library for computing and enumerating integer transformations of Z-polyhedra. In Proceedings of the 2nd International Workshop on Polyhedral Compilation Techniques (IMPACT’12). 6.Google Scholar
Jürgen Teich and Lothar Thiele. 1993. Partitioning of processor arrays: A piecewise regular approach. Integration: The VLSI Journal 14, 3, 297--332. Google ScholarDigital Library
Sven Verdoolaege, Hristo Nikolov, and Todor Stefanov. 2007. pn: A tool for improved derivation of process networks. EURASIP Journal on Embedded Systems 2007, 1, 19. Google ScholarDigital Library
Sven Verdoolaege and Kevin M. Woods. 2008. Counting with rational generating functions. Journal of Symbolic Computation 43, 2, 75--91. Google ScholarDigital Library
Yuxin Wang, Peng Li, and Jason Cong. 2014. Theory and algorithm for generalized memory partitioning in high-level synthesis. In Proceedings of the 2014 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, New York, NY, 199--208. Google ScholarDigital Library
Yuxin Wang, Peng Li, Peng Zhang, Chen Zhang, and Jason Cong. 2013. Memory partitioning for multidimensional arrays in high-level synthesis. In Proceedings of the 50th Annual Design Automation Conference. ACM, New York, NY, Article No. 12. Google ScholarDigital Library
Eric W. Weisstein. 2003. CRC Concise Encyclopedia of Mathematics. CRC Press, Boca Raton, FL.Google Scholar
Xilinx Inc. 2012. Vivado Design Suite User Guide: High-Level Synthesis. Available at http://www.xilinx.com.Google Scholar

Index Terms

Improving Multibank Memory Access Parallelism with Lattice-Based Partitioning

Recommendations

Theory and algorithm for generalized memory partitioning in high-level synthesis
FPGA '14: Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays

The significant development of high-level synthesis tools has greatly facilitated FPGAs as general computing platforms. During the parallelism optimization for the data path, memory becomes a crucial bottleneck that impedes performance enhancement. ...
Read More
Efficient memory partitioning for parallel data access in multidimensional arrays
DAC '15: Proceedings of the 52nd Annual Design Automation Conference

Memory bandwidth bottlenecks severely restrict parallel access of data from memory arrays. To increase bandwidth, memory partitioning algorithms have been proposed to access multiple memory banks simultaneously. However, previous partitioning schemes ...
Read More
Automatic multidimensional memory partitioning for FPGA-based accelerators (abstract only)
FPGA '13: Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays

With the increase of data processing throughput in reconfigurable computing, data parallelism is now crucial for the performance of FPGA-based accelerators. However, most of the data parallelism optimizations are still performed manually by experienced ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Architecture and Code Optimization Volume 11, Issue 4
January 2015
797 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/2695583
Editor:
Koen De Bosschere
Ghent University
Issue’s Table of Contents
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 January 2015
- Accepted: 1 October 2014
- Revised: 1 August 2014
- Received: 1 March 2014
Published in taco Volume 11, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Memory partitioning
field-programmable gate arrays
fine-grained distributed shared memory
polyhedral model
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 37
  Total Citations
  View Citations
- 724
  Total Downloads
- Downloads (Last 12 months)136
- Downloads (Last 6 weeks)9
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Improving Multibank Memory Access Parallelism with Lattice-Based Partitioning

ACM Transactions on Architecture and Code Optimization

Abstract

References

Cited By

Index Terms

Recommendations

Theory and algorithm for generalized memory partitioning in high-level synthesis

Efficient memory partitioning for parallel data access in multidimensional arrays

Automatic multidimensional memory partitioning for FPGA-based accelerators (abstract only)