Compiler support for near data computing

Authors:
Mahmut Taylan Kandemir

Penn State University

Penn State University
View Profile

,
Jihyun Ryoo

Penn State University

Penn State University
View Profile

,
Xulong Tang

University of Pittsburgh

University of Pittsburgh
View Profile

,
Mustafa Karakoy

TUBITAK-BILGEM, Turkey

TUBITAK-BILGEM, Turkey
View Profile

PPoPP '21: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingFebruary 2021Pages 90–104https://doi.org/10.1145/3437801.3441600

Published:17 February 2021Publication History

PPoPP '21: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Pages 90–104

ABSTRACT

Recent works from both hardware and software domains offer various optimizations that try to take advantage of near data computing (NDC) opportunities. While the results from these works indicate performance improvements of various magnitudes, the existing literature lacks a detailed quantification of the potential of NDC and analysis of compiler optimizations on tapping into that potential. This paper first presents an analysis of the NDC potential when executing multithreaded applications on manycore platforms. It then presents two compiler schemes designed to take advantage of NDC. The first of these schemes try to increase the amount of computation that can be performed in a hardware component, whereas the second compiler strategy strikes a balance between optimizing NDC and exploiting data reuse, by being more selective on when to perform NDC (even if the opportunity presents itself) and how. The collected experimental results on a 5×5 manycore system reveal that our first and second compiler schemes improve the overall performance of our multithreaded applications by, respectively, 22.5% and 25.2%, on average. Furthermore, these two compiler schemes are only 6.8% and 4.1% worse than an oracle scheme that makes the best near data computing decisions for each and every computation.

References

2012. The Architecture and Performance of the TILE-Gx Processor Family. http://www.tilera.com/products/processors/TILE-Gx_Family.Google Scholar
Shaizeen Aga, Supreet Jeloka, Arun Subramaniyan, Satish Narayanasamy, David Blaauw, and Reetuparna Das. 2017. Compute Caches. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA).Google Scholar
Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. A Scalable Processing-in-memory Accelerator for Parallel Graph Processing. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA).Google ScholarDigital Library
Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. A Scalable Processing-in-memory Accelerator for Parallel Graph Processing. In Proc. of the International Symposium on Computer Architecture (ISCA).Google ScholarDigital Library
Junwhan Ahn, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. PIM-enabled Instructions: A Low-overhead, Locality-aware Processing-in-memory Architecture. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture (ISCA).Google ScholarDigital Library
Junwhan Ahn, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture. In Proc. of the International Symposium on Computer Architecture (ISCA).Google ScholarDigital Library
Jennifer M. Anderson and Monica S. Lam. 1993. Global Optimizations for Parallelism and Locality on Scalable Parallel Machines. In Proceedings of the ACM SIGPLAN 1993 Conference on Programming Language Design and Implementation (PLDI).Google Scholar
Jeffery M. Arnold, Duncan A. Buell, and Elaine G. Davis. 1992. SPLASH 2. In Proceedings of the Symposium on Parallel Algorithms and Architectures.Google Scholar
Hadi Asghari-Moghaddam, Young Hoon Son, Jung Ho Ahn, and Nam Sung Kim. 2016. Chameleon: Versatile and practical near-DRAM acceleration architecture for large memory systems. In 2016 49th annual IEEE/ACM international symposium on Microarchitecture (MICRO). IEEE, 1--13.Google ScholarDigital Library
Vishal Aslot, Max Domeika, Rudolf Eigenmann, Greg Gaertner, Wesley B. Jones, and Bodo Parady. 2001. SPEComp: A New Benchmark Suite for Measuring Parallel Computer Performance. In OpenMP Shared Memory Parallel Programming, Rudolf Eigenmann and Michael J. Voss (Eds.).Google Scholar
Kristof Beyls and Erik H. D'Hollander. 2009. Refactoring for Data Locality. Computer 42, 2 (2009).Google Scholar
Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The Gem5 Simulator. SIGARCH Comput. Archit. News (2011).Google ScholarDigital Library
Uday Bondhugula, J. Ramanujam, and et al. 2008. PLuTo: A practical and fully automatic polyhedral program optimization system. In Proceedings of Programming Language Design And Implementation (PLDI).Google Scholar
Steve Carr, Kathryn S. McKinley, and Chau-Wen Tseng. 1994. Compiler Optimizations for Improving Data Locality. In Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).Google ScholarDigital Library
John Carter, Wilson Hsieh, Leigh Stoller, Mark Swanson, Lixin Zhang, Erik Brunvand, Al Davis, Chen-Chi Kuo, Ravindra Kuramkote, Michael Parker, Lambert Schaelicke, and Terry Tateyama. 1999. Impulse: building a smarter memory controller. In Proceedings of International Symposium on High-Performance Computer Architecture.Google ScholarCross Ref
Benjamin Y. Cho, Yongkee Kwon, Sangkug Lym, and Mattan Erez. 2020. Near Data Acceleration with Concurrent Host Access. In ISCA.Google Scholar
Wei Ding, Xulong Tang, Mahmut Kandemir, Yuanrui Zhang, and Emre Kultursay. 2015. Optimizing Off-chip Accesses in Multicores. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI).Google ScholarDigital Library
Amin Farmahini-Farahani, Jung Ho Ahn, Katherine Morrow, and Nam Sung Kim. 2015. DRAMA: An Architecture for Accelerated Processing Near Memory. IEEE Computer Architecture Letters 14, 1 (2015).Google ScholarDigital Library
Sílvio Fernandes, Bruno C. Oliveira, and Ivan Saraiva Silva. 2009. Using NoC Routers as Processing Elements. In Proceedings of the Symposium on Integrated Circuits and System Design: Chip on the Dunes.Google ScholarDigital Library
Pierfrancesco Foglia, Cosimo A. Prete, Marco Solinas, and Giovanna Monni. 2010. Re-NUCA: Boosting CMP Performance Through Block Replication. In Proc. of the Euromicro Conference on Digital System Design: Architectures, Methods and Tools.Google ScholarDigital Library
Haohuan Fu, Junfeng Liao, Jinzhe Yang, Lanning Wang, Zhenya Song, Xiaomeng Huang, Chao Yang, Wei Xue, Fangfang Liu, Fangli Qiao, Wei Zhao, Xunqiang Yin, Chaofeng Hou, Chenglong Zhang, Wei Ge, Jian Zhang, Yangang Wang, Chunbo Zhou, and Guangwen Yang. 2016. The Sunway TaihuLight supercomputer: system and applications. Science China Information Sciences 59, 7 (21 Jun 2016), 072001. Google ScholarCross Ref
Mingyu Gao, Grant Ayers, and Christos Kozyrakis. 2015. Practical near-data processing for in-memory analytics frameworks. In 2015 International Conference on Parallel Architecture and Compilation (PACT). IEEE, 113--124.Google ScholarDigital Library
Somnath Ghosh, Margaret Martonosi, and Sharad Malik. 1999. Cache Miss Equations: A Compiler Framework for Analyzing and Tuning Memory Behavior. ACM Trans. Program. Lang. Syst. (TOPLAS) (1999).Google ScholarDigital Library
Maya Gokhale, Bill Holmes, and Ken Iobst. 1995. Processing in Memory: the Terasys Massively Parallel PIM Array. IEEE Computer (1995).Google Scholar
Peng Gu, yufei Ding, Guoyang Chen, Weifeng Zhang, Dimin Niu, and Yuan Xie. 2020. iPIM: Programmable In-Memory Image Processing Accelerator Using Near-Bank Architecture. In ISCA.Google Scholar
Ramyad Hadidi, Lifeng Nai, Hyojong Kim, and Hyesoon Kim. 2017. CAIRO: A Compiler-Assisted Technique for Enabling Instruction-Level Offloading of Processing-In-Memory. Trans. Archit. Code Optim. 14, 4 (2017).Google Scholar
Mary H. Hall, Saman P. Amarasinghe, Brian R. Murphy, Shih-Wei Liao, and Monica S. Lam. 1995. Detecting Coarse-grain Parallelism Using an Interprocedural Parallelizing Compiler. In Supercomputing.Google ScholarDigital Library
Milad Hashemi, Khubaib, Eiman Ebrahimi, Onur Mutlu, and Yale N. Patt. 2016. Accelerating Dependent Cache Misses with an Enhanced Memory Controller. In Proccedings of the International Symposium on Computer Architecture (ISCA).Google Scholar
Kevin Hsieh, Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O'Connor, Nandita Vijaykumar, Onur Mutlu, and Stephen W. Keckler. 2016. Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent near-Data Processing in GPU Systems. In Proc. of the International Symposium on Computer Architecture.Google Scholar
Joe Jeddeloh and Brent Keeth. 2012. Hybrid memory cube new DRAM architecture increases density and performance. In Proc. of the Symposium on VLSI Technology (VLSIT).Google ScholarCross Ref
Yuho Jin. 2015. Unifying Router Power Gating with Data Placement for Energy-Efficient NoC. In Proc. of the International Symposium on Computer Architecture and High Performance Computing.Google ScholarDigital Library
M. Kandemir, J. Ramanujam, A. Choudhary, and P. Banerjee. 2001. A layout-conscious iteration space transformation technique. IEEE Trans. Comput. (2001).Google Scholar
Mahmut Kandemir, Yuanrui Zhang, Jun Liu, and Taylan Yemliha. 2011. Neighborhood-Aware Data Locality Optimization for NoC-Based Multicores. In Proc. of the International Symposium on Code Generation and Optimization.Google ScholarCross Ref
Mahmut Taylan Kandemir, Jihyun Ryoo, Xulong Tang, and Mustafa Karakoy. 2021. Compiler Support for Near Data Computing. Technical Report, Department of Computer Science and Engineering, The Pennsylvania State University (2021).Google Scholar
Gwangsun Kim, Niladrish Chatterjee, Mike O'Connor, and Kevin Hsieh. 2017. Toward standardized near-data processing with unrestricted data placement for GPUs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--12.Google ScholarDigital Library
Orhan Kislal, Jagadish Kotra, Xulong Tang, Mahmut Taylan Kandemir, and Myoungsoo Jung. 2018. Enhancing Computation-to-core Assignment with Physical Location Information. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI).Google ScholarDigital Library
Orhan Kislal, Jagadish Kotra, Xulong Tang, Mahmut Taylan Kandemir, and Myoungsoo Jung. 2017. POSTER: Location-Aware Computation Mapping for Manycore Processors.. In Proceedings of the 2017 International Conference on Parallel Architectures and Compilation.Google ScholarCross Ref
Monica S. Lam and Michael E. Wolf. 2004. A Data Locality Optimizing Algorithm. SIGPLAN Not. 39, 4 (2004).Google Scholar
Feihui Li, Guangyu Chen, Mahmut Kandemir, and Ibrahim Kolcu. 2007. Profile-Driven Energy Reduction in Network-on-Chips. SIGPLAN Not. 42, 6 (2007), 394--404.Google ScholarDigital Library
Amy W. Lim, Gerald I. Cheong, and Monica S. Lam. 1999. An Affine Partitioning Algorithm to Maximize Parallelism and Minimize Communication. In ICS.Google ScholarDigital Library
Qingda Lu, Christophe Alias, Uday Bondhugula, Thomas Henretty, Sriram Krishnamoorthy, J. Ramanujam, Atanas Rountev, P. Sadayappan, Yongjian Chen, Haibo Lin, and Tin-Fook Ngai. 2009. Data Layout Transformation for Enhancing Data Locality on NUCA Chip Multiprocessors. In Proc. of the International Conference on Parallel Architectures and Compilation Techniques (PACT).Google ScholarDigital Library
Chikeung Luk and Todd C. Mowry. 1996. Compiler-based prefetching for recursive data structures. SIGPLAN Not. 31, 9 (1996).Google Scholar
Kathryn S. Mckinley, Steve Carr, and Chauwen Tseng. 1996. Improving Data Locality with Loop Transformations. Transactions on Programming Languages and Systems (TOPLAS) 18, 4 (1996).Google Scholar
Javier Merino, Valentin Puente, and Jose A. Gregorio. 2010. ESP-NUCA: A low-cost adaptive Non-Uniform Cache Architecture. In Proc. of the International Symposium on High-Performance Computer Architecture.Google Scholar
Javier Merino, Valentín Puente, Pablo Prieto, and José Ángel Gregorio. 2008. SP-NUCA: A Cost Effective Dynamic Non-Uniform Cache Architecture. SIGARCH Comput. Archit. News 36, 2 (2008).Google ScholarDigital Library
Onur Mutlu, Saugata Ghose, Juan Gómez-Luna, and Rachata Ausavarungnirun. 2019. Enabling Practical Processing in and near Memory for Data-Intensive Computing. In Proceedings of the Design Automation Conference 2019.Google ScholarDigital Library
Ashutosh Pattnaik, Xulong Tang, Onur Kayiran, Adwait Jog, Asit Mishra, Mahmut T. Kandemir, Anand Sivasubramaniam, and Chita R. Das. 2019. Opportunistic Computing in GPU Architectures. In Proceedings of the International Symposium on Computer Architecture.Google Scholar
Ashutosh Pattnaik, Xulong Tang, Onur Kayiran, Adwait Jog, Asit Mishra, Mahmut T Kandemir, Anand Sivasubramaniam, and Chita R Das. 2019. Opportunistic computing in gpu architectures. In 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA). IEEE, 210--223.Google ScholarDigital Library
Seth H. Pugsley, Jeffrey Jestes, Huihui Zhang, Rajeev Balasubramonian, Vijayalakshmi Srinivasan, Alper Buyuktosunoglu, Al Davis, and Feifei Li. 2014. NDC: Analyzing the impact of 3D-stacked memory+logic devices on MapReduce workloads. In Proc. of the International Symposium on Performance Analysis of Systems and Software (ISPASS).Google ScholarCross Ref
Muhammad M. Rafique and Zhichun Zhu. 2018. CAMPS: Conflict-Aware Memory-Side Prefetching Scheme for Hybrid Memory Cube. In Proc. of the International Conference on Parallel Processing.Google Scholar
Qingchuan Shi, Farrukh Hijaz, and Omer Khan. 2013. Towards efficient dynamic data placement in NoC-based multicores. In Proc. of the International Conference on Computer Design (ICCD).Google ScholarCross Ref
Dimitrios Skarlatos, Nam Sung Kim, and Josep Torrellas. 2017. Pageforge: A near-Memory Content-Aware Page-Merging Architecture. In Proceedings of the International Symposium on Microarchitecture.Google ScholarDigital Library
A. Sodani, R. Gramunt, J. Corbal, H. S. Kim, K. Vinod, S. Chinthamani, S. Hutsell, R. Agarwal, and Y. C. Liu. 2016. Knights Landing: Second-Generation Intel Xeon Phi Product. IEEE Micro (2016).Google ScholarDigital Library
Yonghong Song and Zhiyuan Li. 1999. New Tiling Techniques to Improve Cache Temporal Locality. In PLDI.Google Scholar
Thomas L. Sterling and Hans P. Zima. 2002. Gilgamesh: A Multithreaded Processor-in-Memory Architecture for Petaflops Computing. In Proc. of the Conference on Supercomputing.Google Scholar
Harold S. Stone. 1970. A Logic-in-Memory Computer. Computers C-19, 1 (1970).Google Scholar
Xulong Tang, Mahmut Taylan Kandemir, Hui Zhao, Myoungsoo Jung, and Mustafa Karakoy. 2018. Computing with Near Data. Proc. ACM Meas. Anal. Comput. Syst. 2, 3 (2018).Google ScholarDigital Library
Xulong Tang, Orhan Kislal, Mahmut Kandemir, and Mustafa Karakoy. 2017. Data Movement Aware Computation Partitioning. In Proc. of the International Symposium on Microarchitecture.Google ScholarDigital Library
Xulong Tang, Mahmut Taylan Kandemir, Mustafa Karakoy, and Meena Arunachalam. 2019. Co-Optimizing Memory-Level Parallelism and Cache-Level Parallelism. In Proceedings of the 40th annual ACM SIGPLAN conference on Programming Language Design and Implementation.Google ScholarDigital Library
Gabriel Urzaiz, David Villa, Felix Villanueva, and Juan Carlos Lopez. 2012. Process-in-Network: A Comprehensive Network Processing Approach. Sensors (Basel) 12, 6 (2012), 8112--8134.Google ScholarCross Ref
S. Verdoolaege, M. Bruynooghe, G. Janssens, and P. Catthoor. 2003. Multi-dimensional incremental loop fusion for data locality. In ASAP.Google Scholar
Ben Verghese, Scott Devine, Anoop Gupta, and Mendel Rosenblum. 1996. Operating System Support for Improving Data Locality on CCNUMA Compute Servers. In ASPLOS.Google Scholar
M. E. Wolf and M. S. Lam. 1991. A loop transformation theory and an algorithm to maximize parallelism. IEEE Transactions on Parallel and Distributed Systems (1991).Google Scholar
Michael Wolfe. 1995. high performance compilers for parallel computing.Google Scholar
Xu Yang, Yumin Hou, and Hu He. 2019. A Processing-in-Memory Architecture Programming Paradigm for Wireless Internet-of-Things Applications. Sensors (Basel) 19, 1 (2019), 140.Google ScholarCross Ref

Index Terms

Compiler support for near data computing
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multicore architectures

Recommendations

Performance Characterization of Parallel Discrete Event Simulation on Knights Landing Processor
SIGSIM-PADS '17: Proceedings of the 2017 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation

Performance and scalability of Parallel Discrete Event Simulation (PDES) is often limited by fine-grain communication, especially in execution environments with high communication cost. However, the low cost of on-chip communication in emerging many-...
Read More
Thread-Level Speculation Execution Model Based on LLVM Compiler
CNIOT '21: Proceedings of the 2021 2nd International Conference on Computing, Networks and Internet of Things

With the trend of growing number of processing cores on Chip Multiprocessors, researchers have made a lot of efforts to make full use of core resources through extracting programs’ parallelism. Thread-Level Speculation (TLS) can speculatively ...
Read More
Affinity Alloc: Taming Not-So Near-Data Computing
MICRO '23: Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture

To mitigate the data movement bottleneck on large multicore systems, the near-data computing paradigm (NDC) offloads computation to where the data resides on-chip. The benefit of NDC heavily depends on spatial affinity, where all relevant data are in ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PPoPP '21: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
February 2021
507 pages
ISBN:9781450382946
DOI:10.1145/3437801
General Chair:
Jaejin Lee
Seoul National University, South Korea
,
Program Chair:
Erez Petrank
Technion, Israel
Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 17 February 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
code transformation
data locality
manycore architectures
near-data computing
Qualifiers
- research-article
Conference

Acceptance Rates
PPoPP '21 Paper Acceptance Rate31of150submissions,21%Overall Acceptance Rate230of1,014submissions,23%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 9
  Total Citations
  View Citations
- 847
  Total Downloads
- Downloads (Last 12 months)156
- Downloads (Last 6 weeks)16
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Compiler support for near data computing

PPoPP '21: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

ABSTRACT

References

Cited By

Index Terms

Recommendations

Performance Characterization of Parallel Discrete Event Simulation on Knights Landing Processor

Thread-Level Speculation Execution Model Based on LLVM Compiler

Affinity Alloc: Taming Not-So Near-Data Computing