AIM: Energy-Efficient Aggregation Inside the Memory Hierarchy

Authors:
Junwhan Ahn

Seoul National University, Seoul, Republic of Korea

Seoul National University, Seoul, Republic of Korea
View Profile

,
Sungjoo Yoo

Seoul National University, Seoul, Republic of Korea

Seoul National University, Seoul, Republic of Korea
View Profile

,
Kiyoung Choi

Seoul National University, Seoul, Republic of Korea

Seoul National University, Seoul, Republic of Korea
View Profile

ACM Transactions on Architecture and Code Optimization Volume 13 Issue 4Article No.: 34pp 1–24https://doi.org/10.1145/2994149

Published:25 October 2016Publication History

ACM Transactions on Architecture and Code Optimization

Abstract

In this article, we propose Aggregation-in-Memory (AIM), a new processing-in-memory system designed for energy efficiency and near-term adoption. In order to efficiently perform aggregation, we implement simple aggregation operations in main memory and develop a locality-adaptive host architecture for in-memory aggregation, called cache-conscious aggregation. Through this, AIM executes aggregation at the most energy-efficient location among all levels of the memory hierarchy. Moreover, AIM minimally changes existing sequential programming models and provides fully automated compiler toolchain, thereby allowing unmodified legacy software to use AIM. Evaluations show that AIM greatly improves the energy efficiency of main memory and the system performance.

References

Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015a. A scalable processing-in-memory accelerator for parallel graph processing. In Proceedings of the International Symposium on Computer Architecture. 105--117. Google ScholarDigital Library
Junwhan Ahn, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015b. PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture. In Proceedings of the International Symposium on Computer Architecture. 336--348. Google ScholarDigital Library
Jung Ho Ahn, Mattan Erez, and William J. Dally. 2005. Scatter-add in data parallel architectures. In Proceedings of the International Symposium on High-Performance Computer Architecture. 132--142. Google ScholarDigital Library
Rajeev Balasubramonian, Jichuan Chang, Troy Manning, Jaime H. Moreno, Richard Murphy, Ravi Nair, and Steven Swanson. 2014. Near-data processing: Insights from a MICRO-46 workshop. IEEE Micro 34, 4 (July 2014), 36--42. Google ScholarCross Ref
Sergey Brin and Lawrence Page. 1998. The anatomy of a large-scale hypertextual Web search engine. In Proceedings of the International World-Wide Web Conference. 14--18. Google ScholarDigital Library
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the IEEE International Symposium on Workload Characterization. 44--54. Google ScholarDigital Library
Ke Chen, Sheng Li, Naveen Muralimanohar, Jung Ho Ahn, Jay B. Brockman, and Norman P. Jouppi. 2012. CACTI-3DD: Architecture-level modeling for 3D die-stacked DRAM main memory. In Proceedings of the Design, Automation and Test in Europe. 33--38. Google ScholarDigital Library
Timothy A. Davis and Yifan Hu. 2011. The University of Florida sparse matrix collection. ACM Trans. Math. Software 38, 1 (Dec. 2011), 1:1--1:25. Google ScholarDigital Library
Jeff Draper, Jacqueline Chame, Mary Hall, Craig Steele, Tim Barrett, Jeff LaCoss, John Granacki, Jaewook Shin, Chun Chen, Chang Woo Kang, Ihn Kim, and Gokhan Daglikoca. 2002. The architecture of the DIVA processing-in-memory chip. In Proceedings of the International Conference on Supercomputing. 14--25. Google ScholarDigital Library
Duncan G. Elliott, Michael Stumm, W. Martin. Snelgrove, Christian Cojocaru, and Robert McKenzie. 1999. Computational RAM: Implementing processors in memory. IEEE Design Test Comput. 16, 1 (Jan. 1999), 32--41. Google ScholarDigital Library
Amin Farmahini-Farahani, Jung Ho Ahn, Katherine Morrow, and Nam Sung Kim. 2015. NDA: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules. In Proceedings of the International Symposium on High Performance Computer Architecture. 283--295. Google ScholarCross Ref
L. A. Feldkamp, L. C. Davis, and J. W. Kress. 1984. Practical cone-beam algorithm. J. Opt. Soc. Am. A 1, 6 (June 1984), 612--619. Google ScholarCross Ref
María Jesús Garzarán, Milos Prvulovic, Ye Zhang, Alin Jula, Hao Yu, Lawrence Rauchwerger, and Josep Torrellas. 2001. Architectural support for parallel reductions in scalable shared-memory multiprocessors. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. 243--254. Google ScholarDigital Library
Mary Hall, Peter Kogge, Jeff Koller, Pedro Diniz, Jacqueline Chame, Jeff Draper, Jeff LaCoss, John Granacki, Jay Brockman, Apoorv Srivastava, William Athas, Vincent Freeh, Jaewook Shin, and Joonseok Park. 1999. Mapping irregular applications to DIVA, a PIM-based data-intensive architecture. In Proceedings of the ACM/IEEE Conference on Supercomputing. 57. Google ScholarDigital Library
Hybrid Memory Cube Consortium. 2014. Hybrid Memory Cube Specification 2.0.Google Scholar
Sungpack Hong, Hassan Chafi, Edic Sedlar, and Kunle Olukotun. 2012. Green-Marl: A DSL for easy and efficient graph analysis. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems. 349--362. Google ScholarDigital Library
Sungpack Hong, Tayo Oguntebi, and Kunle Olukotun. 2011. Efficient parallel graph exploration on multi-core CPU and GPU. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. 78--88. Google ScholarDigital Library
Sungpack Hong, Semih Salihoglu, Jennifer Widom, and Kunle Olukotun. 2014. Simplifying scalable graph processing with a domain-specific language. In Proceedings of the International Symposium on Code Generation and Optimization. 208--218. Google ScholarDigital Library
U. Kang, Charalampos E. Tsourakakis, and Christos Faloutsos. 2009. PEGASUS: A peta-scale graph mining system implementation and observations. In Proceedings of the International Conference on Data Mining. 229--238. Google ScholarDigital Library
Yi Kang, Wei Huang, Seung-Moon Yoo, Diana Keen, Zhenzhou Ge, Vinh Lam, Pratap Pattnaik, and Josep Torrellas. 1999. FlexRAM: Toward an advanced intelligent memory system. In Proceedings of the International Conference on Computer Design. 192--201. Google ScholarDigital Library
Daehyun Kim, Mainak Chaudhuri, and Mark Heinrich. 2002. Leveraging cache coherence in active memory systems. In Proceedings of the International Conference on Supercomputing. 2--13. Google ScholarDigital Library
Peter M. Kogge. 1994. EXECUBE—A new architecture for scaleable MPPs. In Proceedings of the International Conference on Parallel Processing. 77--84. Google ScholarDigital Library
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of the Neural Information Processing Systems. 1097--1105. Google ScholarDigital Library
Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis and transformation. In Proceedings of the International Symposium on Code Generation and Optimization. 75--86. Google ScholarDigital Library
Chang Joo Lee, Veynu Narasiman, Eiman Ebrahimi, Onur Mutlu, and Yale N. Patt. 2010. DRAM-aware Last-level Cache Writeback: Reducing Write-caused Interference in Memory Systems. Technical Report No. TR-HPS-2010-002. The University of Texas at Austin.Google Scholar
Jure Leskovec and Andrej Krevl. 2014. SNAP Datasets: Stanford Large Network Dataset Collection. (June 2014). http://snap.stanford.edu/data.Google Scholar
Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the International Symposium on Microarchitecture. 469--480. Google ScholarDigital Library
Kyu-Nam Lim, Woong-Ju Jang, Hyung-Sik Won, Kang-Yeol Lee, Hyungsoo Kim, Dong-Whee Kim, Mi-Hyun Cho, Seung-Lo Kim, Jong-Ho Kang, Keun-Woo Park, and Byung-Tae Jeong. 2012. A 1.2V 23nm 6F² 4Gb DDR3 SDRAM with local-bitline sense amplifier, hybrid LIO sense amplifier and dummy-less array architecture. In International Solid-State Circuits Conference Digest of Technical Papers. 42--44. Google ScholarCross Ref
Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the Conference on Programming Language Design and Implementation. 190--200. Google ScholarDigital Library
Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: A system for large-scale graph processing. In Proceedings of the International Conference on Management of Data. 135--146. Google ScholarDigital Library
Micron Technology 2007. Calculating Memory System Power for DDR3. Micron Technology.Google Scholar
Micron Technology 2009. 4Gb: ×4, ×8, ×16 DDR3 SDRAM. Micron Technology.Google Scholar
Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. 2009. CACTI 6.0: A Tool to Model Large Caches. Technical ReportNo. HPL-2009-85. HP Laboratories.Google Scholar
Ravi Nair, Samuel F. Antao, Carlo Bertolli, Pradip Bose, Jose R. Brunheroto, Tong Chen, Chen-Yong Cher, Carlos H. A. Costa, Jun Doi, Constantinos Evangelinos, Bruce M. Fleischer, Thomas W. Fox, Diego Sanchez Gallo, Leopold Grinberg, John A. Gunnels, Arpith C. Jacob, Philip Jacob, Hans M. Jacobson, Tejas Karkhanis, C. Kim, Jaime H. Moreno, J. Kevin O’Brien, Martin Ohmacht, Yoonho Park, Daniel A. Prener, Bryan S. Rosenburg, Kyung Dong Ryu, Olivier Sallenave, Mauricio J. Serrano, Patrick D. M. Siegl, Krishnan Sugavanam, and Zehra Sura. 2015. Active memory cube: A processing-in-memory architecture for exascale systems. IBM J. Res. Dev. 59, 2/3 (March 2015), 17:1--17:14.Google ScholarDigital Library
Mark Oskin, Frederic T. Chong, and Timothy Sherwood. 1998. Active pages: A computation model for intelligent memory. In Proceedings of the International Symposium on Computer Architecture. 192--203. Google ScholarDigital Library
David Patterson, Thomas Anderson, Neal Cardwell, Richard Fromm, Kimberly Keeton, Christoforos Kozyrakis, Randi Thomas, and Katherine Yelick. 1997. A case for intelligent RAM. IEEE Micro 17, 2 (March 1997), 34--44. Google ScholarDigital Library
Seth H. Pugsley, Jeffrey Jestes, Huihui Zhang, Rajeev Balasubramonian, Vijayalakshmi Srinivasan, Alper Buyuktosunoglu, Al Davis, and Feifei Li. 2014. NDC: Analyzing the impact of 3D-stacked memory+logic devices on MapReduce workloads. In Proceedings of the International Symposium on Performance Analysis of Systems and Software. 190--200.Google ScholarCross Ref
Christopher Rohkohl, Benjamin Keck, Hannes Hofmann, and Joachim Hornegger. 2009. RabbitCT—An open platform for benchmarking 3D cone-beam reconstruction algorithms. Med. Phys. 36, 9 (Sept. 2009), 3940--3944. Google ScholarCross Ref
Yousef Sadd. 2003. Iterative Methods for Sparse Linear Systems. Society for Industrial and Applied Mathematics. Google ScholarDigital Library
Vivek Seshadri, Abhishek Bhowmick, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry. 2014. The dirty-block index. In Proceeding of the International Symposium on Computer Architecture. 157--168. Google ScholarDigital Library
Vivek Seshadri, Kevin Hsieh, Amirali Boroum, Donghyuk Lee, Michael A. Kozuch, Onur Mutlu, Phillip B. Gibbons, and Todd C. Mowry. 2015. Fast bulk bitwise AND and OR in DRAM. IEEE Comput. Architec. Lett. 14, 2 (July-Dec. 2015), 127--131. Google ScholarDigital Library
Thomas L. Sterling and Hans P. Zima. 2002. Gilgamesh: A multithreaded processor-in-memory architecture for petaflops computing. In Proceedings of the ACM/IEEE Conference on Supercomputing. 48. Google ScholarDigital Library
Jeffrey Stuecheli, Dimitris Kaseridis, David Daly, Hillery C. Hunter, and Lizy K. John. 2010. The virtual write queue: Coordinating DRAM and last-level cache policies. In Proceedings of the International Symposium on Computer Architecture. 72--82. Google ScholarDigital Library
Stavros Volos, Javier Picorel, Babak Falsafi, and Boris Grot. 2014. BuMP: Bulk memory access prediction and streaming. In Proceedings of the International Symposium on Microarchitecture. 545--557. Google ScholarDigital Library
Zhe Wang, Samira M. Khan, and Daniel A. Jiménez. 2012. Improving writeback efficiency with decoupled last-write prediction. In Proceedings of the International Symposium on Computer Architecture. 309--320. Google ScholarDigital Library
Dong Ping Zhang, Nuwan Jayasena, Alexander Lyashevsky, Joseph L. Greathouse, Lifan Xu, and Michael Ignatowski. 2014. TOP-PIM: Throughput-oriented programmable processing in memory. In Proceedings of the International Symposium on High-Performance Parallel and Distributed Computing. 85--98. Google ScholarDigital Library
Guowei Zhang, Webb Horn, and Daniel Sanchez. 2015. Exploiting commutativity to reduce the cost of updates to shared data in cache-coherent systems. In Proceedings of the International Symposium on Microarchitecture. Google ScholarDigital Library
Tao Zhang, Ke Chen, Cong Xu, Guangyu Sun, Tao Wang, and Yuan Xie. 2014. Half-DRAM: A high-bandwidth and low-power DRAM architecture from the rethinking of fine-grained activation. In Proceedings of the International Symposium on Computer Architecture. 349--360. Google ScholarDigital Library
Qiuling Zhu, Berkin Akin, H. Ekin Sumbul, Fazle Sadi, James C. Hoe, Larry Pileggi, and Franz Franchetti. 2013. A 3D-stacked logic-in-memory accelerator for application-specific data intensive computing. In Proceedings of the International 3D Systems Integration Conference. 1--7.Google ScholarCross Ref

Index Terms

AIM: Energy-Efficient Aggregation Inside the Memory Hierarchy
1. Hardware
  1. Integrated circuits
    1. Semiconductor memory
      1. Dynamic memory

Recommendations

Processing data where it makes sense: Enabling in-memory computation
Abstract
Today’s systems are overwhelmingly designed to move data to computation. This design choice goes directly against at least three key trends in systems that cause performance, scalability and energy bottlenecks: (1) data access from ...
Read More
Processing Data Where It Makes Sense in Modern Computing Systems: Enabling In-Memory Computation
GLSVLSI '19: Proceedings of the 2019 on Great Lakes Symposium on VLSI

Today's systems are overwhelmingly designed to move data to computation. This design choice goes directly against at least three key trends in systems that cause performance, scalability and energy bottlenecks: 1) data access from memory is already a ...
Read More
CORUSCANT: Fast Efficient Processing-in-Racetrack Memories
MICRO '22: Proceedings of the 55th Annual IEEE/ACM International Symposium on Microarchitecture

The growth in data needs of modern applications has created significant challenges for modern systems leading to a "memory wall." Spintronic Domain-Wall Memory (DWM), provides near-SRAM read/write performance, energy savings and non-volatility, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Architecture and Code Optimization Volume 13, Issue 4
December 2016
648 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/3012405
Editor:
Koen De Bosschere
Ghent University
Issue’s Table of Contents
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 October 2016
- Accepted: 1 August 2016
- Revised: 1 July 2016
- Received: 1 May 2016
Published in taco Volume 13, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Processing-in-memory
aggregation
locality-adaptive execution
near-data processing
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 6
  Total Citations
  View Citations
- 691
  Total Downloads
- Downloads (Last 12 months)64
- Downloads (Last 6 weeks)10
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

AIM: Energy-Efficient Aggregation Inside the Memory Hierarchy

ACM Transactions on Architecture and Code Optimization

Abstract

References

Cited By

Index Terms

Recommendations

Processing data where it makes sense: Enabling in-memory computation

Processing Data Where It Makes Sense in Modern Computing Systems: Enabling In-Memory Computation

CORUSCANT: Fast Efficient Processing-in-Racetrack Memories