Abstract
In this article, we propose Aggregation-in-Memory (AIM), a new processing-in-memory system designed for energy efficiency and near-term adoption. In order to efficiently perform aggregation, we implement simple aggregation operations in main memory and develop a locality-adaptive host architecture for in-memory aggregation, called cache-conscious aggregation. Through this, AIM executes aggregation at the most energy-efficient location among all levels of the memory hierarchy. Moreover, AIM minimally changes existing sequential programming models and provides fully automated compiler toolchain, thereby allowing unmodified legacy software to use AIM. Evaluations show that AIM greatly improves the energy efficiency of main memory and the system performance.
- Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015a. A scalable processing-in-memory accelerator for parallel graph processing. In Proceedings of the International Symposium on Computer Architecture. 105--117. Google ScholarDigital Library
- Junwhan Ahn, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015b. PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture. In Proceedings of the International Symposium on Computer Architecture. 336--348. Google ScholarDigital Library
- Jung Ho Ahn, Mattan Erez, and William J. Dally. 2005. Scatter-add in data parallel architectures. In Proceedings of the International Symposium on High-Performance Computer Architecture. 132--142. Google ScholarDigital Library
- Rajeev Balasubramonian, Jichuan Chang, Troy Manning, Jaime H. Moreno, Richard Murphy, Ravi Nair, and Steven Swanson. 2014. Near-data processing: Insights from a MICRO-46 workshop. IEEE Micro 34, 4 (July 2014), 36--42. Google ScholarCross Ref
- Sergey Brin and Lawrence Page. 1998. The anatomy of a large-scale hypertextual Web search engine. In Proceedings of the International World-Wide Web Conference. 14--18. Google ScholarDigital Library
- Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the IEEE International Symposium on Workload Characterization. 44--54. Google ScholarDigital Library
- Ke Chen, Sheng Li, Naveen Muralimanohar, Jung Ho Ahn, Jay B. Brockman, and Norman P. Jouppi. 2012. CACTI-3DD: Architecture-level modeling for 3D die-stacked DRAM main memory. In Proceedings of the Design, Automation and Test in Europe. 33--38. Google ScholarDigital Library
- Timothy A. Davis and Yifan Hu. 2011. The University of Florida sparse matrix collection. ACM Trans. Math. Software 38, 1 (Dec. 2011), 1:1--1:25. Google ScholarDigital Library
- Jeff Draper, Jacqueline Chame, Mary Hall, Craig Steele, Tim Barrett, Jeff LaCoss, John Granacki, Jaewook Shin, Chun Chen, Chang Woo Kang, Ihn Kim, and Gokhan Daglikoca. 2002. The architecture of the DIVA processing-in-memory chip. In Proceedings of the International Conference on Supercomputing. 14--25. Google ScholarDigital Library
- Duncan G. Elliott, Michael Stumm, W. Martin. Snelgrove, Christian Cojocaru, and Robert McKenzie. 1999. Computational RAM: Implementing processors in memory. IEEE Design Test Comput. 16, 1 (Jan. 1999), 32--41. Google ScholarDigital Library
- Amin Farmahini-Farahani, Jung Ho Ahn, Katherine Morrow, and Nam Sung Kim. 2015. NDA: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules. In Proceedings of the International Symposium on High Performance Computer Architecture. 283--295. Google ScholarCross Ref
- L. A. Feldkamp, L. C. Davis, and J. W. Kress. 1984. Practical cone-beam algorithm. J. Opt. Soc. Am. A 1, 6 (June 1984), 612--619. Google ScholarCross Ref
- María Jesús Garzarán, Milos Prvulovic, Ye Zhang, Alin Jula, Hao Yu, Lawrence Rauchwerger, and Josep Torrellas. 2001. Architectural support for parallel reductions in scalable shared-memory multiprocessors. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. 243--254. Google ScholarDigital Library
- Mary Hall, Peter Kogge, Jeff Koller, Pedro Diniz, Jacqueline Chame, Jeff Draper, Jeff LaCoss, John Granacki, Jay Brockman, Apoorv Srivastava, William Athas, Vincent Freeh, Jaewook Shin, and Joonseok Park. 1999. Mapping irregular applications to DIVA, a PIM-based data-intensive architecture. In Proceedings of the ACM/IEEE Conference on Supercomputing. 57. Google ScholarDigital Library
- Hybrid Memory Cube Consortium. 2014. Hybrid Memory Cube Specification 2.0.Google Scholar
- Sungpack Hong, Hassan Chafi, Edic Sedlar, and Kunle Olukotun. 2012. Green-Marl: A DSL for easy and efficient graph analysis. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems. 349--362. Google ScholarDigital Library
- Sungpack Hong, Tayo Oguntebi, and Kunle Olukotun. 2011. Efficient parallel graph exploration on multi-core CPU and GPU. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. 78--88. Google ScholarDigital Library
- Sungpack Hong, Semih Salihoglu, Jennifer Widom, and Kunle Olukotun. 2014. Simplifying scalable graph processing with a domain-specific language. In Proceedings of the International Symposium on Code Generation and Optimization. 208--218. Google ScholarDigital Library
- U. Kang, Charalampos E. Tsourakakis, and Christos Faloutsos. 2009. PEGASUS: A peta-scale graph mining system implementation and observations. In Proceedings of the International Conference on Data Mining. 229--238. Google ScholarDigital Library
- Yi Kang, Wei Huang, Seung-Moon Yoo, Diana Keen, Zhenzhou Ge, Vinh Lam, Pratap Pattnaik, and Josep Torrellas. 1999. FlexRAM: Toward an advanced intelligent memory system. In Proceedings of the International Conference on Computer Design. 192--201. Google ScholarDigital Library
- Daehyun Kim, Mainak Chaudhuri, and Mark Heinrich. 2002. Leveraging cache coherence in active memory systems. In Proceedings of the International Conference on Supercomputing. 2--13. Google ScholarDigital Library
- Peter M. Kogge. 1994. EXECUBE—A new architecture for scaleable MPPs. In Proceedings of the International Conference on Parallel Processing. 77--84. Google ScholarDigital Library
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of the Neural Information Processing Systems. 1097--1105. Google ScholarDigital Library
- Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis and transformation. In Proceedings of the International Symposium on Code Generation and Optimization. 75--86. Google ScholarDigital Library
- Chang Joo Lee, Veynu Narasiman, Eiman Ebrahimi, Onur Mutlu, and Yale N. Patt. 2010. DRAM-aware Last-level Cache Writeback: Reducing Write-caused Interference in Memory Systems. Technical Report No. TR-HPS-2010-002. The University of Texas at Austin.Google Scholar
- Jure Leskovec and Andrej Krevl. 2014. SNAP Datasets: Stanford Large Network Dataset Collection. (June 2014). http://snap.stanford.edu/data.Google Scholar
- Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the International Symposium on Microarchitecture. 469--480. Google ScholarDigital Library
- Kyu-Nam Lim, Woong-Ju Jang, Hyung-Sik Won, Kang-Yeol Lee, Hyungsoo Kim, Dong-Whee Kim, Mi-Hyun Cho, Seung-Lo Kim, Jong-Ho Kang, Keun-Woo Park, and Byung-Tae Jeong. 2012. A 1.2V 23nm 6F2 4Gb DDR3 SDRAM with local-bitline sense amplifier, hybrid LIO sense amplifier and dummy-less array architecture. In International Solid-State Circuits Conference Digest of Technical Papers. 42--44. Google ScholarCross Ref
- Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the Conference on Programming Language Design and Implementation. 190--200. Google ScholarDigital Library
- Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: A system for large-scale graph processing. In Proceedings of the International Conference on Management of Data. 135--146. Google ScholarDigital Library
- Micron Technology 2007. Calculating Memory System Power for DDR3. Micron Technology.Google Scholar
- Micron Technology 2009. 4Gb: ×4, ×8, ×16 DDR3 SDRAM. Micron Technology.Google Scholar
- Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. 2009. CACTI 6.0: A Tool to Model Large Caches. Technical ReportNo. HPL-2009-85. HP Laboratories.Google Scholar
- Ravi Nair, Samuel F. Antao, Carlo Bertolli, Pradip Bose, Jose R. Brunheroto, Tong Chen, Chen-Yong Cher, Carlos H. A. Costa, Jun Doi, Constantinos Evangelinos, Bruce M. Fleischer, Thomas W. Fox, Diego Sanchez Gallo, Leopold Grinberg, John A. Gunnels, Arpith C. Jacob, Philip Jacob, Hans M. Jacobson, Tejas Karkhanis, C. Kim, Jaime H. Moreno, J. Kevin O’Brien, Martin Ohmacht, Yoonho Park, Daniel A. Prener, Bryan S. Rosenburg, Kyung Dong Ryu, Olivier Sallenave, Mauricio J. Serrano, Patrick D. M. Siegl, Krishnan Sugavanam, and Zehra Sura. 2015. Active memory cube: A processing-in-memory architecture for exascale systems. IBM J. Res. Dev. 59, 2/3 (March 2015), 17:1--17:14.Google ScholarDigital Library
- Mark Oskin, Frederic T. Chong, and Timothy Sherwood. 1998. Active pages: A computation model for intelligent memory. In Proceedings of the International Symposium on Computer Architecture. 192--203. Google ScholarDigital Library
- David Patterson, Thomas Anderson, Neal Cardwell, Richard Fromm, Kimberly Keeton, Christoforos Kozyrakis, Randi Thomas, and Katherine Yelick. 1997. A case for intelligent RAM. IEEE Micro 17, 2 (March 1997), 34--44. Google ScholarDigital Library
- Seth H. Pugsley, Jeffrey Jestes, Huihui Zhang, Rajeev Balasubramonian, Vijayalakshmi Srinivasan, Alper Buyuktosunoglu, Al Davis, and Feifei Li. 2014. NDC: Analyzing the impact of 3D-stacked memory+logic devices on MapReduce workloads. In Proceedings of the International Symposium on Performance Analysis of Systems and Software. 190--200.Google ScholarCross Ref
- Christopher Rohkohl, Benjamin Keck, Hannes Hofmann, and Joachim Hornegger. 2009. RabbitCT—An open platform for benchmarking 3D cone-beam reconstruction algorithms. Med. Phys. 36, 9 (Sept. 2009), 3940--3944. Google ScholarCross Ref
- Yousef Sadd. 2003. Iterative Methods for Sparse Linear Systems. Society for Industrial and Applied Mathematics. Google ScholarDigital Library
- Vivek Seshadri, Abhishek Bhowmick, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry. 2014. The dirty-block index. In Proceeding of the International Symposium on Computer Architecture. 157--168. Google ScholarDigital Library
- Vivek Seshadri, Kevin Hsieh, Amirali Boroum, Donghyuk Lee, Michael A. Kozuch, Onur Mutlu, Phillip B. Gibbons, and Todd C. Mowry. 2015. Fast bulk bitwise AND and OR in DRAM. IEEE Comput. Architec. Lett. 14, 2 (July-Dec. 2015), 127--131. Google ScholarDigital Library
- Thomas L. Sterling and Hans P. Zima. 2002. Gilgamesh: A multithreaded processor-in-memory architecture for petaflops computing. In Proceedings of the ACM/IEEE Conference on Supercomputing. 48. Google ScholarDigital Library
- Jeffrey Stuecheli, Dimitris Kaseridis, David Daly, Hillery C. Hunter, and Lizy K. John. 2010. The virtual write queue: Coordinating DRAM and last-level cache policies. In Proceedings of the International Symposium on Computer Architecture. 72--82. Google ScholarDigital Library
- Stavros Volos, Javier Picorel, Babak Falsafi, and Boris Grot. 2014. BuMP: Bulk memory access prediction and streaming. In Proceedings of the International Symposium on Microarchitecture. 545--557. Google ScholarDigital Library
- Zhe Wang, Samira M. Khan, and Daniel A. Jiménez. 2012. Improving writeback efficiency with decoupled last-write prediction. In Proceedings of the International Symposium on Computer Architecture. 309--320. Google ScholarDigital Library
- Dong Ping Zhang, Nuwan Jayasena, Alexander Lyashevsky, Joseph L. Greathouse, Lifan Xu, and Michael Ignatowski. 2014. TOP-PIM: Throughput-oriented programmable processing in memory. In Proceedings of the International Symposium on High-Performance Parallel and Distributed Computing. 85--98. Google ScholarDigital Library
- Guowei Zhang, Webb Horn, and Daniel Sanchez. 2015. Exploiting commutativity to reduce the cost of updates to shared data in cache-coherent systems. In Proceedings of the International Symposium on Microarchitecture. Google ScholarDigital Library
- Tao Zhang, Ke Chen, Cong Xu, Guangyu Sun, Tao Wang, and Yuan Xie. 2014. Half-DRAM: A high-bandwidth and low-power DRAM architecture from the rethinking of fine-grained activation. In Proceedings of the International Symposium on Computer Architecture. 349--360. Google ScholarDigital Library
- Qiuling Zhu, Berkin Akin, H. Ekin Sumbul, Fazle Sadi, James C. Hoe, Larry Pileggi, and Franz Franchetti. 2013. A 3D-stacked logic-in-memory accelerator for application-specific data intensive computing. In Proceedings of the International 3D Systems Integration Conference. 1--7.Google ScholarCross Ref
Index Terms
- AIM: Energy-Efficient Aggregation Inside the Memory Hierarchy
Recommendations
Processing data where it makes sense: Enabling in-memory computation
AbstractToday’s systems are overwhelmingly designed to move data to computation. This design choice goes directly against at least three key trends in systems that cause performance, scalability and energy bottlenecks: (1) data access from ...
Processing Data Where It Makes Sense in Modern Computing Systems: Enabling In-Memory Computation
GLSVLSI '19: Proceedings of the 2019 on Great Lakes Symposium on VLSIToday's systems are overwhelmingly designed to move data to computation. This design choice goes directly against at least three key trends in systems that cause performance, scalability and energy bottlenecks: 1) data access from memory is already a ...
CORUSCANT: Fast Efficient Processing-in-Racetrack Memories
MICRO '22: Proceedings of the 55th Annual IEEE/ACM International Symposium on MicroarchitectureThe growth in data needs of modern applications has created significant challenges for modern systems leading to a "memory wall." Spintronic Domain-Wall Memory (DWM), provides near-SRAM read/write performance, energy savings and non-volatility, ...
Comments