skip to main content
research-article
Free Access

AIM: Energy-Efficient Aggregation Inside the Memory Hierarchy

Authors Info & Claims
Published:25 October 2016Publication History
Skip Abstract Section

Abstract

In this article, we propose Aggregation-in-Memory (AIM), a new processing-in-memory system designed for energy efficiency and near-term adoption. In order to efficiently perform aggregation, we implement simple aggregation operations in main memory and develop a locality-adaptive host architecture for in-memory aggregation, called cache-conscious aggregation. Through this, AIM executes aggregation at the most energy-efficient location among all levels of the memory hierarchy. Moreover, AIM minimally changes existing sequential programming models and provides fully automated compiler toolchain, thereby allowing unmodified legacy software to use AIM. Evaluations show that AIM greatly improves the energy efficiency of main memory and the system performance.

References

  1. Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015a. A scalable processing-in-memory accelerator for parallel graph processing. In Proceedings of the International Symposium on Computer Architecture. 105--117. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Junwhan Ahn, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015b. PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture. In Proceedings of the International Symposium on Computer Architecture. 336--348. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Jung Ho Ahn, Mattan Erez, and William J. Dally. 2005. Scatter-add in data parallel architectures. In Proceedings of the International Symposium on High-Performance Computer Architecture. 132--142. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Rajeev Balasubramonian, Jichuan Chang, Troy Manning, Jaime H. Moreno, Richard Murphy, Ravi Nair, and Steven Swanson. 2014. Near-data processing: Insights from a MICRO-46 workshop. IEEE Micro 34, 4 (July 2014), 36--42. Google ScholarGoogle ScholarCross RefCross Ref
  5. Sergey Brin and Lawrence Page. 1998. The anatomy of a large-scale hypertextual Web search engine. In Proceedings of the International World-Wide Web Conference. 14--18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the IEEE International Symposium on Workload Characterization. 44--54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Ke Chen, Sheng Li, Naveen Muralimanohar, Jung Ho Ahn, Jay B. Brockman, and Norman P. Jouppi. 2012. CACTI-3DD: Architecture-level modeling for 3D die-stacked DRAM main memory. In Proceedings of the Design, Automation and Test in Europe. 33--38. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Timothy A. Davis and Yifan Hu. 2011. The University of Florida sparse matrix collection. ACM Trans. Math. Software 38, 1 (Dec. 2011), 1:1--1:25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Jeff Draper, Jacqueline Chame, Mary Hall, Craig Steele, Tim Barrett, Jeff LaCoss, John Granacki, Jaewook Shin, Chun Chen, Chang Woo Kang, Ihn Kim, and Gokhan Daglikoca. 2002. The architecture of the DIVA processing-in-memory chip. In Proceedings of the International Conference on Supercomputing. 14--25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Duncan G. Elliott, Michael Stumm, W. Martin. Snelgrove, Christian Cojocaru, and Robert McKenzie. 1999. Computational RAM: Implementing processors in memory. IEEE Design Test Comput. 16, 1 (Jan. 1999), 32--41. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Amin Farmahini-Farahani, Jung Ho Ahn, Katherine Morrow, and Nam Sung Kim. 2015. NDA: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules. In Proceedings of the International Symposium on High Performance Computer Architecture. 283--295. Google ScholarGoogle ScholarCross RefCross Ref
  12. L. A. Feldkamp, L. C. Davis, and J. W. Kress. 1984. Practical cone-beam algorithm. J. Opt. Soc. Am. A 1, 6 (June 1984), 612--619. Google ScholarGoogle ScholarCross RefCross Ref
  13. María Jesús Garzarán, Milos Prvulovic, Ye Zhang, Alin Jula, Hao Yu, Lawrence Rauchwerger, and Josep Torrellas. 2001. Architectural support for parallel reductions in scalable shared-memory multiprocessors. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. 243--254. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Mary Hall, Peter Kogge, Jeff Koller, Pedro Diniz, Jacqueline Chame, Jeff Draper, Jeff LaCoss, John Granacki, Jay Brockman, Apoorv Srivastava, William Athas, Vincent Freeh, Jaewook Shin, and Joonseok Park. 1999. Mapping irregular applications to DIVA, a PIM-based data-intensive architecture. In Proceedings of the ACM/IEEE Conference on Supercomputing. 57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Hybrid Memory Cube Consortium. 2014. Hybrid Memory Cube Specification 2.0.Google ScholarGoogle Scholar
  16. Sungpack Hong, Hassan Chafi, Edic Sedlar, and Kunle Olukotun. 2012. Green-Marl: A DSL for easy and efficient graph analysis. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems. 349--362. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Sungpack Hong, Tayo Oguntebi, and Kunle Olukotun. 2011. Efficient parallel graph exploration on multi-core CPU and GPU. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. 78--88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Sungpack Hong, Semih Salihoglu, Jennifer Widom, and Kunle Olukotun. 2014. Simplifying scalable graph processing with a domain-specific language. In Proceedings of the International Symposium on Code Generation and Optimization. 208--218. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. U. Kang, Charalampos E. Tsourakakis, and Christos Faloutsos. 2009. PEGASUS: A peta-scale graph mining system implementation and observations. In Proceedings of the International Conference on Data Mining. 229--238. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Yi Kang, Wei Huang, Seung-Moon Yoo, Diana Keen, Zhenzhou Ge, Vinh Lam, Pratap Pattnaik, and Josep Torrellas. 1999. FlexRAM: Toward an advanced intelligent memory system. In Proceedings of the International Conference on Computer Design. 192--201. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Daehyun Kim, Mainak Chaudhuri, and Mark Heinrich. 2002. Leveraging cache coherence in active memory systems. In Proceedings of the International Conference on Supercomputing. 2--13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Peter M. Kogge. 1994. EXECUBE—A new architecture for scaleable MPPs. In Proceedings of the International Conference on Parallel Processing. 77--84. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of the Neural Information Processing Systems. 1097--1105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis and transformation. In Proceedings of the International Symposium on Code Generation and Optimization. 75--86. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Chang Joo Lee, Veynu Narasiman, Eiman Ebrahimi, Onur Mutlu, and Yale N. Patt. 2010. DRAM-aware Last-level Cache Writeback: Reducing Write-caused Interference in Memory Systems. Technical Report No. TR-HPS-2010-002. The University of Texas at Austin.Google ScholarGoogle Scholar
  26. Jure Leskovec and Andrej Krevl. 2014. SNAP Datasets: Stanford Large Network Dataset Collection. (June 2014). http://snap.stanford.edu/data.Google ScholarGoogle Scholar
  27. Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the International Symposium on Microarchitecture. 469--480. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Kyu-Nam Lim, Woong-Ju Jang, Hyung-Sik Won, Kang-Yeol Lee, Hyungsoo Kim, Dong-Whee Kim, Mi-Hyun Cho, Seung-Lo Kim, Jong-Ho Kang, Keun-Woo Park, and Byung-Tae Jeong. 2012. A 1.2V 23nm 6F2 4Gb DDR3 SDRAM with local-bitline sense amplifier, hybrid LIO sense amplifier and dummy-less array architecture. In International Solid-State Circuits Conference Digest of Technical Papers. 42--44. Google ScholarGoogle ScholarCross RefCross Ref
  29. Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the Conference on Programming Language Design and Implementation. 190--200. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: A system for large-scale graph processing. In Proceedings of the International Conference on Management of Data. 135--146. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Micron Technology 2007. Calculating Memory System Power for DDR3. Micron Technology.Google ScholarGoogle Scholar
  32. Micron Technology 2009. 4Gb: ×4, ×8, ×16 DDR3 SDRAM. Micron Technology.Google ScholarGoogle Scholar
  33. Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. 2009. CACTI 6.0: A Tool to Model Large Caches. Technical ReportNo. HPL-2009-85. HP Laboratories.Google ScholarGoogle Scholar
  34. Ravi Nair, Samuel F. Antao, Carlo Bertolli, Pradip Bose, Jose R. Brunheroto, Tong Chen, Chen-Yong Cher, Carlos H. A. Costa, Jun Doi, Constantinos Evangelinos, Bruce M. Fleischer, Thomas W. Fox, Diego Sanchez Gallo, Leopold Grinberg, John A. Gunnels, Arpith C. Jacob, Philip Jacob, Hans M. Jacobson, Tejas Karkhanis, C. Kim, Jaime H. Moreno, J. Kevin O’Brien, Martin Ohmacht, Yoonho Park, Daniel A. Prener, Bryan S. Rosenburg, Kyung Dong Ryu, Olivier Sallenave, Mauricio J. Serrano, Patrick D. M. Siegl, Krishnan Sugavanam, and Zehra Sura. 2015. Active memory cube: A processing-in-memory architecture for exascale systems. IBM J. Res. Dev. 59, 2/3 (March 2015), 17:1--17:14.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Mark Oskin, Frederic T. Chong, and Timothy Sherwood. 1998. Active pages: A computation model for intelligent memory. In Proceedings of the International Symposium on Computer Architecture. 192--203. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. David Patterson, Thomas Anderson, Neal Cardwell, Richard Fromm, Kimberly Keeton, Christoforos Kozyrakis, Randi Thomas, and Katherine Yelick. 1997. A case for intelligent RAM. IEEE Micro 17, 2 (March 1997), 34--44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Seth H. Pugsley, Jeffrey Jestes, Huihui Zhang, Rajeev Balasubramonian, Vijayalakshmi Srinivasan, Alper Buyuktosunoglu, Al Davis, and Feifei Li. 2014. NDC: Analyzing the impact of 3D-stacked memory+logic devices on MapReduce workloads. In Proceedings of the International Symposium on Performance Analysis of Systems and Software. 190--200.Google ScholarGoogle ScholarCross RefCross Ref
  38. Christopher Rohkohl, Benjamin Keck, Hannes Hofmann, and Joachim Hornegger. 2009. RabbitCT—An open platform for benchmarking 3D cone-beam reconstruction algorithms. Med. Phys. 36, 9 (Sept. 2009), 3940--3944. Google ScholarGoogle ScholarCross RefCross Ref
  39. Yousef Sadd. 2003. Iterative Methods for Sparse Linear Systems. Society for Industrial and Applied Mathematics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Vivek Seshadri, Abhishek Bhowmick, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry. 2014. The dirty-block index. In Proceeding of the International Symposium on Computer Architecture. 157--168. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Vivek Seshadri, Kevin Hsieh, Amirali Boroum, Donghyuk Lee, Michael A. Kozuch, Onur Mutlu, Phillip B. Gibbons, and Todd C. Mowry. 2015. Fast bulk bitwise AND and OR in DRAM. IEEE Comput. Architec. Lett. 14, 2 (July-Dec. 2015), 127--131. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Thomas L. Sterling and Hans P. Zima. 2002. Gilgamesh: A multithreaded processor-in-memory architecture for petaflops computing. In Proceedings of the ACM/IEEE Conference on Supercomputing. 48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Jeffrey Stuecheli, Dimitris Kaseridis, David Daly, Hillery C. Hunter, and Lizy K. John. 2010. The virtual write queue: Coordinating DRAM and last-level cache policies. In Proceedings of the International Symposium on Computer Architecture. 72--82. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Stavros Volos, Javier Picorel, Babak Falsafi, and Boris Grot. 2014. BuMP: Bulk memory access prediction and streaming. In Proceedings of the International Symposium on Microarchitecture. 545--557. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Zhe Wang, Samira M. Khan, and Daniel A. Jiménez. 2012. Improving writeback efficiency with decoupled last-write prediction. In Proceedings of the International Symposium on Computer Architecture. 309--320. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Dong Ping Zhang, Nuwan Jayasena, Alexander Lyashevsky, Joseph L. Greathouse, Lifan Xu, and Michael Ignatowski. 2014. TOP-PIM: Throughput-oriented programmable processing in memory. In Proceedings of the International Symposium on High-Performance Parallel and Distributed Computing. 85--98. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Guowei Zhang, Webb Horn, and Daniel Sanchez. 2015. Exploiting commutativity to reduce the cost of updates to shared data in cache-coherent systems. In Proceedings of the International Symposium on Microarchitecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Tao Zhang, Ke Chen, Cong Xu, Guangyu Sun, Tao Wang, and Yuan Xie. 2014. Half-DRAM: A high-bandwidth and low-power DRAM architecture from the rethinking of fine-grained activation. In Proceedings of the International Symposium on Computer Architecture. 349--360. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Qiuling Zhu, Berkin Akin, H. Ekin Sumbul, Fazle Sadi, James C. Hoe, Larry Pileggi, and Franz Franchetti. 2013. A 3D-stacked logic-in-memory accelerator for application-specific data intensive computing. In Proceedings of the International 3D Systems Integration Conference. 1--7.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. AIM: Energy-Efficient Aggregation Inside the Memory Hierarchy

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Architecture and Code Optimization
      ACM Transactions on Architecture and Code Optimization  Volume 13, Issue 4
      December 2016
      648 pages
      ISSN:1544-3566
      EISSN:1544-3973
      DOI:10.1145/3012405
      Issue’s Table of Contents

      Copyright © 2016 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 25 October 2016
      • Accepted: 1 August 2016
      • Revised: 1 July 2016
      • Received: 1 May 2016
      Published in taco Volume 13, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader