skip to main content
10.1145/3307650.3322257acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article
Public Access

Duality cache for data parallel acceleration

Published:22 June 2019Publication History

ABSTRACT

Duality Cache is an in-cache computation architecture that enables general purpose data parallel applications to run on caches. This paper presents a holistic approach of building Duality Cache system stack with techniques of performing in-cache floating point arithmetic and transcendental functions, enabling a data-parallel execution model, designing a compiler that accepts existing CUDA programs, and providing flexibility in adopting for various workload characteristics.

Exposure to massive parallelism that exists in the Duality Cache architecture improves performance of GPU benchmarks by 3.6× and OpenACC benchmarks by 4.0× over a server class GPU. Re-purposing existing caches provides 72.6× better performance for CPUs with only 3.5% of area cost. Duality Cache reduces energy by 5.8× over GPUs and 21× over CPUs.

References

  1. N. Abeyratne, R. Das, Q. Li, K. Sewell, B. Giridhar, R. G. Dreslinski, D. Blaauw, and T. Mudge. 2013. Scaling towards kilo-core processors with asymmetric high-radix topologies. In 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA). 496--507. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. Aga, S. Jeloka, A. Subramaniyan, S. Narayanasamy, D. Blaauw, and R. Das. 2017. Compute Caches. In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA). 481--492.Google ScholarGoogle Scholar
  3. Junwhan Ahn, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. PIM-enabled Instructions: A Low-overhead, Locality-aware Processing-in-memory Architecture. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture (ISCA '15).Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Ray Andraka. 1998. A survey of CORDIC algorithms for FPGA based computers. In Proceedings of the 1998 ACM/SIGDA sixth international symposium on Field programmable gate arrays. ACM, 191--200. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Jay B. Brockman, Shyamkumar Thoziyoor, Shannon K. Kuntz, and Peter M. Kogge. 2004. A Low Cost, Multithreaded Processing-in-memory System. In Proceedings of the 3rd Workshop on Memory Performance Issues: In Conjunction with the 31st International Symposium on Computer Architecture (WMPI '04). ACM, New York, NY, USA, 16--22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. Lee, and K. Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In 2009 IEEE International Symposium on Workload Characterization (IISWC). 44--54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Ping Chi, Shuangchen Li, and Cong Xu. 2016. PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory. In IEEE International Symposium on Computer Architecture. IEEE, 27--39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. ROSE compiler infrastructure. 2018. Rose Compiler. http://rosecompiler.org/.Google ScholarGoogle Scholar
  9. Gregory Frederick Diamos, Andrew Robert Kerr, Sudhakar Yalaman-chili, and Nathan Clark. 2010. Ocelot: A Dynamic Optimization Framework for Bulk-synchronous Applications in Heterogeneous Systems. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT '10). ACM, New York, NY, USA, 353--364. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. C. Eckert, X. Wang, J. Wang, A. Subramaniyan, R. Iyer, D. Sylvester, D. Blaaauw, and R. Das. 2018. Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). 383--396.Google ScholarGoogle Scholar
  11. John R. Ellis. 1986. Bulldog: A Compiler for VLSI Architectures. MIT Press, Cambridge, MA, USA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. Farmahini-Farahani, Jung Ho Ahn, K. Morrow, and Nam Sung Kim. 2015. NDA: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules. In High Performance Computer Architecture (HPCA), 2015 IEEE 21st International Symposium on.Google ScholarGoogle ScholarCross RefCross Ref
  13. Ben Feinberg, Uday Kumar Reddy Vengalam, Nathan Whitehair, Shibo Wang, and Engin Ipek. 2018. Enabling scientific computing on memristive accelerators. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 367--382.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Basilio B. Fraguela, Jose Renau, Paul Feautrier, David Padua, and Josep Torrellas. 2003. Programming the FlexRAM Parallel Intelligent Memory System. SIGPLAN Not. 38, 10 (June 2003), 49--60. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Daichi Fujiki, Scott Mahlke, and Reetuparna Das. 2018. In-Memory Data Parallel Processor. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '18). ACM, New York, NY, USA, 1--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Min Huang, Moty Mehalel, Ramesh Arvapalli, and Songnian He. 2013. An Energy Efficient 32-nm 20-MB Shared On-Die L3 Cache for Intel® Xeon® Processor E5 Family. J. Solid-State Circuits (2013).Google ScholarGoogle Scholar
  17. Intel. 2008. x87 and SSE Floating Point Assists in IA-32: Flush-To-Zero (FTZ) and Denormals-Are-Zero (DAZ). https://software.intel.com/en-us/articles/x87-and-sse-floating-point-assists-in-ia-32-flush-to-zero-ftz-and-denormals-are-zero-daz/.Google ScholarGoogle Scholar
  18. Intel. 2018. Intel Processor Graphics. https://software.intel.com/en-us/articles/intel-graphics-developers-guides.Google ScholarGoogle Scholar
  19. S. Jeloka, N. B. Akesh, D. Sylvester, and D. Blaauw. 2016. A 28 nm Configurable Memory (TCAM/BCAM/SRAM) Using Push-Rule 6T Bit Cell Enabling Logic-in-Memory. IEEE Journal of Solid-State Circuits 51, 4 (April 2016), 1009--1021.Google ScholarGoogle Scholar
  20. Wenhao Jia, Kelly A Shaw, and Margaret Martonosi. 2012. Characterizing and improving the use of demand-fetched caches in GPUs. In Proceedings of the 26th ACM international conference on Supercomputing. ACM, 15--24.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Duckhwan Kim, Jaeha Kung, Sek Chai, Sudhakar Yalamanchili, and Saibal Mukhopadhyay. 2016. Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory. In Proceedings of ISCA, Vol. 43. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Yoongu Kim, Weikun Yang, and Onur Mutlu. 2016. Ramulator: A Fast and Extensible DRAM Simulator. Computer Architecture Letters 15, 1 (2016), 45--49. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Victor W. Lee, Changkyu Kim, Jatin Chhugani, Michael Deisher, Daehyun Kim, Anthony D. Nguyen, Nadathur Satish, Mikhail Smelyanskiy, Srinivas Chennupaty, Per Hammarlund, Ronak Singhal, and Pradeep Dubey. 2010. Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU. SIGARCH Comput. Archit. News 38, 3 (June 2010), 451--460. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. MathWorks. 2018. Compute Square Root Using CORDIC. https://www.mathworks.com/help/fixedpoint/examples/compute-square-root-using-cordic.html.Google ScholarGoogle Scholar
  25. Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P Jouppi. 2009. CACTI 6.0: A tool to model large caches. HP laboratories (2009), 22--31.Google ScholarGoogle Scholar
  26. NVIDIA. 2018. CUDA Toolkit. https://developer.nvidia.com/cuda-toolkit.Google ScholarGoogle Scholar
  27. NVIDIA. 2018. Parallel Thread Execution ISA. https://docs.nvidia.com/cuda/parallel-thread-execution/index.html.Google ScholarGoogle Scholar
  28. NVIDIA. 2018. PGI Compilers & Tools. https://www.pgroup.com/.Google ScholarGoogle Scholar
  29. Mark Oskin, Frederic T Chong, Timothy Sherwood, Mark Oskin, Frederic T Chong, and Timothy Sherwood. 1998. Active Pages: A Computation Model for Intelligent Memory. ACM SIGARCHComputer Architecture News 26, 3 (1998), 192--203. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. PathScale. 2013. Performance test suite for openacc compiler, intel mic, patus and single-core cpu. https://github.com/pathscale/OpenACC-benchmarks.Google ScholarGoogle Scholar
  31. D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, and K. Yelick. 1997. A case for intelligent RAM. Micro, IEEE (1997). Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. S.H. Pugsley, J. Jestes, Huihui Zhang, R. Balasubramonian, V. Srinivasan, A. Buyuktosunoglu, A. Davis, and Feifei Li. 2014. NDC: Analyzing the impact of 3D-stacked memory+logic devices on MapReduce workloads. In Performance Analysis of Systems and Software (ISPASS), 2014 IEEE International Symposium on.Google ScholarGoogle ScholarCross RefCross Ref
  33. Vivek Seshadri, Yoongu Kim, Chris Fallin, Donghyuk Lee, Rachata Ausavarungnirun, Gennady Pekhimenko, Yixin Luo, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry. {n.d.}. RowClone: Fast and Energy-efficient in-DRAM Bulk Data Copy and Initialization. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46). Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Ali Shafiee, Anirban Nag, Naveen Muralimanohar, and Rajeev Balasubramonian. 2016. ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars. 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA) (jun 2016), 14--26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Linghao Song, Xuehai Qian, Hai Li, and Yiran Chen. 2017. PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning. In Proceedings - International Symposium on High-Performance Computer Architecture. 541--552.Google ScholarGoogle ScholarCross RefCross Ref
  36. Akihiro Tabuchi, Masahiro Nakao, and Mitsuhisa Sato. 2014. A Source-to-Source OpenACC Compiler for CUDA. In Euro-Par 2013: Parallel Processing Workshops, Dieter an Mey, Michael Alexander, Paolo Bientinesi, Mario Cannataro, Carsten Clauss, Alexandru Costan, Gabor Kecskemeti, Christine Morin, Laura Ricci, Julio Sahuquillo, Martin Schulz, Vittorio Scarano, Stephen L Scott, and Josef Weidendorfer (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 178--187.Google ScholarGoogle Scholar
  37. J. E. Volder. 1959. The CORDIC Trigonometric Computing Technique. IRE Transactions on Electronic Computers EC-8, 3 (Sept 1959), 330--334.Google ScholarGoogle ScholarCross RefCross Ref
  38. John S Walther. 1971. A unified algorithm for elementary functions. In Proceedings of the May 18--20, 1971, spring joint computer conference. ACM, 379--385. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Xiaolong Xie, Yun Liang, Guangyu Sun, and Deming Chen. 2013. An efficient compiler framework for cache bypassing on GPUs. In 2013 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE, 516--523.Google ScholarGoogle ScholarCross RefCross Ref
  40. Q. Xu, H. Jeon, and M. Annavaram. 2014. Graph processing on GPUs: Where are the bottlenecks?. In 2014 IEEE International Symposium on Workload Characterization (IISWC). 140--149.Google ScholarGoogle Scholar
  41. Dongping Zhang, Nuwan Jayasena, Alexander Lyashevsky, Joseph L. Greathouse, Lifan Xu, and Michael Ignatowski. 2014. TOP-PIM: Throughput-oriented Programmable Processing in Memory. In Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing (HPDC '14). Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Qiuling Zhu, B. Akin, H.E. Sumbul, F. Sadi, J.C. Hoe, L. Pileggi, and F. Franchetti. 2013. A 3D-stacked logic-in-memory accelerator for application-specific data intensive computing. In 3D Systems Integration Conference (3DIC), 2013 IEEE International.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    ISCA '19: Proceedings of the 46th International Symposium on Computer Architecture
    June 2019
    849 pages
    ISBN:9781450366694
    DOI:10.1145/3307650

    Copyright © 2019 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 22 June 2019

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article

    Acceptance Rates

    ISCA '19 Paper Acceptance Rate62of365submissions,17%Overall Acceptance Rate543of3,203submissions,17%

    Upcoming Conference

    ISCA '24

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader