ABSTRACT
Duality Cache is an in-cache computation architecture that enables general purpose data parallel applications to run on caches. This paper presents a holistic approach of building Duality Cache system stack with techniques of performing in-cache floating point arithmetic and transcendental functions, enabling a data-parallel execution model, designing a compiler that accepts existing CUDA programs, and providing flexibility in adopting for various workload characteristics.
Exposure to massive parallelism that exists in the Duality Cache architecture improves performance of GPU benchmarks by 3.6× and OpenACC benchmarks by 4.0× over a server class GPU. Re-purposing existing caches provides 72.6× better performance for CPUs with only 3.5% of area cost. Duality Cache reduces energy by 5.8× over GPUs and 21× over CPUs.
- N. Abeyratne, R. Das, Q. Li, K. Sewell, B. Giridhar, R. G. Dreslinski, D. Blaauw, and T. Mudge. 2013. Scaling towards kilo-core processors with asymmetric high-radix topologies. In 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA). 496--507. Google ScholarDigital Library
- S. Aga, S. Jeloka, A. Subramaniyan, S. Narayanasamy, D. Blaauw, and R. Das. 2017. Compute Caches. In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA). 481--492.Google Scholar
- Junwhan Ahn, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. PIM-enabled Instructions: A Low-overhead, Locality-aware Processing-in-memory Architecture. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture (ISCA '15).Google ScholarDigital Library
- Ray Andraka. 1998. A survey of CORDIC algorithms for FPGA based computers. In Proceedings of the 1998 ACM/SIGDA sixth international symposium on Field programmable gate arrays. ACM, 191--200. Google ScholarDigital Library
- Jay B. Brockman, Shyamkumar Thoziyoor, Shannon K. Kuntz, and Peter M. Kogge. 2004. A Low Cost, Multithreaded Processing-in-memory System. In Proceedings of the 3rd Workshop on Memory Performance Issues: In Conjunction with the 31st International Symposium on Computer Architecture (WMPI '04). ACM, New York, NY, USA, 16--22. Google ScholarDigital Library
- S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. Lee, and K. Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In 2009 IEEE International Symposium on Workload Characterization (IISWC). 44--54. Google ScholarDigital Library
- Ping Chi, Shuangchen Li, and Cong Xu. 2016. PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory. In IEEE International Symposium on Computer Architecture. IEEE, 27--39. Google ScholarDigital Library
- ROSE compiler infrastructure. 2018. Rose Compiler. http://rosecompiler.org/.Google Scholar
- Gregory Frederick Diamos, Andrew Robert Kerr, Sudhakar Yalaman-chili, and Nathan Clark. 2010. Ocelot: A Dynamic Optimization Framework for Bulk-synchronous Applications in Heterogeneous Systems. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT '10). ACM, New York, NY, USA, 353--364. Google ScholarDigital Library
- C. Eckert, X. Wang, J. Wang, A. Subramaniyan, R. Iyer, D. Sylvester, D. Blaaauw, and R. Das. 2018. Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). 383--396.Google Scholar
- John R. Ellis. 1986. Bulldog: A Compiler for VLSI Architectures. MIT Press, Cambridge, MA, USA. Google ScholarDigital Library
- A. Farmahini-Farahani, Jung Ho Ahn, K. Morrow, and Nam Sung Kim. 2015. NDA: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules. In High Performance Computer Architecture (HPCA), 2015 IEEE 21st International Symposium on.Google ScholarCross Ref
- Ben Feinberg, Uday Kumar Reddy Vengalam, Nathan Whitehair, Shibo Wang, and Engin Ipek. 2018. Enabling scientific computing on memristive accelerators. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 367--382.Google ScholarDigital Library
- Basilio B. Fraguela, Jose Renau, Paul Feautrier, David Padua, and Josep Torrellas. 2003. Programming the FlexRAM Parallel Intelligent Memory System. SIGPLAN Not. 38, 10 (June 2003), 49--60. Google ScholarDigital Library
- Daichi Fujiki, Scott Mahlke, and Reetuparna Das. 2018. In-Memory Data Parallel Processor. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '18). ACM, New York, NY, USA, 1--14. Google ScholarDigital Library
- Min Huang, Moty Mehalel, Ramesh Arvapalli, and Songnian He. 2013. An Energy Efficient 32-nm 20-MB Shared On-Die L3 Cache for Intel® Xeon® Processor E5 Family. J. Solid-State Circuits (2013).Google Scholar
- Intel. 2008. x87 and SSE Floating Point Assists in IA-32: Flush-To-Zero (FTZ) and Denormals-Are-Zero (DAZ). https://software.intel.com/en-us/articles/x87-and-sse-floating-point-assists-in-ia-32-flush-to-zero-ftz-and-denormals-are-zero-daz/.Google Scholar
- Intel. 2018. Intel Processor Graphics. https://software.intel.com/en-us/articles/intel-graphics-developers-guides.Google Scholar
- S. Jeloka, N. B. Akesh, D. Sylvester, and D. Blaauw. 2016. A 28 nm Configurable Memory (TCAM/BCAM/SRAM) Using Push-Rule 6T Bit Cell Enabling Logic-in-Memory. IEEE Journal of Solid-State Circuits 51, 4 (April 2016), 1009--1021.Google Scholar
- Wenhao Jia, Kelly A Shaw, and Margaret Martonosi. 2012. Characterizing and improving the use of demand-fetched caches in GPUs. In Proceedings of the 26th ACM international conference on Supercomputing. ACM, 15--24.Google ScholarDigital Library
- Duckhwan Kim, Jaeha Kung, Sek Chai, Sudhakar Yalamanchili, and Saibal Mukhopadhyay. 2016. Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory. In Proceedings of ISCA, Vol. 43. Google ScholarDigital Library
- Yoongu Kim, Weikun Yang, and Onur Mutlu. 2016. Ramulator: A Fast and Extensible DRAM Simulator. Computer Architecture Letters 15, 1 (2016), 45--49. Google ScholarDigital Library
- Victor W. Lee, Changkyu Kim, Jatin Chhugani, Michael Deisher, Daehyun Kim, Anthony D. Nguyen, Nadathur Satish, Mikhail Smelyanskiy, Srinivas Chennupaty, Per Hammarlund, Ronak Singhal, and Pradeep Dubey. 2010. Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU. SIGARCH Comput. Archit. News 38, 3 (June 2010), 451--460. Google ScholarDigital Library
- MathWorks. 2018. Compute Square Root Using CORDIC. https://www.mathworks.com/help/fixedpoint/examples/compute-square-root-using-cordic.html.Google Scholar
- Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P Jouppi. 2009. CACTI 6.0: A tool to model large caches. HP laboratories (2009), 22--31.Google Scholar
- NVIDIA. 2018. CUDA Toolkit. https://developer.nvidia.com/cuda-toolkit.Google Scholar
- NVIDIA. 2018. Parallel Thread Execution ISA. https://docs.nvidia.com/cuda/parallel-thread-execution/index.html.Google Scholar
- NVIDIA. 2018. PGI Compilers & Tools. https://www.pgroup.com/.Google Scholar
- Mark Oskin, Frederic T Chong, Timothy Sherwood, Mark Oskin, Frederic T Chong, and Timothy Sherwood. 1998. Active Pages: A Computation Model for Intelligent Memory. ACM SIGARCHComputer Architecture News 26, 3 (1998), 192--203. Google ScholarDigital Library
- PathScale. 2013. Performance test suite for openacc compiler, intel mic, patus and single-core cpu. https://github.com/pathscale/OpenACC-benchmarks.Google Scholar
- D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, and K. Yelick. 1997. A case for intelligent RAM. Micro, IEEE (1997). Google ScholarDigital Library
- S.H. Pugsley, J. Jestes, Huihui Zhang, R. Balasubramonian, V. Srinivasan, A. Buyuktosunoglu, A. Davis, and Feifei Li. 2014. NDC: Analyzing the impact of 3D-stacked memory+logic devices on MapReduce workloads. In Performance Analysis of Systems and Software (ISPASS), 2014 IEEE International Symposium on.Google ScholarCross Ref
- Vivek Seshadri, Yoongu Kim, Chris Fallin, Donghyuk Lee, Rachata Ausavarungnirun, Gennady Pekhimenko, Yixin Luo, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry. {n.d.}. RowClone: Fast and Energy-efficient in-DRAM Bulk Data Copy and Initialization. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46). Google ScholarDigital Library
- Ali Shafiee, Anirban Nag, Naveen Muralimanohar, and Rajeev Balasubramonian. 2016. ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars. 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA) (jun 2016), 14--26. Google ScholarDigital Library
- Linghao Song, Xuehai Qian, Hai Li, and Yiran Chen. 2017. PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning. In Proceedings - International Symposium on High-Performance Computer Architecture. 541--552.Google ScholarCross Ref
- Akihiro Tabuchi, Masahiro Nakao, and Mitsuhisa Sato. 2014. A Source-to-Source OpenACC Compiler for CUDA. In Euro-Par 2013: Parallel Processing Workshops, Dieter an Mey, Michael Alexander, Paolo Bientinesi, Mario Cannataro, Carsten Clauss, Alexandru Costan, Gabor Kecskemeti, Christine Morin, Laura Ricci, Julio Sahuquillo, Martin Schulz, Vittorio Scarano, Stephen L Scott, and Josef Weidendorfer (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 178--187.Google Scholar
- J. E. Volder. 1959. The CORDIC Trigonometric Computing Technique. IRE Transactions on Electronic Computers EC-8, 3 (Sept 1959), 330--334.Google ScholarCross Ref
- John S Walther. 1971. A unified algorithm for elementary functions. In Proceedings of the May 18--20, 1971, spring joint computer conference. ACM, 379--385. Google ScholarDigital Library
- Xiaolong Xie, Yun Liang, Guangyu Sun, and Deming Chen. 2013. An efficient compiler framework for cache bypassing on GPUs. In 2013 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE, 516--523.Google ScholarCross Ref
- Q. Xu, H. Jeon, and M. Annavaram. 2014. Graph processing on GPUs: Where are the bottlenecks?. In 2014 IEEE International Symposium on Workload Characterization (IISWC). 140--149.Google Scholar
- Dongping Zhang, Nuwan Jayasena, Alexander Lyashevsky, Joseph L. Greathouse, Lifan Xu, and Michael Ignatowski. 2014. TOP-PIM: Throughput-oriented Programmable Processing in Memory. In Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing (HPDC '14). Google ScholarDigital Library
- Qiuling Zhu, B. Akin, H.E. Sumbul, F. Sadi, J.C. Hoe, L. Pileggi, and F. Franchetti. 2013. A 3D-stacked logic-in-memory accelerator for application-specific data intensive computing. In 3D Systems Integration Conference (3DIC), 2013 IEEE International.Google Scholar
Recommendations
Compute cache for data parallel acceleration
NoCArc '19: Proceedings of the 12th International Workshop on Network on Chip ArchitecturesThe talk will start with an overview of our work Neural Cache architecture which is capable of fully executing convolutional, fully connected, pooling layers in-cache and also supports quantization in-cache. Then I will present a versatile Compute Cache ...
Data trace cache: an application specific cache architecture
Special issue: MEDEA'05Benefits of advances in processor technology have long been held hostage to the widening processor-memory gap. Off-chip memory access latency is one of the most critical parameters limiting system performance. Caches have been used as a way of ...
Location cache: a low-power L2 cache system
ISLPED '04: Proceedings of the 2004 international symposium on Low power electronics and designWhile set-associative caches incur fewer misses than direct-mapped caches, they typically have slower hit times and higher power consumption, when multiple tag and data banks are probed in parallel. This paper presents the location cache structure which ...
Comments