ABSTRACT
Recent technology advances in memory system design, along with 3D stacking, have made near-data processing (NDP) more feasible to accelerate different workloads. In this work, we explore the near-data processing opportunity of a fundamental operation - linked-list traversal (LLT). We propose a new NDP architecture which does not change the existing sequential programming model and does not require any modification to the core microarchitecture. Instead, we exploit the packetized interface between the core and the memory modules to off-load LLT for NDP. We assume a system with multiple memory modules (e.g., hybrid memory cube (HMC) modules) interconnected with a memory network and our initial evaluation shows that simply off-loading LLT computation to near-memory can actually reduce performance because of the additional off-chip memory network channel traversal. Thus, we first propose NDP-aware data localization to exploit packaging locality - including locality within a single memory module and memory vault - to minimize latency and improve energy efficiency. In order to improve overall throughput and maximize parallelism, we propose batching multiple LLT operations together to amortize the cost of NDP by utilizing the highly parallel execution of NDP processing units and the high bandwidth of 3D stacked DRAM. Our evaluation shows that the combination of NDP-aware data localization and batching can provide significant improvement in performance and energy efficiency.
- J. Ahn et al., "Scatter-add in data parallel architectures," in HPCA, 2005, pp. 132--142. Google ScholarDigital Library
- J. Ahn et al., "McSimA+: A manycore simulator with application-level+ simulation and detailed microarchitecture modeling," in ISPASS, 2013, pp. 74--85.Google Scholar
- J. Ahn et al., "A scalable processing-in-memory accelerator for parallel graph processing," in ISCA, 2015, pp. 105--117. Google ScholarDigital Library
- B. Akin et al., "Data reorganization in memory using 3D-stacked DRAM," in ISCA, 2015, pp. 131--143. Google ScholarDigital Library
- B. Atikoglu et al., "Workload Analysis of a Large-scale Key-value Store," in ACM SIGMETRICS, 2012, pp. 53--64. Google ScholarDigital Library
- R. Balasubramonian et al., "Near-data processing: Insights from a MICRO-46 workshop," IEEE Micro, vol. 34, no. 4, pp. 36--42, 2014.Google ScholarCross Ref
- J. Balfour et al., "Design tradeoffs for tiled CMP on-chip networks," in ICS, 2006, pp. 187--198. Google ScholarDigital Library
- C. Balkesen et al., "Main-memory hash joins on multi-core CPUs: Tuning to the underlying hardware," in ICDE, 2013, pp. 362--373. Google ScholarDigital Library
- A. Basu et al., "Efficient Virtual Memory for Big Memory Servers," in ISCA, 2013, pp. 237--248. Google ScholarDigital Library
- N. Binkert et al., "The gem5 simulator," SIGARCH Comput. Archit. News, vol. 39, no. 2, pp. 1--7, 2011. Google ScholarDigital Library
- S. Blanas et al., "Design and evaluation of main memory hash join algorithms for multi-core CPUs," in International Conference on Management of data (SIGMOD). ACM, 2011, pp. 37--48. Google ScholarDigital Library
- J. Carter et al., "Impulse: Building a smarter memory controller," in HPCA, 1999, pp. 70--79. Google ScholarDigital Library
- K. Chen et al., "CACTI-3DD: Architecture-level modeling for 3D die-stacked DRAM main memory," in DATE, 2012, pp. 33--38. Google ScholarDigital Library
- P. Dlugosch et al., "An efficient and scalable semiconductor architecture for parallel automata processing," IEEE Transactions on Parallel and Distributed Systems, vol. 25, pp. 3088--3098, 2014.Google ScholarCross Ref
- B. Falsafi et al., "A primer on hardware prefetching," Synthesis Lectures on Computer Architecture, vol. 9, no. 1, pp. 1--67, 2014. Google ScholarDigital Library
- Z. Fang et al., "Active memory operations," in ICS, 2007, pp. 232--241. Google ScholarDigital Library
- B. Fitzpatrick and A. Vorobey, "Memcached: a distributed memory object caching system," 2011.Google Scholar
- M. Gao et al., "Practical Near-Data Processing for In-memory Analytics Frameworks," in PACT, 2015, pp. 113--124. Google ScholarDigital Library
- M. Gao et al., "HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing," in HPCA, 2016, pp. 126--137.Google Scholar
- Q. Guo et al., "3d-stacked memory-side acceleration: Accelerator and system design," in In the Workshop on Near-Data Processing (WoNDP), 2014.Google Scholar
- A. Gutierrez et al., "Integrated 3D-stacked Server Designs for Increasing Physical Density of Key-value Stores," in ASPLOS, 2014, pp. 485--498. Google ScholarDigital Library
- T. H. Hetherington et al., "Characterizing and Evaluating a Key-value Store Application on Heterogeneous CPU-GPU Systems," in ISPASS, 2012, pp. 88--98. Google ScholarDigital Library
- B. Hong et al., "Adaptive and Flexible Key-Value Stores Through Soft Data Partitioning," in ICCD, 2016.Google Scholar
- Hybrid Memory Cube Consortium, "Hybrid Memory Cube Specification 2.0," 2014.Google Scholar
- Intel, "Increasing Memory Throughput With Intel Streaming SIMD Extensions 4 (Intel SSE4) Streaming Load," in White Paper, 2008.Google Scholar
- Intel, "Intel 64 and IA-32 Architectures Software Developer's Manual," 2014.Google Scholar
- Intel, "Intel Virtualization Technology for Directed I/O," Architecture Specification, 2014.Google Scholar
- J. Jeddeloh et al., "Hybrid memory cube new DRAM architecture increases density and performance," in Symposium on VLSI Technology, 2012.Google Scholar
- N. Jiang et al., "A detailed and flexible cycle-accurate Network-on-Chip simulator," in ISPASS, 2013, pp. 86--96.Google Scholar
- Y. Kang et al., "FlexRAM: Toward an advanced intelligent memory system," in ICCD, 2012, pp. 5--14. Google ScholarDigital Library
- G. Kim et al., "Memory-centric system interconnect design with hybrid memory cubes," in PACT, 2013, pp. 145--156. Google ScholarDigital Library
- H. Kim et al., "Understanding Energy Aspects of Processing-near-Memory for HPC Workloads," in Proceedings of the 2015 International Symposium on Memory Systems. ACM, 2015, pp. 276--282. Google ScholarDigital Library
- O. Kocberber et al., "Meet the walkers: Accelerating index traversals for in-memory databases," in MICRO, 2013, pp. 468--479. Google ScholarDigital Library
- P. M. Kogge, "EXECUBE-A new architecture for scaleable MPPs," in ICPP, 1994, pp. 77--84. Google ScholarDigital Library
- J. H. Lee et al., "BSSync: Processing near memory for machine learning workloads with bounded staleness consistency models," in PACT, 2015, pp. 241--252. Google ScholarDigital Library
- S. Li et al., "McPAT: an Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures," in MICRO, 2009, pp. 469--480. Google ScholarDigital Library
- G. H. Loh, "A register-file approach for row buffer caches in die-stacked DRAMs," in MICRO, 2011, pp. 351--361. Google ScholarDigital Library
- G. Loh et al., "A processing in memory taxonomy and a case for studying fixed-function pim," in Workshop on Near-Data Processing (WoNDP), 2013.Google Scholar
- R. C. Murphy et al., "Introducing to the graph 500," Cray User's Group (CUG), 2010.Google Scholar
- L. Nai et al., "Instruction Offloading with HMC 2.0 Standard: A Case Study for Graph Traversals," in Proceedings of the 2015 International Symposium on Memory Systems. ACM, 2015, pp. 258--261. Google ScholarDigital Library
- R. Nair et al., "Active Memory Cube: A processing-in-memory architecture for exascale systems," IBM Journal of Research and Development, vol. 59, no. 2/3, pp. 17--1, 2015.Google ScholarDigital Library
- D. Patterson et al., "A case for intelligent RAM," IEEE Micro, vol. 17, no. 2, pp. 34--44, 1997. Google ScholarDigital Library
- J. T. Pawlowski, "Hybrid Memory Cube (HMC)," in Hot Chips, 2011.Google Scholar
- S. H. Pugsley et al., "NDC: Analyzing the impact of 3D-stacked memory+ logic devices on MapReduce workloads," in ISPASS, 2014, pp. 190--200.Google Scholar
- Samsung, "Samsung announces IMDB memory." {Online}. Available: http://www.techeye.net/business/samsung-announces-imdb-memory-with-ndp-hbm-tooGoogle Scholar
- G. Sandhu, "DRAM scaling and bandwidth challenges," in NSF Workshop on Emerging Technologies for Interconnects (WETI), 2012.Google Scholar
- A. Sodani et al., "Knights Landing: Second-Generation Intel Xeon i Product," IEEE Micro, vol. 36, no. 2, pp. 34--46, 2016. Google ScholarDigital Library
- C. B. Zilles, "Benchmark health considered harmful," ACM SIGARCH Computer Architecture News, vol. 29, no. 3, pp. 4--5, 2001. Google ScholarDigital Library
Index Terms
- Accelerating Linked-list Traversal Through Near-Data Processing
Recommendations
Toward standardized near-data processing with unrestricted data placement for GPUs
SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis3D-stacked memory devices with processing logic can help alleviate the memory bandwidth bottleneck in GPUs. However, in order for such Near-Data Processing (NDP) memory stacks to be used for different GPU architectures, it is desirable to standardize ...
SparseP: Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Architectures
POMACSSeveral manufacturers have already started to commercialize near-bank Processing-In-Memory (PIM) architectures, after decades of research efforts. Near-bank PIM architectures place simple cores close to DRAM banks. Recent research demonstrates that they ...
An Architecture for Integrated Near-Data Processors
To increase the performance of data-intensive applications, we present an extension to a CPU architecture that enables arbitrary near-data processing capabilities close to the main memory. This is realized by introducing a component attached to the CPU ...
Comments