research-article

Accelerating Linked-list Traversal Through Near-Data Processing

Authors:
Byungchul Hong

Korea Advanced Institute of Science and Technology, Daejeon, South Korea

Korea Advanced Institute of Science and Technology, Daejeon, South Korea
View Profile

,
Gwangsun Kim

Korea Advanced Institute of Science and Technology, Daejeon, South Korea

Korea Advanced Institute of Science and Technology, Daejeon, South Korea
View Profile

,
Jung Ho Ahn

Seoul National University, Seoul, South Korea

Seoul National University, Seoul, South Korea
View Profile

,
Yongkee Kwon

SK Hynix, Icheon, South Korea

SK Hynix, Icheon, South Korea
View Profile

,
Hongsik Kim

SK Hynix, Icheon, South Korea

SK Hynix, Icheon, South Korea
View Profile

,
John Kim

Korea Advanced Institute of Science and Technology, Daejeon, South Korea

Korea Advanced Institute of Science and Technology, Daejeon, South Korea
View Profile

PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and CompilationSeptember 2016Pages 113–124https://doi.org/10.1145/2967938.2967958

Published:11 September 2016Publication History

PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation

Pages 113–124

ABSTRACT

Recent technology advances in memory system design, along with 3D stacking, have made near-data processing (NDP) more feasible to accelerate different workloads. In this work, we explore the near-data processing opportunity of a fundamental operation - linked-list traversal (LLT). We propose a new NDP architecture which does not change the existing sequential programming model and does not require any modification to the core microarchitecture. Instead, we exploit the packetized interface between the core and the memory modules to off-load LLT for NDP. We assume a system with multiple memory modules (e.g., hybrid memory cube (HMC) modules) interconnected with a memory network and our initial evaluation shows that simply off-loading LLT computation to near-memory can actually reduce performance because of the additional off-chip memory network channel traversal. Thus, we first propose NDP-aware data localization to exploit packaging locality - including locality within a single memory module and memory vault - to minimize latency and improve energy efficiency. In order to improve overall throughput and maximize parallelism, we propose batching multiple LLT operations together to amortize the cost of NDP by utilizing the highly parallel execution of NDP processing units and the high bandwidth of 3D stacked DRAM. Our evaluation shows that the combination of NDP-aware data localization and batching can provide significant improvement in performance and energy efficiency.

References

J. Ahn et al., "Scatter-add in data parallel architectures," in HPCA, 2005, pp. 132--142. Google ScholarDigital Library
J. Ahn et al., "McSimA+: A manycore simulator with application-level+ simulation and detailed microarchitecture modeling," in ISPASS, 2013, pp. 74--85.Google Scholar
J. Ahn et al., "A scalable processing-in-memory accelerator for parallel graph processing," in ISCA, 2015, pp. 105--117. Google ScholarDigital Library
B. Akin et al., "Data reorganization in memory using 3D-stacked DRAM," in ISCA, 2015, pp. 131--143. Google ScholarDigital Library
B. Atikoglu et al., "Workload Analysis of a Large-scale Key-value Store," in ACM SIGMETRICS, 2012, pp. 53--64. Google ScholarDigital Library
R. Balasubramonian et al., "Near-data processing: Insights from a MICRO-46 workshop," IEEE Micro, vol. 34, no. 4, pp. 36--42, 2014.Google ScholarCross Ref
J. Balfour et al., "Design tradeoffs for tiled CMP on-chip networks," in ICS, 2006, pp. 187--198. Google ScholarDigital Library
C. Balkesen et al., "Main-memory hash joins on multi-core CPUs: Tuning to the underlying hardware," in ICDE, 2013, pp. 362--373. Google ScholarDigital Library
A. Basu et al., "Efficient Virtual Memory for Big Memory Servers," in ISCA, 2013, pp. 237--248. Google ScholarDigital Library
N. Binkert et al., "The gem5 simulator," SIGARCH Comput. Archit. News, vol. 39, no. 2, pp. 1--7, 2011. Google ScholarDigital Library
S. Blanas et al., "Design and evaluation of main memory hash join algorithms for multi-core CPUs," in International Conference on Management of data (SIGMOD). ACM, 2011, pp. 37--48. Google ScholarDigital Library
J. Carter et al., "Impulse: Building a smarter memory controller," in HPCA, 1999, pp. 70--79. Google ScholarDigital Library
K. Chen et al., "CACTI-3DD: Architecture-level modeling for 3D die-stacked DRAM main memory," in DATE, 2012, pp. 33--38. Google ScholarDigital Library
P. Dlugosch et al., "An efficient and scalable semiconductor architecture for parallel automata processing," IEEE Transactions on Parallel and Distributed Systems, vol. 25, pp. 3088--3098, 2014.Google ScholarCross Ref
B. Falsafi et al., "A primer on hardware prefetching," Synthesis Lectures on Computer Architecture, vol. 9, no. 1, pp. 1--67, 2014. Google ScholarDigital Library
Z. Fang et al., "Active memory operations," in ICS, 2007, pp. 232--241. Google ScholarDigital Library
B. Fitzpatrick and A. Vorobey, "Memcached: a distributed memory object caching system," 2011.Google Scholar
M. Gao et al., "Practical Near-Data Processing for In-memory Analytics Frameworks," in PACT, 2015, pp. 113--124. Google ScholarDigital Library
M. Gao et al., "HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing," in HPCA, 2016, pp. 126--137.Google Scholar
Q. Guo et al., "3d-stacked memory-side acceleration: Accelerator and system design," in In the Workshop on Near-Data Processing (WoNDP), 2014.Google Scholar
A. Gutierrez et al., "Integrated 3D-stacked Server Designs for Increasing Physical Density of Key-value Stores," in ASPLOS, 2014, pp. 485--498. Google ScholarDigital Library
T. H. Hetherington et al., "Characterizing and Evaluating a Key-value Store Application on Heterogeneous CPU-GPU Systems," in ISPASS, 2012, pp. 88--98. Google ScholarDigital Library
B. Hong et al., "Adaptive and Flexible Key-Value Stores Through Soft Data Partitioning," in ICCD, 2016.Google Scholar
Hybrid Memory Cube Consortium, "Hybrid Memory Cube Specification 2.0," 2014.Google Scholar
Intel, "Increasing Memory Throughput With Intel Streaming SIMD Extensions 4 (Intel SSE4) Streaming Load," in White Paper, 2008.Google Scholar
Intel, "Intel 64 and IA-32 Architectures Software Developer's Manual," 2014.Google Scholar
Intel, "Intel Virtualization Technology for Directed I/O," Architecture Specification, 2014.Google Scholar
J. Jeddeloh et al., "Hybrid memory cube new DRAM architecture increases density and performance," in Symposium on VLSI Technology, 2012.Google Scholar
N. Jiang et al., "A detailed and flexible cycle-accurate Network-on-Chip simulator," in ISPASS, 2013, pp. 86--96.Google Scholar
Y. Kang et al., "FlexRAM: Toward an advanced intelligent memory system," in ICCD, 2012, pp. 5--14. Google ScholarDigital Library
G. Kim et al., "Memory-centric system interconnect design with hybrid memory cubes," in PACT, 2013, pp. 145--156. Google ScholarDigital Library
H. Kim et al., "Understanding Energy Aspects of Processing-near-Memory for HPC Workloads," in Proceedings of the 2015 International Symposium on Memory Systems. ACM, 2015, pp. 276--282. Google ScholarDigital Library
O. Kocberber et al., "Meet the walkers: Accelerating index traversals for in-memory databases," in MICRO, 2013, pp. 468--479. Google ScholarDigital Library
P. M. Kogge, "EXECUBE-A new architecture for scaleable MPPs," in ICPP, 1994, pp. 77--84. Google ScholarDigital Library
J. H. Lee et al., "BSSync: Processing near memory for machine learning workloads with bounded staleness consistency models," in PACT, 2015, pp. 241--252. Google ScholarDigital Library
S. Li et al., "McPAT: an Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures," in MICRO, 2009, pp. 469--480. Google ScholarDigital Library
G. H. Loh, "A register-file approach for row buffer caches in die-stacked DRAMs," in MICRO, 2011, pp. 351--361. Google ScholarDigital Library
G. Loh et al., "A processing in memory taxonomy and a case for studying fixed-function pim," in Workshop on Near-Data Processing (WoNDP), 2013.Google Scholar
R. C. Murphy et al., "Introducing to the graph 500," Cray User's Group (CUG), 2010.Google Scholar
L. Nai et al., "Instruction Offloading with HMC 2.0 Standard: A Case Study for Graph Traversals," in Proceedings of the 2015 International Symposium on Memory Systems. ACM, 2015, pp. 258--261. Google ScholarDigital Library
R. Nair et al., "Active Memory Cube: A processing-in-memory architecture for exascale systems," IBM Journal of Research and Development, vol. 59, no. 2/3, pp. 17--1, 2015.Google ScholarDigital Library
D. Patterson et al., "A case for intelligent RAM," IEEE Micro, vol. 17, no. 2, pp. 34--44, 1997. Google ScholarDigital Library
J. T. Pawlowski, "Hybrid Memory Cube (HMC)," in Hot Chips, 2011.Google Scholar
S. H. Pugsley et al., "NDC: Analyzing the impact of 3D-stacked memory+ logic devices on MapReduce workloads," in ISPASS, 2014, pp. 190--200.Google Scholar
Samsung, "Samsung announces IMDB memory." {Online}. Available: http://www.techeye.net/business/samsung-announces-imdb-memory-with-ndp-hbm-tooGoogle Scholar
G. Sandhu, "DRAM scaling and bandwidth challenges," in NSF Workshop on Emerging Technologies for Interconnects (WETI), 2012.Google Scholar
A. Sodani et al., "Knights Landing: Second-Generation Intel Xeon i Product," IEEE Micro, vol. 36, no. 2, pp. 34--46, 2016. Google ScholarDigital Library
C. B. Zilles, "Benchmark health considered harmful," ACM SIGARCH Computer Architecture News, vol. 29, no. 3, pp. 4--5, 2001. Google ScholarDigital Library

Index Terms

Accelerating Linked-list Traversal Through Near-Data Processing
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Heterogeneous (hybrid) systems
      2. Special purpose systems

Recommendations

Toward standardized near-data processing with unrestricted data placement for GPUs
SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

3D-stacked memory devices with processing logic can help alleviate the memory bandwidth bottleneck in GPUs. However, in order for such Near-Data Processing (NDP) memory stacks to be used for different GPU architectures, it is desirable to standardize ...
Read More
SparseP: Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Architectures
POMACS

Several manufacturers have already started to commercialize near-bank Processing-In-Memory (PIM) architectures, after decades of research efforts. Near-bank PIM architectures place simple cores close to DRAM banks. Recent research demonstrates that they ...
Read More
An Architecture for Integrated Near-Data Processors

To increase the performance of data-intensive applications, we present an extension to a CPU architecture that enables arbitrary near-data processing capabilities close to the main memory. This is realized by introducing a component attached to the CPU ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation
September 2016
474 pages
ISBN:9781450341219
DOI:10.1145/2967938
General Chairs:
Ayal Zaks
Intel, Israel
,
Bilha Mendelson
Optitura, Israel
,
Program Chairs:
Lawrence Rauchwerger
Texas A&M University, USA
,
Wen-mei W. Hwu
University of Illinois at Urbana-Champaign, USA
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 September 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
big-memory workload
linked-list traversal
near-data processing
processing-in-memory
Qualifiers
- research-article
Conference

Acceptance Rates
PACT '16 Paper Acceptance Rate31of119submissions,26%Overall Acceptance Rate121of471submissions,26%
More
Upcoming Conference
PACT '24

Sponsor:

sigarch

International Conference on Parallel Architectures and Compilation Techniques

October 14 - 16, 2024

Southern California , CA , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 44
  Total Citations
  View Citations
- 859
  Total Downloads
- Downloads (Last 12 months)75
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Accelerating Linked-list Traversal Through Near-Data Processing

PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation

ABSTRACT

References

Cited By

Index Terms

Recommendations

Toward standardized near-data processing with unrestricted data placement for GPUs

SparseP: Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Architectures

An Architecture for Integrated Near-Data Processors