skip to main content
10.1145/2967938.2967958acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article

Accelerating Linked-list Traversal Through Near-Data Processing

Authors Info & Claims
Published:11 September 2016Publication History

ABSTRACT

Recent technology advances in memory system design, along with 3D stacking, have made near-data processing (NDP) more feasible to accelerate different workloads. In this work, we explore the near-data processing opportunity of a fundamental operation - linked-list traversal (LLT). We propose a new NDP architecture which does not change the existing sequential programming model and does not require any modification to the core microarchitecture. Instead, we exploit the packetized interface between the core and the memory modules to off-load LLT for NDP. We assume a system with multiple memory modules (e.g., hybrid memory cube (HMC) modules) interconnected with a memory network and our initial evaluation shows that simply off-loading LLT computation to near-memory can actually reduce performance because of the additional off-chip memory network channel traversal. Thus, we first propose NDP-aware data localization to exploit packaging locality - including locality within a single memory module and memory vault - to minimize latency and improve energy efficiency. In order to improve overall throughput and maximize parallelism, we propose batching multiple LLT operations together to amortize the cost of NDP by utilizing the highly parallel execution of NDP processing units and the high bandwidth of 3D stacked DRAM. Our evaluation shows that the combination of NDP-aware data localization and batching can provide significant improvement in performance and energy efficiency.

References

  1. J. Ahn et al., "Scatter-add in data parallel architectures," in HPCA, 2005, pp. 132--142. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. J. Ahn et al., "McSimA+: A manycore simulator with application-level+ simulation and detailed microarchitecture modeling," in ISPASS, 2013, pp. 74--85.Google ScholarGoogle Scholar
  3. J. Ahn et al., "A scalable processing-in-memory accelerator for parallel graph processing," in ISCA, 2015, pp. 105--117. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. B. Akin et al., "Data reorganization in memory using 3D-stacked DRAM," in ISCA, 2015, pp. 131--143. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. B. Atikoglu et al., "Workload Analysis of a Large-scale Key-value Store," in ACM SIGMETRICS, 2012, pp. 53--64. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. R. Balasubramonian et al., "Near-data processing: Insights from a MICRO-46 workshop," IEEE Micro, vol. 34, no. 4, pp. 36--42, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  7. J. Balfour et al., "Design tradeoffs for tiled CMP on-chip networks," in ICS, 2006, pp. 187--198. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. C. Balkesen et al., "Main-memory hash joins on multi-core CPUs: Tuning to the underlying hardware," in ICDE, 2013, pp. 362--373. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Basu et al., "Efficient Virtual Memory for Big Memory Servers," in ISCA, 2013, pp. 237--248. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. N. Binkert et al., "The gem5 simulator," SIGARCH Comput. Archit. News, vol. 39, no. 2, pp. 1--7, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Blanas et al., "Design and evaluation of main memory hash join algorithms for multi-core CPUs," in International Conference on Management of data (SIGMOD). ACM, 2011, pp. 37--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Carter et al., "Impulse: Building a smarter memory controller," in HPCA, 1999, pp. 70--79. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. K. Chen et al., "CACTI-3DD: Architecture-level modeling for 3D die-stacked DRAM main memory," in DATE, 2012, pp. 33--38. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. P. Dlugosch et al., "An efficient and scalable semiconductor architecture for parallel automata processing," IEEE Transactions on Parallel and Distributed Systems, vol. 25, pp. 3088--3098, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  15. B. Falsafi et al., "A primer on hardware prefetching," Synthesis Lectures on Computer Architecture, vol. 9, no. 1, pp. 1--67, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Z. Fang et al., "Active memory operations," in ICS, 2007, pp. 232--241. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. B. Fitzpatrick and A. Vorobey, "Memcached: a distributed memory object caching system," 2011.Google ScholarGoogle Scholar
  18. M. Gao et al., "Practical Near-Data Processing for In-memory Analytics Frameworks," in PACT, 2015, pp. 113--124. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. Gao et al., "HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing," in HPCA, 2016, pp. 126--137.Google ScholarGoogle Scholar
  20. Q. Guo et al., "3d-stacked memory-side acceleration: Accelerator and system design," in In the Workshop on Near-Data Processing (WoNDP), 2014.Google ScholarGoogle Scholar
  21. A. Gutierrez et al., "Integrated 3D-stacked Server Designs for Increasing Physical Density of Key-value Stores," in ASPLOS, 2014, pp. 485--498. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. T. H. Hetherington et al., "Characterizing and Evaluating a Key-value Store Application on Heterogeneous CPU-GPU Systems," in ISPASS, 2012, pp. 88--98. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. B. Hong et al., "Adaptive and Flexible Key-Value Stores Through Soft Data Partitioning," in ICCD, 2016.Google ScholarGoogle Scholar
  24. Hybrid Memory Cube Consortium, "Hybrid Memory Cube Specification 2.0," 2014.Google ScholarGoogle Scholar
  25. Intel, "Increasing Memory Throughput With Intel Streaming SIMD Extensions 4 (Intel SSE4) Streaming Load," in White Paper, 2008.Google ScholarGoogle Scholar
  26. Intel, "Intel 64 and IA-32 Architectures Software Developer's Manual," 2014.Google ScholarGoogle Scholar
  27. Intel, "Intel Virtualization Technology for Directed I/O," Architecture Specification, 2014.Google ScholarGoogle Scholar
  28. J. Jeddeloh et al., "Hybrid memory cube new DRAM architecture increases density and performance," in Symposium on VLSI Technology, 2012.Google ScholarGoogle Scholar
  29. N. Jiang et al., "A detailed and flexible cycle-accurate Network-on-Chip simulator," in ISPASS, 2013, pp. 86--96.Google ScholarGoogle Scholar
  30. Y. Kang et al., "FlexRAM: Toward an advanced intelligent memory system," in ICCD, 2012, pp. 5--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. G. Kim et al., "Memory-centric system interconnect design with hybrid memory cubes," in PACT, 2013, pp. 145--156. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. H. Kim et al., "Understanding Energy Aspects of Processing-near-Memory for HPC Workloads," in Proceedings of the 2015 International Symposium on Memory Systems. ACM, 2015, pp. 276--282. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. O. Kocberber et al., "Meet the walkers: Accelerating index traversals for in-memory databases," in MICRO, 2013, pp. 468--479. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. P. M. Kogge, "EXECUBE-A new architecture for scaleable MPPs," in ICPP, 1994, pp. 77--84. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. J. H. Lee et al., "BSSync: Processing near memory for machine learning workloads with bounded staleness consistency models," in PACT, 2015, pp. 241--252. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. S. Li et al., "McPAT: an Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures," in MICRO, 2009, pp. 469--480. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. G. H. Loh, "A register-file approach for row buffer caches in die-stacked DRAMs," in MICRO, 2011, pp. 351--361. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. G. Loh et al., "A processing in memory taxonomy and a case for studying fixed-function pim," in Workshop on Near-Data Processing (WoNDP), 2013.Google ScholarGoogle Scholar
  39. R. C. Murphy et al., "Introducing to the graph 500," Cray User's Group (CUG), 2010.Google ScholarGoogle Scholar
  40. L. Nai et al., "Instruction Offloading with HMC 2.0 Standard: A Case Study for Graph Traversals," in Proceedings of the 2015 International Symposium on Memory Systems. ACM, 2015, pp. 258--261. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. R. Nair et al., "Active Memory Cube: A processing-in-memory architecture for exascale systems," IBM Journal of Research and Development, vol. 59, no. 2/3, pp. 17--1, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. D. Patterson et al., "A case for intelligent RAM," IEEE Micro, vol. 17, no. 2, pp. 34--44, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. J. T. Pawlowski, "Hybrid Memory Cube (HMC)," in Hot Chips, 2011.Google ScholarGoogle Scholar
  44. S. H. Pugsley et al., "NDC: Analyzing the impact of 3D-stacked memory+ logic devices on MapReduce workloads," in ISPASS, 2014, pp. 190--200.Google ScholarGoogle Scholar
  45. Samsung, "Samsung announces IMDB memory." {Online}. Available: http://www.techeye.net/business/samsung-announces-imdb-memory-with-ndp-hbm-tooGoogle ScholarGoogle Scholar
  46. G. Sandhu, "DRAM scaling and bandwidth challenges," in NSF Workshop on Emerging Technologies for Interconnects (WETI), 2012.Google ScholarGoogle Scholar
  47. A. Sodani et al., "Knights Landing: Second-Generation Intel Xeon i Product," IEEE Micro, vol. 36, no. 2, pp. 34--46, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. C. B. Zilles, "Benchmark health considered harmful," ACM SIGARCH Computer Architecture News, vol. 29, no. 3, pp. 4--5, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Accelerating Linked-list Traversal Through Near-Data Processing

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation
        September 2016
        474 pages
        ISBN:9781450341219
        DOI:10.1145/2967938

        Copyright © 2016 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 11 September 2016

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        PACT '16 Paper Acceptance Rate31of119submissions,26%Overall Acceptance Rate121of471submissions,26%

        Upcoming Conference

        PACT '24
        International Conference on Parallel Architectures and Compilation Techniques
        October 14 - 16, 2024
        Southern California , CA , USA

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader