skip to main content
10.1145/2806777.2806836acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

MemcachedGPU: scaling-up scale-out key-value stores

Published:27 August 2015Publication History

ABSTRACT

This paper tackles the challenges of obtaining more efficient data center computing while maintaining low latency, low cost, programmability, and the potential for workload consolidation. We introduce GNoM, a software framework enabling energy-efficient, latency bandwidth optimized UDP network and application processing on GPUs. GNoM handles the data movement and task management to facilitate the development of high-throughput UDP network services on GPUs. We use GNoM to develop MemcachedGPU, an accelerated key-value store, and evaluate the full system on contemporary hardware.

MemcachedGPU achieves ~10 GbE line-rate processing of ~13 million requests per second (MRPS) while delivering an efficiency of 62 thousand RPS per Watt (KRPS/W) on a high-performance GPU and 84.8 KRPS/W on a low-power GPU. This closely matches the throughput of an optimized FPGA implementation while providing up to 79% of the energy-efficiency on the low-power GPU. Additionally, the low-power GPU can potentially improve cost-efficiency (KRPS/$) up to 17% over a state-of-the-art CPU implementation. At 8 MRPS, MemcachedGPU achieves a 95-percentile RTT latency under 300μs on both GPUs. An offline limit study on the low-power GPU suggests that MemcachedGPU may continue scaling throughput and energy-efficiency up to 28.5 MRPS and 127 KRPS/W respectively.

References

  1. J. T. Adriaens, K. Compton, N. S. Kim, and M. J. Schulte. The Case for GPGPU Spatial Multitasking. In Proceedings of the 18th International Symposium on High Performance Computer Architecture (HPCA), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. R. Agrawal, V. Pistol, J. Pang, J. Tran, D. Tarjan, and A. R. Lebeck. Rhythm: Harnessing Data Parallel Hardware for Server Workloads. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. B. Aker. libMemcached. http://libmemcached.org/libMemcached.html.Google ScholarGoogle Scholar
  4. D. G. Andersen, J. Franklin, M. Kaminsky, A. Phanishayee, L. Tan, and V. Vasudevan. FAWN: A Fast Array of Wimpy Nodes. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles (SOSP), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. O. Arcas-Abella, G. Ndu, N. Sonmez, M. Ghasempour, A. Armejach, J. Navaridas, W. Song, J. Mawer, A. Cristal, and M. Lujan. An empirical evaluation of high-level synthesis languages and tools for database acceleration. In Field Programmable Logic and Applications (FPL), 2014 24th International Conference on, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  6. B. Atikoglu, Y. Xu, E. Frachtenberg, S. Jiang, and M. Paleczny. Workload Analysis of a Large-scale Key-value Store. In Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. P. Bakkum and K. Skadron. Accelerating SQL Database Operations on a GPU with CUDA. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units (GPGPU), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. L. A. Barroso and U. Hölzle. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, 2nd Edition. Synthesis Lectures on Computer Architecture, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. Bauer, S. Treichler, and A. Aiken. Singe: Leveraging warp specialization for high performance on gpus. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '14. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. Berezecki, E. Frachtenberg, M. Paleczny, and K. Steele. Many-core Key-value Store. In Proceedings of the 2011 International Green Computing Conference and Workshops (IGCC), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M. Blott, K. Karras, L. Liu, K. Vissers, J. Bar, and Z. Istvan. Achieving 10Gbps Line-rate Key-value Stores with FPGAs. In Proceedings of the 5th USENIX Workshop on Hot Topics in Cloud Computing, 2013.Google ScholarGoogle Scholar
  12. S. R. Chalamalasetti, K. Lim, M. Wright, A. AuYoung, P. Ranganathan, and M. Margala. An FPGA Memcached Appliance. In Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA), 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. Benchmarking cloud serving systems with ycsb. In Proceedings of the 1st ACM Symposium on Cloud Computing, SoCC '10, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. Corporation. Implementing fpga design with the opencl standard. https://www.altera.com/en US/pdfs/literature/wp/wp-01173-opencl.pdf, 11 2013.Google ScholarGoogle Scholar
  15. J. Dean. Large scale deep learning. Keynote GPU Technical Conference 2015, 03 2015.Google ScholarGoogle Scholar
  16. D. Deyannis, L. Koromilas, G. Vasiliadis, E. Athanasopoulos, and S. Ioannidis. Flying memcache: Lessons learned from different acceleration strategies. In Computer Architecture and High Performance Computing (SBAC-PAD), 2014 IEEE 26th International Symposium on. IEEE, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Dowty and J. Sugerman. GPU Virtualization on VMware's Hosted I/O Architecture. SIGOPS Operating Systems Review, July 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A. Dragojević, D. Narayanan, O. Hodson, and M. Castro. FaRM: Fast Remote Memory. In Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation (NSDI), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. B. Fan, D. G. Andersen, and M. Kaminsky. Memc3: Compact and concurrent memcache with dumber caching and smarter hashing. In 10th Usenix Symposium on Networked Systems Design and Implementation (NSDI '13), 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. V. Gupta, A. Gavrilovska, K. Schwan, H. Kharche, N. Tolia, V. Talwar, and P. Ranganathan. GViM: GPU-accelerated Virtual Machines. In Proceedings of the 3rd ACM Workshop on System-level Virtualization for High Performance Computing (HPCVirt), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. S. Han, K. Jang, K. Park, and S. Moon. PacketShader: A GPU-accelerated Software Router. SIGCOMM Computer Communications Review, October 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. Herlihy, N. Shavit, and M. Tzafrir. Hopscotch Hashing. In Distributed Computing. Springer, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. T. Hetherington, T. Rogers, L. Hsu, M. O'Connor, and T. Aamodt. Characterizing and Evaluating a Key-value Store Application on Heterogeneous CPU-GPU Systems. In Proceeding of the 2012 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. U. Hölzle. Brawny cores still beat wimpy cores, most of the time. IEEE Micro, July/August 2010.Google ScholarGoogle Scholar
  25. X. Huang, C. Rodrigues, S. Jones, I. Buck, and W.-M. Hwu. XMalloc: A Scalable Lock-free Dynamic Memory Allocator for Many-core Machines. In Proceedings of the 2010 IEEE 10th International Conference on Computer and Information Technology (CIT), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Z. Istvan, G. Alonso, M. Blott, and K. Vissers. A flexible hash table design for 10gbps key-value stores on fpgas. In Field Programmable Logic and Applications (FPL), 2013 23rd International Conference on, Sept 2013.Google ScholarGoogle ScholarCross RefCross Ref
  27. B. Jenkins. Function for Producing 32bit Hashes for Hash Table Lookup. http://burtleburtle.net/bob/c/lookup3.c, 2006.Google ScholarGoogle Scholar
  28. J. Jose, H. Subramoni, K. Kandalla, M. Wasi-ur Rahman, H. Wang, S. Narravula, and D. Panda. Scalable Memcached Design for InfiniBand Clusters Using Hybrid Transports. In Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, M. Wasi-ur Rahman, N. Islam, X. Ouyang, H. Wang, S. Sur, and D. Panda. Memcached Design on High Performance RDMA Capable Interconnects. In Proceedings of the 2011 International Conference on Parallel Processing (ICPP), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. S. Kato, M. McThrow, C. Maltzahn, and S. Brandt. Gdev: First-class GPU Resource Management in the Operating System. In Proceedings of the 2012 USENIX Conference on Annual Technical Conference (USENIX ATC), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Khronos OpenCL Working Group. The OpenCL Specification, 1.1 edition, 2011.Google ScholarGoogle Scholar
  32. J. Kim, K. Jang, K. Lee, S. Ma, J. Shim, and S. Moon. Nba (network balancing act): A high-performance packet processing framework for heterogeneous processors. In Proceedings of the Tenth European Conference on Computer Systems, EuroSys '15. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. S. Kim, S. Huh, X. Zhang, Y. Hu, A. Wated, E. Witchel, and M. Silberstein. Gpunet: Networking abstractions for gpu programs. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), Oct. 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. I. Kuon and J. Rose. Measuring the gap between fpgas and asics. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 26, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. J. Leverich and C. Kozyrakis. Reconciling High Server Utilization and Sub-millisecond Quality-of-Service. In Proceedings of the Ninth European Conference on Computer Systems (EuroSys), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. H. Lim, D. Han, D. G. Andersen, and M. Kaminsky. Mica: A holistic approach to fast in-memory key-value storage. In 11th Usenix Symposium on Networked Systems Design and Implementation (NSDI '14), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. K. Lim, D. Meisner, A. G. Saidi, P. Ranganathan, and T. F. Wenisch. Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached. SIGARCH Computer Architecture News, June 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Memcached. A Distributed Memory Object Caching System. http://www.memcached.org.Google ScholarGoogle Scholar
  39. J. Menon, M. De Kruijf, and K. Sankaralingam. iGPU: Exception Support and Speculative Execution on GPUs. In Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. R. Nishtala, H. Fugal, S. Grimm, M. Kwiatkowski, H. Lee, H. C. Li, R. Mcelroy, M. Paleczny, D. Peek, P. Saab, D. Stafford, T. Tung, V. Venkataramani, and F. Inc. Scaling Memcached at Facebook. In Proceedings of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. ntop. PF_RING. http://www.ntop.org/products/pf_ring/.Google ScholarGoogle Scholar
  42. NVIDIA Corporation. NVIDIA's Next Generation CUDA Compute Architecture: Fermi. http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf, 2009.Google ScholarGoogle Scholar
  43. NVIDIA Corporation. NVIDIA CUDA C Programming Guide v4.2. http://developer.nvidia.com/nvidia-gpu-computing-documentation/, 2012.Google ScholarGoogle Scholar
  44. NVIDIA Corporation. NVIDIA's Next Generation CUDA Compute Architecture: Kepler GK110. http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf, 2012.Google ScholarGoogle Scholar
  45. NVIDIA Corporation. Developing a Linux Kernel Module using GPUDirect RDMA. http://docs.nvidia.com/cuda/gpudirect-rdma/index.html, 2014.Google ScholarGoogle Scholar
  46. NVIDIA Corporation. NVIDIA GeForce GTX 750 Ti: Featuring First-Generation Maxwell GPU Technology, Designed for Extreme Performance per Watt. http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce-GTX-750-Ti-Whitepaper.pdf, 2014.Google ScholarGoogle Scholar
  47. A. Papakonstantinou, K. Gururaj, J. Stratton, D. Chen, J. Cong, and W.-M. Hwu. Fcuda: Enabling efficient compilation of cuda kernels onto fpgas. In Application Specific Processors, 2009. SASP '09. IEEE 7th Symposium on, July 2009.Google ScholarGoogle ScholarCross RefCross Ref
  48. D. A. Patterson and J. L. Hennessy. Computer Architecture: A Quantitative Approach. Morgan Kaufmann Publishers Inc., 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. A. Putnam, A. M. Caulfield, E. S. Chung, D. Chiou, K. Constantinides, J. Demme, H. Esmaeilzadeh, J. Fowers, G. P. Gopal, J. Gray, M. Haselman, S. Hauck, S. Heil, A. Hormati, J.-Y. Kim, S. Lanka, J. Larus, E. Peterson, S. Pope, A. Smith, J. Thong, P. Y. Xiao, and D. Burger. A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services. In Proceedings of the 41st Annual International Symposium on Computer Architecture (ISCA), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. C. J. Rossbach, J. Currey, M. Silberstein, B. Ray, and E. Witchel. PTask: Operating System Abstractions to Manage GPUs As Compute Devices. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. L. Shi, H. Chen, J. Sun, and K. Li. vCUDA: GPU-Accelerated High-Performance Computing in Virtual Machines. IEEE Transactions on Computers, June 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. M. Silberstein, B. Ford, I. Keidar, and E. Witchel. GPUfs: Integrating a File System with GPUs. SIGARCH Computer Architecture News, March 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. I. Tanasic, I. Gelado, J. Cabezas, A. Ramirez, N. Navarro, and M. Valero. Enabling Preemptive Multiprogramming on GPUs. In Proceedings of the 41st International Symposium on Computer Architecture (ISCA), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. ThinkTank Energy Products. Watts up? Plug Load Meters. https://www.wattsupmeters.com/secure/index.php.Google ScholarGoogle Scholar
  55. G. Vasiliadis, L. Koromilas, M. Polychronakis, and S. Ioannidis. Gaspp: a gpu-accelerated stateful packet processing framework. In USENIX ATC, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. A. Wiggins and J. Langston. Enhancing the Scalability of Memcached. https://software.intel.com/sites/default/files/m/0/b/6/1/d/45675-memcached_05172012.pdf.Google ScholarGoogle Scholar
  57. H. Wu, G. Diamos, J. Wang, S. Cadambi, S. Yalamanchili, and S. Chakradhar. Optimizing Data Warehousing Applications for GPUs using Kernel Fusion/Fission. In Proceedings of the 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. L. Wu, A. Lottarini, T. K. Paine, M. A. Kim, and K. A. Ross. Q100: The Architecture and Design of a Database Processing Unit. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. K. Zhang, K. Wang, Y. Yuan, L. Guo, R. Lee, and X. Zhang. Mega-kv: A case for gpus to maximize the throughput of in-memory key-value stores. Proceedings of the VLDB Endowment, 8(11), 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. MemcachedGPU: scaling-up scale-out key-value stores

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SoCC '15: Proceedings of the Sixth ACM Symposium on Cloud Computing
        August 2015
        446 pages
        ISBN:9781450336512
        DOI:10.1145/2806777

        Copyright © 2015 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 27 August 2015

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        SoCC '15 Paper Acceptance Rate34of157submissions,22%Overall Acceptance Rate169of722submissions,23%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader