research-article

MemcachedGPU: scaling-up scale-out key-value stores

Authors:
Tayler H. Hetherington

The University of British Columbia

The University of British Columbia
View Profile

,
Mike O'Connor

NVIDIA & UT-Austin

NVIDIA & UT-Austin
View Profile

,
Tor M. Aamodt

The University of British Columbia

The University of British Columbia
View Profile

SoCC '15: Proceedings of the Sixth ACM Symposium on Cloud ComputingAugust 2015Pages 43–57https://doi.org/10.1145/2806777.2806836

Published:27 August 2015Publication History

SoCC '15: Proceedings of the Sixth ACM Symposium on Cloud Computing

Pages 43–57

ABSTRACT

This paper tackles the challenges of obtaining more efficient data center computing while maintaining low latency, low cost, programmability, and the potential for workload consolidation. We introduce GNoM, a software framework enabling energy-efficient, latency bandwidth optimized UDP network and application processing on GPUs. GNoM handles the data movement and task management to facilitate the development of high-throughput UDP network services on GPUs. We use GNoM to develop MemcachedGPU, an accelerated key-value store, and evaluate the full system on contemporary hardware.

MemcachedGPU achieves ~10 GbE line-rate processing of ~13 million requests per second (MRPS) while delivering an efficiency of 62 thousand RPS per Watt (KRPS/W) on a high-performance GPU and 84.8 KRPS/W on a low-power GPU. This closely matches the throughput of an optimized FPGA implementation while providing up to 79% of the energy-efficiency on the low-power GPU. Additionally, the low-power GPU can potentially improve cost-efficiency (KRPS/$) up to 17% over a state-of-the-art CPU implementation. At 8 MRPS, MemcachedGPU achieves a 95-percentile RTT latency under 300μs on both GPUs. An offline limit study on the low-power GPU suggests that MemcachedGPU may continue scaling throughput and energy-efficiency up to 28.5 MRPS and 127 KRPS/W respectively.

References

J. T. Adriaens, K. Compton, N. S. Kim, and M. J. Schulte. The Case for GPGPU Spatial Multitasking. In Proceedings of the 18th International Symposium on High Performance Computer Architecture (HPCA), 2012. Google ScholarDigital Library
S. R. Agrawal, V. Pistol, J. Pang, J. Tran, D. Tarjan, and A. R. Lebeck. Rhythm: Harnessing Data Parallel Hardware for Server Workloads. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2014. Google ScholarDigital Library
B. Aker. libMemcached. http://libmemcached.org/libMemcached.html.Google Scholar
D. G. Andersen, J. Franklin, M. Kaminsky, A. Phanishayee, L. Tan, and V. Vasudevan. FAWN: A Fast Array of Wimpy Nodes. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles (SOSP), 2009. Google ScholarDigital Library
O. Arcas-Abella, G. Ndu, N. Sonmez, M. Ghasempour, A. Armejach, J. Navaridas, W. Song, J. Mawer, A. Cristal, and M. Lujan. An empirical evaluation of high-level synthesis languages and tools for database acceleration. In Field Programmable Logic and Applications (FPL), 2014 24th International Conference on, 2014.Google ScholarCross Ref
B. Atikoglu, Y. Xu, E. Frachtenberg, S. Jiang, and M. Paleczny. Workload Analysis of a Large-scale Key-value Store. In Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), 2012. Google ScholarDigital Library
P. Bakkum and K. Skadron. Accelerating SQL Database Operations on a GPU with CUDA. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units (GPGPU), 2010. Google ScholarDigital Library
L. A. Barroso and U. Hölzle. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, 2nd Edition. Synthesis Lectures on Computer Architecture, 2013. Google ScholarDigital Library
M. Bauer, S. Treichler, and A. Aiken. Singe: Leveraging warp specialization for high performance on gpus. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '14. ACM, 2014. Google ScholarDigital Library
M. Berezecki, E. Frachtenberg, M. Paleczny, and K. Steele. Many-core Key-value Store. In Proceedings of the 2011 International Green Computing Conference and Workshops (IGCC), 2011. Google ScholarDigital Library
M. Blott, K. Karras, L. Liu, K. Vissers, J. Bar, and Z. Istvan. Achieving 10Gbps Line-rate Key-value Stores with FPGAs. In Proceedings of the 5th USENIX Workshop on Hot Topics in Cloud Computing, 2013.Google Scholar
S. R. Chalamalasetti, K. Lim, M. Wright, A. AuYoung, P. Ranganathan, and M. Margala. An FPGA Memcached Appliance. In Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA), 2013. Google ScholarDigital Library
B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. Benchmarking cloud serving systems with ycsb. In Proceedings of the 1st ACM Symposium on Cloud Computing, SoCC '10, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
A. Corporation. Implementing fpga design with the opencl standard. https://www.altera.com/en US/pdfs/literature/wp/wp-01173-opencl.pdf, 11 2013.Google Scholar
J. Dean. Large scale deep learning. Keynote GPU Technical Conference 2015, 03 2015.Google Scholar
D. Deyannis, L. Koromilas, G. Vasiliadis, E. Athanasopoulos, and S. Ioannidis. Flying memcache: Lessons learned from different acceleration strategies. In Computer Architecture and High Performance Computing (SBAC-PAD), 2014 IEEE 26th International Symposium on. IEEE, 2014. Google ScholarDigital Library
M. Dowty and J. Sugerman. GPU Virtualization on VMware's Hosted I/O Architecture. SIGOPS Operating Systems Review, July 2009. Google ScholarDigital Library
A. Dragojević, D. Narayanan, O. Hodson, and M. Castro. FaRM: Fast Remote Memory. In Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation (NSDI), 2014. Google ScholarDigital Library
B. Fan, D. G. Andersen, and M. Kaminsky. Memc3: Compact and concurrent memcache with dumber caching and smarter hashing. In 10th Usenix Symposium on Networked Systems Design and Implementation (NSDI '13), 2013. Google ScholarDigital Library
V. Gupta, A. Gavrilovska, K. Schwan, H. Kharche, N. Tolia, V. Talwar, and P. Ranganathan. GViM: GPU-accelerated Virtual Machines. In Proceedings of the 3rd ACM Workshop on System-level Virtualization for High Performance Computing (HPCVirt), 2009. Google ScholarDigital Library
S. Han, K. Jang, K. Park, and S. Moon. PacketShader: A GPU-accelerated Software Router. SIGCOMM Computer Communications Review, October 2010. Google ScholarDigital Library
M. Herlihy, N. Shavit, and M. Tzafrir. Hopscotch Hashing. In Distributed Computing. Springer, 2008. Google ScholarDigital Library
T. Hetherington, T. Rogers, L. Hsu, M. O'Connor, and T. Aamodt. Characterizing and Evaluating a Key-value Store Application on Heterogeneous CPU-GPU Systems. In Proceeding of the 2012 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2012. Google ScholarDigital Library
U. Hölzle. Brawny cores still beat wimpy cores, most of the time. IEEE Micro, July/August 2010.Google Scholar
X. Huang, C. Rodrigues, S. Jones, I. Buck, and W.-M. Hwu. XMalloc: A Scalable Lock-free Dynamic Memory Allocator for Many-core Machines. In Proceedings of the 2010 IEEE 10th International Conference on Computer and Information Technology (CIT), 2010. Google ScholarDigital Library
Z. Istvan, G. Alonso, M. Blott, and K. Vissers. A flexible hash table design for 10gbps key-value stores on fpgas. In Field Programmable Logic and Applications (FPL), 2013 23rd International Conference on, Sept 2013.Google ScholarCross Ref
B. Jenkins. Function for Producing 32bit Hashes for Hash Table Lookup. http://burtleburtle.net/bob/c/lookup3.c, 2006.Google Scholar
J. Jose, H. Subramoni, K. Kandalla, M. Wasi-ur Rahman, H. Wang, S. Narravula, and D. Panda. Scalable Memcached Design for InfiniBand Clusters Using Hybrid Transports. In Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), 2012. Google ScholarDigital Library
J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, M. Wasi-ur Rahman, N. Islam, X. Ouyang, H. Wang, S. Sur, and D. Panda. Memcached Design on High Performance RDMA Capable Interconnects. In Proceedings of the 2011 International Conference on Parallel Processing (ICPP), 2011. Google ScholarDigital Library
S. Kato, M. McThrow, C. Maltzahn, and S. Brandt. Gdev: First-class GPU Resource Management in the Operating System. In Proceedings of the 2012 USENIX Conference on Annual Technical Conference (USENIX ATC), 2012. Google ScholarDigital Library
Khronos OpenCL Working Group. The OpenCL Specification, 1.1 edition, 2011.Google Scholar
J. Kim, K. Jang, K. Lee, S. Ma, J. Shim, and S. Moon. Nba (network balancing act): A high-performance packet processing framework for heterogeneous processors. In Proceedings of the Tenth European Conference on Computer Systems, EuroSys '15. ACM, 2015. Google ScholarDigital Library
S. Kim, S. Huh, X. Zhang, Y. Hu, A. Wated, E. Witchel, and M. Silberstein. Gpunet: Networking abstractions for gpu programs. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), Oct. 2014. Google ScholarDigital Library
I. Kuon and J. Rose. Measuring the gap between fpgas and asics. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 26, 2007. Google ScholarDigital Library
J. Leverich and C. Kozyrakis. Reconciling High Server Utilization and Sub-millisecond Quality-of-Service. In Proceedings of the Ninth European Conference on Computer Systems (EuroSys), 2014. Google ScholarDigital Library
H. Lim, D. Han, D. G. Andersen, and M. Kaminsky. Mica: A holistic approach to fast in-memory key-value storage. In 11th Usenix Symposium on Networked Systems Design and Implementation (NSDI '14), 2014. Google ScholarDigital Library
K. Lim, D. Meisner, A. G. Saidi, P. Ranganathan, and T. F. Wenisch. Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached. SIGARCH Computer Architecture News, June 2013. Google ScholarDigital Library
Memcached. A Distributed Memory Object Caching System. http://www.memcached.org.Google Scholar
J. Menon, M. De Kruijf, and K. Sankaralingam. iGPU: Exception Support and Speculative Execution on GPUs. In Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA), 2012. Google ScholarDigital Library
R. Nishtala, H. Fugal, S. Grimm, M. Kwiatkowski, H. Lee, H. C. Li, R. Mcelroy, M. Paleczny, D. Peek, P. Saab, D. Stafford, T. Tung, V. Venkataramani, and F. Inc. Scaling Memcached at Facebook. In Proceedings of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2013. Google ScholarDigital Library
ntop. PF_RING. http://www.ntop.org/products/pf_ring/.Google Scholar
NVIDIA Corporation. NVIDIA's Next Generation CUDA Compute Architecture: Fermi. http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf, 2009.Google Scholar
NVIDIA Corporation. NVIDIA CUDA C Programming Guide v4.2. http://developer.nvidia.com/nvidia-gpu-computing-documentation/, 2012.Google Scholar
NVIDIA Corporation. NVIDIA's Next Generation CUDA Compute Architecture: Kepler GK110. http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf, 2012.Google Scholar
NVIDIA Corporation. Developing a Linux Kernel Module using GPUDirect RDMA. http://docs.nvidia.com/cuda/gpudirect-rdma/index.html, 2014.Google Scholar
NVIDIA Corporation. NVIDIA GeForce GTX 750 Ti: Featuring First-Generation Maxwell GPU Technology, Designed for Extreme Performance per Watt. http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce-GTX-750-Ti-Whitepaper.pdf, 2014.Google Scholar
A. Papakonstantinou, K. Gururaj, J. Stratton, D. Chen, J. Cong, and W.-M. Hwu. Fcuda: Enabling efficient compilation of cuda kernels onto fpgas. In Application Specific Processors, 2009. SASP '09. IEEE 7th Symposium on, July 2009.Google ScholarCross Ref
D. A. Patterson and J. L. Hennessy. Computer Architecture: A Quantitative Approach. Morgan Kaufmann Publishers Inc., 1990. Google ScholarDigital Library
A. Putnam, A. M. Caulfield, E. S. Chung, D. Chiou, K. Constantinides, J. Demme, H. Esmaeilzadeh, J. Fowers, G. P. Gopal, J. Gray, M. Haselman, S. Hauck, S. Heil, A. Hormati, J.-Y. Kim, S. Lanka, J. Larus, E. Peterson, S. Pope, A. Smith, J. Thong, P. Y. Xiao, and D. Burger. A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services. In Proceedings of the 41st Annual International Symposium on Computer Architecture (ISCA), 2014. Google ScholarDigital Library
C. J. Rossbach, J. Currey, M. Silberstein, B. Ray, and E. Witchel. PTask: Operating System Abstractions to Manage GPUs As Compute Devices. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP), 2011. Google ScholarDigital Library
L. Shi, H. Chen, J. Sun, and K. Li. vCUDA: GPU-Accelerated High-Performance Computing in Virtual Machines. IEEE Transactions on Computers, June 2012. Google ScholarDigital Library
M. Silberstein, B. Ford, I. Keidar, and E. Witchel. GPUfs: Integrating a File System with GPUs. SIGARCH Computer Architecture News, March 2013. Google ScholarDigital Library
I. Tanasic, I. Gelado, J. Cabezas, A. Ramirez, N. Navarro, and M. Valero. Enabling Preemptive Multiprogramming on GPUs. In Proceedings of the 41st International Symposium on Computer Architecture (ISCA), 2014. Google ScholarDigital Library
ThinkTank Energy Products. Watts up? Plug Load Meters. https://www.wattsupmeters.com/secure/index.php.Google Scholar
G. Vasiliadis, L. Koromilas, M. Polychronakis, and S. Ioannidis. Gaspp: a gpu-accelerated stateful packet processing framework. In USENIX ATC, 2014. Google ScholarDigital Library
A. Wiggins and J. Langston. Enhancing the Scalability of Memcached. https://software.intel.com/sites/default/files/m/0/b/6/1/d/45675-memcached_05172012.pdf.Google Scholar
H. Wu, G. Diamos, J. Wang, S. Cadambi, S. Yalamanchili, and S. Chakradhar. Optimizing Data Warehousing Applications for GPUs using Kernel Fusion/Fission. In Proceedings of the 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2012. Google ScholarDigital Library
L. Wu, A. Lottarini, T. K. Paine, M. A. Kim, and K. A. Ross. Q100: The Architecture and Design of a Database Processing Unit. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2014. Google ScholarDigital Library
K. Zhang, K. Wang, Y. Yuan, L. Guo, R. Lee, and X. Zhang. Mega-kv: A case for gpus to maximize the throughput of in-memory key-value stores. Proceedings of the VLDB Endowment, 8(11), 2015. Google ScholarDigital Library

Index Terms

MemcachedGPU: scaling-up scale-out key-value stores
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Parallel and distributed DBMSs
2. Networks
  1. Network protocols

Recommendations

A distributed in-memory key-value store system on heterogeneous CPU---GPU cluster

In-memory key-value stores play a critical role in many data-intensive applications to provide high-throughput and low latency data accesses. In-memory key-value stores have several unique properties that include (1) data-intensive operations demanding ...
Read More
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Read More
Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs
SCC '12: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis

OpenCL (Open Computing Language) is a framework for general-purpose parallel programming. Programs written in OpenCL are functionally portable across multiple processors including CPUs, GPUs, and also FPGAs. Using an auto-tuning technique makes ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SoCC '15: Proceedings of the Sixth ACM Symposium on Cloud Computing
August 2015
446 pages
ISBN:9781450336512
DOI:10.1145/2806777
General Chair:
Shahram Ghandeharizadeh
University of Southern California
,
Program Chairs:
Magdalena Balazinska
University of Washington
,
Michael J. Freedman
Princeton University
,
Publications Chair:
Sumita Barahmand
Microsoft
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 August 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
GPU
data center
key-value store
Qualifiers
- research-article
Conference

Acceptance Rates
SoCC '15 Paper Acceptance Rate34of157submissions,22%Overall Acceptance Rate169of722submissions,23%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 46
  Total Citations
  View Citations
- 853
  Total Downloads
- Downloads (Last 12 months)35
- Downloads (Last 6 weeks)8
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

MemcachedGPU: scaling-up scale-out key-value stores

SoCC '15: Proceedings of the Sixth ACM Symposium on Cloud Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

A distributed in-memory key-value store system on heterogeneous CPU---GPU cluster

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs