research-article

Programming and Runtime Support to Blaze FPGA Accelerator Deployment at Datacenter Scale

Authors:
Muhuan Huang

University of California Los Angeles and Falcon Computing Solutions, Inc.

University of California Los Angeles and Falcon Computing Solutions, Inc.
View Profile

,
Di Wu

University of California Los Angeles and Falcon Computing Solutions, Inc.

University of California Los Angeles and Falcon Computing Solutions, Inc.
View Profile

,
Cody Hao Yu

University of California Los Angeles

University of California Los Angeles
View Profile

,
Zhenman Fang

University of California Los Angeles

University of California Los Angeles
View Profile

,
Matteo Interlandi

University of California Los Angeles

University of California Los Angeles
View Profile

,
Tyson Condie

University of California Los Angeles

University of California Los Angeles
View Profile

,
Jason Cong

University of California Los Angeles

University of California Los Angeles
View Profile

SoCC '16: Proceedings of the Seventh ACM Symposium on Cloud ComputingOctober 2016Pages 456–469https://doi.org/10.1145/2987550.2987569

Published:05 October 2016Publication History

SoCC '16: Proceedings of the Seventh ACM Symposium on Cloud Computing

Pages 456–469

ABSTRACT

With the end of CPU core scaling due to dark silicon limitations, customized accelerators on FPGAs have gained increased attention in modern datacenters due to their lower power, high performance and energy efficiency. Evidenced by Microsoft's FPGA deployment in its Bing search engine and Intel's 16.7 billion acquisition of Altera, integrating FPGAs into datacenters is considered one of the most promising approaches to sustain future datacenter growth. However, it is quite challenging for existing big data computing systems---like Apache Spark and Hadoop---to access the performance and energy benefits of FPGA accelerators.

In this paper we design and implement Blaze to provide programming and runtime support for enabling easy and efficient deployments of FPGA accelerators in datacenters. In particular, Blaze abstracts FPGA accelerators as a service (FaaS) and provides a set of clean programming APIs for big data processing applications to easily utilize those accelerators. Our Blaze runtime implements an FaaS framework to efficiently share FPGA accelerators among multiple heterogeneous threads on a single node, and extends Hadoop YARN with accelerator-centric scheduling to efficiently share them among multiple computing tasks in the cluster. Experimental results using four representative big data applications demonstrate that Blaze greatly reduces the programming efforts to access FPGA accelerators in systems like Apache Spark and YARN, and improves the system throughput by 1.7× to 3× (and energy efficiency by 1.5× to 2.7×) compared to a conventional CPU-only cluster.

References

Apache Hadoop. https://hadoop.apache.org. Accessed: 2016-05-24.Google Scholar
Apache parquet. https://parquet.apache.org/. Accessed: 2016-05-24.Google Scholar
Aparapi in amd developer website. http://developer.amd.com/tools-and-sdks/opencl-zone/aparapi/. Accessed: 2016-05-24.Google Scholar
Facebook engineering (2012) under the hood: Scheduling mapreduce jobs more efficiently with corona. https://www.facebook.com/notes/facebook-engineering/under-the-hood-scheduling-mapreduce-jobs-more-efficiently-with-corona/10151142560538920. Accessed: 2016-01-30.Google Scholar
HTCondor. https://research.cs.wisc.edu/htcondor. Accessed: 2016-05-24.Google Scholar
Intel to Start Shipping Xeons With FPGAs in Early 2016. http://www.eweek.com/servers/intel-to-start-shipping-xeons-with-fpgas-in-early-2016.html. Accessed: 2016-05-17.Google Scholar
Large scale distributed deep learning on Hadoop clusters. http://yahoohadoop.tumblr.com/post/129872361846/large-scale-distributed-deep-learning-on-hadoop. Accessed: 2016-05-24.Google Scholar
The MNIST database of handwritten digits. https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html#mnist8m. Accessed: 2016-05-24.Google Scholar
Project Tungsten: Bringing Apache Spark Closer to Bare Metal. https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html. Accessed: 2016-08-10.Google Scholar
The snappy-java port. https://github.com/xerial/snappy-java. Accessed: 2016-08-01.Google Scholar
Spark MLlib. http://spark.apache.org/mllib/. Accessed: 2016-05-24.Google Scholar
Xeon+FPGA Platform for the Data Center. https://www.ece.cmu.edu/~calcm/carl/lib/exe/fetch.php?media=carl15-gupta.pdf. Accessed: 2016-05-17.Google Scholar
Brech, B., Rubio, J., and Hollinger, M. IBM Data Engine for NoSQL - Power Systems Edition. Tech. rep., IBM Systems Group, 2015.Google Scholar
Byma, S., Steffan, J. G., Bannazadeh, H., Garcia, A. L., and Chow, P. FPGAs in the cloud: Booting virtualized hardware accelerators with openstack. In Field-Programmable Custom Computing Machines (FCCM), 2014 IEEE 22nd Annual International Symposium on (2014), IEEE, pp. 109--116. Google ScholarDigital Library
Chen, Y.-T., Cong, J., Fang, Z., Lei, J., and Wei., P. When Apache Spark meets FPGAs: A case study for next-generation dna sequencing acceleration. In The 8th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 16) (2016).Google Scholar
Chen, Y. T., Cong, J., Lei, J., and Wei, P. A novel high-throughput acceleration engine for read alignment. In Field-Programmable Custom Computing Machines (FCCM), 2015 IEEE 23rd Annual International Symposium on (May 2015), pp. 199--202. Google ScholarDigital Library
Chen, Y.-T., Cong, J., Li, S., Peto, M., Spellman, P., Wei, P., and Zhou, P. CS-BWAMEM: A fast and scalable read aligner at the cloud scale for whole genome sequencing. High Throughput Sequencing Algorithms and Applications (HITSEQ) (2015).Google Scholar
Cong, J., Huang, M., Wu, D., and Yu, C. H. Heterogeneous datacenters: Options and opportunities. In Proceedings of the 53nd Annual Design Automation Conference (2016), ACM. Google ScholarDigital Library
Cong, J., Liu, B., Neuendorffer, S., Noguera, J., Vissers, K., and Zhang, Z. High-level synthesis for FPGAs: From prototyping to deployment. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on 30, 4 (April 2011), 473--491. Google ScholarDigital Library
El-Helw, I., Hofman, R., and Bal, H. E. Glasswing: Accelerating mapreduce on multi-core and many-core clusters. In Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing (New York, NY, USA, 2014), HPDC '14, ACM, pp. 295--298. Google ScholarDigital Library
Esmaeilzadeh, H., Blem, E., St.Amant, R., Sankar-alingam, K., and Burger, D. Dark silicon and the end of multicore scaling. In Computer Architecture (ISCA), 2011 38th Annual International Symposium on (June 2011), pp. 365--376. Google ScholarDigital Library
Grossman, M., Breternitz, M., and Sarkar, V. HadoopCL: Mapreduce on distributed heterogeneous platforms through seamless integration of Hadoop and OpenCL. In Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum (Washington, DC, USA, 2013), IPDPSW '13, IEEE Computer Society, pp. 1918--1927. Google ScholarDigital Library
Grossman, M., and Sarkar, V. Swat: A programmable, in-memory, distributed, high-performance computing platform. The 25th International Symposium on High-Performance Parallel and Distributed Computing (HPDC) (2016). Google ScholarDigital Library
Hindman, B., Konwinski, A., Zaharia, M., Ghodsi, A., Joseph, A. D., Katz, R., Shenker, S., and Stoica, I. Mesos: A platform for fine-grained resource sharing in the data center. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation (Berkeley, CA, USA, 2011), NSDI'11, USENIX Association, pp. 295--308. Google ScholarDigital Library
Hong, C., Chen, D., Chen, W., Zheng, W., and Lin, H. MapCG: Writing parallel program portable between CPU and GPU. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (New York, NY, USA, 2010), PACT '10, ACM, pp. 217--226. Google ScholarDigital Library
Isard, M., Budiu, M., Yu, Y., Birrell, A., and Fetterly, D. Dryad: distributed data-parallel programs from sequential building blocks. In ACM SIGOPS Operating Systems Review (2007), vol. 41, ACM, pp. 59--72. Google ScholarDigital Library
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., and Darrell, T. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014).Google Scholar
K. Choi, Y., and Cong, J. Acceleration of EM-based 3D CT reconstruction using FPGA. IEEE Transactions on Biomedical Circuits and Systems 10, 3 (June 2016), 754--767.Google ScholarCross Ref
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:1303.3997 (2013).Google Scholar
Li, P., Luo, Y., Zhang, N., and Cao, Y. HeteroSpark: A heterogeneous CPU/GPU spark platform for machine learning algorithms. In Networking, Architecture and Storage (NAS), 2015 IEEE International Conference on (Aug 2015), pp. 347--348.Google Scholar
Lin, Z., and Chow, P. Zcluster: A Zynq-based Hadoop cluster. In Field-Programmable Technology (FPT), 2013 International Conference on (Dec 2013), pp. 450--453.Google ScholarCross Ref
Ousterhout, K., Rasti, R., Ratnasamy, S., Shenker, S., and Chun, B.-G. Making sense of performance in data analytics frameworks. In 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15) (Oakland, CA, May 2015), USENIX Association, pp. 293--307. Google ScholarDigital Library
Putnam, A., Caulfield, A. M., Chung, E. S., Chiou, D., Constantinides, K., Demme, J., Esmaeilzadeh, H., Fowers, J., Gopal, G. P., Gray, J., Haselman, M., Hauck, S., Heil, S., Hormati, A., Kim, J.-Y., Lanka, S., Larus, J., Peterson, E., Pope, S., Smith, A., Thong, J., Xiao, P. Y., and Burger, D. A reconfigurable fabric for accelerating large-scale datacenter services. In Computer Architecture (ISCA), 2014 ACM/IEEE 41st International Symposium on (June 2014), ieeexplore.ieee.org, pp. 13--24. Google ScholarDigital Library
Rajagopalan, V., Boppana, V., Dutta, S., Taylor, B., and Wittig, R. Xilinx Zynq-7000 EPP--an extensible processing platform family. In 23rd Hot Chips Symposium (2011), pp. 1352--1357.Google ScholarCross Ref
Rossbach, C. J., Yu, Y., Currey, J., Martin, J.-P., and Fetterly, D. Dandelion: a compiler and runtime for heterogeneous systems. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (2013), ACM, pp. 49--68. Google ScholarDigital Library
Sabne, A., Sakdhnagool, P., and Eigenmann, R. HeteroDoop: A MapReduce programming system for accelerator clusters. In Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing (New York, NY, USA, 2015), HPDC '15, ACM, pp. 235--246. Google ScholarDigital Library
Schwarzkopf, M., Konwinski, A., Abd-El-Malek, M., and Wilkes, J. Omega: flexible, scalable schedulers for large compute clusters. In Proceedings of the 8th ACM European Conference on Computer Systems (2013), ACM, pp. 351--364. Google ScholarDigital Library
Segal, O., Colangelo, P., Nasiri, N., Qian, Z., and Margala, M. SparkCL: A unified programming framework for accelerators on heterogeneous clusters. CoRR abs/1505.01120 (2015).Google Scholar
Shan, Y., Wang, B., Yan, J., Wang, Y., Xu, N., and Yang, H. FPMR: Mapreduce framework on FPGA. In Proceedings of the 18th Annual ACM/SIGDA International Symposium on Field Programmable Gate Arrays (New York, NY, USA, 2010), FPGA '10, ACM, pp. 93--102. Google ScholarDigital Library
Stuart, J. A., and Owens, J. D. Multi-GPU mapreduce on GPU clusters. In Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium (Washington, DC, USA, 2011), IPDPS '11, IEEE Computer Society, pp. 1068--1079. Google ScholarDigital Library
Tsoi, K. H., and Luk, W. Axel: A heterogeneous cluster with FPGAs and GPUs. In Proceedings of the 18th Annual ACM/SIGDA International Symposium on Field Programmable Gate Arrays (New York, NY, USA, 2010), FPGA '10, ACM, pp. 115--124. Google ScholarDigital Library
Vavilapalli, V. K., Murthy, A. C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., et al. Apache Hadoop YARN: Yet another resource negotiator. In Proceedings of the 4th annual Symposium on Cloud Computing (2013), ACM, p. 5. Google ScholarDigital Library
Wang, Z., Zhang, S., He, B., and Zhang, W. Melia: A MapReduce framework on OpenCL-based FPGAs. IEEE Transactions on Parallel and Distributed Systems PP, 99 (2016), 1--1.Google Scholar
Yeung, J. H. C., Tsang, C. C., Tsoi, K. H., Kwan, B. S. H., Cheung, C. C. C., Chan, A. P. C., and Leong, P. H. W. Map-reduce as a programming model for custom computing machines. In Field-Programmable Custom Computing Machines, 2008. FCCM '08. 16th International Symposium on (April 2008), pp. 149--159. Google ScholarDigital Library
Yin, D., Li, G., and Huang, K.-D. Scalable MapReduce framework on FPGA. In Lecture Notes in Computer Science, S. Andreev, S. Balandin, and Y. Koucheryavy, Eds. Springer Berlin Heidelberg, 2012, pp. 280--294.Google Scholar
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M. J., Shenker, S., and Stoica, I. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (2012), USENIX Association, pp. 2--2. Google ScholarDigital Library
Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., and Stoica, I. Spark: Cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing (2010), pp. 10--10. Google ScholarDigital Library
Zhang, C., Li, P., Sun, G., Guan, Y., Xiao, B., and Cong, J. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (New York, NY, USA, 2015), FPGA '15, ACM, pp. 161--170. Google ScholarDigital Library

Index Terms

Programming and Runtime Support to Blaze FPGA Accelerator Deployment at Datacenter Scale
1. Computer systems organization
  1. Architectures
    1. Other architectures

Recommendations

A Study of FPGA Virtualization and Accelerator Scheduling
ETCD'17: Proceedings of the first Workshop on Emerging Technologies for software-defined and reconfigurable hardware-accelerated Cloud Datacenters

Deploying field-programmable gate arrays (FPGAs) on the cloud to accelerate the processing of the explosively growing server workloads is becoming a clear trend today. However, the costs reduction of accelerator design and deployment is still difficult ...
Read More
Field Programmable Gate Array (FPGA) Accelerator Sharing
Read More
Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks
FPGA '16: Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Convolutional Neural Networks (CNNs) have gained popularity in many computer vision applications such as image classification, face detection, and video analysis, because of their ability to train and classify with high accuracy. Due to multiple ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SoCC '16: Proceedings of the Seventh ACM Symposium on Cloud Computing
October 2016
534 pages
ISBN:9781450345255
DOI:10.1145/2987550
Editors:
Marcos K. Aguilera,
Brian Cooper,
Yanlei Diao
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 5 October 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
FPGA-as-a-service
heterogeneous datacenter
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
SoCC '16 Paper Acceptance Rate38of151submissions,25%Overall Acceptance Rate169of722submissions,23%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 73
  Total Citations
  View Citations
- 1,438
  Total Downloads
- Downloads (Last 12 months)42
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Programming and Runtime Support to Blaze FPGA Accelerator Deployment at Datacenter Scale

SoCC '16: Proceedings of the Seventh ACM Symposium on Cloud Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

A Study of FPGA Virtualization and Accelerator Scheduling

Field Programmable Gate Array (FPGA) Accelerator Sharing

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Programming and Runtime Support to Blaze FPGA Accelerator Deployment at Datacenter Scale

SoCC '16: Proceedings of the Seventh ACM Symposium on Cloud Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

A Study of FPGA Virtualization and Accelerator Scheduling

Field Programmable Gate Array (FPGA) Accelerator Sharing

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media