ABSTRACT
With the end of CPU core scaling due to dark silicon limitations, customized accelerators on FPGAs have gained increased attention in modern datacenters due to their lower power, high performance and energy efficiency. Evidenced by Microsoft's FPGA deployment in its Bing search engine and Intel's 16.7 billion acquisition of Altera, integrating FPGAs into datacenters is considered one of the most promising approaches to sustain future datacenter growth. However, it is quite challenging for existing big data computing systems---like Apache Spark and Hadoop---to access the performance and energy benefits of FPGA accelerators.
In this paper we design and implement Blaze to provide programming and runtime support for enabling easy and efficient deployments of FPGA accelerators in datacenters. In particular, Blaze abstracts FPGA accelerators as a service (FaaS) and provides a set of clean programming APIs for big data processing applications to easily utilize those accelerators. Our Blaze runtime implements an FaaS framework to efficiently share FPGA accelerators among multiple heterogeneous threads on a single node, and extends Hadoop YARN with accelerator-centric scheduling to efficiently share them among multiple computing tasks in the cluster. Experimental results using four representative big data applications demonstrate that Blaze greatly reduces the programming efforts to access FPGA accelerators in systems like Apache Spark and YARN, and improves the system throughput by 1.7× to 3× (and energy efficiency by 1.5× to 2.7×) compared to a conventional CPU-only cluster.
- Apache Hadoop. https://hadoop.apache.org. Accessed: 2016-05-24.Google Scholar
- Apache parquet. https://parquet.apache.org/. Accessed: 2016-05-24.Google Scholar
- Aparapi in amd developer website. http://developer.amd.com/tools-and-sdks/opencl-zone/aparapi/. Accessed: 2016-05-24.Google Scholar
- Facebook engineering (2012) under the hood: Scheduling mapreduce jobs more efficiently with corona. https://www.facebook.com/notes/facebook-engineering/under-the-hood-scheduling-mapreduce-jobs-more-efficiently-with-corona/10151142560538920. Accessed: 2016-01-30.Google Scholar
- HTCondor. https://research.cs.wisc.edu/htcondor. Accessed: 2016-05-24.Google Scholar
- Intel to Start Shipping Xeons With FPGAs in Early 2016. http://www.eweek.com/servers/intel-to-start-shipping-xeons-with-fpgas-in-early-2016.html. Accessed: 2016-05-17.Google Scholar
- Large scale distributed deep learning on Hadoop clusters. http://yahoohadoop.tumblr.com/post/129872361846/large-scale-distributed-deep-learning-on-hadoop. Accessed: 2016-05-24.Google Scholar
- The MNIST database of handwritten digits. https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html#mnist8m. Accessed: 2016-05-24.Google Scholar
- Project Tungsten: Bringing Apache Spark Closer to Bare Metal. https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html. Accessed: 2016-08-10.Google Scholar
- The snappy-java port. https://github.com/xerial/snappy-java. Accessed: 2016-08-01.Google Scholar
- Spark MLlib. http://spark.apache.org/mllib/. Accessed: 2016-05-24.Google Scholar
- Xeon+FPGA Platform for the Data Center. https://www.ece.cmu.edu/~calcm/carl/lib/exe/fetch.php?media=carl15-gupta.pdf. Accessed: 2016-05-17.Google Scholar
- Brech, B., Rubio, J., and Hollinger, M. IBM Data Engine for NoSQL - Power Systems Edition. Tech. rep., IBM Systems Group, 2015.Google Scholar
- Byma, S., Steffan, J. G., Bannazadeh, H., Garcia, A. L., and Chow, P. FPGAs in the cloud: Booting virtualized hardware accelerators with openstack. In Field-Programmable Custom Computing Machines (FCCM), 2014 IEEE 22nd Annual International Symposium on (2014), IEEE, pp. 109--116. Google ScholarDigital Library
- Chen, Y.-T., Cong, J., Fang, Z., Lei, J., and Wei., P. When Apache Spark meets FPGAs: A case study for next-generation dna sequencing acceleration. In The 8th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 16) (2016).Google Scholar
- Chen, Y. T., Cong, J., Lei, J., and Wei, P. A novel high-throughput acceleration engine for read alignment. In Field-Programmable Custom Computing Machines (FCCM), 2015 IEEE 23rd Annual International Symposium on (May 2015), pp. 199--202. Google ScholarDigital Library
- Chen, Y.-T., Cong, J., Li, S., Peto, M., Spellman, P., Wei, P., and Zhou, P. CS-BWAMEM: A fast and scalable read aligner at the cloud scale for whole genome sequencing. High Throughput Sequencing Algorithms and Applications (HITSEQ) (2015).Google Scholar
- Cong, J., Huang, M., Wu, D., and Yu, C. H. Heterogeneous datacenters: Options and opportunities. In Proceedings of the 53nd Annual Design Automation Conference (2016), ACM. Google ScholarDigital Library
- Cong, J., Liu, B., Neuendorffer, S., Noguera, J., Vissers, K., and Zhang, Z. High-level synthesis for FPGAs: From prototyping to deployment. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on 30, 4 (April 2011), 473--491. Google ScholarDigital Library
- El-Helw, I., Hofman, R., and Bal, H. E. Glasswing: Accelerating mapreduce on multi-core and many-core clusters. In Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing (New York, NY, USA, 2014), HPDC '14, ACM, pp. 295--298. Google ScholarDigital Library
- Esmaeilzadeh, H., Blem, E., St.Amant, R., Sankar-alingam, K., and Burger, D. Dark silicon and the end of multicore scaling. In Computer Architecture (ISCA), 2011 38th Annual International Symposium on (June 2011), pp. 365--376. Google ScholarDigital Library
- Grossman, M., Breternitz, M., and Sarkar, V. HadoopCL: Mapreduce on distributed heterogeneous platforms through seamless integration of Hadoop and OpenCL. In Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum (Washington, DC, USA, 2013), IPDPSW '13, IEEE Computer Society, pp. 1918--1927. Google ScholarDigital Library
- Grossman, M., and Sarkar, V. Swat: A programmable, in-memory, distributed, high-performance computing platform. The 25th International Symposium on High-Performance Parallel and Distributed Computing (HPDC) (2016). Google ScholarDigital Library
- Hindman, B., Konwinski, A., Zaharia, M., Ghodsi, A., Joseph, A. D., Katz, R., Shenker, S., and Stoica, I. Mesos: A platform for fine-grained resource sharing in the data center. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation (Berkeley, CA, USA, 2011), NSDI'11, USENIX Association, pp. 295--308. Google ScholarDigital Library
- Hong, C., Chen, D., Chen, W., Zheng, W., and Lin, H. MapCG: Writing parallel program portable between CPU and GPU. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (New York, NY, USA, 2010), PACT '10, ACM, pp. 217--226. Google ScholarDigital Library
- Isard, M., Budiu, M., Yu, Y., Birrell, A., and Fetterly, D. Dryad: distributed data-parallel programs from sequential building blocks. In ACM SIGOPS Operating Systems Review (2007), vol. 41, ACM, pp. 59--72. Google ScholarDigital Library
- Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., and Darrell, T. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014).Google Scholar
- K. Choi, Y., and Cong, J. Acceleration of EM-based 3D CT reconstruction using FPGA. IEEE Transactions on Biomedical Circuits and Systems 10, 3 (June 2016), 754--767.Google ScholarCross Ref
- Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:1303.3997 (2013).Google Scholar
- Li, P., Luo, Y., Zhang, N., and Cao, Y. HeteroSpark: A heterogeneous CPU/GPU spark platform for machine learning algorithms. In Networking, Architecture and Storage (NAS), 2015 IEEE International Conference on (Aug 2015), pp. 347--348.Google Scholar
- Lin, Z., and Chow, P. Zcluster: A Zynq-based Hadoop cluster. In Field-Programmable Technology (FPT), 2013 International Conference on (Dec 2013), pp. 450--453.Google ScholarCross Ref
- Ousterhout, K., Rasti, R., Ratnasamy, S., Shenker, S., and Chun, B.-G. Making sense of performance in data analytics frameworks. In 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15) (Oakland, CA, May 2015), USENIX Association, pp. 293--307. Google ScholarDigital Library
- Putnam, A., Caulfield, A. M., Chung, E. S., Chiou, D., Constantinides, K., Demme, J., Esmaeilzadeh, H., Fowers, J., Gopal, G. P., Gray, J., Haselman, M., Hauck, S., Heil, S., Hormati, A., Kim, J.-Y., Lanka, S., Larus, J., Peterson, E., Pope, S., Smith, A., Thong, J., Xiao, P. Y., and Burger, D. A reconfigurable fabric for accelerating large-scale datacenter services. In Computer Architecture (ISCA), 2014 ACM/IEEE 41st International Symposium on (June 2014), ieeexplore.ieee.org, pp. 13--24. Google ScholarDigital Library
- Rajagopalan, V., Boppana, V., Dutta, S., Taylor, B., and Wittig, R. Xilinx Zynq-7000 EPP--an extensible processing platform family. In 23rd Hot Chips Symposium (2011), pp. 1352--1357.Google ScholarCross Ref
- Rossbach, C. J., Yu, Y., Currey, J., Martin, J.-P., and Fetterly, D. Dandelion: a compiler and runtime for heterogeneous systems. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (2013), ACM, pp. 49--68. Google ScholarDigital Library
- Sabne, A., Sakdhnagool, P., and Eigenmann, R. HeteroDoop: A MapReduce programming system for accelerator clusters. In Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing (New York, NY, USA, 2015), HPDC '15, ACM, pp. 235--246. Google ScholarDigital Library
- Schwarzkopf, M., Konwinski, A., Abd-El-Malek, M., and Wilkes, J. Omega: flexible, scalable schedulers for large compute clusters. In Proceedings of the 8th ACM European Conference on Computer Systems (2013), ACM, pp. 351--364. Google ScholarDigital Library
- Segal, O., Colangelo, P., Nasiri, N., Qian, Z., and Margala, M. SparkCL: A unified programming framework for accelerators on heterogeneous clusters. CoRR abs/1505.01120 (2015).Google Scholar
- Shan, Y., Wang, B., Yan, J., Wang, Y., Xu, N., and Yang, H. FPMR: Mapreduce framework on FPGA. In Proceedings of the 18th Annual ACM/SIGDA International Symposium on Field Programmable Gate Arrays (New York, NY, USA, 2010), FPGA '10, ACM, pp. 93--102. Google ScholarDigital Library
- Stuart, J. A., and Owens, J. D. Multi-GPU mapreduce on GPU clusters. In Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium (Washington, DC, USA, 2011), IPDPS '11, IEEE Computer Society, pp. 1068--1079. Google ScholarDigital Library
- Tsoi, K. H., and Luk, W. Axel: A heterogeneous cluster with FPGAs and GPUs. In Proceedings of the 18th Annual ACM/SIGDA International Symposium on Field Programmable Gate Arrays (New York, NY, USA, 2010), FPGA '10, ACM, pp. 115--124. Google ScholarDigital Library
- Vavilapalli, V. K., Murthy, A. C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., et al. Apache Hadoop YARN: Yet another resource negotiator. In Proceedings of the 4th annual Symposium on Cloud Computing (2013), ACM, p. 5. Google ScholarDigital Library
- Wang, Z., Zhang, S., He, B., and Zhang, W. Melia: A MapReduce framework on OpenCL-based FPGAs. IEEE Transactions on Parallel and Distributed Systems PP, 99 (2016), 1--1.Google Scholar
- Yeung, J. H. C., Tsang, C. C., Tsoi, K. H., Kwan, B. S. H., Cheung, C. C. C., Chan, A. P. C., and Leong, P. H. W. Map-reduce as a programming model for custom computing machines. In Field-Programmable Custom Computing Machines, 2008. FCCM '08. 16th International Symposium on (April 2008), pp. 149--159. Google ScholarDigital Library
- Yin, D., Li, G., and Huang, K.-D. Scalable MapReduce framework on FPGA. In Lecture Notes in Computer Science, S. Andreev, S. Balandin, and Y. Koucheryavy, Eds. Springer Berlin Heidelberg, 2012, pp. 280--294.Google Scholar
- Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M. J., Shenker, S., and Stoica, I. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (2012), USENIX Association, pp. 2--2. Google ScholarDigital Library
- Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., and Stoica, I. Spark: Cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing (2010), pp. 10--10. Google ScholarDigital Library
- Zhang, C., Li, P., Sun, G., Guan, Y., Xiao, B., and Cong, J. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (New York, NY, USA, 2015), FPGA '15, ACM, pp. 161--170. Google ScholarDigital Library
Index Terms
- Programming and Runtime Support to Blaze FPGA Accelerator Deployment at Datacenter Scale
Recommendations
A Study of FPGA Virtualization and Accelerator Scheduling
ETCD'17: Proceedings of the first Workshop on Emerging Technologies for software-defined and reconfigurable hardware-accelerated Cloud DatacentersDeploying field-programmable gate arrays (FPGAs) on the cloud to accelerate the processing of the explosively growing server workloads is becoming a clear trend today. However, the costs reduction of accelerator design and deployment is still difficult ...
Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks
FPGA '16: Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysConvolutional Neural Networks (CNNs) have gained popularity in many computer vision applications such as image classification, face detection, and video analysis, because of their ability to train and classify with high accuracy. Due to multiple ...
Comments