skip to main content
10.1145/2987550.2987569acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Programming and Runtime Support to Blaze FPGA Accelerator Deployment at Datacenter Scale

Authors Info & Claims
Published:05 October 2016Publication History

ABSTRACT

With the end of CPU core scaling due to dark silicon limitations, customized accelerators on FPGAs have gained increased attention in modern datacenters due to their lower power, high performance and energy efficiency. Evidenced by Microsoft's FPGA deployment in its Bing search engine and Intel's 16.7 billion acquisition of Altera, integrating FPGAs into datacenters is considered one of the most promising approaches to sustain future datacenter growth. However, it is quite challenging for existing big data computing systems---like Apache Spark and Hadoop---to access the performance and energy benefits of FPGA accelerators.

In this paper we design and implement Blaze to provide programming and runtime support for enabling easy and efficient deployments of FPGA accelerators in datacenters. In particular, Blaze abstracts FPGA accelerators as a service (FaaS) and provides a set of clean programming APIs for big data processing applications to easily utilize those accelerators. Our Blaze runtime implements an FaaS framework to efficiently share FPGA accelerators among multiple heterogeneous threads on a single node, and extends Hadoop YARN with accelerator-centric scheduling to efficiently share them among multiple computing tasks in the cluster. Experimental results using four representative big data applications demonstrate that Blaze greatly reduces the programming efforts to access FPGA accelerators in systems like Apache Spark and YARN, and improves the system throughput by 1.7× to 3× (and energy efficiency by 1.5× to 2.7×) compared to a conventional CPU-only cluster.

References

  1. Apache Hadoop. https://hadoop.apache.org. Accessed: 2016-05-24.Google ScholarGoogle Scholar
  2. Apache parquet. https://parquet.apache.org/. Accessed: 2016-05-24.Google ScholarGoogle Scholar
  3. Aparapi in amd developer website. http://developer.amd.com/tools-and-sdks/opencl-zone/aparapi/. Accessed: 2016-05-24.Google ScholarGoogle Scholar
  4. Facebook engineering (2012) under the hood: Scheduling mapreduce jobs more efficiently with corona. https://www.facebook.com/notes/facebook-engineering/under-the-hood-scheduling-mapreduce-jobs-more-efficiently-with-corona/10151142560538920. Accessed: 2016-01-30.Google ScholarGoogle Scholar
  5. HTCondor. https://research.cs.wisc.edu/htcondor. Accessed: 2016-05-24.Google ScholarGoogle Scholar
  6. Intel to Start Shipping Xeons With FPGAs in Early 2016. http://www.eweek.com/servers/intel-to-start-shipping-xeons-with-fpgas-in-early-2016.html. Accessed: 2016-05-17.Google ScholarGoogle Scholar
  7. Large scale distributed deep learning on Hadoop clusters. http://yahoohadoop.tumblr.com/post/129872361846/large-scale-distributed-deep-learning-on-hadoop. Accessed: 2016-05-24.Google ScholarGoogle Scholar
  8. The MNIST database of handwritten digits. https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html#mnist8m. Accessed: 2016-05-24.Google ScholarGoogle Scholar
  9. Project Tungsten: Bringing Apache Spark Closer to Bare Metal. https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html. Accessed: 2016-08-10.Google ScholarGoogle Scholar
  10. The snappy-java port. https://github.com/xerial/snappy-java. Accessed: 2016-08-01.Google ScholarGoogle Scholar
  11. Spark MLlib. http://spark.apache.org/mllib/. Accessed: 2016-05-24.Google ScholarGoogle Scholar
  12. Xeon+FPGA Platform for the Data Center. https://www.ece.cmu.edu/~calcm/carl/lib/exe/fetch.php?media=carl15-gupta.pdf. Accessed: 2016-05-17.Google ScholarGoogle Scholar
  13. Brech, B., Rubio, J., and Hollinger, M. IBM Data Engine for NoSQL - Power Systems Edition. Tech. rep., IBM Systems Group, 2015.Google ScholarGoogle Scholar
  14. Byma, S., Steffan, J. G., Bannazadeh, H., Garcia, A. L., and Chow, P. FPGAs in the cloud: Booting virtualized hardware accelerators with openstack. In Field-Programmable Custom Computing Machines (FCCM), 2014 IEEE 22nd Annual International Symposium on (2014), IEEE, pp. 109--116. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Chen, Y.-T., Cong, J., Fang, Z., Lei, J., and Wei., P. When Apache Spark meets FPGAs: A case study for next-generation dna sequencing acceleration. In The 8th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 16) (2016).Google ScholarGoogle Scholar
  16. Chen, Y. T., Cong, J., Lei, J., and Wei, P. A novel high-throughput acceleration engine for read alignment. In Field-Programmable Custom Computing Machines (FCCM), 2015 IEEE 23rd Annual International Symposium on (May 2015), pp. 199--202. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Chen, Y.-T., Cong, J., Li, S., Peto, M., Spellman, P., Wei, P., and Zhou, P. CS-BWAMEM: A fast and scalable read aligner at the cloud scale for whole genome sequencing. High Throughput Sequencing Algorithms and Applications (HITSEQ) (2015).Google ScholarGoogle Scholar
  18. Cong, J., Huang, M., Wu, D., and Yu, C. H. Heterogeneous datacenters: Options and opportunities. In Proceedings of the 53nd Annual Design Automation Conference (2016), ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Cong, J., Liu, B., Neuendorffer, S., Noguera, J., Vissers, K., and Zhang, Z. High-level synthesis for FPGAs: From prototyping to deployment. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on 30, 4 (April 2011), 473--491. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. El-Helw, I., Hofman, R., and Bal, H. E. Glasswing: Accelerating mapreduce on multi-core and many-core clusters. In Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing (New York, NY, USA, 2014), HPDC '14, ACM, pp. 295--298. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Esmaeilzadeh, H., Blem, E., St.Amant, R., Sankar-alingam, K., and Burger, D. Dark silicon and the end of multicore scaling. In Computer Architecture (ISCA), 2011 38th Annual International Symposium on (June 2011), pp. 365--376. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Grossman, M., Breternitz, M., and Sarkar, V. HadoopCL: Mapreduce on distributed heterogeneous platforms through seamless integration of Hadoop and OpenCL. In Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum (Washington, DC, USA, 2013), IPDPSW '13, IEEE Computer Society, pp. 1918--1927. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Grossman, M., and Sarkar, V. Swat: A programmable, in-memory, distributed, high-performance computing platform. The 25th International Symposium on High-Performance Parallel and Distributed Computing (HPDC) (2016). Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Hindman, B., Konwinski, A., Zaharia, M., Ghodsi, A., Joseph, A. D., Katz, R., Shenker, S., and Stoica, I. Mesos: A platform for fine-grained resource sharing in the data center. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation (Berkeley, CA, USA, 2011), NSDI'11, USENIX Association, pp. 295--308. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Hong, C., Chen, D., Chen, W., Zheng, W., and Lin, H. MapCG: Writing parallel program portable between CPU and GPU. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (New York, NY, USA, 2010), PACT '10, ACM, pp. 217--226. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Isard, M., Budiu, M., Yu, Y., Birrell, A., and Fetterly, D. Dryad: distributed data-parallel programs from sequential building blocks. In ACM SIGOPS Operating Systems Review (2007), vol. 41, ACM, pp. 59--72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., and Darrell, T. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014).Google ScholarGoogle Scholar
  28. K. Choi, Y., and Cong, J. Acceleration of EM-based 3D CT reconstruction using FPGA. IEEE Transactions on Biomedical Circuits and Systems 10, 3 (June 2016), 754--767.Google ScholarGoogle ScholarCross RefCross Ref
  29. Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:1303.3997 (2013).Google ScholarGoogle Scholar
  30. Li, P., Luo, Y., Zhang, N., and Cao, Y. HeteroSpark: A heterogeneous CPU/GPU spark platform for machine learning algorithms. In Networking, Architecture and Storage (NAS), 2015 IEEE International Conference on (Aug 2015), pp. 347--348.Google ScholarGoogle Scholar
  31. Lin, Z., and Chow, P. Zcluster: A Zynq-based Hadoop cluster. In Field-Programmable Technology (FPT), 2013 International Conference on (Dec 2013), pp. 450--453.Google ScholarGoogle ScholarCross RefCross Ref
  32. Ousterhout, K., Rasti, R., Ratnasamy, S., Shenker, S., and Chun, B.-G. Making sense of performance in data analytics frameworks. In 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15) (Oakland, CA, May 2015), USENIX Association, pp. 293--307. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Putnam, A., Caulfield, A. M., Chung, E. S., Chiou, D., Constantinides, K., Demme, J., Esmaeilzadeh, H., Fowers, J., Gopal, G. P., Gray, J., Haselman, M., Hauck, S., Heil, S., Hormati, A., Kim, J.-Y., Lanka, S., Larus, J., Peterson, E., Pope, S., Smith, A., Thong, J., Xiao, P. Y., and Burger, D. A reconfigurable fabric for accelerating large-scale datacenter services. In Computer Architecture (ISCA), 2014 ACM/IEEE 41st International Symposium on (June 2014), ieeexplore.ieee.org, pp. 13--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Rajagopalan, V., Boppana, V., Dutta, S., Taylor, B., and Wittig, R. Xilinx Zynq-7000 EPP--an extensible processing platform family. In 23rd Hot Chips Symposium (2011), pp. 1352--1357.Google ScholarGoogle ScholarCross RefCross Ref
  35. Rossbach, C. J., Yu, Y., Currey, J., Martin, J.-P., and Fetterly, D. Dandelion: a compiler and runtime for heterogeneous systems. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (2013), ACM, pp. 49--68. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Sabne, A., Sakdhnagool, P., and Eigenmann, R. HeteroDoop: A MapReduce programming system for accelerator clusters. In Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing (New York, NY, USA, 2015), HPDC '15, ACM, pp. 235--246. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Schwarzkopf, M., Konwinski, A., Abd-El-Malek, M., and Wilkes, J. Omega: flexible, scalable schedulers for large compute clusters. In Proceedings of the 8th ACM European Conference on Computer Systems (2013), ACM, pp. 351--364. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Segal, O., Colangelo, P., Nasiri, N., Qian, Z., and Margala, M. SparkCL: A unified programming framework for accelerators on heterogeneous clusters. CoRR abs/1505.01120 (2015).Google ScholarGoogle Scholar
  39. Shan, Y., Wang, B., Yan, J., Wang, Y., Xu, N., and Yang, H. FPMR: Mapreduce framework on FPGA. In Proceedings of the 18th Annual ACM/SIGDA International Symposium on Field Programmable Gate Arrays (New York, NY, USA, 2010), FPGA '10, ACM, pp. 93--102. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Stuart, J. A., and Owens, J. D. Multi-GPU mapreduce on GPU clusters. In Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium (Washington, DC, USA, 2011), IPDPS '11, IEEE Computer Society, pp. 1068--1079. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Tsoi, K. H., and Luk, W. Axel: A heterogeneous cluster with FPGAs and GPUs. In Proceedings of the 18th Annual ACM/SIGDA International Symposium on Field Programmable Gate Arrays (New York, NY, USA, 2010), FPGA '10, ACM, pp. 115--124. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Vavilapalli, V. K., Murthy, A. C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., et al. Apache Hadoop YARN: Yet another resource negotiator. In Proceedings of the 4th annual Symposium on Cloud Computing (2013), ACM, p. 5. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Wang, Z., Zhang, S., He, B., and Zhang, W. Melia: A MapReduce framework on OpenCL-based FPGAs. IEEE Transactions on Parallel and Distributed Systems PP, 99 (2016), 1--1.Google ScholarGoogle Scholar
  44. Yeung, J. H. C., Tsang, C. C., Tsoi, K. H., Kwan, B. S. H., Cheung, C. C. C., Chan, A. P. C., and Leong, P. H. W. Map-reduce as a programming model for custom computing machines. In Field-Programmable Custom Computing Machines, 2008. FCCM '08. 16th International Symposium on (April 2008), pp. 149--159. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Yin, D., Li, G., and Huang, K.-D. Scalable MapReduce framework on FPGA. In Lecture Notes in Computer Science, S. Andreev, S. Balandin, and Y. Koucheryavy, Eds. Springer Berlin Heidelberg, 2012, pp. 280--294.Google ScholarGoogle Scholar
  46. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M. J., Shenker, S., and Stoica, I. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (2012), USENIX Association, pp. 2--2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., and Stoica, I. Spark: Cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing (2010), pp. 10--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Zhang, C., Li, P., Sun, G., Guan, Y., Xiao, B., and Cong, J. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (New York, NY, USA, 2015), FPGA '15, ACM, pp. 161--170. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Programming and Runtime Support to Blaze FPGA Accelerator Deployment at Datacenter Scale

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SoCC '16: Proceedings of the Seventh ACM Symposium on Cloud Computing
      October 2016
      534 pages
      ISBN:9781450345255
      DOI:10.1145/2987550

      Copyright © 2016 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 5 October 2016

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited

      Acceptance Rates

      SoCC '16 Paper Acceptance Rate38of151submissions,25%Overall Acceptance Rate169of722submissions,23%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader