Abstract
Process technology-driven performance and energy efficiency improvements have slowed down as we approach physical design limits. General-purpose manycore architectures attempt to circumvent this challenge, but they have a significant performance and energy-efficient gap compared to special-purpose solutions. Domain-specific architectures (DSAs), an instance of heterogeneous architectures, efficiently combine general-purpose cores and specialized hardware accelerators to boost energy efficiency and provide programming flexibility. Indeed, the hardware, software, and systems aspects in DSAs are highly tailored to maximize the energy efficiency of applications in a target domain. As DSAs and their conceptualization advance rapidly, there is a strong need to understand the research problems that need immediate attention. This article discusses the primary research directions in the design and runtime management of DSAs. Then, it surveys some promising approaches and highlights the outstanding research needs.
- [1] Apple. [n.d.]. Apple Secure Enclave. Retrieved May 15, 2022 from https://support.apple.com/guide/security/secure-enclave-sec59b0b31ff/web.Google Scholar
- [2] Cadence. [n.d.]. ARM CoreLink Interconnects Whitepaper. Retrieved May 15, 2022 from https://ip.cadence.com/uploads/251/white-paper-interconnect-solutions-debugging-issues-advanced-ARM-CoreLink-pdf.Google Scholar
- [3] ARM. [n.d.]. ARM TrustZone. Retrived May 15, 2022 from https://developer.arm.com/documentation/PRD29-GENC-009492/c/TrustZone-Hardware-Architecture.Google Scholar
- [4] Google. [n.d.]. Google’s Thrust Towards Open-Source Hardware. Retrieved May 15, 2022 from https://opensource.googleblog.com/2019/05/google-fosters-open-source-hardware.html.Google Scholar
- [5] Aakash Jani. 2022. Year in Review: PC Processors Adopt Hybrid CPUs. Retrieved May 15, 2022 from https://www.techinsights.com/blog/year-review-pc-processors-adopt-hybrid-cpus.Google Scholar
- [6] Retrieved May 15, 2022 from https://futurenetworks.ieee.org/images/files/pdf/FirstResponder/Tom-Rondeau-DARPA.pdf.Google Scholar
- [7] Siemens. [n.d.]. Veloce2 Emulator. Retrieved May 15, 2022 from https://www.mentor.com/products/fv/emulation-systems/veloce.Google Scholar
- [8] Synopsys. [n.d.]. ZeBu Server 4. Retrieved May 15, 2022 from https://www.synopsys.com/verification/emulation/zebu-server.html.Google Scholar
- [9] . 2019. Toward an open-source digital flow: First learnings from the OpenROAD project. In Proceedings of the 56th Annual Design Automation Conference. 1–4.Google ScholarDigital Library
- [10] . 2021. Heterogeneity-aware scheduling on SoCs for autonomous vehicles. IEEE Computer Architecture Letters 20, 2 (2021), 82–85.Google ScholarCross Ref
- [11] . 2013. List scheduling algorithm for heterogeneous systems by an optimistic cost table. IEEE Transactions on Parallel and Distributed Systems 25, 3 (2013), 682–694.Google ScholarDigital Library
- [12] . 2020. DS3: A system-level domain-specific system-on-chip simulation framework. IEEE Transactions on Computers 69, 8 (2020), 1248–1262.Google Scholar
- [13] . 2006. The Landscape of Parallel Computing Research: A View from Berkeley. Technical Report No. UCB/EECS-2006-183. EECS Department, University of California, Berkeley.Google Scholar
- [14] . 2014. Instruction Sets Should Be Free: The Case for RISC-V. Technical Report No. UCB/EECS-2014-146. EECS Department, University of California, Berkeley.Google Scholar
- [15] . 2008. HW-SW emulation framework for temperature-aware design in MPSoCs. ACM Transactions on Design Automation of Electronic Systems 12, 3 (2008), 1–26.Google ScholarDigital Library
- [16] . 2007. An MPSoC performance estimation framework using transaction level modeling. In Proceedings of the International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA’07). 525–533.Google ScholarDigital Library
- [17] . 2011. StarPU: A unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience 23, 2 (2011), 187–198.Google ScholarDigital Library
- [18] . 2020. Creating an agile hardware design flow. In Proceedings of the 57th ACM/IEEE Design Automation Conference (DAC’20). 1–6.Google Scholar
- [19] . 2006. QEMU: A multihost, multitarget emulator. Linux Journal 2006, 145 (2006), 3.Google ScholarDigital Library
- [20] . 2008. General-purpose modular hardware and software framework for mobile outdoor augmented reality applications in engineering. Advanced Engineering Informatics 22, 1 (2008), 90–105.Google ScholarDigital Library
- [21] . 2005. QEMU, A fast and portable dynamic translator. In Proceedings of the USENIX Annual Technical Conference: FREENIX Track, Vol. 41. 10–5555.Google Scholar
- [22] . 2009. ReSP: A nonintrusive transaction-level reflective MPSoC simulation platform for design space exploration. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 28, 12 (2009), 1857–1869.Google ScholarDigital Library
- [23] . 2019. FLAME: Graph-based hardware representations for rapid and precise performance modeling. In Proceedings of the Design, Automation, and Test in Europe Conference and Exhibition (DATE’19). 1775–1780.Google ScholarCross Ref
- [24] . 2000. A survey of design techniques for system-level dynamic power management. IEEE Transactions on Very Scale Integration (VLSI) Systems 8, 3 (2000), 299–316.Google ScholarDigital Library
- [25] . 2017. Algorithmic optimization of thermal and power management for heterogeneous mobile platforms. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 26, 3 (2017), 544–557.Google ScholarDigital Library
- [26] . 2010. DAG scheduling using a lookahead variant of the heterogeneous earliest finish time algorithm. In Proceedings of the Euromicro Conference on Parallel, Distributed, and Network-Based Processing.27–34.Google ScholarDigital Library
- [27] . 2022. FARSI: An early-stage design space exploration framework to tame the domain-specific system-on-chip complexity. ACM Transactions on Embedded Computing Systems. Online, June 16, 2022.Google ScholarDigital Library
- [28] . 2021. Secure and resilient SoCs for autonomous vehicles. In International Workshop on Domain Specific System Architecture (DOSSA), in conjunction with IEEE International Symposium on High-Performance Computer Architecture (HPCA). https://scholar.google.com/scholar?hl=en&as_sdt=0%2C50&q=Secure+and+resilient+SoCs+for+autonomous+vehicles&btnG=.Google Scholar
- [29] . 2001. A comparison of eleven static heuristics for mapping a class of independent tasks onto heterogeneous distributed computing systems. Journal of Parallel and Distributed Computing 61, 6 (2001), 810–837.Google ScholarDigital Library
- [30] . 2011. A heterogeneous parallel framework for domain-specific languages. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. 89–100.Google ScholarDigital Library
- [31] . 2021. Intel’s hyperscale-ready infrastructure processing unit (IPU). In Proceedings of the IEEE Hot Chips 33 Symposium (HCS’21). 1–16.Google ScholarCross Ref
- [32] . 2021. Nvidia data center processing unit (DPU) architecture. In Proceedings of the IEEE Hot Chips 33 Symposium (HCS’21). 1–20.Google ScholarCross Ref
- [33] . 2013. LegUp: An open-source high-level synthesis tool for FPGA-based processor/accelerator systems. ACM Transactions on Embedded Computing Systems 13, 2 (2013), 1–27.Google ScholarDigital Library
- [34] . 2009. HeMPS—A framework for NoC-based MPSoC generation. In Proceedings of the International Symposium on Circuits and Systems. 1345–1348.Google ScholarCross Ref
- [35] . 2016. The case for embedded scalable platforms. In Proceedings of the ACM/EDAC/IEEE Design Automation Conference (DAC’16). 1–6.Google Scholar
- [36] . 2011. MAPS: Mapping concurrent dataflow applications to heterogeneous MPSoCs. IEEE Transactions on Industrial Informatics 9, 1 (2011), 527–545.Google ScholarCross Ref
- [37] . 2020. PSB-RNN: A processing-in-memory systolic array architecture using block circulant matrices for recurrent neural networks. In Proceedings of the Design, Automation, and Test in Europe Conference and Exhibition (DATE’20). 180–185.Google ScholarCross Ref
- [38] . 2020. FARM: A flexible accelerator for recurrent and memory augmented neural networks. Journal of Signal Processing Systems 92, 11 (2020), 1247–1261.Google ScholarDigital Library
- [39] . 2021. Crossbar based processing in memory accelerator architecture for graph convolutional networks. In Proceedings of the IEEE/ACM International Conference on Computer Aided Design (ICCAD’21). 1–9.Google ScholarDigital Library
- [40] . 2019. Fast virtual prototyping for embedded computing systems design and exploration. In Proceedings of Rapid Simulation and Performance Evaluation: Methods and Tools. 1–8.Google ScholarDigital Library
- [41] . 2022. A 507 GMACs/J 256-core domain adaptive systolic-array-processor for wireless communication and linear-algebra kernels in 12nm FINFET. In Proceedings of the 2022 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits’22).Google Scholar
- [42] . 2017. Challenges and trends in modern SoC design verification. IEEE Design & Test 34, 5 (2017), 7–22.Google ScholarCross Ref
- [43] . 2018. An architecture-agnostic integer linear programming approach to CGRA mapping. In Proceedings of the 55th Annual Design Automation Conference. 1–6.Google ScholarDigital Library
- [44] . 2008. Energy-and performance-aware incremental mapping for networks on chip with multiple voltage levels. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 27, 10 (2008), 1866–1879.Google ScholarDigital Library
- [45] . 2014. A fully pipelined and dynamically composable architecture of CGRA. In Proceedings of the Annual International Symposium on Field-Programmable Custom Computing Machines. 9–16.Google ScholarCross Ref
- [46] . 2010. Customizable domain-specific computing. IEEE Design & Test of Computers 28, 2 (2010), 6–15.Google ScholarDigital Library
- [47] . 2021. TensorFlow Lite Micro: Embedded machine learning for TinyML systems. Proceedings of Machine Learning and Systems 3 (2021), 800–811.Google Scholar
- [48] . 2011. A survey of hard real-time scheduling for multiprocessor systems. ACM Computing Surveys 43, 4 (2011), 1–44.Google ScholarDigital Library
- [49] . 2021. ThermalAttackNet: Are CNNs making it easy to perform temperature side-channel attack in mobile edge devices?Future Internet 13, 6 (2021), 146.Google ScholarCross Ref
- [50] . 2019. P-EdgeCoolingMode: An agent-based performance aware thermal management unit for DVFS enabled heterogeneous MPSoCs. IET Computers & Digital Techniques 13, 6 (2019), 514–523.Google ScholarCross Ref
- [51] . 2016. SPARTA: Runtime task allocation for energy efficient heterogeneous manycores. In Proceedings of the International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS’16). 1–10.Google ScholarDigital Library
- [52] . 2005. 64-bit floating-point FPGA matrix multiplication. In Proceedings of the International Symposium on Field-Programmable Gate Arrays. 86–95.Google ScholarDigital Library
- [53] . 2018. CycleTandem: Energy-saving scheduling for real-time systems with hardware accelerators. In Proceedings of the IEEE Real-Time Systems Symposium (RTSS’18). 94–106.Google ScholarCross Ref
- [54] . 2015. Improving MPSoC reliability through adapting runtime task schedule based on time-correlated fault behavior. In Proceedings of the Design, Automation, and Test in Europe Conference and Exhibition (DATE’15). 818–823.Google ScholarCross Ref
- [55] . 2011. Dark silicon and the end of multicore scaling. In Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA’11). IEEE, Los Alamitos, CA, 365–376.Google ScholarDigital Library
- [56] . 2014. SESSL: A domain-specific language for simulation experiments. ACM Transactions on Modeling and Computer Simulation 24, 2 (2014), 1–25.Google ScholarDigital Library
- [57] . 1987. The program dependence graph and its use in optimization. ACM Transactions on Programming Languages and Systems 9, 3 (1987), 319–349.Google ScholarDigital Library
- [58] . 2012. Adaptive fault-tolerant DVFS with dynamic online AVF prediction. Microelectronics Reliability 52, 6 (2012), 1197–1208.Google ScholarCross Ref
- [59] . 2017. Prototyping a GPGPU neural network for deep-learning big data analysis. Big Data Research 8 (2017), 50–56.Google ScholarCross Ref
- [60] . 1998. FFTW: An adaptive software architecture for the FFT. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’98), Vol. 3. 1381–1384.Google ScholarCross Ref
- [61] . 2009. Embedded System Design: Modeling, Synthesis and Verification. Springer Science & Business Media.Google ScholarDigital Library
- [62] . 1996. System design methodologies: Aiming at the 100 h design cycle. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 4, 1 (1996), 70–82.Google ScholarDigital Library
- [63] . 2012. SpecC: Specification Language and Methodology. Springer Science & Business Media.Google Scholar
- [64] . 2002. GNU Scientific Library. Network Theory Limited.Google Scholar
- [65] . 2020. Comparison of Several FFT Libraries in C/C++.
Technical Report . STFC.Google Scholar - [66] . 2005. Chip makers turn to multicore processors. Computer 38, 5 (2005), 11–13.Google ScholarDigital Library
- [67] . 2007. Feature-NoC emulation: A tool and design flow for MPSoC. IEEE Circuits and Systems Magazine 7, 4 (2007), 42–51.Google ScholarCross Ref
- [68] . 2006. Multi-core processors: New way to achieve high system performance. In Proceedings of the International Symposium on Parallel Computing in Electrical Engineering (PARELEC’06). 9–13.Google ScholarDigital Library
- [69] . 2010. Host-compiled simulation of multi-core platforms. In Proceedings of the 21st IEEE International Symposium on Rapid System Protyping. 1–6.Google ScholarCross Ref
- [70] . 2009. Electronic system-level synthesis methodologies. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 28, 10 (2009), 1517–1530.Google ScholarDigital Library
- [71] . 2005. Transaction-Level Modeling with SystemC. Vol. 2. Springer.Google Scholar
- [72] . 2021. Accelerator integration for open-source SoC design. IEEE Micro 41, 4 (2021), 8–14.Google ScholarCross Ref
- [73] . 2018. NoC-based support of heterogeneous cache-coherence models for accelerators. In Proceedings of the IEEE/ACM International Symposium on Networks-on-Chip (NoCS’18). 1–8.Google ScholarCross Ref
- [74] . 2018. Heterogeneous Integration at DARPA: Pathfinding and Progress in Assembly Approaches. DARPA.Google Scholar
- [75] . 2012. GPU merge path: A GPU merging algorithm. In Proceedings of the ACM International Conference on Supercomputing. 331–340.Google ScholarDigital Library
- [76] . 2020. Will RISC-V revolutionize computing?Communications of the ACM 63, 5 (2020), 30–32.Google ScholarDigital Library
- [77] . 2004. Methods for evaluating and covering the design space during early design development. Integration 38, 2 (2004), 131–183.Google ScholarDigital Library
- [78] . 2007. System Design with SystemCTM. Springer Science & Business Media.Google Scholar
- [79] . 2008. EXPRESSION: A language for architecture exploration through compiler/simulator retargetability. In Proceedings of the Design, Automation, and Test in Europe Conference and Exhibition (DATE’08). 31–45.Google ScholarCross Ref
- [80] . 2020. Proactive scenario characteristic-aware online power management on mobile systems. IEEE Access 8 (2020), 69695–69711.Google ScholarCross Ref
- [81] . 2014. STEAM: A smart temperature and energy aware multicore controller. ACM Transactions on Embedded Computing Systems 13, 5s (2014), 1–25.Google ScholarDigital Library
- [82] . 2012. Energy-efficient operation of multicore processors by DVFS, task migration, and active cooling. IEEE Transactions on Computers 63, 2 (2012), 349–360.Google ScholarDigital Library
- [83] . ODROID-XU3. Retrieved May 15, 2022 from https://wiki.odroid.com/old_product/odroid-xu3/odroid-xu3.Google Scholar
- [84] . 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.Google ScholarCross Ref
- [85] . 2018. A new golden age for computer architecture: Domain-specific hardware/software co-design, enhanced. In Proceedings of the ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA’18).Google Scholar
- [86] . 2019. A new golden age for computer architecture. Communications of the ACM 62, 2 (2019), 48–60.Google ScholarDigital Library
- [87] . 2005. Energy-and performance-aware mapping for regular NoC architectures. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 24, 4 (2005), 551–562.Google ScholarDigital Library
- [88] . 2021. Advances in Artificial Systems for Logistics Engineering. Springer.Google ScholarCross Ref
- [89] . 2011. Analysis and optimization of fault-tolerant task scheduling on multiprocessor embedded systems. In Proceedings of the 7th IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis. 247–256.Google ScholarDigital Library
- [90] . 2009. Lifetime reliability-aware task allocation and scheduling for MPSoC platforms. In Proceedings of the Design, Automation, and Test in Europe Conference and Exhibition (DATE’09). 51–56.Google Scholar
- [91] . 2013. The temperature side channel and heating fault attacks. In Proceedings of the International Conference on Smart Card Research and Advanced Applications. 219–235.Google Scholar
- [92] . 2010. Odin II—An open-source Verilog HDL synthesis tool for CAD research. In Proceedings of the 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines. 149–156.Google ScholarDigital Library
- [93] . 2016. Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition. Morgan Kaufmann.Google Scholar
- [94] . 2008. A performance study of multiprocessor task scheduling algorithms. Journal of Supercomputing 43, 1 (2008), 77–97.Google ScholarDigital Library
- [95] . 1989. Available instruction-level parallelism for superscalar and superpipelined machines. ACM SIGARCH Computer Architecture News 17, 2 (1989), 272–282.Google ScholarDigital Library
- [96] . 2021. Ten lessons from three generations shaped Google’s TPUv4i: Industrial product. In Proceedings of the ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA’21). 1–14.Google ScholarDigital Library
- [97] . 2018. A configurable RISC-V for NoC-based MPSoCs: A framework for hardware emulation. In Proceedings of the 11th International Workshop on Network on Chip Architectures (NoCArc’18). 1–6.Google ScholarCross Ref
- [98] . 2022. Versa: A 36-core systolic multiprocessor with dynamically reconfigurable interconnect and memory. IEEE Journal of Solid-State Circuits 57, 4 (2022), 986–998.Google ScholarCross Ref
- [99] . 2012. Recent thermal management techniques for microprocessors. ACM Computing Surveys 44, 3 (2012), 1–42.Google ScholarDigital Library
- [100] . 2018. HPVM: Heterogeneous parallel virtual machine. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 68–80.Google ScholarDigital Library
- [101] . 2020. Runtime task scheduling using imitation learning for heterogeneous many-core systems. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 39, 11 (2020), 4064–4077.Google ScholarCross Ref
- [102] . 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25 (2012).Google Scholar
- [103] . 2020. SEDAN: Security-aware design of time-critical automotive networks. IEEE Transactions on Vehicular Technology 69, 8 (2020), 9017–9030.Google ScholarCross Ref
- [104] . 2017. HERO: Heterogeneous embedded research platform for exploring RISC-V manycore accelerators on FPGA. arXiv preprint arXiv:1712.06497 (2017).Google Scholar
- [105] . 2011. Visual SLAM for autonomous ground vehicles. In Proceedings of the International Conference on Robotics and Automation. 1732–1737.Google ScholarCross Ref
- [106] . 2004. LLVM: A compilation framework for lifelong program analysis and transformation. In Proceedings of the International Symposium on Code Generation and Optimization. 75–86.Google ScholarDigital Library
- [107] . 2009. OpenMP to GPGPU: A compiler framework for automatic translation and optimization. ACM SIGPLAN Notices 44, 4 (2009), 101–110.Google ScholarDigital Library
- [108] . 2015. Energy-efficient task scheduling for multi-core platforms with per-core DVFS. Journal of Parallel and Distributed Computing 86 (2015), 71–81.Google ScholarDigital Library
- [109] . 2018. The architectural implications of autonomous driving: Constraints and acceleration. In Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems. 751–766.Google ScholarDigital Library
- [110] . 2014. Shielding heterogeneous MPSoCs from untrustworthy 3PIPs through security-driven task scheduling. IEEE Transactions on Emerging Topics in Computing 2, 4 (2014), 461–472.Google ScholarCross Ref
- [111] . 2019. A survey of coarse-grained reconfigurable architecture and design: Taxonomy, challenges, and applications. ACM Computing Surveys 52, 6 (2019), 1–39.Google ScholarDigital Library
- [112] . 2021. Performant, multi-objective scheduling of highly interleaved task graphs on heterogeneous system on chip devices. IEEE Transactions on Parallel and Distributed Systems 33 (2021), 2148–2162.Google Scholar
- [113] . 2022. CEDR—A compiler-integrated, extensible DSSoC runtime. ACM Transactions on Embedded Computing Systems. Online, April 13, 2022.Google ScholarDigital Library
- [114] . 2020. User-space emulation framework for domain-specific SoC design. In Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW’20). 44–53.Google ScholarCross Ref
- [115] . 2021. SAGE: A split-architecture methodology for efficient end-to-end autonomous vehicle control. ACM Transactions on Embedded Computing Systems 20, 5s (2021), 1–22.Google ScholarDigital Library
- [116] . 2019. Visual inertial odometry at the edge: A hardware-software co-design approach for ultra-low latency and power. In Proceedings of the Design, Automation, and Test in Europe Conference and Exhibition (DATE’19). 960–963.Google ScholarCross Ref
- [117] . 2019. Dynamic resource management of heterogeneous mobile platforms via imitation learning. IEEE Transactions on Very Large Scale Integration (VLSI) Systems.Google ScholarCross Ref
- [118] . 2021. Energy-efficient networks-on-chip architectures: Design and run-time optimization. In Network-on-Chip Security and Privacy. Springer, 55–75.Google ScholarCross Ref
- [119] . 2020. Agile SoC development with open ESP. In Proceedings of the IEEE/ACM International Conference on Computer Aided Design (ICCAD’20). 1–9.Google ScholarDigital Library
- [120] . 2016. Resource management with deep reinforcement learning. In Proceedings of the ACM Workshop on Hot Topics in Networks. 50–56.Google ScholarDigital Library
- [121] . 2019. Learning scheduling algorithms for data processing clusters. In Proceedings of the ACM Special Interest Group on Data Communication (SIGCOMM’19). ACM, New York, NY, 270–288.Google ScholarDigital Library
- [122] . 2008. Outstanding research problems in NoC design: System, microarchitecture, and circuit perspectives. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 28, 1 (2008), 3–21.Google ScholarDigital Library
- [123] . 2000. Algorithms for extracting structured motifs using a suffix tree with an application to promoter and regulatory site consensus identification. Journal of Computational Biology 7, 3-4 (2000), 345–362.Google ScholarCross Ref
- [124] . 2021. Interactions, impacts, and coincidences of the first golden age of computer architecture. IEEE Micro 41, 6 (2021), 131–139.Google ScholarDigital Library
- [125] . 2005. When and how to develop domain-specific languages. ACM Computing Surveys 37, 4 (2005), 316–344.Google ScholarDigital Library
- [126] . 2020. A survey of FPGA-based accelerators for convolutional neural networks. Neural Computing and Applications 32, 4 (2020), 1109–1139.Google ScholarDigital Library
- [127] . 2015. A survey of CPU-GPU heterogeneous computing techniques. ACM Computing Surveys 47, 4 (2015), 1–35.Google ScholarDigital Library
- [128] . 2019. HESSLE-FREE: Heterogeneous systems leveraging fuzzy control for runtime resource management. ACM Transactions on Embedded Computing Systems 18, 5s (2019), 1–19.Google ScholarDigital Library
- [129] . 2018. FluidFFT: Common API (C++ and Python) for fast Fourier transform HPC libraries. arXiv preprint arXiv:1807.01775 (2018).Google Scholar
- [130] . 2009. Thermal balancing policy for multiprocessor stream computing platforms. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 28, 12 (2009), 1870–1882.Google ScholarDigital Library
- [131] . 2020. Plundervolt: Software-based fault injection attacks against Intel SGX. In Proceedings of the IEEE Symposium on Security and Privacy (SP’20). 1466–1482.Google ScholarCross Ref
- [132] . 2018. Rendered insecure: GPU side channel attacks are practical. In Proceedings of ACM SIGSAC Conference on Computer and Communications Security. 2139–2153.Google ScholarDigital Library
- [133] . 2021. The design process for Google’s training chips: TPUv2 and TPUv3. IEEE Micro 41, 2 (2021), 56–63.Google ScholarCross Ref
- [134] . 2019. Deep learning vs. traditional computer vision. In Proceedings of the Science and Information Conference. 128–144.Google Scholar
- [135] . 2015. Performance/energy trade-off in scientific computing: The case of ARM big.LITTLE and Intel Sandy Bridge. IET Computers & Digital Techniques 9, 1 (2015), 27–35.Google ScholarCross Ref
- [136] . 2021. Automated test generation for hardware Trojan detection using reinforcement learning. In Proceedings of the 26th Asia and South Pacific Design Automation Conference. 408–413.Google ScholarDigital Library
- [137] . 2020. A survey on energy management for mobile and IoT devices. IEEE Design & Test 37, 5 (2020), 7–24.Google ScholarCross Ref
- [138] . 2018. 50 years of computer architecture: From the mainframe CPU to the domain-specific TPU and the open RISC-V instruction set. In Proceedings of the 2018 IEEE International Solid-State Circuits Conference-(ISSCC’18). IEEE, Los Alamitos, CA, 27–31.Google ScholarCross Ref
- [139] . 2020. Run-time reconfigurable MPSoC-based on-board processor for vision-based space navigation. IEEE Access 8 (2020), 59891–59905.Google ScholarCross Ref
- [140] . 2019. A study of hardware performance counters selection for cross architectural GPU power modeling. In XXV Congreso Argentino de Ciencias de la Computación (CACIC’19).Google Scholar
- [141] . 2006. A systematic approach to exploring embedded system architectures at multiple abstraction levels. IEEE Transactions on Computers 55, 2 (2006), 99–112.Google ScholarDigital Library
- [142] . 2018. The use of machine learning algorithms in recommender systems: A systematic review. Expert Systems with Applications 97 (2018), 205–227.Google ScholarCross Ref
- [143] . 2017. Programming heterogeneous systems from an image processing DSL. ACM Transactions on Architecture and Code Optimization 14, 3 (2017), 1–25.Google ScholarDigital Library
- [144] . 2012. Agile hardware and co-design. In Proceedings of the Embedded Systems Conference. 1–8.Google Scholar
- [145] . 2017. Halide: Decoupling algorithms from schedules for high-performance image processing. Communications of the ACM 61, 1 (2017), 106–115.Google ScholarDigital Library
- [146] . 2017. Inter-cluster thread-to-core mapping and DVFS on heterogeneous multi-cores. IEEE Transactions on Multi-Scale Computing Systems 4, 3 (2017), 369–382.Google ScholarCross Ref
- [147] . 1999. Design methodologies based on hardware description languages. IEEE Transactions on Industrial Electronics 46, 1 (1999), 3–12.Google ScholarCross Ref
- [148] . 2007. Power and reliability management of SoCs. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 15, 4 (2007), 391–403.Google ScholarDigital Library
- [149] . 2009. Efficient FPGA implementation of FFT/IFFT processor. International Journal of Circuits, Systems and Signal Processing 3, 3 (2009), 103–110.Google Scholar
- [150] . 2016. Providing sustainable performance in thermally constrained mobile devices. In Proceedings of the 14th ACM/IEEE Symposium on Embedded Systems for Real-Time Multimedia. 72–77.Google ScholarDigital Library
- [151] . 2014. FPGA emulation and prototyping of a cyberphysical-system-on-chip (CPSoC). In Proceedings of the IEEE International Symposium on Rapid System Prototyping. 121–127.Google ScholarCross Ref
- [152] . 2020. HiLITE: Hierarchical and lightweight imitation learning for power management of embedded SoCs. IEEE Computer Architecture Letters 19, 1 (2020), 63–67.Google ScholarCross Ref
- [153] . 2016. Co-designing accelerators and SoC interfaces using gem5-Aladdin. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). 1–12.Google ScholarDigital Library
- [154] . 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google Scholar
- [155] . 2013. Mapping on multi/many-core systems: Survey of current and emerging trends. In Proceedings of the 50th ACM/EDAC/IEEE Design Automation Conference (DAC’13). 1–10.Google ScholarDigital Library
- [156] . 1998. Models and languages for parallel computation. ACM Computing Surveys, 2 (1998), 123–169.Google Scholar
- [157] . 2012. Aspen: A domain specific language for performance modeling. In Proceedings of the International Conference on High Performance Computing, Networking, Storage, and Analysis (SC’12). 1–11.Google ScholarDigital Library
- [158] . 2014. Quality of Service (QoS) in ARM® Systems: An Overview. White Paper. ARM, Cambridge, UK.Google Scholar
- [159] . 2016. Throughput-optimized OpenCL-based FPGA accelerator for large-scale convolutional neural networks. In Proceedings of the International Symposium on Field-Programmable Gate Arrays. 16–25.Google ScholarDigital Library
- [160] . 2014. Delite: A compiler architecture for performance-oriented embedded domain-specific languages. ACM Transactions on Embedded Computing Systems 13, 4s (2014), 1–25.Google ScholarDigital Library
- [161] . 2018. A unified hardware/software monitoring method for reconfigurable computing architectures using PAPI. In Proceedings of the 13th International Symposium on Reconfigurable Communication-Centric Systems-on-Chip (ReCoSoC’18). 1–8.Google ScholarCross Ref
- [162] . 2021. Hardware specialization: From cell to heterogeneous microprocessors everywhere. IEEE Micro 41, 6 (2021), 112–120.Google ScholarDigital Library
- [163] . 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1–9.Google ScholarCross Ref
- [164] . 2016. An energy-efficient task scheduling algorithm in DVFS-enabled cloud environment. Journal of Grid Computing 14, 1 (2016), 55–74.Google ScholarDigital Library
- [165] . 2018. Energy-aware scheduling of conditional task graphs on NoC-based MPSoCs. In Proceedings of the 51st Hawaii International Conference on System Sciences.Google ScholarCross Ref
- [166] . 2017. The end of Moore’s law: A new beginning for information technology. Computing in Science & Engineering 19, 2 (2017), 41–50.Google ScholarDigital Library
- [167] . 2002. Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Transactions on Parallel and Distributed Systems 13, 3 (2002), 260–274.Google ScholarDigital Library
- [168] . 2022. RedMulE: A compact FP16 matrix-multiplication accelerator for adaptive deep learning on RISC-V-based ultra-low-power SoCs. arXiv preprint arXiv:2204.11192 (2022).Google Scholar
- [169] . 2017. Deep convolutional neural network architecture with reconfigurable computation patterns. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 25, 8 (2017), 2220–2233.Google ScholarDigital Library
- [170] . 2019. Machine understanding of domain computation for domain-specific system-on-chips (DSSoC). In Open Architecture/Open Business Model Net-Centric Systems and Defense Transformation 2019, Vol. 11015. International Society for Optics and Photonics, SPIE, 180–187.Google Scholar
- [171] . 2020. Automated parallel kernel extraction from dynamic application traces. arXiv preprint arXiv:2001.09995 (2020).Google Scholar
- [172] . 1975. NP-complete scheduling problems. Journal of Computer and System Sciences 10, 3 (1975), 384–393. Google ScholarDigital Library
- [173] . 2010. Scenario-based design space exploration of MPSoCs. In Proceedings of the IEEE International Conference on Computer Design. 305–312.Google ScholarCross Ref
- [174] . 2011. Hardware-supported virtualization on ARM. In Proceedings of the 2nd Asia-Pacific Workshop on Systems. 1–5.Google ScholarDigital Library
- [175] . 2021. STOMP: Agile evaluation of scheduling policies in heterogeneous multi-processors. In Proceedings of the 3rd International Workshop on Domain Specific System Architecture in Conjunction with the 27th IEEE International Symposium on High-Performance Computer Architecture (DOSSA-3 @ HPCA’21).Google Scholar
- [176] . 2010. SESAM: An MPSoC simulation environment for dynamic application processing. In Proceedings of the 10th IEEE International Conference on Computer and Information Technology. 1880–1886.Google ScholarDigital Library
- [177] . 2019. Generic connectivity-based CGRA mapping via integer linear programming. In Proceedings of the Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’19). 65–73.Google ScholarCross Ref
- [178] . 2022. A novel systolic array processor with dynamic dataflows. Integration 85 (2022), 42–47.Google ScholarDigital Library
- [179] . 2014. Intel math kernel library. In High-Performance Computing on the Intel® Xeon Phi\(^{TM}\). Springer, 167–188.Google Scholar
- [180] . 2013. Implications of the power wall: Dim cores and reconfigurable logic. IEEE Micro 33, 5 (2013), 40–48.Google ScholarDigital Library
- [181] . 2019. Benchmarking TPU, GPU, and CPU platforms for deep learning. arXiv preprint arXiv:1907.10701 (2019).Google Scholar
- [182] . 2017. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. In Proceedings of the 54th Annual Design Automation Conference. 1–6.Google ScholarDigital Library
- [183] . 2018. Machine learning for healthcare: On the verge of a major shift in healthcare epidemiology. Clinical Infectious Diseases 66, 1 (2018), 149–153.Google ScholarCross Ref
- [184] . 2021. HiMap: Fast and scalable high-quality mapping on CGRA via hierarchical abstraction. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 41, 10 (2021), 3290–3303.Google Scholar
- [185] . 2011. Distributed thermal management for embedded heterogeneous MPSoCs with dedicated hardware accelerators. In Proceedings of the IEEE 29th International Conference on Computer Design (ICCD’11). 183–189.Google ScholarDigital Library
- [186] . 2015. Soft and hard reliability-aware scheduling for multicore embedded systems with energy harvesting. IEEE Transactions on Multi-Scale Computing Systems 1, 4 (2015), 220–235.Google ScholarCross Ref
- [187] . 2019. Self-optimizing and self-programming computing systems: A combined compiler, complex networks, and machine learning approach. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 27, 6 (2019), 1416–1427.Google ScholarDigital Library
- [188] . 2021. Plasticity-on-chip design: Exploiting self-similarity for data communications. IEEE Transactions on Computers 70, 6 (2021), 950–962.Google ScholarCross Ref
- [189] . 2020. Accelerating deep neural network computation on a low power reconfigurable architecture. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS’20). 1–5.Google ScholarCross Ref
- [190] . 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the International Symposium on Field-Programmable Gate Arrays. 161–170.Google ScholarDigital Library
- [191] . 2018. Graphlt: A high-performance graph DSL. Proceedings of the ACM on Programming Languages 2, OOPSLA (2018), Article 121, 30 pages.Google ScholarDigital Library
- [192] . 2020. Towards higher performance and robust compilation for CGRA modulo scheduling. IEEE Transactions on Parallel and Distributed Systems 31, 9 (2020), 2201–2219.Google ScholarCross Ref
- [193] . 2019. Security-critical energy-aware task scheduling for heterogeneous real-time MPSoCs in IoT. IEEE Transactions on Services Computing 13, 4 (2019), 745–758.Google ScholarCross Ref
- [194] . 2022. DRHEFT: Deadline-constrained reliability-aware HEFT algorithm for real-time heterogeneous MPSoC systems. IEEE Transactions on Reliability 71, 1 (2022), 178–189.Google Scholar
Index Terms
- Domain-Specific Architectures: Research Problems and Promising Approaches
Recommendations
Exploring Domain-Specific Architectures for Energy-Efficient Wearable Computing
AbstractThis paper explores the use of domain-specific architectures for energy-efficient and flexible computing of a variety of workloads, including signal processing applications, in wearable devices. As wearable devices become more popular, and with ...
Reconfigurable Coprocessor for Multimedia Application Domain
A new reconfigurable architectural template is presented. Such a template is composed of coarse-grained and fine-grained reconfigurable datapath and control to obtain performances at custom designed chip level. To show the adaptability/performance of ...
A Domain-Specific System-On-Chip Design for Energy Efficient Wearable Edge AI Applications
ISLPED '22: Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and DesignArtificial intelligence (AI) based wearable applications collect and process a significant amount of streaming sensor data. Transmitting the raw data to cloud processors wastes scarce energy and threatens user privacy. Wearable edge AI devices should ...
Comments