Abstract
The diversity of today's mobile applications requires embedded processor cores with a high resource efficiency, that means, the devices should provide a high performance at low area requirements and power consumption. The fine-grained parallelism supported by multiple functional units of VLIW architectures offers a high throughput at reasonable low clock frequencies compared to single-core RISC processors. To efficiently utilize the processor pipeline, common system architectures have to cope with data hazards due to data dependencies between consecutive operations. On the one hand, such hazards can be resolved by complex forwarding circuits (i.e., a pipeline bypass) which forward intermediate results to a subsequent instruction. On the other hand, the pipeline bypass can strongly affect or even dominate the total resource requirements and degrade the maximum clock frequency. In this work the CoreVA VLIW architecture is used for the development and the analysis of application-specific bypass configurations. It is shown that many paths of a comprehensive bypass system are rarely used and may not be required for certain applications. For this reason, several strategies have been implemented to enhance the efficiency of the total system by introducing application-specific bypass configurations. The configuration can be carried out statically by only implementing required paths or at runtime by dynamically reconfiguring the hardware. An algorithm is proposed which derives an optimized configuration by iteratively disabling single bypass paths. The adaptation of these application-specific bypass configurations allows for a reduction of the critical path by 26%. As a result, the execution time and energy requirements could be reduced by up to 21.5%. Using Dynamic Frequency Scaling (DFS) and dynamic deactivation/reactivation of bypass paths allows for a runtime reconfiguration of the bypass system. This ensures the highest efficiency while processing varying applications.
- Ahuja, P. S., Clark, D. W., and Rogers, A. 1995. The performance impact of incomplete bypassing in processor pipelines. In Proceedings of the 28th Annual International Symposium on Microarchitecture (MICRO'95). 36--45. Google ScholarDigital Library
- Brigham, E. and Morrow, R. 2009. The fast Fourier transform. IEEE Spectrum 4, 12, 63--70. Google ScholarDigital Library
- Brown, M. D. and Patt, Y. N. 2001. Using internal redundant representations and limited bypass to support pipelined adders and register files. In Proceedings of the 8th Annual International Symposium on High-Performance Computer Architecture. 289--298. Google ScholarDigital Library
- Daemen, J. and Rijmen, V. 2002. The Design of Rijndael: AES--The Advanced Encryption Standard. Springer. Google ScholarDigital Library
- Dreesen, R., Jungeblut, T., Thies, M., Porrmann, M., Rückert, U., and Kastens, U. 2009. A synchronization method for register traces of pipelined processors. In Proceedings of the International Embedded Systems Symposium (IESS'09). 207--217.Google Scholar
- Ekdahl, P. and Johansson, T. 2000. SNOW-- A new stream cipher. In Proceedings of the 1st Open NESSIE Workshop.Google Scholar
- Fan, K., Clark, N., Chu, M., Manjunath, K. V., Ravindran, R., Smelyanskiy, M., and Mahlke, S. 2003. Systematic register bypass customization for application-specific processors. In Proceedings of the of IEEE International Conference on Application-Specific Systems, Architectures, and Processors (ASSAP'03). 64--74.Google Scholar
- Fisher, J. A. 1983. Very long instruction word architectures and the ELI-512. In Proceedings of the 10th Annual International Symposium on Computer Architecture (ISCA'83). 140--150. Google ScholarDigital Library
- Fisher, J. A. 2009. Retrospective: Very long instruction word architectures and the ELI-512. IEEE Solid-State Circ. Mag. 1, 34--36.Google ScholarCross Ref
- Fisher, J. A., Faraboschi, P., and Young, C. 2009. VLIW processors: From blue sky to best buy. IEEE Solid-State Circ. Mag. 1, 10--17.Google ScholarCross Ref
- Goel, N., Kumar, A., and Panda, P. R. 2007. Power reduction in VLIW processor with compiler driven bypass network. In Proceedings of the 20th International Conference on VLSI Design (VLSID'07), held jointly with 6th International Conference on Embedded Systems. 233--238. Google ScholarDigital Library
- Hsu, C., Kremer, U., and Hsiao, M. 2001. Compiler-directed dynamic voltage/frequency scheduling for energy reduction in microprocessors. In Proceedings of the International Symposium on Low Power Electronics and Design. IEEE, 275--278. Google ScholarDigital Library
- Hussmann, M., Thies, M., and Kastens, U. 2005. Parallelizing compilation through load-time scheduling for a superscalar processor family. In Proceedings of the 3rd Workshop on Optimizations for DSP and Embedded Systems (ODES'05), held in conjunction with the 3rd IEEE/ACM International Symposium on Code Generation and Optimization (CGO'05).Google Scholar
- Jungeblut, T., Dreesen, R., Porrmann, M., Thies, M., Rückert, U., and Kastens, U. 2010a. A framework for the design space exploration of software-defined radio applications. In Proceedings of the 2nd International ICST Conference on Mobile Lightweight Wireless Systems.Google Scholar
- Jungeblut, T., Klassen, D., Dreesen, R., Porrmann, M., Thies, M., Rückert, U., and Kastens, U. 2009. Design space exploration for next generation wireless technologies. In Proceedings of the Electrical and Electronic Engineering for Communication Conference (EEEfCOM'09).Google Scholar
- Jungeblut, T., Puttmann, C., Dreesen, R., Porrmann, M., Thies, M., Rückert, U., and Kastens, U. 2010b. Resource efficiency of hardware extensions of a 4-issue VLIW processor for elliptic curve cryptography. Adv. Radio Sci. 8, 295--305.Google ScholarCross Ref
- Jungeblut, T., Sievers, G., Porrmann, M., and Rückert, U. 2010c. Design space exploration for memory subsystems of VLIW architectures. In Proceedings of the 5th IEEE International Conference on Networking, Architecture, and Storage (NAS'10). Google ScholarDigital Library
- Kastens, U., Le, D. K., Slowik, A., and Thies, M. 2004. Feedback driven instruction-set extension. In Proceedings of the ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES'04). Google ScholarDigital Library
- Lung, C., Hsiao, H., Zeng, Z., and Chang, S. 2010. LP-based multi-mode multi-corner clock skew optimization. In Proceedings of the International Symposium on VLSI Design Automation and Test (VLSI-DAT'10). IEEE, 335--338.Google Scholar
- Peterson, W. W. and Brown, D. T. 1961. Cyclic codes for error detection. Proc. IRE 49, 1, 228--235.Google ScholarCross Ref
- Porrmann, M., Hagemeyer, J., Pohl, C., Romoth, J., and Strugholtz, M. 2010. RAPTOR -- A scalable platform for rapid prototyping and FPGA-based cluster computing. In Parallel Computing: From Multicores and GPU's to Petascale, Advances in Parallel Computing, vol. 19, IOS Press, 592--599.Google Scholar
- Richardson, I. 2010. The H.264 Advanced Video Compression Standard. John Wiley and Sons. Google ScholarDigital Library
- Sami, M., Sciuto, D., Silvano, C., Zaccaria, V., and Zafalon, R. 2002. Low-power data forwarding for VLIW embedded architectures. IEEE Trans. VLSI Syst. 10, 5, 614--622. Google ScholarDigital Library
- Terechko, A., Garg, M., and Corporaal, H. 2005. Evaluation of speed and area of clustered VLIW processors. In Proceedings of the 18th International Conference on VLSI Design. IEEE, 557--563. Google ScholarDigital Library
- Viterbi, A. 2002. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Inf. Theory 13, 2, 260--269. Google ScholarDigital Library
- Weicker, R. 1984. Dhrystone: A synthetic systems programming benchmark. Comm. ACM 27, 10, 1013--1030. Google ScholarDigital Library
- Xie, Y., Wolf, W., and Lekatsas, H. 2006. Code compression for embedded VLIW processors using variable-to-fixed coding. IEEE Trans. VLSI Syst. 14, 5, 525--536. Google ScholarDigital Library
Index Terms
- A systematic approach for optimized bypass configurations for application-specific embedded processors
Recommendations
Retargetable code generation for application-specific processors
Special issue: Parallel computing technologiesAn approach of intelligent retargetable compiler is introduced to overcome the gap between hardware and software development and to increase performance of embedded systems. It focuses on knowledgeable treatment of code generation where knowledge about ...
Generating interlocked instruction pipelines from specifications of instruction sets
CODES+ISSS '12: Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesisThe development of application specific processors (ASIPs) for systems-on-a-chip (SoCs) became increasingly popular in recent years. To efficiently develop such processors, respective tools are crucial. This paper presents methods to generate pipelined ...
Application specific forwarding network and instruction encoding for multi-pipe ASIPs
CODES+ISSS '06: Proceedings of the 4th international conference on Hardware/software codesign and system synthesisSmall area and code size are two critical design issues in most of embedded system designs. In this paper, we tackle these issues by customizing forwarding networks and instruction encoding schemes for multi-pipe Application Specific Instruction-Set ...
Comments