research-article

A systematic approach for optimized bypass configurations for application-specific embedded processors

Authors:
Thorsten Jungeblut

Bielefeld University, Germany

Bielefeld University, Germany
View Profile

,
Boris Hübener

Bielefeld University, Germany

Bielefeld University, Germany
View Profile

,
Mario Porrmann

University of Paderborn, Germany

University of Paderborn, Germany
View Profile

,
Ulrich Rückert

Bielefeld University, Germany

Bielefeld University, Germany
View Profile

ACM Transactions on Embedded Computing Systems Volume 13 Issue 2Article No.: 18pp 1–25https://doi.org/10.1145/2514641.2514645

Published:30 September 2013Publication History

ACM Transactions on Embedded Computing Systems

Abstract

The diversity of today's mobile applications requires embedded processor cores with a high resource efficiency, that means, the devices should provide a high performance at low area requirements and power consumption. The fine-grained parallelism supported by multiple functional units of VLIW architectures offers a high throughput at reasonable low clock frequencies compared to single-core RISC processors. To efficiently utilize the processor pipeline, common system architectures have to cope with data hazards due to data dependencies between consecutive operations. On the one hand, such hazards can be resolved by complex forwarding circuits (i.e., a pipeline bypass) which forward intermediate results to a subsequent instruction. On the other hand, the pipeline bypass can strongly affect or even dominate the total resource requirements and degrade the maximum clock frequency. In this work the CoreVA VLIW architecture is used for the development and the analysis of application-specific bypass configurations. It is shown that many paths of a comprehensive bypass system are rarely used and may not be required for certain applications. For this reason, several strategies have been implemented to enhance the efficiency of the total system by introducing application-specific bypass configurations. The configuration can be carried out statically by only implementing required paths or at runtime by dynamically reconfiguring the hardware. An algorithm is proposed which derives an optimized configuration by iteratively disabling single bypass paths. The adaptation of these application-specific bypass configurations allows for a reduction of the critical path by 26%. As a result, the execution time and energy requirements could be reduced by up to 21.5%. Using Dynamic Frequency Scaling (DFS) and dynamic deactivation/reactivation of bypass paths allows for a runtime reconfiguration of the bypass system. This ensures the highest efficiency while processing varying applications.

References

Ahuja, P. S., Clark, D. W., and Rogers, A. 1995. The performance impact of incomplete bypassing in processor pipelines. In Proceedings of the 28^th Annual International Symposium on Microarchitecture (MICRO'95). 36--45. Google ScholarDigital Library
Brigham, E. and Morrow, R. 2009. The fast Fourier transform. IEEE Spectrum 4, 12, 63--70. Google ScholarDigital Library
Brown, M. D. and Patt, Y. N. 2001. Using internal redundant representations and limited bypass to support pipelined adders and register files. In Proceedings of the 8^th Annual International Symposium on High-Performance Computer Architecture. 289--298. Google ScholarDigital Library
Daemen, J. and Rijmen, V. 2002. The Design of Rijndael: AES--The Advanced Encryption Standard. Springer. Google ScholarDigital Library
Dreesen, R., Jungeblut, T., Thies, M., Porrmann, M., Rückert, U., and Kastens, U. 2009. A synchronization method for register traces of pipelined processors. In Proceedings of the International Embedded Systems Symposium (IESS'09). 207--217.Google Scholar
Ekdahl, P. and Johansson, T. 2000. SNOW-- A new stream cipher. In Proceedings of the 1^st Open NESSIE Workshop.Google Scholar
Fan, K., Clark, N., Chu, M., Manjunath, K. V., Ravindran, R., Smelyanskiy, M., and Mahlke, S. 2003. Systematic register bypass customization for application-specific processors. In Proceedings of the of IEEE International Conference on Application-Specific Systems, Architectures, and Processors (ASSAP'03). 64--74.Google Scholar
Fisher, J. A. 1983. Very long instruction word architectures and the ELI-512. In Proceedings of the 10^th Annual International Symposium on Computer Architecture (ISCA'83). 140--150. Google ScholarDigital Library
Fisher, J. A. 2009. Retrospective: Very long instruction word architectures and the ELI-512. IEEE Solid-State Circ. Mag. 1, 34--36.Google ScholarCross Ref
Fisher, J. A., Faraboschi, P., and Young, C. 2009. VLIW processors: From blue sky to best buy. IEEE Solid-State Circ. Mag. 1, 10--17.Google ScholarCross Ref
Goel, N., Kumar, A., and Panda, P. R. 2007. Power reduction in VLIW processor with compiler driven bypass network. In Proceedings of the 20^th International Conference on VLSI Design (VLSID'07), held jointly with 6^th International Conference on Embedded Systems. 233--238. Google ScholarDigital Library
Hsu, C., Kremer, U., and Hsiao, M. 2001. Compiler-directed dynamic voltage/frequency scheduling for energy reduction in microprocessors. In Proceedings of the International Symposium on Low Power Electronics and Design. IEEE, 275--278. Google ScholarDigital Library
Hussmann, M., Thies, M., and Kastens, U. 2005. Parallelizing compilation through load-time scheduling for a superscalar processor family. In Proceedings of the 3^rd Workshop on Optimizations for DSP and Embedded Systems (ODES'05), held in conjunction with the 3^rd IEEE/ACM International Symposium on Code Generation and Optimization (CGO'05).Google Scholar
Jungeblut, T., Dreesen, R., Porrmann, M., Thies, M., Rückert, U., and Kastens, U. 2010a. A framework for the design space exploration of software-defined radio applications. In Proceedings of the 2^nd International ICST Conference on Mobile Lightweight Wireless Systems.Google Scholar
Jungeblut, T., Klassen, D., Dreesen, R., Porrmann, M., Thies, M., Rückert, U., and Kastens, U. 2009. Design space exploration for next generation wireless technologies. In Proceedings of the Electrical and Electronic Engineering for Communication Conference (EEEfCOM'09).Google Scholar
Jungeblut, T., Puttmann, C., Dreesen, R., Porrmann, M., Thies, M., Rückert, U., and Kastens, U. 2010b. Resource efficiency of hardware extensions of a 4-issue VLIW processor for elliptic curve cryptography. Adv. Radio Sci. 8, 295--305.Google ScholarCross Ref
Jungeblut, T., Sievers, G., Porrmann, M., and Rückert, U. 2010c. Design space exploration for memory subsystems of VLIW architectures. In Proceedings of the 5^th IEEE International Conference on Networking, Architecture, and Storage (NAS'10). Google ScholarDigital Library
Kastens, U., Le, D. K., Slowik, A., and Thies, M. 2004. Feedback driven instruction-set extension. In Proceedings of the ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES'04). Google ScholarDigital Library
Lung, C., Hsiao, H., Zeng, Z., and Chang, S. 2010. LP-based multi-mode multi-corner clock skew optimization. In Proceedings of the International Symposium on VLSI Design Automation and Test (VLSI-DAT'10). IEEE, 335--338.Google Scholar
Peterson, W. W. and Brown, D. T. 1961. Cyclic codes for error detection. Proc. IRE 49, 1, 228--235.Google ScholarCross Ref
Porrmann, M., Hagemeyer, J., Pohl, C., Romoth, J., and Strugholtz, M. 2010. RAPTOR -- A scalable platform for rapid prototyping and FPGA-based cluster computing. In Parallel Computing: From Multicores and GPU's to Petascale, Advances in Parallel Computing, vol. 19, IOS Press, 592--599.Google Scholar
Richardson, I. 2010. The H.264 Advanced Video Compression Standard. John Wiley and Sons. Google ScholarDigital Library
Sami, M., Sciuto, D., Silvano, C., Zaccaria, V., and Zafalon, R. 2002. Low-power data forwarding for VLIW embedded architectures. IEEE Trans. VLSI Syst. 10, 5, 614--622. Google ScholarDigital Library
Terechko, A., Garg, M., and Corporaal, H. 2005. Evaluation of speed and area of clustered VLIW processors. In Proceedings of the 18^th International Conference on VLSI Design. IEEE, 557--563. Google ScholarDigital Library
Viterbi, A. 2002. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Inf. Theory 13, 2, 260--269. Google ScholarDigital Library
Weicker, R. 1984. Dhrystone: A synthetic systems programming benchmark. Comm. ACM 27, 10, 1013--1030. Google ScholarDigital Library
Xie, Y., Wolf, W., and Lekatsas, H. 2006. Code compression for embedded VLIW processors using variable-to-fixed coding. IEEE Trans. VLSI Syst. 14, 5, 525--536. Google ScholarDigital Library

Index Terms

A systematic approach for optimized bypass configurations for application-specific embedded processors
1. Hardware
  1. Integrated circuits
    1. Logic circuits

Recommendations

Retargetable code generation for application-specific processors
Special issue: Parallel computing technologies

An approach of intelligent retargetable compiler is introduced to overcome the gap between hardware and software development and to increase performance of embedded systems. It focuses on knowledgeable treatment of code generation where knowledge about ...
Read More
Generating interlocked instruction pipelines from specifications of instruction sets
CODES+ISSS '12: Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis

The development of application specific processors (ASIPs) for systems-on-a-chip (SoCs) became increasingly popular in recent years. To efficiently develop such processors, respective tools are crucial. This paper presents methods to generate pipelined ...
Read More
Application specific forwarding network and instruction encoding for multi-pipe ASIPs
CODES+ISSS '06: Proceedings of the 4th international conference on Hardware/software codesign and system synthesis

Small area and code size are two critical design issues in most of embedded system designs. In this paper, we tackle these issues by customizing forwarding networks and instruction encoding schemes for multi-pipe Application Specific Instruction-Set ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Embedded Computing Systems Volume 13, Issue 2
Special issue on application-specific processors
September 2013
254 pages
ISSN:1539-9087
EISSN:1558-3465
DOI:10.1145/2514641
Issue’s Table of Contents

Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States

Journal Family
ACM Journals for the Design of Smart and Connected Systems
Publication History
- Published: 30 September 2013
- Accepted: 1 May 2012
- Revised: 1 February 2012
- Received: 1 February 2011
Published in tecs Volume 13, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
CoreVA
DFS
Pipeline bypass
VLIW
application specific
forwarding
multifrequency
optimized
pipeline
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 257
  Total Downloads
- Downloads (Last 12 months)6
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A systematic approach for optimized bypass configurations for application-specific embedded processors

ACM Transactions on Embedded Computing Systems

Abstract

References

Cited By

Index Terms

Recommendations

Retargetable code generation for application-specific processors

Generating interlocked instruction pipelines from specifications of instruction sets

Application specific forwarding network and instruction encoding for multi-pipe ASIPs

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Journal Family

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

A systematic approach for optimized bypass configurations for application-specific embedded processors

ACM Transactions on Embedded Computing Systems

Abstract

References

Cited By

Index Terms

Recommendations

Retargetable code generation for application-specific processors

Generating interlocked instruction pipelines from specifications of instruction sets

Application specific forwarding network and instruction encoding for multi-pipe ASIPs

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Journal Family

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media