article

Dynamic coalescing for 16-bit instructions

Authors:
Arvind Krishnaswamy

The University of Arizona, Tucson, AZ

The University of Arizona, Tucson, AZ
View Profile

,
Rajiv Gupta

The University of Arizona, Tucson, AZ

The University of Arizona, Tucson, AZ
View Profile

Authors Info & Claims

ACM Transactions on Embedded Computing Systems Volume 4 Issue 1pp 3–37https://doi.org/10.1145/1053271.1053273

Published:01 February 2005Publication History

ACM Transactions on Embedded Computing Systems

Abstract

In the embedded domain, memory usage and energy consumption are critical constraints.Embedded processors such as the ARM and MIPS provide a 16-bit instruction set, (called Thumb in the case of the ARM family of processors), in addition to the 32-bit instruction set to address these concerns. Using 16-bit instructions one can achieve code size reduction and instruction cache energy savings at the cost of performance. This paper presents a novel approach that enhances the performance of 16-bit Thumb code. We have observed that throughout Thumb code there exist Thumb instruction pairs that are equivalent to a single ARM instruction. We have developed enhancements to the processor microarchitecture and the Thumb instruction set to exploit this property. We enhance the Thumb instruction set by incorporating Augmenting eXtensions (AX). A Thumb instruction pair that can be combined into a single ARM instruction is replaced by an AXThumb instruction pair by the compiler. The AX instruction is coalesced with the immediately following Thumb instruction to generate a single ARM instruction at decode time. The enhanced microarchitecture ensures that coalescing does not introduce pipeline delays or increase cycle time thereby resulting in reduction of both instruction counts and cycle counts. Using AX instructions and coalescing hardware we are also able to support efficient predicated execution in 16-bit mode.

References

Burger, D. and Austin, T. 1996. The Simplescalar Toolset. Technical Report CS-TR-96-1308, University of Wisconsin-Madison.]]Google Scholar
Debray, S. and Evans, W. 2002. Profile-guided code compression. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, New York, 95--105.]] Google Scholar
Friendly, D. H., Patel, S. J., and Patt, Y. N. 1998. Putting the fill unit to work: Dynamic optimizations for trace cache microprocessors. In Proceedings of the 31st Annual International Symposium on Microarchitecture. IEEE/ACM, Piscataway, NJ/New York, 173--181.]] Google Scholar
Furber, S. 1996. ARM System Architecture. Addison-Wesley, Reading, MA.]] Google Scholar
Hu, S. and Smith, J. 2004. Using dynamic binary translation to fuse dependent instructions. In Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization. IEEE/ACM, Piscataway, NJ/New York, 213--224.]] Google Scholar
Intel 2000a. The Intel Xscale Microarchitecture Technical Summary. ftp://download.intel. com/design/intelxscale/XScaleDatasheet4.pdf.]]Google Scholar
Intel 2000b. Sa-110 Microprocessor Technical Reference Manual. ftp://download.intel.com/design/strong/applnots/27819401.pdf.]]Google Scholar
Intel 2002. A white paper on The Intel pxa250 applications processor.]]Google Scholar
Jacobson, Q. and Smith, J. E. 1999. Instruction pre-processing in trace processors. In Proceedings of the International Symposium on High-Performance Computer Architecture. IEEE-CS, Piscataway, NJ, 125--129.]] Google Scholar
Krishnaswamy, A. and Gupta, R. 2002. Profile guided selection of arm and thumb instructions. In Proceedings of the ACM SIGPLAN Joint Conference on Languages Compilers and Tools for Embedded Systems & Software and Compilers for Embedded Systems, Berlin, Germany. ACM, New York, 55--64.]] Google Scholar
Lee, C., Potkonjak, M., and Mangione-Smith, W. 1997. Mediabench: A tool for evaluating and synthesizing multimedia and communications systems. In Proceedings of the 30th Annual International Symposium on Microarchitecture. IEEE/ACM, Research Triangle Park, NC, 330--335.]] Google Scholar
Lee, S., Lee, J., Min, S. L., Hiser, J., and Davidson, J. W. 2003. Code generation for a dual instruction set processor based on selective code transformation. In Proceedings of the 7th International Workshop on Software and Compilers for Embedded Systems, Vienna, Austria. LNCS, vol. 2826, Springer, Berlin, 33--48.]]Google Scholar
Lefurgy, C., Bird, P., Chen, I.-C., and Mudge, T. 1997. Improving code density using compression techniques. In Proceedings of the 13th Annual International Symposium on Microarchitecture. IEEE/ACM, Research Triangle Park, NC, 194--203.]] Google Scholar
Lekatsas, H. and Wolf, W. 1998. Code--compression for embedded systems. In Proceedings of the Design Automation Conference. IEEE/ACM, 516--521.]] Google Scholar
McGhan, H. and O'Connor, M. 1998. Picojava: A direct execution engine for java bytecode. IEEE Comput. 31, 10 (Oct.), 22--30.]] Google Scholar
Memik, G., Mangione-Smith, W. and Hu. 2001. Netbench: A benchmarking suite for network processors. In Proceedings of the IEEE International Conference on Computer-Aided Design. IEEE, Piscataway, NJ, 39--42.]] Google Scholar
Qasem, A., Whalley, D., Yuan, X., and van Engelen, R. 2001. Using a swap instruction to coalesce loads and stores. In Proceedings of the European Conference on Parallel Computing. 235--240.]] Google Scholar
Razdan, R. and Smith, M.D. 1994. A high-performance microarchitecture with hardware-programmable functional units. In Proceedings of the 27th Annual International Symposium on Microarchitecture. IEEE/ACM, San Jose, CA, 172--180.]] Google Scholar
Reinman, G. and Jouppi, N. 1999. An integrated cache timing and power model. Technical Report, Western Research Lab.]]Google Scholar
Segars, S., Clarke, K., and Goudge, L. 1995. Embedded control problems, thumb and the arm7tdmi. IEEE Micro 15, 5 (Oct.), 22--30.]] Google Scholar
Segars, S. 2001. Low power design techniques for microprocessors. Tutorial Notes, International Solid-State Circuits Conference. IEEE, Piscataway, NJ.]]Google Scholar
Wolf, T. and Franklin, M. 2000. Commbench---A telecommunications benchmark for network processors. In Proceedings of the International Symposium on Performance Analysis of Systems and Software. IEEE, Piscataway, NJ, 154--162.]] Google Scholar
Wolfe, A. and Chanin, A. 1992. Executing compressed programs on an embedded risc architecture. In Proceedings of the 25th Annual International Symposium on Microarchitecture. IEEE/ACM, Portland, OR, 81--91.]] Google Scholar

Index Terms

Dynamic coalescing for 16-bit instructions
1. Computer systems organization
  1. Architectures
2. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Enhancing the performance of 16-bit code using augmenting instructions
Special Issue: Proceedings of the 2003 ACM SIGPLAN conference on Language, compiler, and tool support for embedded systems (San Diego, CA).

In the embedded domain, memory usage and energy consumption are critical constraints. Dual width instruction set embedded processors such as the ARM provide a 16-bit instruction set in addition to the 32-bit instruction set to address these concerns. ...
Read More
Enhancing the performance of 16-bit code using augmenting instructions
LCTES '03: Proceedings of the 2003 ACM SIGPLAN conference on Language, compiler, and tool for embedded systems

In the embedded domain, memory usage and energy consumption are critical constraints. Dual width instruction set embedded processors such as the ARM provide a 16-bit instruction set in addition to the 32-bit instruction set to address these concerns. ...
Read More
Integrated instruction selection and register allocation for compact code generation exploiting freeform mixing of 16- and 32-bit instructions
CGO '10: Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization

For memory constrained embedded systems code size is at least as important as performance. One way of increasing code density is to exploit compact instruction formats, e.g. ARM Thumb, where the processor either operates in standard or compact ...
Read More

Reviews

Reviewer: Olivier Louis Marie Lecarme

It seems that several important processor designs have been done without consideration of the poor compiler, in charge of exploiting the capabilities of the processor to reach its peak performance. I'm thinking, for example, of the IA-64 Intel processor. This paper addresses this problem, discussing both hardware processor design and compiling techniques. Here, the processor is the ARM family. This is one of the most ubiquitous microprocessors, used in embedded products like the iPod, the Playstation, mobile phones, camcorders, pocket personal computers (PCs), and so on. If one remembers that "more than 98 percent of all microprocessors are used in embedded products," obviously it is extremely important to improve their performance as much as possible. The constraints, in comparison to processors used in computers, are mostly related to energy and memory savings. However, these savings should not be attained at the expense of speed. The ARM family uses a 32-bit instruction set, but in order to save memory and energy, it also uses a 16-bit instruction set, properly named "Thumb." As the authors demonstrate, using Thumb code results in a code size reduction of about 30 percent, but also in a three-fold increase in the number of instructions to execute. Thus, the code is slower, and the energy savings is much lower than expected. In order to correct this, the authors have designed an enhancement to the Thumb instruction set called augmenting extensions (AX). These instructions are handled in the decode stage of the processor, and thus they don't use a cycle in the pipeline. Every one is coalesced with the following Thumb instruction, yielding an ARM instruction. This has the advantage of reducing the number of Thumb instructions to be generated by the compiler and executed by the processor (an ARM instruction does more work than a Thumb one). Thus, there are gains in speed, energy savings, and memory usage. The bulk of the paper is devoted to explaining needed modifications to the hardware, as well as to the compiling techniques needed to generate the code. For example, in some cases, the instructions in the two branches of an if-then-else construct must be generated by pairs, one for the true part and one for the false part, which is uncommon. Despite a few typographical errors, the paper is well written and pleasant to read. The presented results are convincing. Whether the ideas will actually be used remains to be seen.

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Embedded Computing Systems Volume 4, Issue 1
February 2005
254 pages
ISSN:1539-9087
EISSN:1558-3465
DOI:10.1145/1053271
Issue’s Table of Contents

Copyright © 2005 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States

Journal Family
ACM Journals for the Design of Smart and Connected Systems
Publication History
- Published: 1 February 2005
Published in tecs Volume 4, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
16-bit Thumb ISA
32-bit ARM ISA
AX instructions
Embedded processor
code size
energy
instruction coalescing
performance
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 17
  Total Citations
  View Citations
- 720
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Dynamic coalescing for 16-bit instructions

ACM Transactions on Embedded Computing Systems

Abstract

References

Cited By

Index Terms

Recommendations

Enhancing the performance of 16-bit code using augmenting instructions

Enhancing the performance of 16-bit code using augmenting instructions

Integrated instruction selection and register allocation for compact code generation exploiting freeform mixing of 16- and 32-bit instructions

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Journal Family

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Dynamic coalescing for 16-bit instructions

ACM Transactions on Embedded Computing Systems

Abstract

References

Cited By

Index Terms

Recommendations

Enhancing the performance of 16-bit code using augmenting instructions

Enhancing the performance of 16-bit code using augmenting instructions

Integrated instruction selection and register allocation for compact code generation exploiting freeform mixing of 16- and 32-bit instructions

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Journal Family

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media