Abstract
In the embedded domain, memory usage and energy consumption are critical constraints.Embedded processors such as the ARM and MIPS provide a 16-bit instruction set, (called Thumb in the case of the ARM family of processors), in addition to the 32-bit instruction set to address these concerns. Using 16-bit instructions one can achieve code size reduction and instruction cache energy savings at the cost of performance. This paper presents a novel approach that enhances the performance of 16-bit Thumb code. We have observed that throughout Thumb code there exist Thumb instruction pairs that are equivalent to a single ARM instruction. We have developed enhancements to the processor microarchitecture and the Thumb instruction set to exploit this property. We enhance the Thumb instruction set by incorporating Augmenting eXtensions (AX). A Thumb instruction pair that can be combined into a single ARM instruction is replaced by an AXThumb instruction pair by the compiler. The AX instruction is coalesced with the immediately following Thumb instruction to generate a single ARM instruction at decode time. The enhanced microarchitecture ensures that coalescing does not introduce pipeline delays or increase cycle time thereby resulting in reduction of both instruction counts and cycle counts. Using AX instructions and coalescing hardware we are also able to support efficient predicated execution in 16-bit mode.
- Burger, D. and Austin, T. 1996. The Simplescalar Toolset. Technical Report CS-TR-96-1308, University of Wisconsin-Madison.]]Google Scholar
- Debray, S. and Evans, W. 2002. Profile-guided code compression. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, New York, 95--105.]] Google Scholar
- Friendly, D. H., Patel, S. J., and Patt, Y. N. 1998. Putting the fill unit to work: Dynamic optimizations for trace cache microprocessors. In Proceedings of the 31st Annual International Symposium on Microarchitecture. IEEE/ACM, Piscataway, NJ/New York, 173--181.]] Google Scholar
- Furber, S. 1996. ARM System Architecture. Addison-Wesley, Reading, MA.]] Google Scholar
- Hu, S. and Smith, J. 2004. Using dynamic binary translation to fuse dependent instructions. In Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization. IEEE/ACM, Piscataway, NJ/New York, 213--224.]] Google Scholar
- Intel 2000a. The Intel Xscale Microarchitecture Technical Summary. ftp://download.intel. com/design/intelxscale/XScaleDatasheet4.pdf.]]Google Scholar
- Intel 2000b. Sa-110 Microprocessor Technical Reference Manual. ftp://download.intel.com/design/strong/applnots/27819401.pdf.]]Google Scholar
- Intel 2002. A white paper on The Intel pxa250 applications processor.]]Google Scholar
- Jacobson, Q. and Smith, J. E. 1999. Instruction pre-processing in trace processors. In Proceedings of the International Symposium on High-Performance Computer Architecture. IEEE-CS, Piscataway, NJ, 125--129.]] Google Scholar
- Krishnaswamy, A. and Gupta, R. 2002. Profile guided selection of arm and thumb instructions. In Proceedings of the ACM SIGPLAN Joint Conference on Languages Compilers and Tools for Embedded Systems & Software and Compilers for Embedded Systems, Berlin, Germany. ACM, New York, 55--64.]] Google Scholar
- Lee, C., Potkonjak, M., and Mangione-Smith, W. 1997. Mediabench: A tool for evaluating and synthesizing multimedia and communications systems. In Proceedings of the 30th Annual International Symposium on Microarchitecture. IEEE/ACM, Research Triangle Park, NC, 330--335.]] Google Scholar
- Lee, S., Lee, J., Min, S. L., Hiser, J., and Davidson, J. W. 2003. Code generation for a dual instruction set processor based on selective code transformation. In Proceedings of the 7th International Workshop on Software and Compilers for Embedded Systems, Vienna, Austria. LNCS, vol. 2826, Springer, Berlin, 33--48.]]Google Scholar
- Lefurgy, C., Bird, P., Chen, I.-C., and Mudge, T. 1997. Improving code density using compression techniques. In Proceedings of the 13th Annual International Symposium on Microarchitecture. IEEE/ACM, Research Triangle Park, NC, 194--203.]] Google Scholar
- Lekatsas, H. and Wolf, W. 1998. Code--compression for embedded systems. In Proceedings of the Design Automation Conference. IEEE/ACM, 516--521.]] Google Scholar
- McGhan, H. and O'Connor, M. 1998. Picojava: A direct execution engine for java bytecode. IEEE Comput. 31, 10 (Oct.), 22--30.]] Google Scholar
- Memik, G., Mangione-Smith, W. and Hu. 2001. Netbench: A benchmarking suite for network processors. In Proceedings of the IEEE International Conference on Computer-Aided Design. IEEE, Piscataway, NJ, 39--42.]] Google Scholar
- Qasem, A., Whalley, D., Yuan, X., and van Engelen, R. 2001. Using a swap instruction to coalesce loads and stores. In Proceedings of the European Conference on Parallel Computing. 235--240.]] Google Scholar
- Razdan, R. and Smith, M.D. 1994. A high-performance microarchitecture with hardware-programmable functional units. In Proceedings of the 27th Annual International Symposium on Microarchitecture. IEEE/ACM, San Jose, CA, 172--180.]] Google Scholar
- Reinman, G. and Jouppi, N. 1999. An integrated cache timing and power model. Technical Report, Western Research Lab.]]Google Scholar
- Segars, S., Clarke, K., and Goudge, L. 1995. Embedded control problems, thumb and the arm7tdmi. IEEE Micro 15, 5 (Oct.), 22--30.]] Google Scholar
- Segars, S. 2001. Low power design techniques for microprocessors. Tutorial Notes, International Solid-State Circuits Conference. IEEE, Piscataway, NJ.]]Google Scholar
- Wolf, T. and Franklin, M. 2000. Commbench---A telecommunications benchmark for network processors. In Proceedings of the International Symposium on Performance Analysis of Systems and Software. IEEE, Piscataway, NJ, 154--162.]] Google Scholar
- Wolfe, A. and Chanin, A. 1992. Executing compressed programs on an embedded risc architecture. In Proceedings of the 25th Annual International Symposium on Microarchitecture. IEEE/ACM, Portland, OR, 81--91.]] Google Scholar
Index Terms
- Dynamic coalescing for 16-bit instructions
Recommendations
Enhancing the performance of 16-bit code using augmenting instructions
Special Issue: Proceedings of the 2003 ACM SIGPLAN conference on Language, compiler, and tool support for embedded systems (San Diego, CA).In the embedded domain, memory usage and energy consumption are critical constraints. Dual width instruction set embedded processors such as the ARM provide a 16-bit instruction set in addition to the 32-bit instruction set to address these concerns. ...
Enhancing the performance of 16-bit code using augmenting instructions
LCTES '03: Proceedings of the 2003 ACM SIGPLAN conference on Language, compiler, and tool for embedded systemsIn the embedded domain, memory usage and energy consumption are critical constraints. Dual width instruction set embedded processors such as the ARM provide a 16-bit instruction set in addition to the 32-bit instruction set to address these concerns. ...
Integrated instruction selection and register allocation for compact code generation exploiting freeform mixing of 16- and 32-bit instructions
CGO '10: Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimizationFor memory constrained embedded systems code size is at least as important as performance. One way of increasing code density is to exploit compact instruction formats, e.g. ARM Thumb, where the processor either operates in standard or compact ...
Comments