skip to main content
article

Dynamic coalescing for 16-bit instructions

Published:01 February 2005Publication History
Skip Abstract Section

Abstract

In the embedded domain, memory usage and energy consumption are critical constraints.Embedded processors such as the ARM and MIPS provide a 16-bit instruction set, (called Thumb in the case of the ARM family of processors), in addition to the 32-bit instruction set to address these concerns. Using 16-bit instructions one can achieve code size reduction and instruction cache energy savings at the cost of performance. This paper presents a novel approach that enhances the performance of 16-bit Thumb code. We have observed that throughout Thumb code there exist Thumb instruction pairs that are equivalent to a single ARM instruction. We have developed enhancements to the processor microarchitecture and the Thumb instruction set to exploit this property. We enhance the Thumb instruction set by incorporating Augmenting eXtensions (AX). A Thumb instruction pair that can be combined into a single ARM instruction is replaced by an AXThumb instruction pair by the compiler. The AX instruction is coalesced with the immediately following Thumb instruction to generate a single ARM instruction at decode time. The enhanced microarchitecture ensures that coalescing does not introduce pipeline delays or increase cycle time thereby resulting in reduction of both instruction counts and cycle counts. Using AX instructions and coalescing hardware we are also able to support efficient predicated execution in 16-bit mode.

References

  1. Burger, D. and Austin, T. 1996. The Simplescalar Toolset. Technical Report CS-TR-96-1308, University of Wisconsin-Madison.]]Google ScholarGoogle Scholar
  2. Debray, S. and Evans, W. 2002. Profile-guided code compression. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, New York, 95--105.]] Google ScholarGoogle Scholar
  3. Friendly, D. H., Patel, S. J., and Patt, Y. N. 1998. Putting the fill unit to work: Dynamic optimizations for trace cache microprocessors. In Proceedings of the 31st Annual International Symposium on Microarchitecture. IEEE/ACM, Piscataway, NJ/New York, 173--181.]] Google ScholarGoogle Scholar
  4. Furber, S. 1996. ARM System Architecture. Addison-Wesley, Reading, MA.]] Google ScholarGoogle Scholar
  5. Hu, S. and Smith, J. 2004. Using dynamic binary translation to fuse dependent instructions. In Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization. IEEE/ACM, Piscataway, NJ/New York, 213--224.]] Google ScholarGoogle Scholar
  6. Intel 2000a. The Intel Xscale Microarchitecture Technical Summary. ftp://download.intel. com/design/intelxscale/XScaleDatasheet4.pdf.]]Google ScholarGoogle Scholar
  7. Intel 2000b. Sa-110 Microprocessor Technical Reference Manual. ftp://download.intel.com/design/strong/applnots/27819401.pdf.]]Google ScholarGoogle Scholar
  8. Intel 2002. A white paper on The Intel pxa250 applications processor.]]Google ScholarGoogle Scholar
  9. Jacobson, Q. and Smith, J. E. 1999. Instruction pre-processing in trace processors. In Proceedings of the International Symposium on High-Performance Computer Architecture. IEEE-CS, Piscataway, NJ, 125--129.]] Google ScholarGoogle Scholar
  10. Krishnaswamy, A. and Gupta, R. 2002. Profile guided selection of arm and thumb instructions. In Proceedings of the ACM SIGPLAN Joint Conference on Languages Compilers and Tools for Embedded Systems & Software and Compilers for Embedded Systems, Berlin, Germany. ACM, New York, 55--64.]] Google ScholarGoogle Scholar
  11. Lee, C., Potkonjak, M., and Mangione-Smith, W. 1997. Mediabench: A tool for evaluating and synthesizing multimedia and communications systems. In Proceedings of the 30th Annual International Symposium on Microarchitecture. IEEE/ACM, Research Triangle Park, NC, 330--335.]] Google ScholarGoogle Scholar
  12. Lee, S., Lee, J., Min, S. L., Hiser, J., and Davidson, J. W. 2003. Code generation for a dual instruction set processor based on selective code transformation. In Proceedings of the 7th International Workshop on Software and Compilers for Embedded Systems, Vienna, Austria. LNCS, vol. 2826, Springer, Berlin, 33--48.]]Google ScholarGoogle Scholar
  13. Lefurgy, C., Bird, P., Chen, I.-C., and Mudge, T. 1997. Improving code density using compression techniques. In Proceedings of the 13th Annual International Symposium on Microarchitecture. IEEE/ACM, Research Triangle Park, NC, 194--203.]] Google ScholarGoogle Scholar
  14. Lekatsas, H. and Wolf, W. 1998. Code--compression for embedded systems. In Proceedings of the Design Automation Conference. IEEE/ACM, 516--521.]] Google ScholarGoogle Scholar
  15. McGhan, H. and O'Connor, M. 1998. Picojava: A direct execution engine for java bytecode. IEEE Comput. 31, 10 (Oct.), 22--30.]] Google ScholarGoogle Scholar
  16. Memik, G., Mangione-Smith, W. and Hu. 2001. Netbench: A benchmarking suite for network processors. In Proceedings of the IEEE International Conference on Computer-Aided Design. IEEE, Piscataway, NJ, 39--42.]] Google ScholarGoogle Scholar
  17. Qasem, A., Whalley, D., Yuan, X., and van Engelen, R. 2001. Using a swap instruction to coalesce loads and stores. In Proceedings of the European Conference on Parallel Computing. 235--240.]] Google ScholarGoogle Scholar
  18. Razdan, R. and Smith, M.D. 1994. A high-performance microarchitecture with hardware-programmable functional units. In Proceedings of the 27th Annual International Symposium on Microarchitecture. IEEE/ACM, San Jose, CA, 172--180.]] Google ScholarGoogle Scholar
  19. Reinman, G. and Jouppi, N. 1999. An integrated cache timing and power model. Technical Report, Western Research Lab.]]Google ScholarGoogle Scholar
  20. Segars, S., Clarke, K., and Goudge, L. 1995. Embedded control problems, thumb and the arm7tdmi. IEEE Micro 15, 5 (Oct.), 22--30.]] Google ScholarGoogle Scholar
  21. Segars, S. 2001. Low power design techniques for microprocessors. Tutorial Notes, International Solid-State Circuits Conference. IEEE, Piscataway, NJ.]]Google ScholarGoogle Scholar
  22. Wolf, T. and Franklin, M. 2000. Commbench---A telecommunications benchmark for network processors. In Proceedings of the International Symposium on Performance Analysis of Systems and Software. IEEE, Piscataway, NJ, 154--162.]] Google ScholarGoogle Scholar
  23. Wolfe, A. and Chanin, A. 1992. Executing compressed programs on an embedded risc architecture. In Proceedings of the 25th Annual International Symposium on Microarchitecture. IEEE/ACM, Portland, OR, 81--91.]] Google ScholarGoogle Scholar

Index Terms

  1. Dynamic coalescing for 16-bit instructions

      Recommendations

      Reviews

      Olivier Louis Marie Lecarme

      It seems that several important processor designs have been done without consideration of the poor compiler, in charge of exploiting the capabilities of the processor to reach its peak performance. I'm thinking, for example, of the IA-64 Intel processor. This paper addresses this problem, discussing both hardware processor design and compiling techniques. Here, the processor is the ARM family. This is one of the most ubiquitous microprocessors, used in embedded products like the iPod, the Playstation, mobile phones, camcorders, pocket personal computers (PCs), and so on. If one remembers that "more than 98 percent of all microprocessors are used in embedded products," obviously it is extremely important to improve their performance as much as possible. The constraints, in comparison to processors used in computers, are mostly related to energy and memory savings. However, these savings should not be attained at the expense of speed. The ARM family uses a 32-bit instruction set, but in order to save memory and energy, it also uses a 16-bit instruction set, properly named "Thumb." As the authors demonstrate, using Thumb code results in a code size reduction of about 30 percent, but also in a three-fold increase in the number of instructions to execute. Thus, the code is slower, and the energy savings is much lower than expected. In order to correct this, the authors have designed an enhancement to the Thumb instruction set called augmenting extensions (AX). These instructions are handled in the decode stage of the processor, and thus they don't use a cycle in the pipeline. Every one is coalesced with the following Thumb instruction, yielding an ARM instruction. This has the advantage of reducing the number of Thumb instructions to be generated by the compiler and executed by the processor (an ARM instruction does more work than a Thumb one). Thus, there are gains in speed, energy savings, and memory usage. The bulk of the paper is devoted to explaining needed modifications to the hardware, as well as to the compiling techniques needed to generate the code. For example, in some cases, the instructions in the two branches of an if-then-else construct must be generated by pairs, one for the true part and one for the false part, which is uncommon. Despite a few typographical errors, the paper is well written and pleasant to read. The presented results are convincing. Whether the ideas will actually be used remains to be seen.

      Access critical reviews of Computing literature here

      Become a reviewer for Computing Reviews.

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader