Abstract
Higher microprocessor frequencies accentuate the performance cost of memory accesses. This is especially noticeable in the Intel's IA32 architecture where lack of registers results in increased number of memory accesses. This paper presents novel, non-speculative technique that partially hides the increasing load-to-use latency, by allowing the early issue of load instructions. Early load address resolution relies on register tracking to safely compute the addresses of memory references in the front-end part of the processor pipeline. Register tracking enables decode-time computation of register values by tracking simple operations of the form reg±immediate. Register tracking may be performed in any pipeline stage following instruction decode and prior to execution.
Several tracking schemes are proposed in this paper:
Stack pointer tracking allows safe early resolution of stack references by keeping track of the value of the ESP register (the stack pointer). About 25% of all loads are stack loads and 95% of these loads may be resolved in the front-end.
Absolute address tracking allows the early resolution of constant-address loads.
Displacement-based tracking tackles all loads with addresses of the form reg±immediate by tracking the values of all general-purpose registers. This class corresponds to 82% of all loads, and about 65% of these loads can be safely resolved in the front-end pipeline.
The paper describes the tracking schemes, analyzes their performance potential in a deeply pipelined processor and discusses the integration of tracking with memory disambiguation.
- 1 T. M. Austin and G. S. Sohi, Zero-cycle Loads: Microarchitecture Support for Reducing Load Latency, in Proceedings of the 28th Annual International Symposium on Microarchitecture, November 1995. Google ScholarDigital Library
- 2 T.M.Austin, D.N. Pnevmatikatos, G.S. Sohi. Streamlining Data Cache Access with Fast Address Calculation, In 22nd International Symposium on Computer Architecture, 1995, pp. 369-380 Google ScholarDigital Library
- 3 J. Baer and T. Chen, An Effective On-Chip Preloading Scheme to Reduce Data Access Penalty, in Proceedings of the International Conference on Supercomputing, November 1991. Google ScholarDigital Library
- 4 M. Bekerman, S. Jourdan, R. Ronen, G. Kirshenboim, L. Rappoport, A. Yoaz, U. Weiser. Correlated Load Address Predictors, in Proceedings of the 26th Annual International Symposium on Computer Architecture, May 1999. Google ScholarDigital Library
- 5 T. Chen and and J. Baer, Effective Hardware-Based Data Prefetching for High-Performance Processors, in IEEE Transactions on Computer, V.45 N.5, May 1995. Google ScholarDigital Library
- 6 S. Cho, P.-C. Yew, G. Lee. Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor, in Proceedings of the 26th International Symposium on Computer Architecture, May 1999. Google ScholarDigital Library
- 7 G. Chrysos and J. Emer, Memory Dependence Prediction Using Store Sets, in Proceedings of the 25th International Symposium on Computer Architecture, July 1998. Google ScholarCross Ref
- 8 D. Ditzel and R. McLellan. Register Allocation for Free: The C Machine Stack Cache, in Proc. of the Symposium on Architectural Support for Programming Languages and Operating Systems, March 1982. Google ScholarDigital Library
- 9 R. J. Eickemeyer and S. Vassiliadis, A Load-Instruction Unit for Pipelined Processors, in IBM Journal of Research and Development, July 93. Google ScholarDigital Library
- 10 F. Gabbay and A. Mendelson. The Effect of Instruction Fetch Bandwidth on Value Prediction, in Proceeding of the 25th International Symposium on computer Architecture, July, 1998. Google ScholarDigital Library
- 11 J. Gonzalez and A. Gonzalez, Speculative Execution via Address Prediction and Data Prefetching, in Proceedings of the International Conference on Supercomputing, 1997. Google ScholarDigital Library
- 12 Pentium Pro Family Developer Manual, Volume 2: Programmer s Reference Manual, Intel Corporation, 1996Google Scholar
- 13 S. Jourdan, R. Ronen, M. Bekerman, B. Shomar, A. Yoaz, A Novel Renaming Scheme to Exploit Value Temporal Locality Through Physical Register Reuse and Unification, in Proceedings of the 31st Annual International Symposium on Microarchitecture, November 1998. Google ScholarDigital Library
- 14 M. H. Lipasti, C. B. Wilkerson, and J. P. Shen, Value Locality and Load Value Prediction, in Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems, October 1996. Google ScholarDigital Library
- 15 A. I. Moshovos, S. E. Breach, T. N. Vijaykumar, and G. S. Sohi, Speculation and Synchronization of Data Dependencies, in Proceedings of the 24th International Symposium on Computer Architecture, June 1997. Google ScholarDigital Library
- 16 A. I. Moshovos and G. S. Sohi, Streamlining Inter-operation Memory Communication via Data Dependence Prediction, in Proceedings of the 30th Annual international Symposium on Microarchitecture, December 1997. Google ScholarDigital Library
- 17 E. Rotenberg, S. Bennett, and J. Smith, Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching, in Proceedings of the 29th International Symposium on Microarchitecture, December 1996. Google ScholarDigital Library
- 18 R. Valentine, G. Sheaffer, R. Ronen, I. Spillinger and A. Yoaz, Out-of-order Superscalar Microprocessor with a Renaming Device that Maps Instructions from Memory to Registers, U.S. Patent 5,838,941, November 1998.Google Scholar
- 19 A. Yoaz, M. Erez, R. Ronen, and S. Jourdan, Speculation Techniques for Improving Load Related Instruction Scheduling, in Proceedings of the 26th Annual International Symposium on Computer Architecture, May 1999. Google ScholarDigital Library
Index Terms
- Early load address resolution via register tracking
Recommendations
Early load address resolution via register tracking
ISCA '00: Proceedings of the 27th annual international symposium on Computer architectureHigher microprocessor frequencies accentuate the performance cost of memory accesses. This is especially noticeable in the Intel's IA32 architecture where lack of registers results in increased number of memory accesses. This paper presents novel, non-...
Address-Value Decoupling for Early Register Deallocation
ICPP '06: Proceedings of the 2006 International Conference on Parallel ProcessingWe propose a series of aggressive register deallocation mechanisms to reduce the register file pressure and increase the parallelism exploited by superscalar microprocessors. Our techniques are based on a key observation that a register value can be ...
Speculative register promotion using Advanced Load Address Table (ALAT)
CGO '03: Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimizationThe pervasive use of pointers with complicated patterns in C programs often constrains compiler alias analysis to yield conservative register allocation and promotion. Speculative register promotion with hardware support has the potential to more ...
Comments