Abstract
This paper describes future execution (FE), a simple hardware-only technique to accelerate individual program threads running on multicore microprocessors. Our approach uses available idle cores to prefetch important data for the threads executing on the active cores. FE is based on the observation that many cache misses are caused by loads that execute repeatedly and whose address-generating program slices do not change (much) between consecutive executions. To exploit this property, FE dynamically creates a prefetching thread for each active core by simply sending a copy of all committed, register-writing instructions to an otherwise idle core. The key innovation is that on the way to the second core, a value predictor replaces each predictable instruction in the prefetching thread with a load immediate instruction, where the immediate is the predicted result that the instruction is likely to produce during its nth next dynamic execution. Executing this modified instruction stream (i.e., the prefetching thread) on another core allows to compute the future results of the instructions that are not directly predictable, issue prefetches into the shared memory hierarchy, and thus reduce the primary threads' memory access time. We demonstrate the viability and effectiveness of future execution by performing cycle-accurate simulations of a two-way CMP running the single-threaded SPECcpu2000 benchmark suite. Our mechanism improves program performance by 12%, on average, over a baseline that already includes an optimized hardware stream prefetcher. We further show that FE is complementary to runahead execution and that the combination of these two techniques raises the average speedup to 20% above the performance of the baseline processor with the aggressive stream prefetcher.
- Ceze, L., Strauss, K., Tuck, J., Renau, J., and Torrellas, J. 2004. Cava: Hiding l2 misses with checkpoint-assisted value prediction. IEEE Computer Architecture Letters 3, 1, 7. Google ScholarDigital Library
- Collins, J. D., Tullsen, D. M., Wang, H., and Shen, J. P. 2001. Dynamic speculative precomputation. In Proceedings of the 34th Annual ACM/IEEE International Symposium on Microarchitecture. 306--317. Google Scholar
- Cooksey, R., Jourdan, S., and Grunwald, D. 2002. A stateless, content-directed data prefetching mechanism. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems. 279--290. Google Scholar
- Dundas, J. and Mudge, T. 1997. Improving data cache performance by pre-executing instructions under a cache miss. In Proceedings of the 11th International Conference on Supercomputing. 68--75. Google Scholar
- Fu, J. W. C., Patel, J. H., and Janssens, B. L. 1992. Stride directed prefetching in scalar processors. In Proceedings of the 25th Annual International Symposium on Microarchitecture. 102--110. Google ScholarDigital Library
- Ganusov, I. 2005. Hardware prefetching based on future execution in chip multiprocessor architectures. M.S. thesis, Department of Electrical and Computer Engineering, Cornell University, Ithaca, New York.Google Scholar
- Ganusov, I. and Burtscher, M. 2005. Future execution: A hardware prefetching technique for chip multiprocessors. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques. 350--360. Google Scholar
- Goeman, B., Vandierendonck, H., and de Bosschere, K. 2001. Differential fcm: Increasing value prediction accuracy by improving table usage efficiency. In Proceedings of the 7th International Symposium on High-Performance Computer Architecture. Google Scholar
- http://www.spec.org/osg/cpu2000/.Google Scholar
- Hu, Z., Martonosi, M., and Kaxiras, S. 2003. Tcp: Tag correlating prefetchers. In Proceedings of the 9th International Symposium on High-Performance Computer Architecture. 317. Google Scholar
- Ibrahim, K. Z., Byrd, G. T., and Rotenberg, E. 2003. Slipstream execution mode for cmp-based multiprocessors. In Proceedings of the 9th International Symposium on High-Performance Computer Architecture. 179. Google Scholar
- Joseph, D. and Grunwald, D. 1997. Prefetching using markov predictors. In Proceedings of the 24th Annual International Symposium on Computer Architecture. 252--263. Google Scholar
- Jouppi, N. P. 1990. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In Proceedings of the 17th Annual International Symposium on Computer Architecture. 364--373. Google Scholar
- Kirman, N., Kirman, M., Chaudhuri, M., and Martinez, J. F. 2005. Checkpointed early load retirement. In Proceedings of the 11th International Symposium on High-Performance Computer Architecture. 16--27. Google ScholarDigital Library
- Lai, A.-C., Fide, C., and Falsafi, B. 2001. Dead-block prediction & dead-block correlating prefetchers. In Proceedings of the 28th Annual International Symposium on Computer Architecture. 144--154. Google Scholar
- Larson, E., Chatterjee, S., and Austin, T. 2001. Mase: A novel infrastructure for detailed microarchitectural modeling. In Proceedings of the 2nd International Symposium on Performance Analysis of Systems and Software. 1--9.Google Scholar
- Lipasti, M. H., Wilkerson, C. B., and Shen, J. P. 1996. Value locality and load value prediction. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems. 138--147. Google Scholar
- Luk, C.-K. 2001. Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors. In Proceedings of the 28th Annual International Symposium on Computer Architecture. 40--51. Google Scholar
- Moshovos, A., Pnevmatikatos, D. N., and Baniasadi, A. 2001. Slice-processors: an implementation of operation-based prediction. In Proceedings of the 15th International Conference on Supercomputing. 321--334. Google Scholar
- Mutlu, O., Stark, J., Wilkerson, C., and Patt, Y. N. 2003. Runahead execution: An alternative to very large instruction windows for out-of-order processors. In Proceedings of the 9th International Symposium on High-Performance Computer Architecture. 129. Google Scholar
- Palacharla, S. and Kessler, R. E. 1994. Evaluating stream buffers as a secondary cache replacement. In Proceedings of the 21st Annual International Symposium on Computer Architecture. 24--33. Google Scholar
- Rattner, J. 2005. Multi-core to the masses. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques. 3. Google ScholarDigital Library
- Roth, A. and Sohi, G. S. 2001. Speculative data-driven multithreading. In Proceedings of the 7th International Symposium on High-Performance Computer Architecture. 37. Google Scholar
- Roth, A. and Sohi, G. S. 2002. A quantitative framework for automated pre-execution thread selection. In Proceedings of the 35th Annual ACM/IEEE International Symposium on Microarchitecture. 430--441. Google Scholar
- Roth, A., Moshovos, A., and Sohi, G. S. 1998. Dependence based prefetching for linked data structures. In Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems. 115--126. Google Scholar
- Sazeides, Y. and Smith, J. E. 1997. The predictability of data values. In Proceedings of the 30th Annual ACM/IEEE International Symposium on Microarchitecture. 248--258. Google Scholar
- Sherwood, T., Perelman, E., Hamerly, G., and Calder, B. 2002. Automatically characterizing large scale program behavior. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems. 45--57. Google Scholar
- Shivakumar, P. and Jouppi, N. P. December 2001. Cacti 3.0: An integrated cache timing, power, and area model. Tech. Rep. WRL-2001-2, Compaq Western Research Laboratory.Google Scholar
- Srinivasan, S. T., Akkary, H., Holman, T., and Lai, K. 2004. A minimal dual-core speculative multi-threading architecture. In Proceedings of the IEEE International Conference on Computer Design. 360--367. Google Scholar
- Srivastava, A. and Eustace, A. 1994. Atom: a system for building customized program analysis tools. In Proceedings of the ACM SIGPLAN 1994 Conference on Programming Language Design and Implementation. ACM Press, New York. 196--205. Google Scholar
- Zhou, H. 2005. Dual-core execution: Building a highly scalable single-thread instruction window. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques. Google Scholar
- Zhou, H. and Conte, T. M. 2003. Enhancing memory level parallelism via recovery-free value prediction. In Proceedings of the 17th Annual International Conference on Supercomputing. 326--335. Google Scholar
- Zilles, C. and Sohi, G. 2001. Execution-based prediction using speculative slices. In Proceedings of the 28th Annual International Symposium on Computer Architecture. 2--13. Google Scholar
Index Terms
- Future execution: A prefetching mechanism that uses multiple cores to speed up single threads
Recommendations
A PAB-based multi-prefetcher mechanism
Aggressive prefetching mechanisms improve performance of some important applications, but substantially increase bus traffic and "pressure" on cache tag arrays. They may even reduce performance of applications that are not memory bounded. We introduce a ...
A compiler-directed data prefetching scheme for chip multiprocessors
PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programmingData prefetching has been widely used in the past as a technique for hiding memory access latencies. However, data prefetching in multi-threaded applications running on chip multiprocessors (CMPs) can be problematic when multiple cores compete for a ...
A compiler-directed data prefetching scheme for chip multiprocessors
PPoPP '09Data prefetching has been widely used in the past as a technique for hiding memory access latencies. However, data prefetching in multi-threaded applications running on chip multiprocessors (CMPs) can be problematic when multiple cores compete for a ...
Comments