article

Free Access

Future execution: A prefetching mechanism that uses multiple cores to speed up single threads

Authors:
Ilya Ganusov

Cornell University, Ithaca, NY

Cornell University, Ithaca, NY
View Profile

,
Martin Burtscher

Cornell University, Ithaca, NY

Cornell University, Ithaca, NY
View Profile

ACM Transactions on Architecture and Code Optimization Volume 3 Issue 4pp 424–449https://doi.org/10.1145/1187976.1187979

Published:01 December 2006Publication History

ACM Transactions on Architecture and Code Optimization

Abstract

This paper describes future execution (FE), a simple hardware-only technique to accelerate individual program threads running on multicore microprocessors. Our approach uses available idle cores to prefetch important data for the threads executing on the active cores. FE is based on the observation that many cache misses are caused by loads that execute repeatedly and whose address-generating program slices do not change (much) between consecutive executions. To exploit this property, FE dynamically creates a prefetching thread for each active core by simply sending a copy of all committed, register-writing instructions to an otherwise idle core. The key innovation is that on the way to the second core, a value predictor replaces each predictable instruction in the prefetching thread with a load immediate instruction, where the immediate is the predicted result that the instruction is likely to produce during its nth next dynamic execution. Executing this modified instruction stream (i.e., the prefetching thread) on another core allows to compute the future results of the instructions that are not directly predictable, issue prefetches into the shared memory hierarchy, and thus reduce the primary threads' memory access time. We demonstrate the viability and effectiveness of future execution by performing cycle-accurate simulations of a two-way CMP running the single-threaded SPECcpu2000 benchmark suite. Our mechanism improves program performance by 12%, on average, over a baseline that already includes an optimized hardware stream prefetcher. We further show that FE is complementary to runahead execution and that the combination of these two techniques raises the average speedup to 20% above the performance of the baseline processor with the aggressive stream prefetcher.

References

Ceze, L., Strauss, K., Tuck, J., Renau, J., and Torrellas, J. 2004. Cava: Hiding l2 misses with checkpoint-assisted value prediction. IEEE Computer Architecture Letters 3, 1, 7. Google ScholarDigital Library
Collins, J. D., Tullsen, D. M., Wang, H., and Shen, J. P. 2001. Dynamic speculative precomputation. In Proceedings of the 34th Annual ACM/IEEE International Symposium on Microarchitecture. 306--317. Google Scholar
Cooksey, R., Jourdan, S., and Grunwald, D. 2002. A stateless, content-directed data prefetching mechanism. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems. 279--290. Google Scholar
Dundas, J. and Mudge, T. 1997. Improving data cache performance by pre-executing instructions under a cache miss. In Proceedings of the 11th International Conference on Supercomputing. 68--75. Google Scholar
Fu, J. W. C., Patel, J. H., and Janssens, B. L. 1992. Stride directed prefetching in scalar processors. In Proceedings of the 25th Annual International Symposium on Microarchitecture. 102--110. Google ScholarDigital Library
Ganusov, I. 2005. Hardware prefetching based on future execution in chip multiprocessor architectures. M.S. thesis, Department of Electrical and Computer Engineering, Cornell University, Ithaca, New York.Google Scholar
Ganusov, I. and Burtscher, M. 2005. Future execution: A hardware prefetching technique for chip multiprocessors. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques. 350--360. Google Scholar
Goeman, B., Vandierendonck, H., and de Bosschere, K. 2001. Differential fcm: Increasing value prediction accuracy by improving table usage efficiency. In Proceedings of the 7th International Symposium on High-Performance Computer Architecture. Google Scholar
http://www.spec.org/osg/cpu2000/.Google Scholar
Hu, Z., Martonosi, M., and Kaxiras, S. 2003. Tcp: Tag correlating prefetchers. In Proceedings of the 9th International Symposium on High-Performance Computer Architecture. 317. Google Scholar
Ibrahim, K. Z., Byrd, G. T., and Rotenberg, E. 2003. Slipstream execution mode for cmp-based multiprocessors. In Proceedings of the 9th International Symposium on High-Performance Computer Architecture. 179. Google Scholar
Joseph, D. and Grunwald, D. 1997. Prefetching using markov predictors. In Proceedings of the 24th Annual International Symposium on Computer Architecture. 252--263. Google Scholar
Jouppi, N. P. 1990. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In Proceedings of the 17th Annual International Symposium on Computer Architecture. 364--373. Google Scholar
Kirman, N., Kirman, M., Chaudhuri, M., and Martinez, J. F. 2005. Checkpointed early load retirement. In Proceedings of the 11th International Symposium on High-Performance Computer Architecture. 16--27. Google ScholarDigital Library
Lai, A.-C., Fide, C., and Falsafi, B. 2001. Dead-block prediction & dead-block correlating prefetchers. In Proceedings of the 28th Annual International Symposium on Computer Architecture. 144--154. Google Scholar
Larson, E., Chatterjee, S., and Austin, T. 2001. Mase: A novel infrastructure for detailed microarchitectural modeling. In Proceedings of the 2nd International Symposium on Performance Analysis of Systems and Software. 1--9.Google Scholar
Lipasti, M. H., Wilkerson, C. B., and Shen, J. P. 1996. Value locality and load value prediction. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems. 138--147. Google Scholar
Luk, C.-K. 2001. Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors. In Proceedings of the 28th Annual International Symposium on Computer Architecture. 40--51. Google Scholar
Moshovos, A., Pnevmatikatos, D. N., and Baniasadi, A. 2001. Slice-processors: an implementation of operation-based prediction. In Proceedings of the 15th International Conference on Supercomputing. 321--334. Google Scholar
Mutlu, O., Stark, J., Wilkerson, C., and Patt, Y. N. 2003. Runahead execution: An alternative to very large instruction windows for out-of-order processors. In Proceedings of the 9th International Symposium on High-Performance Computer Architecture. 129. Google Scholar
Palacharla, S. and Kessler, R. E. 1994. Evaluating stream buffers as a secondary cache replacement. In Proceedings of the 21st Annual International Symposium on Computer Architecture. 24--33. Google Scholar
Rattner, J. 2005. Multi-core to the masses. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques. 3. Google ScholarDigital Library
Roth, A. and Sohi, G. S. 2001. Speculative data-driven multithreading. In Proceedings of the 7th International Symposium on High-Performance Computer Architecture. 37. Google Scholar
Roth, A. and Sohi, G. S. 2002. A quantitative framework for automated pre-execution thread selection. In Proceedings of the 35th Annual ACM/IEEE International Symposium on Microarchitecture. 430--441. Google Scholar
Roth, A., Moshovos, A., and Sohi, G. S. 1998. Dependence based prefetching for linked data structures. In Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems. 115--126. Google Scholar
Sazeides, Y. and Smith, J. E. 1997. The predictability of data values. In Proceedings of the 30th Annual ACM/IEEE International Symposium on Microarchitecture. 248--258. Google Scholar
Sherwood, T., Perelman, E., Hamerly, G., and Calder, B. 2002. Automatically characterizing large scale program behavior. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems. 45--57. Google Scholar
Shivakumar, P. and Jouppi, N. P. December 2001. Cacti 3.0: An integrated cache timing, power, and area model. Tech. Rep. WRL-2001-2, Compaq Western Research Laboratory.Google Scholar
Srinivasan, S. T., Akkary, H., Holman, T., and Lai, K. 2004. A minimal dual-core speculative multi-threading architecture. In Proceedings of the IEEE International Conference on Computer Design. 360--367. Google Scholar
Srivastava, A. and Eustace, A. 1994. Atom: a system for building customized program analysis tools. In Proceedings of the ACM SIGPLAN 1994 Conference on Programming Language Design and Implementation. ACM Press, New York. 196--205. Google Scholar
Zhou, H. 2005. Dual-core execution: Building a highly scalable single-thread instruction window. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques. Google Scholar
Zhou, H. and Conte, T. M. 2003. Enhancing memory level parallelism via recovery-free value prediction. In Proceedings of the 17th Annual International Conference on Supercomputing. 326--335. Google Scholar
Zilles, C. and Sohi, G. 2001. Execution-based prediction using speculative slices. In Proceedings of the 28th Annual International Symposium on Computer Architecture. 2--13. Google Scholar

Index Terms

Future execution: A prefetching mechanism that uses multiple cores to speed up single threads
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multiple instruction, multiple data

Recommendations

A PAB-based multi-prefetcher mechanism

Aggressive prefetching mechanisms improve performance of some important applications, but substantially increase bus traffic and "pressure" on cache tag arrays. They may even reduce performance of applications that are not memory bounded. We introduce a ...
Read More
A compiler-directed data prefetching scheme for chip multiprocessors
PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming

Data prefetching has been widely used in the past as a technique for hiding memory access latencies. However, data prefetching in multi-threaded applications running on chip multiprocessors (CMPs) can be problematic when multiple cores compete for a ...
Read More
A compiler-directed data prefetching scheme for chip multiprocessors
PPoPP '09

Data prefetching has been widely used in the past as a technique for hiding memory access latencies. However, data prefetching in multi-threaded applications running on chip multiprocessors (CMPs) can be problematic when multiple cores compete for a ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Architecture and Code Optimization Volume 3, Issue 4
December 2006
169 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/1187976
Issue’s Table of Contents

Copyright © 2006 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 December 2006
Published in taco Volume 3, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Future execution
chip multiprocessors
memory wall
prefetching
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 20
  Total Citations
  View Citations
- 865
  Total Downloads
- Downloads (Last 12 months)22
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Future execution: A prefetching mechanism that uses multiple cores to speed up single threads

ACM Transactions on Architecture and Code Optimization

Abstract

References

Cited By

Index Terms

Recommendations

A PAB-based multi-prefetcher mechanism

A compiler-directed data prefetching scheme for chip multiprocessors

A compiler-directed data prefetching scheme for chip multiprocessors

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Future execution: A prefetching mechanism that uses multiple cores to speed up single threads

ACM Transactions on Architecture and Code Optimization

Abstract

References

Cited By

Index Terms

Recommendations

A PAB-based multi-prefetcher mechanism

A compiler-directed data prefetching scheme for chip multiprocessors

A compiler-directed data prefetching scheme for chip multiprocessors

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media