skip to main content
article
Free Access

Future execution: A prefetching mechanism that uses multiple cores to speed up single threads

Published:01 December 2006Publication History
Skip Abstract Section

Abstract

This paper describes future execution (FE), a simple hardware-only technique to accelerate individual program threads running on multicore microprocessors. Our approach uses available idle cores to prefetch important data for the threads executing on the active cores. FE is based on the observation that many cache misses are caused by loads that execute repeatedly and whose address-generating program slices do not change (much) between consecutive executions. To exploit this property, FE dynamically creates a prefetching thread for each active core by simply sending a copy of all committed, register-writing instructions to an otherwise idle core. The key innovation is that on the way to the second core, a value predictor replaces each predictable instruction in the prefetching thread with a load immediate instruction, where the immediate is the predicted result that the instruction is likely to produce during its nth next dynamic execution. Executing this modified instruction stream (i.e., the prefetching thread) on another core allows to compute the future results of the instructions that are not directly predictable, issue prefetches into the shared memory hierarchy, and thus reduce the primary threads' memory access time. We demonstrate the viability and effectiveness of future execution by performing cycle-accurate simulations of a two-way CMP running the single-threaded SPECcpu2000 benchmark suite. Our mechanism improves program performance by 12%, on average, over a baseline that already includes an optimized hardware stream prefetcher. We further show that FE is complementary to runahead execution and that the combination of these two techniques raises the average speedup to 20% above the performance of the baseline processor with the aggressive stream prefetcher.

References

  1. Ceze, L., Strauss, K., Tuck, J., Renau, J., and Torrellas, J. 2004. Cava: Hiding l2 misses with checkpoint-assisted value prediction. IEEE Computer Architecture Letters 3, 1, 7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Collins, J. D., Tullsen, D. M., Wang, H., and Shen, J. P. 2001. Dynamic speculative precomputation. In Proceedings of the 34th Annual ACM/IEEE International Symposium on Microarchitecture. 306--317. Google ScholarGoogle Scholar
  3. Cooksey, R., Jourdan, S., and Grunwald, D. 2002. A stateless, content-directed data prefetching mechanism. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems. 279--290. Google ScholarGoogle Scholar
  4. Dundas, J. and Mudge, T. 1997. Improving data cache performance by pre-executing instructions under a cache miss. In Proceedings of the 11th International Conference on Supercomputing. 68--75. Google ScholarGoogle Scholar
  5. Fu, J. W. C., Patel, J. H., and Janssens, B. L. 1992. Stride directed prefetching in scalar processors. In Proceedings of the 25th Annual International Symposium on Microarchitecture. 102--110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Ganusov, I. 2005. Hardware prefetching based on future execution in chip multiprocessor architectures. M.S. thesis, Department of Electrical and Computer Engineering, Cornell University, Ithaca, New York.Google ScholarGoogle Scholar
  7. Ganusov, I. and Burtscher, M. 2005. Future execution: A hardware prefetching technique for chip multiprocessors. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques. 350--360. Google ScholarGoogle Scholar
  8. Goeman, B., Vandierendonck, H., and de Bosschere, K. 2001. Differential fcm: Increasing value prediction accuracy by improving table usage efficiency. In Proceedings of the 7th International Symposium on High-Performance Computer Architecture. Google ScholarGoogle Scholar
  9. http://www.spec.org/osg/cpu2000/.Google ScholarGoogle Scholar
  10. Hu, Z., Martonosi, M., and Kaxiras, S. 2003. Tcp: Tag correlating prefetchers. In Proceedings of the 9th International Symposium on High-Performance Computer Architecture. 317. Google ScholarGoogle Scholar
  11. Ibrahim, K. Z., Byrd, G. T., and Rotenberg, E. 2003. Slipstream execution mode for cmp-based multiprocessors. In Proceedings of the 9th International Symposium on High-Performance Computer Architecture. 179. Google ScholarGoogle Scholar
  12. Joseph, D. and Grunwald, D. 1997. Prefetching using markov predictors. In Proceedings of the 24th Annual International Symposium on Computer Architecture. 252--263. Google ScholarGoogle Scholar
  13. Jouppi, N. P. 1990. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In Proceedings of the 17th Annual International Symposium on Computer Architecture. 364--373. Google ScholarGoogle Scholar
  14. Kirman, N., Kirman, M., Chaudhuri, M., and Martinez, J. F. 2005. Checkpointed early load retirement. In Proceedings of the 11th International Symposium on High-Performance Computer Architecture. 16--27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Lai, A.-C., Fide, C., and Falsafi, B. 2001. Dead-block prediction & dead-block correlating prefetchers. In Proceedings of the 28th Annual International Symposium on Computer Architecture. 144--154. Google ScholarGoogle Scholar
  16. Larson, E., Chatterjee, S., and Austin, T. 2001. Mase: A novel infrastructure for detailed microarchitectural modeling. In Proceedings of the 2nd International Symposium on Performance Analysis of Systems and Software. 1--9.Google ScholarGoogle Scholar
  17. Lipasti, M. H., Wilkerson, C. B., and Shen, J. P. 1996. Value locality and load value prediction. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems. 138--147. Google ScholarGoogle Scholar
  18. Luk, C.-K. 2001. Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors. In Proceedings of the 28th Annual International Symposium on Computer Architecture. 40--51. Google ScholarGoogle Scholar
  19. Moshovos, A., Pnevmatikatos, D. N., and Baniasadi, A. 2001. Slice-processors: an implementation of operation-based prediction. In Proceedings of the 15th International Conference on Supercomputing. 321--334. Google ScholarGoogle Scholar
  20. Mutlu, O., Stark, J., Wilkerson, C., and Patt, Y. N. 2003. Runahead execution: An alternative to very large instruction windows for out-of-order processors. In Proceedings of the 9th International Symposium on High-Performance Computer Architecture. 129. Google ScholarGoogle Scholar
  21. Palacharla, S. and Kessler, R. E. 1994. Evaluating stream buffers as a secondary cache replacement. In Proceedings of the 21st Annual International Symposium on Computer Architecture. 24--33. Google ScholarGoogle Scholar
  22. Rattner, J. 2005. Multi-core to the masses. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques. 3. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Roth, A. and Sohi, G. S. 2001. Speculative data-driven multithreading. In Proceedings of the 7th International Symposium on High-Performance Computer Architecture. 37. Google ScholarGoogle Scholar
  24. Roth, A. and Sohi, G. S. 2002. A quantitative framework for automated pre-execution thread selection. In Proceedings of the 35th Annual ACM/IEEE International Symposium on Microarchitecture. 430--441. Google ScholarGoogle Scholar
  25. Roth, A., Moshovos, A., and Sohi, G. S. 1998. Dependence based prefetching for linked data structures. In Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems. 115--126. Google ScholarGoogle Scholar
  26. Sazeides, Y. and Smith, J. E. 1997. The predictability of data values. In Proceedings of the 30th Annual ACM/IEEE International Symposium on Microarchitecture. 248--258. Google ScholarGoogle Scholar
  27. Sherwood, T., Perelman, E., Hamerly, G., and Calder, B. 2002. Automatically characterizing large scale program behavior. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems. 45--57. Google ScholarGoogle Scholar
  28. Shivakumar, P. and Jouppi, N. P. December 2001. Cacti 3.0: An integrated cache timing, power, and area model. Tech. Rep. WRL-2001-2, Compaq Western Research Laboratory.Google ScholarGoogle Scholar
  29. Srinivasan, S. T., Akkary, H., Holman, T., and Lai, K. 2004. A minimal dual-core speculative multi-threading architecture. In Proceedings of the IEEE International Conference on Computer Design. 360--367. Google ScholarGoogle Scholar
  30. Srivastava, A. and Eustace, A. 1994. Atom: a system for building customized program analysis tools. In Proceedings of the ACM SIGPLAN 1994 Conference on Programming Language Design and Implementation. ACM Press, New York. 196--205. Google ScholarGoogle Scholar
  31. Zhou, H. 2005. Dual-core execution: Building a highly scalable single-thread instruction window. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques. Google ScholarGoogle Scholar
  32. Zhou, H. and Conte, T. M. 2003. Enhancing memory level parallelism via recovery-free value prediction. In Proceedings of the 17th Annual International Conference on Supercomputing. 326--335. Google ScholarGoogle Scholar
  33. Zilles, C. and Sohi, G. 2001. Execution-based prediction using speculative slices. In Proceedings of the 28th Annual International Symposium on Computer Architecture. 2--13. Google ScholarGoogle Scholar

Index Terms

  1. Future execution: A prefetching mechanism that uses multiple cores to speed up single threads

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Architecture and Code Optimization
      ACM Transactions on Architecture and Code Optimization  Volume 3, Issue 4
      December 2006
      169 pages
      ISSN:1544-3566
      EISSN:1544-3973
      DOI:10.1145/1187976
      Issue’s Table of Contents

      Copyright © 2006 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 1 December 2006
      Published in taco Volume 3, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader