article

Free Access

Effective cache prefetching on bus-based multiprocessors

Authors:
Dean M. Tullsen

University of Washington

University of Washington
View Profile

,
Susan J. Eggers

University of Washington

University of Washington
View Profile

Authors Info & Claims

ACM Transactions on Computer Systems Volume 13 Issue 1pp 57–88https://doi.org/10.1145/200912.201006

Published:01 February 1995Publication History

ACM Transactions on Computer Systems

Abstract

Compiler-directed cache prefetching has the potential to hide much of the high memory latency seen by current and future high-performance processors. However, prefetching is not without costs, particularly on a shared-memory multiprocessor. Prefetching can negatively affect bus utilization, overall cache miss rates, memory latencies and data sharing. We simulate the effects of a compiler-directed prefetching algorithm, running on a range of bus-based multiprocessors. We show that, despite a high memory latency, this architecture does not necessarily support prefetching well, in some cases actually causing performance degradations. We pinpoint several problems with prefetching on a shared-memory architecture (additional conflict misses, no reduction in the data-sharing traffic and associated latencies, a multiprocessor's greater sensitivity to memory utilization and the sensitivity of the cache hit rate to prefetch distance) and measure their effect on performance. We then solve those problems through architectural techniques and heuristics for prefetching that could be easily incorporated into a compiler: (1) victim caching, which eliminates most of the cache conflict misses caused by prefetching in a direct-mapped cache, (2) special prefetch algorithms for shared data, which significantly improve the ability of our basic prefetching algorithm to prefetch individual misses, and (3) compiler-based shared-data restructuring, which eliminates many of the invalidation misses the basic prefetching algorithm does not predict. The combined effect of these improvements is to make prefetching effective over a much wider range of memory architectures.

References

CALLAHAN, D., KENNEDY, K., AND PORTERFIELD, A. 1991. Software prefetching. In The 4th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, New York, 40-52. Google Scholar
CHEN, T.-F. 1993. Data prefetching for high-performance processors. Tech. Rep. No. UW TR-93-07-01, Ph.D. thesis, Univ. of Washington, Seattle, Wash. July. Google Scholar
CHEN, T.-F. AND BAER, J.-L. 1994. A performance study of software and hardware data prefetching schemes. In 21st Annual International Sympostum on Computer Architecture. ACM/IEEE, New York, 223-232. Google Scholar
CHEN, T.-F. AND B~R, J.-L. 1992. Reducing memory latency via non-blocking and prefetching caches. In The 4th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, New York, 51-61. Google Scholar
CHEN, W. Y., BRINGM~N, R. A., MAHLKE, S. A., HANK, R. E., AND SICOLO, J. E. 1992. An efficient architecture for loop based data preloading. In 25th Internattonal Symposium on Microarchitecture. ACM/IEEE, New York, 92 101. Google Scholar
CHE~, W. Y., M~L~, S. A., CHANG, P. P., ~D HWU, W.W. 1991. Data access microarchitectures for super-scalar processors with compiler-assisted data prefetching. In 24th Internatwnal Symposium on Microarchitecture. ACM/IEEE, New York, 69 73. Google Scholar
DEVADAS, S. ANn NEWTON, A.R. 1987. Topological optimization of multiple level array logic. IEEE Trans. Comput. Aid. Des. (Nov.), 915-941.Google Scholar
DUBOIS, M., SKEPPSTEDT, J., RICCIULLI, L., RAMAMURTHY, K., AND STENSTROM, P. 1993. The detection and elimination of useless misses in multiprocessors. In 20th Annual International Symposium on Computer Architecture. ACM/IEEE, New York, 88-97. Google Scholar
EDMONDSON, J. AND RUBINFIELD, P. 1994. An overview of the 21164 AXP microprocessor. In Hot Chips V/, 1-8.Google Scholar
EOOERS, S. J 1991. Snnplic:ty versus accuracy in a model of cache coherency overhead. IEEE Trans. Comput. 40, 8/Aug.), 893-906 Google Scholar
EGGERS, S.J. 1989. Simulation analysis of data sharing m shared memory mulnprocessors. Tech. Rep No UCB/CSD 89/501 Ph.D. thesis, Umv of Cahforma, Berkeley Mar. Google Scholar
EGGERS, S. J. AND JEREMIASSEN, T. E. 1991 Eliminating false sharing In International Conference on Parallel Processing. Vol. 1. IEEE, New York, 377-381.Google Scholar
EGGERS, S. J., KEPPEL, D. R., KOLmNOER, E. J., ANn LEX'~T, H.M. 1990. Techmques for inline tracing on a shared-memory multiprocessor. In Proceedings of the 1990 ACM Sigmetrics ACM, New York, 37 47. Google Scholar
HENNESSY, J. L. AND JouPPI, N.P. 1991. Computer technology and architecture: An evolving interaction. IE~~ Comput. 24, 9 (Sept.), 18-29. Google Scholar
JEREMIASSEN, T. E AND EGGERS, S. J. 1994. Static analysis of barner synchronization in explicitly parallel programs. In Internatmnal Conference on Parallel Architectures and Compslatlon Techniques. ACM, New York, 171-180 Google Scholar
JEREMIASSEN, T. E. ANn E(~UERS, S. J. 1992. Computing per-process summary side-effect information. In 5th International Workshop on Languages and Compilers for Parallel Computing. Lecture Notes on Computer Smence, vol. 757. Springer-Verlag, New York, 175 191. Google Scholar
JouPPI, N. P. 1990. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In 17th Annual Internatmnal S wnposium on Computer Architecture. ACM/IEEE, New York, 364 373. Google Scholar
KROFT, D. 1981. Lockup-free instruction fetch/prefetch cache organization. In 8th Annual Internatmnal Symposrum on Computer Architecture. ACM/IEEE, New York, 81 87. Google Scholar
LENOSKL D., LAUDON, J., JOE, T., NAKAHIRA, D., STEVENS, L., GUPTA, A., AND HENNESSY, J. 1993. The dash prototype: Logic overhead and performance. IEEE Trans. Parallel Dlstrrb. Syst. 4, 1 (Jan.), 41 61. Google Scholar
LOVETT, R. AND THAKKAR, S. 1988. The symmetry multiprocessor system. In International Conference on Pclrallel Processing. IEEE, New York, 303 310.Google Scholar
MA, H.-K. T., DEV^DAS, S., WEI, R., AND SANGIOVANNi-VINCENTELLI, A. 1987. Logic verification algorithms and their parallel implementation. In 24th Design Automation Conference. 283-290. Google Scholar
MICROPROCESSOR. 1994a. M~croprocess. Rep. 8, 2, 1-8.Google Scholar
MICROPROCESSOR. 1994b. Mmroprocess. Rep. 8, 13, 1 9.Google Scholar
MOTOROLA. 1990. MC88100 RISC Mmroprocessor User's Manual. Prentice-Hall, Englewood Cliffs, N J. Google Scholar
MOWRY, T. C. ANn GUPTA, A. 1991. Tolerating latency through software-controlled prefetchmg m shared-memory multiprocessors. J. Parallel Dlstrtb. Comput. 12, 2 (June), 87-106. Google Scholar
MOWRY, T. C., LAM, M. S., AND GUPTA, A. 1992. Design and evaluatmn of a compiler algorithm for prefetching. In The 5th Internatzonal Conference on Architectural Support for Programming Languages and Operattng Systems. ACM, New York, 62 73. Google Scholar
PAPAMARCOS, M. S. AND PATEL, J.H. 1984. A low-overhead coherence solution for multiprocessors with pmvate cache memories. In 11th Annual Internatmnal Symposzum on Computer Architecture. ACM/IEEE, New York, 348 354. Google Scholar
SCHEURICH, C. AND DUBOIS, M. 1991. Lockup-free caches in high-performance multiprocessors. J. Parallel Dlstrib. Comput. 11, i (Jan.), 25-36. Google Scholar
SINGH, J. P., WEBER, W., AND GUPTA, A. 1991. SPLASH: Stanford parallel applications for shared-memory. Tech Rep. CSL-TR-91-469. Comput. Syst. Lab., Stanford Univ., Stanford, Cahf. Google Scholar
SOHI, G. S. AND FnANKL{N, M. 1991. High-bandwidth data memory systems for superscalar processor. In The 4th Internattonal Conference on Architectural Support for Programming Languages and Operating Systems. ACM, New York, 53 62 Google Scholar
TORELLAS, J., LAM, M. S., AND HENNESSY, J L. 1994. False sharing and spatial locality in multiprocessor caches. IEEE Trans. Comput. 43, 6 (June), 651 663. Google Scholar
TULLSEN, D. M. AND EGGERS, S. J. 1993. L~m~tations of cache prefetching on a bus-based multlprocessor. In 20th Annual International Symposium on Computer Archrtecture. ACM/ IEEE, New York, 278 288. Google Scholar

Index Terms

Effective cache prefetching on bus-based multiprocessors
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multiple instruction, multiple data
2. Hardware
  1. Integrated circuits
    1. Semiconductor memory
      1. Dynamic memory

Recommendations

Limitations of cache prefetching on a bus-based multiprocessor
ISCA '93: Proceedings of the 20th annual international symposium on computer architecture

Compiler-directed cache prefetching has the potential to hide much of the high memory latency seen by current and future high-performance processors. However, prefetching is not without costs, particularly on a multiprocessor. Prefetching can negatively ...
Read More
Efficient Integration of Compiler-Directed Cache Coherence and Data Prefetching

Cache coherence enforcement and memory latency reduction and hiding are very important and challenging problems in the design of large-scale distributed shared-memory (DSM) multiprocessors. We propose an integrated approach to solve these problems ...
Read More
Two techniques for improving performance on bus-based multiprocessors
HPCA '95: Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture

We explore two techniques for reducing memory latency in bus-based multiprocessors. The first one, designed for sector caches, is a snoopy cache coherence protocol that uses a large transfer block to take advantage of spatial locality, while using a ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Computer Systems Volume 13, Issue 1
Feb. 1995
88 pages
ISSN:0734-2071
EISSN:1557-7333
DOI:10.1145/200912
Issue’s Table of Contents

Copyright © 1995 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 February 1995
Published in tocs Volume 13, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
bus-based multiprocessors
cache prefetching
false sharing
memory latency hiding
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 31
  Total Citations
  View Citations
- 666
  Total Downloads
- Downloads (Last 12 months)36
- Downloads (Last 6 weeks)9
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Effective cache prefetching on bus-based multiprocessors

ACM Transactions on Computer Systems

Abstract

References

Cited By

Index Terms

Recommendations

Limitations of cache prefetching on a bus-based multiprocessor

Efficient Integration of Compiler-Directed Cache Coherence and Data Prefetching

Two techniques for improving performance on bus-based multiprocessors

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Effective cache prefetching on bus-based multiprocessors

ACM Transactions on Computer Systems

Abstract

References

Cited By

Index Terms

Recommendations

Limitations of cache prefetching on a bus-based multiprocessor

Efficient Integration of Compiler-Directed Cache Coherence and Data Prefetching

Two techniques for improving performance on bus-based multiprocessors

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media