Article

Mechanisms for store-wait-free multiprocessors

Authors:
Thomas F. Wenisch

Carnegie Mellon University, Pittsburgh, PA

Carnegie Mellon University, Pittsburgh, PA
View Profile

,
Anastasia Ailamaki

Carnegie Mellon University, Pittsburgh, PA

Carnegie Mellon University, Pittsburgh, PA
View Profile

,
Babak Falsafi

Carnegie Mellon University, Pittsburgh, PA

Carnegie Mellon University, Pittsburgh, PA
View Profile

,
Andreas Moshovos

University of Toronto, Toronto, ON, Canada

University of Toronto, Toronto, ON, Canada
View Profile

ISCA '07: Proceedings of the 34th annual international symposium on Computer architectureJune 2007Pages 266–277https://doi.org/10.1145/1250662.1250696

Published:09 June 2007Publication History

ISCA '07: Proceedings of the 34th annual international symposium on Computer architecture

Pages 266–277

ABSTRACT

Store misses cause significant delays in shared-memory multiprocessors because of limited store buffering and ordering constraints required for proper synchronization. Today, programmers must choose from a spectrum of memory consistency models that reduce store stalls at the cost of increased programming complexity. Prior research suggests that the performance gap among consistency models can be closed through speculation--enforcing order only when dynamically necessary. Unfortunately, past designs either provide insufficient buffering, replace all stores with read-modify-write operations, and/or recover from ordering violations via impractical fine-grained rollback mechanisms.

We propose two mechanisms that, together, enable store-wait-free implementations of any memory consistency model. To eliminate buffer-capacity-related stalls, we propose the scalable store buffer, which places private/speculative values directly into the L1 cache, thereby eliminating the non-scalable associative search of conventional store buffers. To eliminate ordering-related stalls, we propose atomic sequence ordering, which enforces ordering constraints over coarse-grain access sequences while relaxing order among individual accesses. Using cycle-accurate full-system simulation of scientific and commercial applications, we demonstrate that these mechanisms allow the simplified programming of strict ordering while outperforming conventional implementations on average by 32% (sequential consistency), 22% (SPARC total store order) and 9% (SPARC relaxed memory order).

References

S. V. Adve and K. Gharachorloo. Shared memory consistency models: A tutorial. IEEE Computer, 29(12):66--76, Dec. 1996. Google ScholarDigital Library
H. Akkary, R. Rajwar, and S. T. Srinivasan. Checkpoint processing and recovery: Towards scalable large instruction window processors. Proc. of the 36th Int'l Symposium on Microarchitecture, Dec. 2003. Google ScholarDigital Library
R. Bhargava and L. K. John. Issues in the design of store buffers in dynamically scheduled processors. Proc. of the Int'l Symposium on the Performance Analysis of Systems and Software, Apr. 2000. Google ScholarDigital Library
L. Ceze, K. Strauss, J. Tuck, J. Torrellas, and J. Renau. CAVA: Using checkpoint-assisted value prediction to hide L2 misses. ACM Transactions on Architecture and Code Optimization, 3(2):182--208, 2006. Google ScholarDigital Library
L. Ceze, J. Tuck, P. Montesinos, and J. Torrellas. Bulk enforcement of sequential consistency. Proc. of the 34th Int'l Symposium on Computer Architecture, Jun. 2007. Google ScholarDigital Library
Y. Chou, B. Fahs, and S. Abraham. Microarchitecture optimizations for exploiting memory-level parallelism. Proc. of the 31st Int'l Symposium on Computer Architecture, Jun. 2004. Google ScholarDigital Library
Y. Chou, L. Spracklen, and S. G. Abraham. Store memory-level parallelism optimizations for commercial applications. Proc. of the 38th Int'l Symposium on Microarchitecture, Dec. 2005. Google ScholarDigital Library
O. Ergin, D. Balkan, D. Ponomarev, and K. Ghose. Increasing processor performance through early register release. Int'l Conference on Computer Design, Oct. 2004. Google ScholarDigital Library
A. Gandhi, H. Akkary, R. Rajwar, S. T. Srinivasan, and K. Lai. Scalable load and store processing in latency tolerant processors. Proc. of the 38th Int'l Symposium on Microarchitecture, Dec. 2005.Google ScholarDigital Library
K. Gharachorloo, A. Gupta, and J. Hennessy. Two techniques to enhance the performance of memory consistency models. Proc. of the Int'l Conference on Parallel Processing, Aug. 1991.Google Scholar
K. Gharachorloo, A. Gupta, and J. L. Hennessy. Performance evaluation of memory consistency models for shared memory multiprocessors. Proc. of the 4th Int'l Conference on Architectural Support for Programming Languages and Operating Systems, Apr. 1991. Google ScholarDigital Library
C. Gniady and B. Falsafi. Speculative sequential consistency with little custom storage. Proc. of the 10th Int'l Conference on Parallel Architectures and Compilation Techniques, Sep. 2002. Google ScholarDigital Library
C. Gniady, B. Falsafi, and T. N. Vijaykumar. Is SC + ILP = RC? Proc. of the 26th Int'l Symposium on Computer Architecture, May 1999. Google ScholarDigital Library
L. Hammond, M. Willey, and K. Olukotun. Data speculation support for a chip multiprocessor. Proc. of the 8th Int'l Conference on Architectural Support for Programming Languages and Operating Systems, Oct. 1998. Google ScholarDigital Library
M. Herlihy and J. E. B. Moss. Transactional memory: Architectural support for lock-free data structures. Technical Report 92/07, Digital Equipment Corporation, Cambridge Research Laboratory, Dec. 1992.Google Scholar
M. D. Hill. Multiprocessors should support simple memory consistency models. IEEE Computer, 31(8), Aug. 1998. Google ScholarDigital Library
V. Krishnan and J. Torrellas. A chip-multiprocessor architecture with speculative multithreading. IEEE Transactions on Computers, 48(9):866--880, 1999. Google ScholarDigital Library
L. Lamport. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Transactions on Computers, C-28(9):690--691, Sep. 1979.Google ScholarDigital Library
J. Larus and R. Rajwar. Transactional Memory. Morgan Claypool Publishers, 2006.Google ScholarCross Ref
J. F. Martinez, J. Renau, M. C. Huang, M. Prvulovic, and J. Torrellas. Cherry: checkpointed early resource recycling in out-of-order microprocessors. Proc. of the 35th Int'l Symposium on Microarchitecture, Dec. 2002. Google ScholarDigital Library
J. F. Martinez and J. Torrellas. Speculative synchronization: applying thread-level speculation to explicitly parallel applications. Proc. of the 10th Int'l Conference on Architectural Support for Programming Languages and Operating Systems, Oct. 2002. Google ScholarDigital Library
I. Park, C. Ooi, and T. N. Vijaykumar. Reducing design complexity of the load/store queue. Proc. of the 36th Int'l Symposium on Microarchitecture, Dec. 2003. Google ScholarDigital Library
R. Rajwar and J. R. Goodman. Speculative lock elision: enabling highly concurrent multithreaded execution. Proc. of the 34th Int'l Symposium on Microarchitecture, Dec. 2001. Google ScholarDigital Library
R. Rajwar and J. R. Goodman. Transactional lock-free execution of lock-based programs. Proc. of the 10th Int'l Conference on Architectural Support for Programming Languages and Operating Systems, Oct. 2002. Google ScholarDigital Library
P. Ranganathan, K. Gharachorloo, S. V. Adve, and L. A. Barroso. Performance of database workloads on shared-memory systems with out-of-order processors. Proc. of the 8th Int'l Conference on Architectural Support for Programming Languages and Operating Systems, Oct. 1998. Google ScholarDigital Library
P. Ranganathan, V. S. Pai, H. Abdel-Shafi, and S. V. Adve. The interaction of software prefetching with ilp processors in shared-memory systems. Proc. of the 24th Int'l Symposium on Computer Architecture, Jun. 1997. Google ScholarDigital Library
P. Ranganathan, V. S. Pai, and S. V. Adve. Using speculative retirement and larger instruction windows to narrow the performance gap between memory consistency models. Proc. of the 9th Symposium on Parallel Algorithms and Architectures, Jun. 1997. Google ScholarDigital Library
A. Roth. Store vulnerability window (SVW): Re-execution filtering for enhanced load optimization. Proc. of the 32nd Int'l Symposium on Computer Architecture, Jun. 2005. Google ScholarDigital Library
T. Sha, M. M. K. Martin, and A. Roth. NoSQ: Store-load communications without a store queue. Proc. of the 39th Int'l Symposium on Microarchitecture, Dec. 2006. Google ScholarDigital Library
M. Shao, A. Ailamaki, and B. Falsafi. DBmbench: Fast and accurate database workload representation on modern microarchitecture. Proc. of the 15th IBM Center for Advanced Studies Conference, Oct. 2005. Google ScholarDigital Library
G. S. Sohi, S. E. Breach, and T. N. Vijaykumar. Multiscalar processors. Proc. of the 22nd Int'l Symposium on Computer Architecture, Jun. 1995. Google ScholarDigital Library
J. G. Steffan, C. B. Colohan, A. Zhai, and T. C. Mowry. A scalable approach to thread-level speculation. Proc. of the 27th Int'l Symposium on Computer Architecture, Jul. 2000. Google ScholarDigital Library
P. Stenstrom, M. Brorsson, and L. Sandberg. Adaptive cache coherence protocol optimized for migratory sharing. Proc. of the 20th Int'l Symposium on Computer Architecture, May 1993. Google ScholarDigital Library
S. Subramaniam and G. H. Loh. Fire-and-Forget: Load/store scheduling with no store queue at all. Proc. of the 39th Int'l Symposium on Microarchitecture, Dec. 2006. Google ScholarDigital Library
E. F. Torres, P. Ibanez, V. Vinals, and J. M. Llaberia. Store buffer design in first-level multibanked data caches. Proc. of the 32nd Int'l Symposium on Computer Architecture, Jun. 2005. Google ScholarDigital Library
C. von Praun, H. W. Cain, J.-D. Choi, and K. D. Ryu. Conditional memory ordering. Proc. of the 33rd Int'l Symposium on Computer Architecture, Jun. 2006. Google ScholarDigital Library
T. F. Wenisch, R. E. Wunderlich, M. Ferdman, A. Ailamaki, B. Falsafi, and J. C. Hoe. SimFlex: statistical sampling of computer system simulation. IEEE Micro, 26(4):18--31, Jul-Aug 2006. Google ScholarDigital Library
R. Wunderlich, T. Wenisch, B. Falsafi, and J. Hoe. SMARTS: Accelerating microarchitecture simulation through rigorous statistical sampling. Proc. of the 30th Int'l Symposium on Computer Architecture, Jun. 2003. Google ScholarDigital Library

Index Terms

Mechanisms for store-wait-free multiprocessors
1. Computer systems organization
  1. Architectures
    1. Parallel architectures

Recommendations

Mechanisms for store-wait-free multiprocessors

Store misses cause significant delays in shared-memory multiprocessors because of limited store buffering and ordering constraints required for proper synchronization. Today, programmers must choose from a spectrum of memory consistency models that ...
Read More
Sequential Hardware Prefetching in Shared-Memory Multiprocessors

To offset the effect of read miss penalties on processor utilization in shared-memory multiprocessors, several software- and hardware-based data prefetching schemes have been proposed. A major advantage of hardware techniques is that they need no ...
Read More
TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs

Translation Lookaside Buffers (TLBs) are critical to overall system performance. Much past research has addressed uniprocessor TLBs, lowering access times and miss rates. However, as Chip MultiProcessors (CMPs) become ubiquitous, TLB design and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ISCA '07: Proceedings of the 34th annual international symposium on Computer architecture
June 2007
542 pages
ISBN:9781595937063
DOI:10.1145/1250662
General Chair:
Dean Tullsen
University of California, San Diego
,
Program Chair:
Brad Calder
Microsoft & University of California, San Diego
ACM SIGARCH Computer Architecture News Volume 35, Issue 2
May 2007
527 pages
ISSN:0163-5964
DOI:10.1145/1273440
Issue’s Table of Contents
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 June 2007
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
memory consistency models
store buffer design
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate543of3,203submissions,17%
Upcoming Conference
ISCA '24

Sponsor:

sigarch

ISCA '24: The 51st Annual International Symposium on Computer Architecture

June 29 - July 3, 2024

Buenos Aires , Argentina
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 138
  Total Citations
  View Citations
- 1,060
  Total Downloads
- Downloads (Last 12 months)49
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Mechanisms for store-wait-free multiprocessors

ISCA '07: Proceedings of the 34th annual international symposium on Computer architecture

ABSTRACT

References

Cited By

Index Terms

Recommendations

Mechanisms for store-wait-free multiprocessors

Sequential Hardware Prefetching in Shared-Memory Multiprocessors

TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs