ABSTRACT
Store misses cause significant delays in shared-memory multiprocessors because of limited store buffering and ordering constraints required for proper synchronization. Today, programmers must choose from a spectrum of memory consistency models that reduce store stalls at the cost of increased programming complexity. Prior research suggests that the performance gap among consistency models can be closed through speculation--enforcing order only when dynamically necessary. Unfortunately, past designs either provide insufficient buffering, replace all stores with read-modify-write operations, and/or recover from ordering violations via impractical fine-grained rollback mechanisms.
We propose two mechanisms that, together, enable store-wait-free implementations of any memory consistency model. To eliminate buffer-capacity-related stalls, we propose the scalable store buffer, which places private/speculative values directly into the L1 cache, thereby eliminating the non-scalable associative search of conventional store buffers. To eliminate ordering-related stalls, we propose atomic sequence ordering, which enforces ordering constraints over coarse-grain access sequences while relaxing order among individual accesses. Using cycle-accurate full-system simulation of scientific and commercial applications, we demonstrate that these mechanisms allow the simplified programming of strict ordering while outperforming conventional implementations on average by 32% (sequential consistency), 22% (SPARC total store order) and 9% (SPARC relaxed memory order).
- S. V. Adve and K. Gharachorloo. Shared memory consistency models: A tutorial. IEEE Computer, 29(12):66--76, Dec. 1996. Google ScholarDigital Library
- H. Akkary, R. Rajwar, and S. T. Srinivasan. Checkpoint processing and recovery: Towards scalable large instruction window processors. Proc. of the 36th Int'l Symposium on Microarchitecture, Dec. 2003. Google ScholarDigital Library
- R. Bhargava and L. K. John. Issues in the design of store buffers in dynamically scheduled processors. Proc. of the Int'l Symposium on the Performance Analysis of Systems and Software, Apr. 2000. Google ScholarDigital Library
- L. Ceze, K. Strauss, J. Tuck, J. Torrellas, and J. Renau. CAVA: Using checkpoint-assisted value prediction to hide L2 misses. ACM Transactions on Architecture and Code Optimization, 3(2):182--208, 2006. Google ScholarDigital Library
- L. Ceze, J. Tuck, P. Montesinos, and J. Torrellas. Bulk enforcement of sequential consistency. Proc. of the 34th Int'l Symposium on Computer Architecture, Jun. 2007. Google ScholarDigital Library
- Y. Chou, B. Fahs, and S. Abraham. Microarchitecture optimizations for exploiting memory-level parallelism. Proc. of the 31st Int'l Symposium on Computer Architecture, Jun. 2004. Google ScholarDigital Library
- Y. Chou, L. Spracklen, and S. G. Abraham. Store memory-level parallelism optimizations for commercial applications. Proc. of the 38th Int'l Symposium on Microarchitecture, Dec. 2005. Google ScholarDigital Library
- O. Ergin, D. Balkan, D. Ponomarev, and K. Ghose. Increasing processor performance through early register release. Int'l Conference on Computer Design, Oct. 2004. Google ScholarDigital Library
- A. Gandhi, H. Akkary, R. Rajwar, S. T. Srinivasan, and K. Lai. Scalable load and store processing in latency tolerant processors. Proc. of the 38th Int'l Symposium on Microarchitecture, Dec. 2005.Google ScholarDigital Library
- K. Gharachorloo, A. Gupta, and J. Hennessy. Two techniques to enhance the performance of memory consistency models. Proc. of the Int'l Conference on Parallel Processing, Aug. 1991.Google Scholar
- K. Gharachorloo, A. Gupta, and J. L. Hennessy. Performance evaluation of memory consistency models for shared memory multiprocessors. Proc. of the 4th Int'l Conference on Architectural Support for Programming Languages and Operating Systems, Apr. 1991. Google ScholarDigital Library
- C. Gniady and B. Falsafi. Speculative sequential consistency with little custom storage. Proc. of the 10th Int'l Conference on Parallel Architectures and Compilation Techniques, Sep. 2002. Google ScholarDigital Library
- C. Gniady, B. Falsafi, and T. N. Vijaykumar. Is SC + ILP = RC? Proc. of the 26th Int'l Symposium on Computer Architecture, May 1999. Google ScholarDigital Library
- L. Hammond, M. Willey, and K. Olukotun. Data speculation support for a chip multiprocessor. Proc. of the 8th Int'l Conference on Architectural Support for Programming Languages and Operating Systems, Oct. 1998. Google ScholarDigital Library
- M. Herlihy and J. E. B. Moss. Transactional memory: Architectural support for lock-free data structures. Technical Report 92/07, Digital Equipment Corporation, Cambridge Research Laboratory, Dec. 1992.Google Scholar
- M. D. Hill. Multiprocessors should support simple memory consistency models. IEEE Computer, 31(8), Aug. 1998. Google ScholarDigital Library
- V. Krishnan and J. Torrellas. A chip-multiprocessor architecture with speculative multithreading. IEEE Transactions on Computers, 48(9):866--880, 1999. Google ScholarDigital Library
- L. Lamport. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Transactions on Computers, C-28(9):690--691, Sep. 1979.Google ScholarDigital Library
- J. Larus and R. Rajwar. Transactional Memory. Morgan Claypool Publishers, 2006.Google ScholarCross Ref
- J. F. Martinez, J. Renau, M. C. Huang, M. Prvulovic, and J. Torrellas. Cherry: checkpointed early resource recycling in out-of-order microprocessors. Proc. of the 35th Int'l Symposium on Microarchitecture, Dec. 2002. Google ScholarDigital Library
- J. F. Martinez and J. Torrellas. Speculative synchronization: applying thread-level speculation to explicitly parallel applications. Proc. of the 10th Int'l Conference on Architectural Support for Programming Languages and Operating Systems, Oct. 2002. Google ScholarDigital Library
- I. Park, C. Ooi, and T. N. Vijaykumar. Reducing design complexity of the load/store queue. Proc. of the 36th Int'l Symposium on Microarchitecture, Dec. 2003. Google ScholarDigital Library
- R. Rajwar and J. R. Goodman. Speculative lock elision: enabling highly concurrent multithreaded execution. Proc. of the 34th Int'l Symposium on Microarchitecture, Dec. 2001. Google ScholarDigital Library
- R. Rajwar and J. R. Goodman. Transactional lock-free execution of lock-based programs. Proc. of the 10th Int'l Conference on Architectural Support for Programming Languages and Operating Systems, Oct. 2002. Google ScholarDigital Library
- P. Ranganathan, K. Gharachorloo, S. V. Adve, and L. A. Barroso. Performance of database workloads on shared-memory systems with out-of-order processors. Proc. of the 8th Int'l Conference on Architectural Support for Programming Languages and Operating Systems, Oct. 1998. Google ScholarDigital Library
- P. Ranganathan, V. S. Pai, H. Abdel-Shafi, and S. V. Adve. The interaction of software prefetching with ilp processors in shared-memory systems. Proc. of the 24th Int'l Symposium on Computer Architecture, Jun. 1997. Google ScholarDigital Library
- P. Ranganathan, V. S. Pai, and S. V. Adve. Using speculative retirement and larger instruction windows to narrow the performance gap between memory consistency models. Proc. of the 9th Symposium on Parallel Algorithms and Architectures, Jun. 1997. Google ScholarDigital Library
- A. Roth. Store vulnerability window (SVW): Re-execution filtering for enhanced load optimization. Proc. of the 32nd Int'l Symposium on Computer Architecture, Jun. 2005. Google ScholarDigital Library
- T. Sha, M. M. K. Martin, and A. Roth. NoSQ: Store-load communications without a store queue. Proc. of the 39th Int'l Symposium on Microarchitecture, Dec. 2006. Google ScholarDigital Library
- M. Shao, A. Ailamaki, and B. Falsafi. DBmbench: Fast and accurate database workload representation on modern microarchitecture. Proc. of the 15th IBM Center for Advanced Studies Conference, Oct. 2005. Google ScholarDigital Library
- G. S. Sohi, S. E. Breach, and T. N. Vijaykumar. Multiscalar processors. Proc. of the 22nd Int'l Symposium on Computer Architecture, Jun. 1995. Google ScholarDigital Library
- J. G. Steffan, C. B. Colohan, A. Zhai, and T. C. Mowry. A scalable approach to thread-level speculation. Proc. of the 27th Int'l Symposium on Computer Architecture, Jul. 2000. Google ScholarDigital Library
- P. Stenstrom, M. Brorsson, and L. Sandberg. Adaptive cache coherence protocol optimized for migratory sharing. Proc. of the 20th Int'l Symposium on Computer Architecture, May 1993. Google ScholarDigital Library
- S. Subramaniam and G. H. Loh. Fire-and-Forget: Load/store scheduling with no store queue at all. Proc. of the 39th Int'l Symposium on Microarchitecture, Dec. 2006. Google ScholarDigital Library
- E. F. Torres, P. Ibanez, V. Vinals, and J. M. Llaberia. Store buffer design in first-level multibanked data caches. Proc. of the 32nd Int'l Symposium on Computer Architecture, Jun. 2005. Google ScholarDigital Library
- C. von Praun, H. W. Cain, J.-D. Choi, and K. D. Ryu. Conditional memory ordering. Proc. of the 33rd Int'l Symposium on Computer Architecture, Jun. 2006. Google ScholarDigital Library
- T. F. Wenisch, R. E. Wunderlich, M. Ferdman, A. Ailamaki, B. Falsafi, and J. C. Hoe. SimFlex: statistical sampling of computer system simulation. IEEE Micro, 26(4):18--31, Jul-Aug 2006. Google ScholarDigital Library
- R. Wunderlich, T. Wenisch, B. Falsafi, and J. Hoe. SMARTS: Accelerating microarchitecture simulation through rigorous statistical sampling. Proc. of the 30th Int'l Symposium on Computer Architecture, Jun. 2003. Google ScholarDigital Library
Index Terms
- Mechanisms for store-wait-free multiprocessors
Recommendations
Mechanisms for store-wait-free multiprocessors
Store misses cause significant delays in shared-memory multiprocessors because of limited store buffering and ordering constraints required for proper synchronization. Today, programmers must choose from a spectrum of memory consistency models that ...
Sequential Hardware Prefetching in Shared-Memory Multiprocessors
To offset the effect of read miss penalties on processor utilization in shared-memory multiprocessors, several software- and hardware-based data prefetching schemes have been proposed. A major advantage of hardware techniques is that they need no ...
TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs
Translation Lookaside Buffers (TLBs) are critical to overall system performance. Much past research has addressed uniprocessor TLBs, lowering access times and miss rates. However, as Chip MultiProcessors (CMPs) become ubiquitous, TLB design and ...
Comments