skip to main content
10.1145/1250662.1250696acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
Article

Mechanisms for store-wait-free multiprocessors

Published:09 June 2007Publication History

ABSTRACT

Store misses cause significant delays in shared-memory multiprocessors because of limited store buffering and ordering constraints required for proper synchronization. Today, programmers must choose from a spectrum of memory consistency models that reduce store stalls at the cost of increased programming complexity. Prior research suggests that the performance gap among consistency models can be closed through speculation--enforcing order only when dynamically necessary. Unfortunately, past designs either provide insufficient buffering, replace all stores with read-modify-write operations, and/or recover from ordering violations via impractical fine-grained rollback mechanisms.

We propose two mechanisms that, together, enable store-wait-free implementations of any memory consistency model. To eliminate buffer-capacity-related stalls, we propose the scalable store buffer, which places private/speculative values directly into the L1 cache, thereby eliminating the non-scalable associative search of conventional store buffers. To eliminate ordering-related stalls, we propose atomic sequence ordering, which enforces ordering constraints over coarse-grain access sequences while relaxing order among individual accesses. Using cycle-accurate full-system simulation of scientific and commercial applications, we demonstrate that these mechanisms allow the simplified programming of strict ordering while outperforming conventional implementations on average by 32% (sequential consistency), 22% (SPARC total store order) and 9% (SPARC relaxed memory order).

References

  1. S. V. Adve and K. Gharachorloo. Shared memory consistency models: A tutorial. IEEE Computer, 29(12):66--76, Dec. 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. H. Akkary, R. Rajwar, and S. T. Srinivasan. Checkpoint processing and recovery: Towards scalable large instruction window processors. Proc. of the 36th Int'l Symposium on Microarchitecture, Dec. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. Bhargava and L. K. John. Issues in the design of store buffers in dynamically scheduled processors. Proc. of the Int'l Symposium on the Performance Analysis of Systems and Software, Apr. 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. L. Ceze, K. Strauss, J. Tuck, J. Torrellas, and J. Renau. CAVA: Using checkpoint-assisted value prediction to hide L2 misses. ACM Transactions on Architecture and Code Optimization, 3(2):182--208, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. L. Ceze, J. Tuck, P. Montesinos, and J. Torrellas. Bulk enforcement of sequential consistency. Proc. of the 34th Int'l Symposium on Computer Architecture, Jun. 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Y. Chou, B. Fahs, and S. Abraham. Microarchitecture optimizations for exploiting memory-level parallelism. Proc. of the 31st Int'l Symposium on Computer Architecture, Jun. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Y. Chou, L. Spracklen, and S. G. Abraham. Store memory-level parallelism optimizations for commercial applications. Proc. of the 38th Int'l Symposium on Microarchitecture, Dec. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. O. Ergin, D. Balkan, D. Ponomarev, and K. Ghose. Increasing processor performance through early register release. Int'l Conference on Computer Design, Oct. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Gandhi, H. Akkary, R. Rajwar, S. T. Srinivasan, and K. Lai. Scalable load and store processing in latency tolerant processors. Proc. of the 38th Int'l Symposium on Microarchitecture, Dec. 2005.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. K. Gharachorloo, A. Gupta, and J. Hennessy. Two techniques to enhance the performance of memory consistency models. Proc. of the Int'l Conference on Parallel Processing, Aug. 1991.Google ScholarGoogle Scholar
  11. K. Gharachorloo, A. Gupta, and J. L. Hennessy. Performance evaluation of memory consistency models for shared memory multiprocessors. Proc. of the 4th Int'l Conference on Architectural Support for Programming Languages and Operating Systems, Apr. 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. C. Gniady and B. Falsafi. Speculative sequential consistency with little custom storage. Proc. of the 10th Int'l Conference on Parallel Architectures and Compilation Techniques, Sep. 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. C. Gniady, B. Falsafi, and T. N. Vijaykumar. Is SC + ILP = RC? Proc. of the 26th Int'l Symposium on Computer Architecture, May 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. L. Hammond, M. Willey, and K. Olukotun. Data speculation support for a chip multiprocessor. Proc. of the 8th Int'l Conference on Architectural Support for Programming Languages and Operating Systems, Oct. 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. Herlihy and J. E. B. Moss. Transactional memory: Architectural support for lock-free data structures. Technical Report 92/07, Digital Equipment Corporation, Cambridge Research Laboratory, Dec. 1992.Google ScholarGoogle Scholar
  16. M. D. Hill. Multiprocessors should support simple memory consistency models. IEEE Computer, 31(8), Aug. 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. V. Krishnan and J. Torrellas. A chip-multiprocessor architecture with speculative multithreading. IEEE Transactions on Computers, 48(9):866--880, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. L. Lamport. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Transactions on Computers, C-28(9):690--691, Sep. 1979.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J. Larus and R. Rajwar. Transactional Memory. Morgan Claypool Publishers, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  20. J. F. Martinez, J. Renau, M. C. Huang, M. Prvulovic, and J. Torrellas. Cherry: checkpointed early resource recycling in out-of-order microprocessors. Proc. of the 35th Int'l Symposium on Microarchitecture, Dec. 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. F. Martinez and J. Torrellas. Speculative synchronization: applying thread-level speculation to explicitly parallel applications. Proc. of the 10th Int'l Conference on Architectural Support for Programming Languages and Operating Systems, Oct. 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. I. Park, C. Ooi, and T. N. Vijaykumar. Reducing design complexity of the load/store queue. Proc. of the 36th Int'l Symposium on Microarchitecture, Dec. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. R. Rajwar and J. R. Goodman. Speculative lock elision: enabling highly concurrent multithreaded execution. Proc. of the 34th Int'l Symposium on Microarchitecture, Dec. 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. R. Rajwar and J. R. Goodman. Transactional lock-free execution of lock-based programs. Proc. of the 10th Int'l Conference on Architectural Support for Programming Languages and Operating Systems, Oct. 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. P. Ranganathan, K. Gharachorloo, S. V. Adve, and L. A. Barroso. Performance of database workloads on shared-memory systems with out-of-order processors. Proc. of the 8th Int'l Conference on Architectural Support for Programming Languages and Operating Systems, Oct. 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. P. Ranganathan, V. S. Pai, H. Abdel-Shafi, and S. V. Adve. The interaction of software prefetching with ilp processors in shared-memory systems. Proc. of the 24th Int'l Symposium on Computer Architecture, Jun. 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. P. Ranganathan, V. S. Pai, and S. V. Adve. Using speculative retirement and larger instruction windows to narrow the performance gap between memory consistency models. Proc. of the 9th Symposium on Parallel Algorithms and Architectures, Jun. 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. A. Roth. Store vulnerability window (SVW): Re-execution filtering for enhanced load optimization. Proc. of the 32nd Int'l Symposium on Computer Architecture, Jun. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. T. Sha, M. M. K. Martin, and A. Roth. NoSQ: Store-load communications without a store queue. Proc. of the 39th Int'l Symposium on Microarchitecture, Dec. 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. M. Shao, A. Ailamaki, and B. Falsafi. DBmbench: Fast and accurate database workload representation on modern microarchitecture. Proc. of the 15th IBM Center for Advanced Studies Conference, Oct. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. G. S. Sohi, S. E. Breach, and T. N. Vijaykumar. Multiscalar processors. Proc. of the 22nd Int'l Symposium on Computer Architecture, Jun. 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. J. G. Steffan, C. B. Colohan, A. Zhai, and T. C. Mowry. A scalable approach to thread-level speculation. Proc. of the 27th Int'l Symposium on Computer Architecture, Jul. 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. P. Stenstrom, M. Brorsson, and L. Sandberg. Adaptive cache coherence protocol optimized for migratory sharing. Proc. of the 20th Int'l Symposium on Computer Architecture, May 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. S. Subramaniam and G. H. Loh. Fire-and-Forget: Load/store scheduling with no store queue at all. Proc. of the 39th Int'l Symposium on Microarchitecture, Dec. 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. E. F. Torres, P. Ibanez, V. Vinals, and J. M. Llaberia. Store buffer design in first-level multibanked data caches. Proc. of the 32nd Int'l Symposium on Computer Architecture, Jun. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. C. von Praun, H. W. Cain, J.-D. Choi, and K. D. Ryu. Conditional memory ordering. Proc. of the 33rd Int'l Symposium on Computer Architecture, Jun. 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. T. F. Wenisch, R. E. Wunderlich, M. Ferdman, A. Ailamaki, B. Falsafi, and J. C. Hoe. SimFlex: statistical sampling of computer system simulation. IEEE Micro, 26(4):18--31, Jul-Aug 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. R. Wunderlich, T. Wenisch, B. Falsafi, and J. Hoe. SMARTS: Accelerating microarchitecture simulation through rigorous statistical sampling. Proc. of the 30th Int'l Symposium on Computer Architecture, Jun. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Mechanisms for store-wait-free multiprocessors

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      ISCA '07: Proceedings of the 34th annual international symposium on Computer architecture
      June 2007
      542 pages
      ISBN:9781595937063
      DOI:10.1145/1250662
      • General Chair:
      • Dean Tullsen,
      • Program Chair:
      • Brad Calder
      • cover image ACM SIGARCH Computer Architecture News
        ACM SIGARCH Computer Architecture News  Volume 35, Issue 2
        May 2007
        527 pages
        ISSN:0163-5964
        DOI:10.1145/1273440
        Issue’s Table of Contents

      Copyright © 2007 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 9 June 2007

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      Overall Acceptance Rate543of3,203submissions,17%

      Upcoming Conference

      ISCA '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader