Article

CQoS: a framework for enabling QoS in shared caches of CMP platforms

Author:
Ravi Iyer

Intel Corporation, Hillsboro, OR

Intel Corporation, Hillsboro, OR
View Profile

ICS '04: Proceedings of the 18th annual international conference on SupercomputingJune 2004Pages 257–266https://doi.org/10.1145/1006209.1006246

Published:26 June 2004Publication History

ICS '04: Proceedings of the 18th annual international conference on Supercomputing

Pages 257–266

ABSTRACT

Cache hierarchies have been traditionally designed for usage by a single application, thread or core. As multi-threaded (MT) and multi-core (CMP) platform architectures emerge and their workloads range from single-threaded and multithreaded applications to complex virtual machines (VMs), a shared cache resource will be consumed by these different entities generating heterogeneous memory access streams exhibiting different locality properties and varying memory sensitivity. As a result, conventional cache management approaches that treat all memory accesses equally are bound to result in inefficient space utilization and poor performance even for applications with good locality properties. To address this problem, this paper presents a new cache management framework (CQoS) that (1) recognizes the heterogeneity in memory access streams, (2) introduces the notion of QoS to handle the varying degrees of locality and latency sensitivity and (3) assigns and enforces priorities to streams based on latency sensitivity, locality degree and application performance needs. To achieve this, we propose CQoS options for priority classification, priority assignment and priority enforcement. We briefly describe CQoS priority classification and assignment options -- ranging from user-driven and developer-driven to compiler-detected and flow-based approaches. Our focus in this paper is on CQoS mechanisms for priority enforcement -- these include (1) selective cache allocation, (2) static/dynamic set partitioning and (3) heterogeneous cache regions. We discuss the architectural design and implementation complexity of these CQoS options. To evaluate the performance trade-offs for these options, we have modeled these CQoS options in a cache simulator and evaluated their performance in CMP platforms running network-intensive server workloads. Our simulation results show the effectiveness of our proposed options and make the case for CQoS in future multi-threaded/multi-core platforms since it improves shared cache efficiency and increases overall system performance as a result.

References

H. Abdel-Shafi, et al., "An Evaluation of Fine-Grain Producer-Initiated Communication in Cache-Coherent Multiprocessors," Proceedings of the 3rd International Symposium on High-Performance Computer Architecture, February 1997, 204--215.]] Google ScholarDigital Library
K. Beyls, "Faster Computing through Software-Controlled Cache Replacement," http://escher.elis.ugent.be/publ/Edocs/DOC/P102_118.pdf]]Google Scholar
F. Bodin, A. Seznec, "Skewed Associativity improves performance and enhances predictability", IEEE Transactions on Computers, May 1997.]] Google ScholarDigital Library
D. Clark et. al., "An analysis of TCP Processing overhead", IEEE Communications, June 1989.]]Google Scholar
T. Garfinkel , Ben Pfaff , Jim Chow , Mendel Rosenblum , Dan Boneh, "Terra: a virtual machine-based platform for trusted computing," Proceedings of the 9th ACM symposium on Operating Systems Principles, Oct 2003, NY, USA.]] Google ScholarDigital Library
R. Iyer, "CASPER: Cache Architecture, Simulation and Performance Exploration using Re-streams," Intel's Design and Test Technology Conference (DTTC), 2001.]]Google Scholar
R. Iyer, "On Modeling and Analyzing Cache Hierarchies using CASPER," MASCOTS-11, 2003.]]Google Scholar
P. Jain, et al., "Software Assisted Cache Replacement and Prefetching Pollution Control," http://www.csail.mit.edu/research/abstracts/abstracts03/architecture/24jain.pdf]]Google Scholar
N.P. Jouppi, "Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers," Proceedings of 17th International Symposium on Computer Architecture, pages 364--373. IEEE, June 1990.]] Google ScholarDigital Library
S.T. King, George W. Dunlap, Peter M. Chen, "Operating System Support for Virtual Machines", Proceedings of the 2003 Annual USENIX Technical Conference, June 2003.]] Google ScholarDigital Library
D. Koufaty, et.al, "Data Forwarding in Scalable Shared Memory Multiprocessors, IEEE TPDS, 1997.]] Google ScholarDigital Library
D. Lilja and P-C. Yew, "Combining hardware and software cache coherence strategies," International Conference on Supercomputing, 1991.]] Google ScholarDigital Library
S. Makineni and R. Iyer, "Architectural Characterization of TCP/IP Packet Processing on the Pentium® M microprocessor," HPCA-10, 2004.]] Google ScholarDigital Library
S. Makineni and R. Iyer, "Performance Characterization of TCP/IP Packet Processing in Commercial Workloads," IEEE WWC-6, 2003.]]Google Scholar
D. Marr et al., "Hyper-Threading Technology Architecture and Microarchitecture" Intel Technology Journal, 2002. http://www.intel.com/technology/itj/2002/volume06issue01/]]Google Scholar
M. Martin, et al., "Token Coherence: A New Framework for Shared-Memory Multiprocessors," IEEE Micro Special Issue, Nov-Dec 2003.]] Google ScholarDigital Library
N. Megido, "Adaptive Replacement Cache," IBM T.J. Watson Research Center, http://www.almaden.ibm.com/cs/people/dmodha/arc-fast.pdf]]Google Scholar
D. Minturn, et al., "Exploiting Architectural Techniques for Improving TCP/IP Processing Performance," submitted to a conference.]]Google Scholar
B. Nayfeh, K. Olukotun and J.P. Singh, "The Impact of Shared Cache Clustering in Small-Scale Shared Memory Multiprocessors," Int'l Conference on High Performance Computer Architecture (HPCA-1), Feb 1996.]] Google ScholarDigital Library
J. B. Postel, "Transmission Control Protocol", RFC 793, Information Sciences Institute, Sept. 1981.]]Google Scholar
D.K. Poulsen and P.C. Yew, "Integrating Fine Grained Message Passing in Cache Coherent Shared Memory Multiprocessors," Journal of Parallel and Distributed Computing, 1996.]] Google ScholarDigital Library
P. Ranganathan, et al., "The Interaction of Software Prefetching with ILP Processors in Shared-Memory Systems," 24th International Symposium on Computer Architecture, June 1997, 144--156.]] Google ScholarDigital Library
A. Seznec, "Decoupled Sectored Caches", IEEE Transactions on Computers, Feb. 1997.]] Google ScholarDigital Library
SimpleScalar LLC, http://www.simplescalar.com]]Google Scholar
Y. Solihin, J. Lee, and Josep Torrellas. "Using a User-Level Memory Thread for Correlation Prefetching", The 29th Annual International Symposium on Computer Architecture (ISCA 2002), Anchorage, Alaska, May 2002.]] Google ScholarDigital Library
"SPECweb99 Design Document," available at http://www.specbench.org/osg/web99/docs/whitepaper.html]]Google Scholar
P. Stenstrom, "A Survey of Cache Coherence Protocols," IEEE Computer, 1990.]] Google ScholarDigital Library
E. Suh, L. Rudolph and S. Devadas, "Dynamic Partitioning of Shared Cache Memory," Journal of Supercomputing, July 2002.]] Google ScholarDigital Library
"The TTCP Benchmark", http://ftp.arl.mil/~mike/ttcp.html]]Google Scholar
D. M. Tullsen and S. J. Eggers. "Limitations of Cache Prefetching on a Bus-Based Multiprocessor," Proc. 20th Annual Int. Symposium on Computer Architecture, pp.278--288, 1993.]] Google ScholarDigital Library
D.M. Tullsen, S.J. Eggers, and H.M. Levy, "Simultaneous Multithreading: Maximizing On-Chip Parallelism," 22nd International Symposium on Computer Architecture, 1995.]] Google ScholarDigital Library
VMware Inc., "VMware is Virtual Infrastructure", http://www.vmware.com/vinfrastructure/]]Google Scholar
C. A. Waldspurger, "Memory Resource Management in VMware ESX Server," 5th Symposium on OSDI, 2002.]] Google ScholarDigital Library
W. A. Wulf and S. A. McKee. "Hitting the Memory Wall: Implications of the Obvious," Computer Architecture News, 23(1):20--24, Mar 1995.]] Google ScholarDigital Library
L. Zhao, et al., "Efficient Cache Structures and Policies for Server Network Acceleration," submitted to a conference.]]Google Scholar

Index Terms

CQoS: a framework for enabling QoS in shared caches of CMP platforms
1. Hardware
  1. Integrated circuits
    1. Semiconductor memory
      1. Dynamic memory

Recommendations

Fast and fair: data-stream quality of service
CASES '05: Proceedings of the 2005 international conference on Compilers, architectures and synthesis for embedded systems

Chip multiprocessors have the potential to exploit thread level parallelism, particularly in the context of embedded server farms where the available number of threads can be quite high. Unfortunately, both per-core and overall throughput are ...
Read More
Reactive NUCA: near-optimal block placement and replication in distributed caches
ISCA '09: Proceedings of the 36th annual international symposium on Computer architecture

Increases in on-chip communication delay and the large working sets of server and scientific workloads complicate the design of the on-chip last-level cache for multicore processors. The large working sets favor a shared cache design that maximizes the ...
Read More
Reactive NUCA: near-optimal block placement and replication in distributed caches

Increases in on-chip communication delay and the large working sets of server and scientific workloads complicate the design of the on-chip last-level cache for multicore processors. The large working sets favor a shared cache design that maximizes the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICS '04: Proceedings of the 18th annual international conference on Supercomputing
June 2004
360 pages
ISBN:1581138393
DOI:10.1145/1006209
General Chair:
Paul Feautrier
LIP, ENS Lyon
,
Program Chairs:
James Goodman
University of Auckland
,
André Seznec
IRISA, INRIA
Copyright © 2004 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 June 2004
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
CMP
QoS
cache
partitioning
performance
sharing
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate584of2,055submissions,28%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 240
  Total Citations
  View Citations
- 1,668
  Total Downloads
- Downloads (Last 12 months)21
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.