article

Atomic Vector Operations on Chip Multiprocessors

ACM SIGARCH Computer Architecture News Volume 36 Issue 3June 2008pp 441–452https://doi.org/10.1145/1394608.1382154

Published:01 June 2008Publication History

ACM SIGARCH Computer Architecture News

Abstract

The current trend is for processors to deliver dramatic improvements in parallel performance while only modestly improving serial performance. Parallel performance is harvested through vector/SIMD instructions as well as multithreading (through both multithreaded cores and chip multiprocessors). Vector parallelism can be more efficiently supported than multithreading, but is often harder for software to exploit. In particular, code with sparse data access patterns cannot easily utilize the vector/SIMD instructions of mainstream processors. Hardware to scatter and gather sparse data has previously been proposed to enable vector execution for these codes. However, on multithreaded architectures, a number of applications spend significant time on atomic operations (e.g., parallel reductions), which cannot be vectorized using previously proposed schemes. This paper proposes architectural support for atomic vector operations (referred to as GLSC) that addresses this limitation. GLSC extends scatter-gather hardware to support atomic memory operations. Our experiments show that the GLSC provides an average performance improvement on a set of important RMS kernels of 54% for 4-wide SIMD.

References

AMD Opteron Processor Family. http://www.amd.com/.Google Scholar
CRAY-2 Engineering Maintenance Manual. Cray Research Inc., Publication No. HM-2032, 1985.Google Scholar
IBM Corporation. System/370 Principles of Operation. IBM Corporation, 1983. Google ScholarDigital Library
Intel Pentium/Core/Core 2 Processors. http://www.intel.com/.Google Scholar
NVIDIA CUDA (Compute Unified Device Architecture). http://www.nvidia.com/, 2007.Google Scholar
PowerPC User Instruction Set Architecture (Book I). 2003.Google Scholar
D. Abts, A. Bataineh, S. Scott, G. Faanes, J. Schwarzmeier, E. Lundberg, M. Bye, and G. Schwoerer. The cray black-widow: A highly scalable vector multiprocessor. In Supercomputing , 2007. Google ScholarDigital Library
J. Ahn, M. Erez, and W. J. Dally. Scatter-add in data parallel architectures. In HPCA, 2005. Google ScholarDigital Library
C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: Characterization and architectural implications. Princeton University Technical Report TR-811-08, 2008.Google Scholar
S. Chatterjee, G. E. Blelloch, and M. Zagha. Scan primitives for vector computers. In Supercomputing, 1990. Google ScholarDigital Library
P. Dubey. Recognition, Mining and Synthesis Moves Computers to the Era of Tera. Technology@Intel Magazine, February 2005.Google Scholar
C. Ericson. Real-time Collision Detection. Morgan-Kauffman, San Francisco, CA, USA, 2003. Google ScholarDigital Library
Z. Fang, L. Zhang, J. B. Carter, A. Ibrahim, and M. A. Parker. Active memory operations. In ICS, 2007. Google ScholarDigital Library
A. V. Goldberg and R. E. Tarjan. A new approach to the maximum-flow problem. Journal of the ACM, 35(4):921- 940, 1988. Google ScholarDigital Library
A. Gottlieb, B. D. Lubachevsky, and L. Rudolph. Basic techniques for the efficient coordination of very large numbers of cooperating sequential processors. ACM TOPLAS, 5(2):164-189, 1983. Google ScholarDigital Library
M. Gschwind. Chip multiprocessing and the Cell broadband engine. In ACM Computing Frontier, pages 1-8, 2006. Google ScholarDigital Library
J. Gummaraju, M. Erez, J. Coburn, M. Rosenblum, and W. J. Dally. Architectural support for the stream execution model on general-purpose processors. In PACT, 2007. Google ScholarDigital Library
M. Herlihy and J. E. B. Moss. Transactional memory: Architectural support for lock-free data structures. In ISCA, pages 289-300, 1993. Google ScholarDigital Library
G. Kane and J. Heirich. MIPS RISC Architecture: reference for the R2000, R3000, R6000 and the new R4000 instruction set computer architecture. Prentice-Hall, 1992. Google ScholarDigital Library
J. R. Larus and R. Rajwar. Transactional Memory. Morgan and Claypool, 2006.Google ScholarCross Ref
J. Laudon and D. Lenoski. The SGI Origin: A ccNUMA Highly Scalable Server. In ISCA, 1997. Google ScholarDigital Library
W. E. Lorensen and H. E. Cline. Marching cubes: A high resolution 3d surface construction algorithm. ACM Computer Graphics, 21(4):163-169, 1987. Google ScholarDigital Library
J. Rattner. Cool Codes for Hot Chips: A Quantitative Basis for Multi-Core Design. HotChips Keynote, 2006.Google ScholarCross Ref
O. Schenk. Scalable Parallel Sparse LU Factorization Methods on Shared Memory Multiprocessors. PhD thesis, ETH Zurich, Zurich, Switzerland, 2005.Google Scholar
S. L. Scott. Synchronization and communication in the T3E multiprocessor. In ASPLOS, 1996. Google ScholarDigital Library
R. Smith. Open dynamics engine v0.5 user guide. http://www.ode.org/ode-latest-userguide.html, 2006.Google Scholar
J. Z. Wang. Integrated Region-Based Image Retrieval. Kluwer Academic Publishers, Boston, MA, USA, 2001. Google ScholarDigital Library

Index Terms

Atomic Vector Operations on Chip Multiprocessors

Recommendations

Atomic Vector Operations on Chip Multiprocessors
ISCA '08: Proceedings of the 35th Annual International Symposium on Computer Architecture

The current trend is for processors to deliver dramatic improvements in parallel performance while only modestly improving serial performance. Parallel performance is harvested through vector/SIMD instructions as well as multithreading (through both ...
Read More
Efficient multimedia coprocessor with enhanced SIMD engines for exploiting ILP and DLP

Multimedia applications have become increasingly important in daily computing. These applications are composed of heterogeneous regions of code mixed with data-level parallelism (DLP) and instruction-level parallelism (ILP). A standard solution for a ...
Read More
Vector Extensions for Decision Support DBMS Acceleration
MICRO-45: Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture

Database management systems (DBMS) have become an essential tool for industry and research and are often a significant component of data centres. As a result of this criticality, efficient execution of DBMS engines has become an important area of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGARCH Computer Architecture News Volume 36, Issue 3
June 2008
449 pages
ISSN:0163-5964
DOI:10.1145/1394608
Issue’s Table of Contents
ISCA '08: Proceedings of the 35th Annual International Symposium on Computer Architecture
June 2008
449 pages
ISBN:9780769531748
Copyright © 2008 Authors
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 June 2008
Check for updates
Author Tags
SIMD
locks
multiprocessors
reductions
vector
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 16
  Total Citations
  View Citations
- 824
  Total Downloads
- Downloads (Last 12 months)11
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Atomic Vector Operations on Chip Multiprocessors

ACM SIGARCH Computer Architecture News

Abstract

References

Cited By

Index Terms

Recommendations

Atomic Vector Operations on Chip Multiprocessors

Efficient multimedia coprocessor with enhanced SIMD engines for exploiting ILP and DLP

Vector Extensions for Decision Support DBMS Acceleration