Abstract
The current trend is for processors to deliver dramatic improvements in parallel performance while only modestly improving serial performance. Parallel performance is harvested through vector/SIMD instructions as well as multithreading (through both multithreaded cores and chip multiprocessors). Vector parallelism can be more efficiently supported than multithreading, but is often harder for software to exploit. In particular, code with sparse data access patterns cannot easily utilize the vector/SIMD instructions of mainstream processors. Hardware to scatter and gather sparse data has previously been proposed to enable vector execution for these codes. However, on multithreaded architectures, a number of applications spend significant time on atomic operations (e.g., parallel reductions), which cannot be vectorized using previously proposed schemes. This paper proposes architectural support for atomic vector operations (referred to as GLSC) that addresses this limitation. GLSC extends scatter-gather hardware to support atomic memory operations. Our experiments show that the GLSC provides an average performance improvement on a set of important RMS kernels of 54% for 4-wide SIMD.
- AMD Opteron Processor Family. http://www.amd.com/.Google Scholar
- CRAY-2 Engineering Maintenance Manual. Cray Research Inc., Publication No. HM-2032, 1985.Google Scholar
- IBM Corporation. System/370 Principles of Operation. IBM Corporation, 1983. Google ScholarDigital Library
- Intel Pentium/Core/Core 2 Processors. http://www.intel.com/.Google Scholar
- NVIDIA CUDA (Compute Unified Device Architecture). http://www.nvidia.com/, 2007.Google Scholar
- PowerPC User Instruction Set Architecture (Book I). 2003.Google Scholar
- D. Abts, A. Bataineh, S. Scott, G. Faanes, J. Schwarzmeier, E. Lundberg, M. Bye, and G. Schwoerer. The cray black-widow: A highly scalable vector multiprocessor. In Supercomputing , 2007. Google ScholarDigital Library
- J. Ahn, M. Erez, and W. J. Dally. Scatter-add in data parallel architectures. In HPCA, 2005. Google ScholarDigital Library
- C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: Characterization and architectural implications. Princeton University Technical Report TR-811-08, 2008.Google Scholar
- S. Chatterjee, G. E. Blelloch, and M. Zagha. Scan primitives for vector computers. In Supercomputing, 1990. Google ScholarDigital Library
- P. Dubey. Recognition, Mining and Synthesis Moves Computers to the Era of Tera. Technology@Intel Magazine, February 2005.Google Scholar
- C. Ericson. Real-time Collision Detection. Morgan-Kauffman, San Francisco, CA, USA, 2003. Google ScholarDigital Library
- Z. Fang, L. Zhang, J. B. Carter, A. Ibrahim, and M. A. Parker. Active memory operations. In ICS, 2007. Google ScholarDigital Library
- A. V. Goldberg and R. E. Tarjan. A new approach to the maximum-flow problem. Journal of the ACM, 35(4):921- 940, 1988. Google ScholarDigital Library
- A. Gottlieb, B. D. Lubachevsky, and L. Rudolph. Basic techniques for the efficient coordination of very large numbers of cooperating sequential processors. ACM TOPLAS, 5(2):164-189, 1983. Google ScholarDigital Library
- M. Gschwind. Chip multiprocessing and the Cell broadband engine. In ACM Computing Frontier, pages 1-8, 2006. Google ScholarDigital Library
- J. Gummaraju, M. Erez, J. Coburn, M. Rosenblum, and W. J. Dally. Architectural support for the stream execution model on general-purpose processors. In PACT, 2007. Google ScholarDigital Library
- M. Herlihy and J. E. B. Moss. Transactional memory: Architectural support for lock-free data structures. In ISCA, pages 289-300, 1993. Google ScholarDigital Library
- G. Kane and J. Heirich. MIPS RISC Architecture: reference for the R2000, R3000, R6000 and the new R4000 instruction set computer architecture. Prentice-Hall, 1992. Google ScholarDigital Library
- J. R. Larus and R. Rajwar. Transactional Memory. Morgan and Claypool, 2006.Google ScholarCross Ref
- J. Laudon and D. Lenoski. The SGI Origin: A ccNUMA Highly Scalable Server. In ISCA, 1997. Google ScholarDigital Library
- W. E. Lorensen and H. E. Cline. Marching cubes: A high resolution 3d surface construction algorithm. ACM Computer Graphics, 21(4):163-169, 1987. Google ScholarDigital Library
- J. Rattner. Cool Codes for Hot Chips: A Quantitative Basis for Multi-Core Design. HotChips Keynote, 2006.Google ScholarCross Ref
- O. Schenk. Scalable Parallel Sparse LU Factorization Methods on Shared Memory Multiprocessors. PhD thesis, ETH Zurich, Zurich, Switzerland, 2005.Google Scholar
- S. L. Scott. Synchronization and communication in the T3E multiprocessor. In ASPLOS, 1996. Google ScholarDigital Library
- R. Smith. Open dynamics engine v0.5 user guide. http://www.ode.org/ode-latest-userguide.html, 2006.Google Scholar
- J. Z. Wang. Integrated Region-Based Image Retrieval. Kluwer Academic Publishers, Boston, MA, USA, 2001. Google ScholarDigital Library
Index Terms
- Atomic Vector Operations on Chip Multiprocessors
Recommendations
Atomic Vector Operations on Chip Multiprocessors
ISCA '08: Proceedings of the 35th Annual International Symposium on Computer ArchitectureThe current trend is for processors to deliver dramatic improvements in parallel performance while only modestly improving serial performance. Parallel performance is harvested through vector/SIMD instructions as well as multithreading (through both ...
Efficient multimedia coprocessor with enhanced SIMD engines for exploiting ILP and DLP
Multimedia applications have become increasingly important in daily computing. These applications are composed of heterogeneous regions of code mixed with data-level parallelism (DLP) and instruction-level parallelism (ILP). A standard solution for a ...
Vector Extensions for Decision Support DBMS Acceleration
MICRO-45: Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on MicroarchitectureDatabase management systems (DBMS) have become an essential tool for industry and research and are often a significant component of data centres. As a result of this criticality, efficient execution of DBMS engines has become an important area of ...
Comments