- 1.I. Banicescu and S. E Hummel. Balancing processor loads and exploiting data locality in N-body simulations. In Proceedings of Supercomputing'95 (CD-ROM), San Diego, CA, Dec. 1995. Available from http://www.supercomp.orglscg5/proceedings/594.BHUM/SC95.HTM. Google ScholarDigital Library
- 2.T. Bially. Space-filling curves: Their generation and their application to bandwidth reduction. IEEE Transactions on Information Theory, IT-15(6):658-664, Nov. 1969.Google ScholarDigital Library
- 3.J. Bilmes, K. Asanovit, C.-W. Chin, and J. Demmel. Optimizing matrix multiply using PHiPAC: a Portable, High-Performance, ANSI C coding methodology. In Proceedings of International Conference on Supercomputing, pages 340-347, Vienna, Austria, July 1997. Google ScholarDigital Library
- 4.R. D. Blumofe, C. E Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 207-216, Santa Barbara, CA, July 1995. Also see http:Htheory.lcs.mit.edufcilk. Google ScholarDigital Library
- 5.S. Carr, K. S. McKinley, and C.-W. Tseng. Compiler optimizations for improving data locality. In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 252-262, San Jose, CA, Oct. 1994. Google ScholarDigital Library
- 6.L. Carter, J. Ferrante, and S. F. Hummel. Hierarchical tiling for improved superscalar performance. In International Parallel Processing Symposium, Apr. 1995. Google ScholarDigital Library
- 7.S. Chatterjee, J. R. Gilbert, R. Schreiber, and S.-H. Teng. Optimal evaluation of array expressions on massively parallel machines. ACM Trans. Prog. Lang. Syst., 17(1):123-156, Jan. 1995. Google ScholarDigital Library
- 8.M. Cierniak and W. Li. Unifying data and control transformations for distributed shared-memory machines. In Proceedings of the ACM SIGPLAN'95 Conference on Programming Language Design and Implementation, pages 205-217, La Jolla, CA, June 1995. Google ScholarDigital Library
- 9.D. Culler and J. P. Singh. Parallel Computer Architecture: A Hardware~Software Approach. Morgan Kaufmann, 1998. Google ScholarDigital Library
- 10.J. J. Dongarra, J. D. Croz, I. S. Duff, and S. Hammarling. A set of Level 3 Basic Linear Algebra Subprograms. ACM Trans. Math. Softw., 16(1): I-17, Jan. 1990. Google ScholarDigital Library
- 11.P. C, Fischer and R. L. Probert. Efficient procedures for using matrix algorithms. In Automata, Languages and Programming, number 14 in Lecture Notes in Computer Science, pages 413-427. Springer-Verlag, 1974. Google ScholarDigital Library
- 12.J. D. Frens and D. S. Wise. Auto-blocking matrix-multiplication or tracking BLAS3 performance with source code. In Proceedings of the Sixth ACM SIG- PLAN Symposium on Principles and Practice of Parallel Programming, pages 206-216, Las Vegas, NV, June 1997. Google ScholarDigital Library
- 13.M. Frigo and S. G. Johnson. FFTW: An adaptive software architecture for the FFT. In Proceedings oflCASSP'98, volume 3, page 1381, Seattle, WA, 1998. IEEE.Google ScholarCross Ref
- 14.M. E Goodchild and A. W. Grandfield. Optimizing raster storage: an examination of four alternatives. In Proceedings of Auto-Carto 6, volume 1, pages 400-407, Ottawa, Oct. 1983.Google Scholar
- 15.M. Gupta. Automatic Data Partitioning on Distributed Memory Multicomputers. PhD thesis, University of Illinois at Urbana-Champaign, Urbana, IL, Sept. 1992. Available as technical reports UILU-ENG-92-2237 and CRHC-92-19. Google ScholarDigital Library
- 16.E G. Gustavson. Recursion leads to automatic variable blocking for dense linearalgebra algorithms. 1BM Journal of Research and Development, 41 (6):737-755, Nov. 1997. Google ScholarDigital Library
- 17.N.J. Higham. Accuracy and Stability of Numerical Algorithms. SIAM, Philadephia, 1996. Google ScholarDigital Library
- 18.D. Hilbert. 0ber stetige Abbildung einer Linie auf ein Fl~ichensttlck. Mathematische Annalen, 38:459--460, 1891.Google ScholarCross Ref
- 19.M.D. Hill and A. J. Smith. Evaluating associativity in CPU caches. IEEE Trans. Comput., C-38(12): 1612-1630, Dec. 1989. Google ScholarDigital Library
- 20.Y. C. Hu, S. L. Johnsson, and S.-H. Teng. High Performance Fortran for highly irregular problems. In Proceedings of the Sixth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 13-24, Las Vegas, NV, June 1997. Google ScholarDigital Library
- 21.S. E Hummel, I. Banicescu, C.-T. Wang, and J. Wein. Load balancing and data locality via fractiling: An experimental study. In Language, Compilers and Run- Time Systems for Scalable Computers. Kluwer Academic Publishers, 1995.Google Scholar
- 22.H.V. Jagadish. Linearclustering of objects with multiple attributes. In H. Garcia- Molina and H. V. Jagadish, editors, Proceedings of the 1990 ACM SIGMOD International Conference on Managementof Data, pages 332-342, Atlantic City, N J, May 1990. ACM, ACM Press. Published as SIGMOD RECORD 19(2), June 1990. Google ScholarDigital Library
- 23.K. Kennedy and U. Kremer. Automatic data layout for distributed memory machines. ACM Trans. Prog. Lang. Syst., 1998. To appear. Google ScholarDigital Library
- 24.K. Knobe, J. D. Lukas, and G. L. Steele Jr. Data optimization: Allocation of arrays to reduce communication on SIMD machines. 1ournal of Parallel and Distributed Computing, 8(2): 102-118, Feb. 1990. Google ScholarDigital Library
- 25.C.H. Koelbel, D. B. Loveman, R. S. Schreiber, G. L. Steele Jr., and M. E. Zosel. The High Performance Fortran Handbook. Scientific and Engineering Computation. The MIT Press, Cambridge, MA, 1994. Google ScholarDigital Library
- 26.M. S. Lam, E. E. Rothberg, and M. E. Wolf. The cache performance and optimizations of blocked algorithms. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 63-74, Apr. 1991. Google ScholarDigital Library
- 27.R. Laurini. Graphical data bases built on Peano space-filling curves. In C. E. Vandoni, editor, Proceedings of the EUROGRAPHICS'85 Conference, pages 327-338, Amsterdam, 1985. North-Holland.Google Scholar
- 28.C. E. Leiserson. Personal communication, Aug. 1998.Google Scholar
- 29.M. E. Mace. Memory Storage Patterns in Parallel Processing. Kluwer international series in engineering and computer science. Kluwer Academic Press, NorwelI, MA, 1987. Google ScholarDigital Library
- 30.M. Mano. Digital Design. Prentice-Hall, 1984. Google ScholarDigital Library
- 31.B. Moon, H. V. Jagadish, C. Faloutsos, and J. H. Saltz. Analysis of the clustering properting of Hilbert space-filling curve. Technical Report CS-TR-3611, Computer Science Department, University of Maryland, College Park, MD, 1996. Google ScholarDigital Library
- 32.G. Peano. Sur une courbe qui remplit toute une aire plaine. Mathematische Annalen, 36:157-160, 1890.Google ScholarCross Ref
- 33.J. R. Pilkington and S. B. Baden. Dynamic partitioning of non-uniform structured workloads with spacefilling curves. IEEE Transactions on Parallel and Distributed Systems, 7(3):288-300, Mar. 1996. Google ScholarDigital Library
- 34.H. Sagan. Space-Filling Curves. Springer-Verlag, 1994. ISBN 0-387-94265-3.Google Scholar
- 35.J. P. Singh, T. Joe, J. L. Hennessy, and A. Gupta. An en~pirical comparison of the Kendall Square Research KSR- 1 and the Stanford DASH multiprocessors. In Proceedings of Supercomputing '93, pages 214-225, Portland, OR, Nov. 1993. Google ScholarDigital Library
- 36.V. Strassen. Gaussian elimination is not optimal. Numer. Math., 13:354-356, 1969.Google ScholarDigital Library
- 37.M. Thottethodi, S. Chatterjee, and A. R. Lebeck. Tuning Strassen's matrix multiplication for memory efficiency. In Proceedings ofSC98 (CD-ROM), Orlando, FL, Nov. 1998. Available from http ://www.supercomp.org/sc98. Google ScholarDigital Library
- 38.M. S. Warren and J. K. Salmon. A parallel hashed Oct-Tree N-body algorithm. In Proceedings of Supercomputing'93, pages 12-21, Portland, OR, Nov. 1993. Google ScholarDigital Library
- 40.M.E. Wolf and M. S. Lain. A data locality optimizing algorithm. In Proceedings of the ACM S1GPLAN'91 Conference on Programming Language Design and Implementation, pages 30-44, Toronto, Canada, June 1991. Google ScholarDigital Library
- 41.M.J. Wolfe. More iteration space tiling. In Proceedings of Supercomputing'89, pages 655-664, Reno, NV, Nov. 1989. Google ScholarDigital Library
Index Terms
- Recursive array layouts and fast parallel matrix multiplication
Recommendations
A framework for practical parallel fast matrix multiplication
PPoPP '15Matrix multiplication is a fundamental computation in many scientific disciplines. In this paper, we show that novel fast matrix multiplication algorithms can significantly outperform vendor implementations of the classical algorithm and Strassen's ...
Communication-optimal parallel algorithm for strassen's matrix multiplication
SPAA '12: Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architecturesParallel matrix multiplication is one of the most studied fundamental problems in distributed and high performance computing. We obtain a new parallel algorithm that is based on Strassen's fast matrix multiplication and minimizes communication. The ...
Recursive Array Layouts and Fast Matrix Multiplication
The performance of both serial and parallel implementations of matrix multiplication is highly sensitive to memory system behavior. False sharing and cache conflicts cause traditional column-major or row-major array layouts to incur high variability in ...
Comments