skip to main content
10.1145/305619.305645acmconferencesArticle/Chapter ViewAbstractPublication PagesspaaConference Proceedingsconference-collections
Article
Free Access

Recursive array layouts and fast parallel matrix multiplication

Authors Info & Claims
Published:01 June 1999Publication History
First page image

References

  1. 1.I. Banicescu and S. E Hummel. Balancing processor loads and exploiting data locality in N-body simulations. In Proceedings of Supercomputing'95 (CD-ROM), San Diego, CA, Dec. 1995. Available from http://www.supercomp.orglscg5/proceedings/594.BHUM/SC95.HTM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. 2.T. Bially. Space-filling curves: Their generation and their application to bandwidth reduction. IEEE Transactions on Information Theory, IT-15(6):658-664, Nov. 1969.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. 3.J. Bilmes, K. Asanovit, C.-W. Chin, and J. Demmel. Optimizing matrix multiply using PHiPAC: a Portable, High-Performance, ANSI C coding methodology. In Proceedings of International Conference on Supercomputing, pages 340-347, Vienna, Austria, July 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. 4.R. D. Blumofe, C. E Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 207-216, Santa Barbara, CA, July 1995. Also see http:Htheory.lcs.mit.edufcilk. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. 5.S. Carr, K. S. McKinley, and C.-W. Tseng. Compiler optimizations for improving data locality. In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 252-262, San Jose, CA, Oct. 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. 6.L. Carter, J. Ferrante, and S. F. Hummel. Hierarchical tiling for improved superscalar performance. In International Parallel Processing Symposium, Apr. 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. 7.S. Chatterjee, J. R. Gilbert, R. Schreiber, and S.-H. Teng. Optimal evaluation of array expressions on massively parallel machines. ACM Trans. Prog. Lang. Syst., 17(1):123-156, Jan. 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. 8.M. Cierniak and W. Li. Unifying data and control transformations for distributed shared-memory machines. In Proceedings of the ACM SIGPLAN'95 Conference on Programming Language Design and Implementation, pages 205-217, La Jolla, CA, June 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. 9.D. Culler and J. P. Singh. Parallel Computer Architecture: A Hardware~Software Approach. Morgan Kaufmann, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. 10.J. J. Dongarra, J. D. Croz, I. S. Duff, and S. Hammarling. A set of Level 3 Basic Linear Algebra Subprograms. ACM Trans. Math. Softw., 16(1): I-17, Jan. 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. 11.P. C, Fischer and R. L. Probert. Efficient procedures for using matrix algorithms. In Automata, Languages and Programming, number 14 in Lecture Notes in Computer Science, pages 413-427. Springer-Verlag, 1974. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. 12.J. D. Frens and D. S. Wise. Auto-blocking matrix-multiplication or tracking BLAS3 performance with source code. In Proceedings of the Sixth ACM SIG- PLAN Symposium on Principles and Practice of Parallel Programming, pages 206-216, Las Vegas, NV, June 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. 13.M. Frigo and S. G. Johnson. FFTW: An adaptive software architecture for the FFT. In Proceedings oflCASSP'98, volume 3, page 1381, Seattle, WA, 1998. IEEE.Google ScholarGoogle ScholarCross RefCross Ref
  14. 14.M. E Goodchild and A. W. Grandfield. Optimizing raster storage: an examination of four alternatives. In Proceedings of Auto-Carto 6, volume 1, pages 400-407, Ottawa, Oct. 1983.Google ScholarGoogle Scholar
  15. 15.M. Gupta. Automatic Data Partitioning on Distributed Memory Multicomputers. PhD thesis, University of Illinois at Urbana-Champaign, Urbana, IL, Sept. 1992. Available as technical reports UILU-ENG-92-2237 and CRHC-92-19. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. 16.E G. Gustavson. Recursion leads to automatic variable blocking for dense linearalgebra algorithms. 1BM Journal of Research and Development, 41 (6):737-755, Nov. 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. 17.N.J. Higham. Accuracy and Stability of Numerical Algorithms. SIAM, Philadephia, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. 18.D. Hilbert. 0ber stetige Abbildung einer Linie auf ein Fl~ichensttlck. Mathematische Annalen, 38:459--460, 1891.Google ScholarGoogle ScholarCross RefCross Ref
  19. 19.M.D. Hill and A. J. Smith. Evaluating associativity in CPU caches. IEEE Trans. Comput., C-38(12): 1612-1630, Dec. 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. 20.Y. C. Hu, S. L. Johnsson, and S.-H. Teng. High Performance Fortran for highly irregular problems. In Proceedings of the Sixth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 13-24, Las Vegas, NV, June 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. 21.S. E Hummel, I. Banicescu, C.-T. Wang, and J. Wein. Load balancing and data locality via fractiling: An experimental study. In Language, Compilers and Run- Time Systems for Scalable Computers. Kluwer Academic Publishers, 1995.Google ScholarGoogle Scholar
  22. 22.H.V. Jagadish. Linearclustering of objects with multiple attributes. In H. Garcia- Molina and H. V. Jagadish, editors, Proceedings of the 1990 ACM SIGMOD International Conference on Managementof Data, pages 332-342, Atlantic City, N J, May 1990. ACM, ACM Press. Published as SIGMOD RECORD 19(2), June 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. 23.K. Kennedy and U. Kremer. Automatic data layout for distributed memory machines. ACM Trans. Prog. Lang. Syst., 1998. To appear. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. 24.K. Knobe, J. D. Lukas, and G. L. Steele Jr. Data optimization: Allocation of arrays to reduce communication on SIMD machines. 1ournal of Parallel and Distributed Computing, 8(2): 102-118, Feb. 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. 25.C.H. Koelbel, D. B. Loveman, R. S. Schreiber, G. L. Steele Jr., and M. E. Zosel. The High Performance Fortran Handbook. Scientific and Engineering Computation. The MIT Press, Cambridge, MA, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. 26.M. S. Lam, E. E. Rothberg, and M. E. Wolf. The cache performance and optimizations of blocked algorithms. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 63-74, Apr. 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. 27.R. Laurini. Graphical data bases built on Peano space-filling curves. In C. E. Vandoni, editor, Proceedings of the EUROGRAPHICS'85 Conference, pages 327-338, Amsterdam, 1985. North-Holland.Google ScholarGoogle Scholar
  28. 28.C. E. Leiserson. Personal communication, Aug. 1998.Google ScholarGoogle Scholar
  29. 29.M. E. Mace. Memory Storage Patterns in Parallel Processing. Kluwer international series in engineering and computer science. Kluwer Academic Press, NorwelI, MA, 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. 30.M. Mano. Digital Design. Prentice-Hall, 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. 31.B. Moon, H. V. Jagadish, C. Faloutsos, and J. H. Saltz. Analysis of the clustering properting of Hilbert space-filling curve. Technical Report CS-TR-3611, Computer Science Department, University of Maryland, College Park, MD, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. 32.G. Peano. Sur une courbe qui remplit toute une aire plaine. Mathematische Annalen, 36:157-160, 1890.Google ScholarGoogle ScholarCross RefCross Ref
  33. 33.J. R. Pilkington and S. B. Baden. Dynamic partitioning of non-uniform structured workloads with spacefilling curves. IEEE Transactions on Parallel and Distributed Systems, 7(3):288-300, Mar. 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. 34.H. Sagan. Space-Filling Curves. Springer-Verlag, 1994. ISBN 0-387-94265-3.Google ScholarGoogle Scholar
  35. 35.J. P. Singh, T. Joe, J. L. Hennessy, and A. Gupta. An en~pirical comparison of the Kendall Square Research KSR- 1 and the Stanford DASH multiprocessors. In Proceedings of Supercomputing '93, pages 214-225, Portland, OR, Nov. 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. 36.V. Strassen. Gaussian elimination is not optimal. Numer. Math., 13:354-356, 1969.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. 37.M. Thottethodi, S. Chatterjee, and A. R. Lebeck. Tuning Strassen's matrix multiplication for memory efficiency. In Proceedings ofSC98 (CD-ROM), Orlando, FL, Nov. 1998. Available from http ://www.supercomp.org/sc98. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. 38.M. S. Warren and J. K. Salmon. A parallel hashed Oct-Tree N-body algorithm. In Proceedings of Supercomputing'93, pages 12-21, Portland, OR, Nov. 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. 40.M.E. Wolf and M. S. Lain. A data locality optimizing algorithm. In Proceedings of the ACM S1GPLAN'91 Conference on Programming Language Design and Implementation, pages 30-44, Toronto, Canada, June 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. 41.M.J. Wolfe. More iteration space tiling. In Proceedings of Supercomputing'89, pages 655-664, Reno, NV, Nov. 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Recursive array layouts and fast parallel matrix multiplication

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            SPAA '99: Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
            June 1999
            261 pages
            ISBN:1581131240
            DOI:10.1145/305619

            Copyright © 1999 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 1 June 1999

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • Article

            Acceptance Rates

            SPAA '99 Paper Acceptance Rate26of90submissions,29%Overall Acceptance Rate447of1,461submissions,31%

            Upcoming Conference

            SPAA '24

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader