ABSTRACT
The uniform representation of 2-dimensional arrays serially in Morton order (or {\eee} order) supports both their iterative scan with cartesian indices and their divide-and-conquer manipulation as quaternary trees. This data structure is important because it relaxes serious problems of locality and latency, and the tree helps to schedule multi-processing. Results here show how it facilitates algorithms that avoid cache misses and page faults at all levels in hierarchical memory, independently of a specific runtime environment.
We have built a rudimentary C-to-C translator that implements matrices in Morton-order from source that presumes a row-major implementation. Early performance from LAPACK's reference implementation of \texttt{dgesv} (linear solver), and all its supporting routines (including \texttt{dgemm} matrix-multiplication) form a successful research demonstration. Its performance predicts improvements from new algebra in back-end optimizers.
We also present results from a more stylish \texttt{dgemm} algorithm that takes better advantage of this representation. With only routine back-end optimizations inserted by hand (unfolding the base case and passing arguments in registers), we achieve machine performance exceeding that of the manufacturer-crafted {\tt dgemm} running at 67% of peak flops. And the same code performs similarly on several machines.
Together, these results show how existing codes and future block-recursive algorithms can work well together on this matrix representation. Locality is key to future performance, and the new representation has a remarkable impact.
- 1.N. Ahmed and K. Pingali. Automatic generation of block-recursive codes. In A. Bode, T. Ludwig, W. Karl, and R. Wism. uller, editors, EURO-PAR 2000 Parallel Processing, volume 1900 of Lecture Notes in Computer Science, pages 368-378, Heidelberg, 2000. Springer. http://link.springer.de/link/service/series/0558/bibs/1900/19000368.htm]] Google Scholar
- 2.F. E. Allen, J. Cocke, and K. Kennedy. Reduction of operator strength. In S. W. Muchnick and N. D. Jones, editors, Program Flow Analysis: Theory and Applications, chapter 3.2, pages 79-101. Prentice-Hall, Englewood Cliffs, NJ, 1981.]]Google Scholar
- 3.J. Backus. The history of F ortran i, ii, and iii. In R. L. Wexelblat, editor, History of Programming Languages, pages 25-45. Academic Press, New York, 1981. Also preprinted in SIGPLAN Not., 13(8):166-180, Aug. 1978.]] Google ScholarDigital Library
- 4.J. Bilmes, K. Asanovic, C.-W. Chin, and J. Demmel. Optimizing matrix multiply using PHiPAC: a protable, high-performance, ansi c coding methodology. InProc. 1977 Intl. Conf. on Supercomputing, pages 340-347, New York, July 1997. ACM Press. http://www.acm.org/pubs/citations/proceedings/supercomputing/263580/p340-bilmes/]] Google ScholarDigital Library
- 5.R. M. Burstall and J. Darlington. A transformation system for developing recursive programs. J.ACM, 24(1):44-67, Jan. 1977. http://www.acm.org/pubs/citations/journals/jacm/1977-24-1/p44-burstall/]] Google ScholarDigital Library
- 6.F. W. Burton, V. J. Kollias, and J. G. Kollias. Real-time raster-to-quadtree and quadtree-to-raster conversion algorithms with modest storage requirements. Angew. Informatik, 4:170-174, 1986.]]Google Scholar
- 7.S. Chatterjee, V. V. Jain, A. R. Lebeck, S. Mundhra, and M. Thottethodi. Nonlinear array layouts for hierarchical memory systems. In Proc. 1999 Intl. Conf. on Supercomputing, pages 444-453, New York, June 1999. ACM Press. http://www.acm.org/pubs/citations/proceedings/supercomputing/305138/p444-chatterjee/]] Google ScholarDigital Library
- 8.S. Chatterjee, A. R. Lebeck, P. K. Patnala, and M. Thottenthodi. Recursive array layouts and fast parallel matrix multiplication. In Proc. 11th ACM Symp. Parallel Algorithms and Architectures, pages 222-231, New York, June 1999. ACM Press. http://www.acm.org/pubs/citations/proceedings/spaa/305619/p222-chatterjee/]] Google ScholarDigital Library
- 9.H. G. Cragon. A historical note on binary tree. SIGARCH Comput. Archit. News, 18(4):3, Dec. 1990.]]Google Scholar
- 10.D. Culler, R. Karp, D. Patterson, A. Sahay, K.E.Schauser, E. Santos, R. Subramonian, and T. von Eicken. LogP: a practical model of parallel computation. Commun. ACM, 39(11):78-85, Nov. 1996. http://www.acm.org/pubs/citations/journals/cacm/1996-39-11/p78-culler/]] Google ScholarDigital Library
- 11.C. Ding and K. Kennedy. The memory bandwidth bottleneck and its amelioration by a compiler. In 14th International Parallel and Distributed Processing Symposium (IPDPS'00), Los Alamitos, CA, 2000. IEEE Computer Society Press. http://www.computer.org/proceedings/ipdps/0574/05740181abs.htm]] Google ScholarDigital Library
- 12.J. J. Dongarra, J. Du Croz, S. Hammarling, and R. J. Hanson. An extended set of F ortran basic linear algebra subprograms. ACM Trans. Math. Softw., 14(1):1-17, Mar. 1988. http://www.acm.org/pubs/citations/journals/toms/1988-14-1/p1-dongarra/]] Google ScholarDigital Library
- 13.E. Elmroth and F. Gustavson. Applying recursion to serial and parallel QR factorization leads to better performance. IBM J. Res. Develop., 44(4):605-624, July 2000. http://www.research.ibm.com/journal/rd/444/elmroth.html]]Google ScholarDigital Library
- 14.P. C. Fischer and R. L. Probert. Storage reorganization techniques for matrix computation in a paging environment. Commun. ACM, 22(7):405-415, July 1979. http://www.acm.org/pubs/citations/journals/cacm/1979-22-7/p405-fischer/]] Google ScholarDigital Library
- 15.C. W. Fraser and D. R. Hanson. A Retargetable C Compiler : Design and Implementation. Benjamin/Cummings, Redwood City, CA, 1995.]] Google ScholarDigital Library
- 16.J. D. Frens and D. S. Wise. Auto-blocking matrix multiplication, or tracking BLAS3 performance from source code. Proc. 6th ACM SIGPLAN Symp. on Principles and Practice of Parallel Program., SIGPLAN Not., 32(7):206-216, July 1997. http://www.acm.org/pubs/citations/proceedings/ppopp/263764/p206-frens/]] Google ScholarDigital Library
- 17.M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In Proc. 40th Ann. Symp. Foundations of Computer Science, session 6B. IEEE Computer Society, Los Alamitos, CA, 1999. http://www.computer.org/proceedings/focs/0409/04090285abs.htm]] Google ScholarDigital Library
- 18.I. Gargantini. An effective way to represent quadtrees. Commun. ACM, 25(12):905-910, Dec. 1982. http://www.acm.org/pubs/citations/journals/cacm/1982-25-12/p905-gargantini/]] Google ScholarDigital Library
- 19.P. W. Grant, J. A. Sharp, M. F. Webster, and X. Zhang. Experiences of parallising finite-element problemsin a functional style. Softw. Prac. Exper., 25(9):947-974, Sept. 1995.]] Google ScholarDigital Library
- 20.F. G. Gustavson. Recursion leads to automatic variable blocking for dense linear-algebra algorithms. IBM J. Res. Develop., 41(6):737-755, Nov. 1997. http://www.research.ibm.com/journal/rd/416/gustavson.html]] Google ScholarDigital Library
- 21.E.-J. Im and K. Yelick. Optimizing sparse matrix vector multimplication on SMPs. In 9th SIAM Conf. on Parallel Processing for Scientific Computing, volume 98 of Proc. in Applied Mathematics, Mar. 1999. http://www.siam.org/catalog/mcc07/pr98.htm]]Google Scholar
- 22.D. E. Knuth. Fundamental Algorithms, volume 1 of The Art of Computer Programming. Addison-Wesley, Reading, MA, third edition, 1997.]] Google ScholarDigital Library
- 23.I. Kodukula, N. Ahmed, and K. Pingali. Data-centric multi-level blocking. Proc. ACM SIGPLAN '97 Conf. on Program. Language Design and Implementation, SIGPLAN Not., 32(7):346-357, May 1997. http://www.acm.org/pubs/citations/proceedings/pldi/258915/p346-kodukula/]] Google ScholarDigital Library
- 24.A. R. Lebeck, X. Fan, H. Zeng, and C. Ellis. Power-aware page allocation. Proc. 9th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, SIGPLAN Not., 35(11):105-116, Nov. 2000. http://www.acm.org/pubs/citations/proceedings/asplos/356988/p105-lebeck/]] Google ScholarDigital Library
- 25.A. C. McKellar and E. G. Coffman, Jr. Organizing matrices and matrix operations for paged-memory systems. Commun. ACM, 12(3):153-165, Mar. 1969.]] Google ScholarDigital Library
- 26.G. M. Morton. A computer oriented geodetic data base and a new technique in file sequencing. Technical report, IBM Ltd., Ottawa, Ontario, Mar. 1, 1966.]]Google Scholar
- 27.G. Newman. Organizing arrays for paged memory systems. Commun. ACM, 38(7):93-103 + 108-110, July 1995. http://www.acm.org/pubs/citations/journals/cacm/1995-38-7/p93-newman/]] Google ScholarDigital Library
- 28.J. A. Orenstein and T. H. Merrett. A class of data structures for associative searching. In Proc. 3rd ACM SIGACT-SIGMOD Symp. on Principles of Database Systems, pages 181-190, New York, 1984. ACM Press.]] Google ScholarDigital Library
- 29.G. Peano. Sur une courbe, qui remplit toute une aire plaine. Math. Ann., 36:157-160, 1890.]]Google ScholarCross Ref
- 30.H. Samet. The Design and Analysis of Spatial Data Structures, section 2.7. Addison-Wesley, Reading, MA, 1990.]] Google ScholarDigital Library
- 31.G. Schrack. Finding neighbors of equal size in linear quadtrees and octrees in constant time. CVGIP: Image Underst., 55(3):221-230, May 1992.]] Google ScholarDigital Library
- 32.K. D. Tocher. The application of automatic computers to sampling experiments. J. Roy. Statist. Soc. Ser. B, 16(1):39-61, 1954. See pp. 53-55.]]Google Scholar
- 33.M. S. Warren. and J. K. Salmon. A parallel hashed oct-tree N-body problem. In Proc. Supercomputing '93, pages 12-21, Los Alamitos, CA, 1993. IEEE Computer Society Press.]] Google Scholar
- 34.R. C. Whaley and J. J. Dongarra. Automatically tuned linear algebra software. In Proc. Supercomputing '98, Los Alamitos, CA, 1998. IEEE Computer Society. http://www.supercomp.org/sc98/TechPapers/sc98 FullAbstracts/Whaley814/INDEX.HTM]] Google ScholarDigital Library
- 35.D. S. Wise. Ahnentafel indexing into Morton-ordered arrays, or matrix locality for free. In A. Bode, T. Ludwig, W. Karl, and R. Wism. uller, editors, Euro-Par 2000 - Parallel Processing, volume 1900 of Lecture Notes in Computer Science, pages 774-883, Heidelberg, 2000. Springer. http://link.springer.de/link/service/series/0558/bibs/1900/19000774.htm]] Google Scholar
- 36.D. S. Wise and J. D. Frens. Morton-order matrices deserve compilers' support. Technical Report 533, Computer Science Dept., Indiana University, Nov. 1999. http://www.cs.indiana.edu/ftp/techreports/TR533.html]]Google Scholar
- 37.Q.Yi, V. Adve, and K. Kennedy. Transforming loops to recursion for multi-level memory hierarchies. Proc. ACM SIGPLAN '00 Conf. on Program. Language Design and Implementation, SIGPLAN Not., 35(5):169-181, May 2000. http://www.acm.org/pubs/citations/proceedings/pldi/301618a/p169-yi/]] Google ScholarDigital Library
Index Terms
- Language support for Morton-order matrices
Recommendations
Language support for Morton-order matrices
The uniform representation of 2-dimensional arrays serially in Morton order (or {\eee} order) supports both their iterative scan with cartesian indices and their divide-and-conquer manipulation as quaternary trees. This data structure is important ...
Analyzing block locality in Morton-order and Morton-hybrid matrices
MEDEA '06: Proceedings of the 2006 workshop on MEmory performance: DEaling with Applications, systems and architecturesAs the architectures of computers change, introducing more caches onto multicore chips, even more locality becomes necessary. With the bandwidth between caches and RAM now even more valuable, additional locality from new matrix representations will be ...
The Opie compiler from row-major source to Morton-ordered matrices
WMPI '04: Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architectureThe Opie Project aims to develop a compiler to transform C codes written for row-major matrix representation into equivalent codes for Morton-order matrix representation, and to apply its techniques to other languages. Accepting a possible reduction in ...
Comments