Article

Language support for Morton-order matrices

Authors:
David S. Wise

Computer Science Dept., Indiana University, Bloomington, IN

Computer Science Dept., Indiana University, Bloomington, IN
View Profile

,
Jeremy D. Frens

Dept. of Computer Science, Calvin College, Grand Rapids, MI and Indiana University

Dept. of Computer Science, Calvin College, Grand Rapids, MI and Indiana University
View Profile

,
Yuhong Gu

Oracle Corporation, One Oracle Drive, Nashua, NH and Indiana University

Oracle Corporation, One Oracle Drive, Nashua, NH and Indiana University
View Profile

,
Gregory A. Alexander

Computer Science Dept., Indiana University, Bloomington, IN

Computer Science Dept., Indiana University, Bloomington, IN
View Profile

PPoPP '01: Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programmingJune 2001Pages 24–33https://doi.org/10.1145/379539.379559

Published:18 June 2001Publication History

PPoPP '01: Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming

Pages 24–33

ABSTRACT

The uniform representation of 2-dimensional arrays serially in Morton order (or {\eee} order) supports both their iterative scan with cartesian indices and their divide-and-conquer manipulation as quaternary trees. This data structure is important because it relaxes serious problems of locality and latency, and the tree helps to schedule multi-processing. Results here show how it facilitates algorithms that avoid cache misses and page faults at all levels in hierarchical memory, independently of a specific runtime environment.

We have built a rudimentary C-to-C translator that implements matrices in Morton-order from source that presumes a row-major implementation. Early performance from LAPACK's reference implementation of \texttt{dgesv} (linear solver), and all its supporting routines (including \texttt{dgemm} matrix-multiplication) form a successful research demonstration. Its performance predicts improvements from new algebra in back-end optimizers.

We also present results from a more stylish \texttt{dgemm} algorithm that takes better advantage of this representation. With only routine back-end optimizations inserted by hand (unfolding the base case and passing arguments in registers), we achieve machine performance exceeding that of the manufacturer-crafted {\tt dgemm} running at 67% of peak flops. And the same code performs similarly on several machines.

Together, these results show how existing codes and future block-recursive algorithms can work well together on this matrix representation. Locality is key to future performance, and the new representation has a remarkable impact.

References

1.N. Ahmed and K. Pingali. Automatic generation of block-recursive codes. In A. Bode, T. Ludwig, W. Karl, and R. Wism. uller, editors, EURO-PAR 2000 Parallel Processing, volume 1900 of Lecture Notes in Computer Science, pages 368-378, Heidelberg, 2000. Springer. http://link.springer.de/link/service/series/0558/bibs/1900/19000368.htm]] Google Scholar
2.F. E. Allen, J. Cocke, and K. Kennedy. Reduction of operator strength. In S. W. Muchnick and N. D. Jones, editors, Program Flow Analysis: Theory and Applications, chapter 3.2, pages 79-101. Prentice-Hall, Englewood Cliffs, NJ, 1981.]]Google Scholar
3.J. Backus. The history of F ortran i, ii, and iii. In R. L. Wexelblat, editor, History of Programming Languages, pages 25-45. Academic Press, New York, 1981. Also preprinted in SIGPLAN Not., 13(8):166-180, Aug. 1978.]] Google ScholarDigital Library
4.J. Bilmes, K. Asanovic, C.-W. Chin, and J. Demmel. Optimizing matrix multiply using PHiPAC: a protable, high-performance, ansi c coding methodology. InProc. 1977 Intl. Conf. on Supercomputing, pages 340-347, New York, July 1997. ACM Press. http://www.acm.org/pubs/citations/proceedings/supercomputing/263580/p340-bilmes/]] Google ScholarDigital Library
5.R. M. Burstall and J. Darlington. A transformation system for developing recursive programs. J.ACM, 24(1):44-67, Jan. 1977. http://www.acm.org/pubs/citations/journals/jacm/1977-24-1/p44-burstall/]] Google ScholarDigital Library
6.F. W. Burton, V. J. Kollias, and J. G. Kollias. Real-time raster-to-quadtree and quadtree-to-raster conversion algorithms with modest storage requirements. Angew. Informatik, 4:170-174, 1986.]]Google Scholar
7.S. Chatterjee, V. V. Jain, A. R. Lebeck, S. Mundhra, and M. Thottethodi. Nonlinear array layouts for hierarchical memory systems. In Proc. 1999 Intl. Conf. on Supercomputing, pages 444-453, New York, June 1999. ACM Press. http://www.acm.org/pubs/citations/proceedings/supercomputing/305138/p444-chatterjee/]] Google ScholarDigital Library
8.S. Chatterjee, A. R. Lebeck, P. K. Patnala, and M. Thottenthodi. Recursive array layouts and fast parallel matrix multiplication. In Proc. 11th ACM Symp. Parallel Algorithms and Architectures, pages 222-231, New York, June 1999. ACM Press. http://www.acm.org/pubs/citations/proceedings/spaa/305619/p222-chatterjee/]] Google ScholarDigital Library
9.H. G. Cragon. A historical note on binary tree. SIGARCH Comput. Archit. News, 18(4):3, Dec. 1990.]]Google Scholar
10.D. Culler, R. Karp, D. Patterson, A. Sahay, K.E.Schauser, E. Santos, R. Subramonian, and T. von Eicken. LogP: a practical model of parallel computation. Commun. ACM, 39(11):78-85, Nov. 1996. http://www.acm.org/pubs/citations/journals/cacm/1996-39-11/p78-culler/]] Google ScholarDigital Library
11.C. Ding and K. Kennedy. The memory bandwidth bottleneck and its amelioration by a compiler. In 14th International Parallel and Distributed Processing Symposium (IPDPS'00), Los Alamitos, CA, 2000. IEEE Computer Society Press. http://www.computer.org/proceedings/ipdps/0574/05740181abs.htm]] Google ScholarDigital Library
12.J. J. Dongarra, J. Du Croz, S. Hammarling, and R. J. Hanson. An extended set of F ortran basic linear algebra subprograms. ACM Trans. Math. Softw., 14(1):1-17, Mar. 1988. http://www.acm.org/pubs/citations/journals/toms/1988-14-1/p1-dongarra/]] Google ScholarDigital Library
13.E. Elmroth and F. Gustavson. Applying recursion to serial and parallel QR factorization leads to better performance. IBM J. Res. Develop., 44(4):605-624, July 2000. http://www.research.ibm.com/journal/rd/444/elmroth.html]]Google ScholarDigital Library
14.P. C. Fischer and R. L. Probert. Storage reorganization techniques for matrix computation in a paging environment. Commun. ACM, 22(7):405-415, July 1979. http://www.acm.org/pubs/citations/journals/cacm/1979-22-7/p405-fischer/]] Google ScholarDigital Library
15.C. W. Fraser and D. R. Hanson. A Retargetable C Compiler : Design and Implementation. Benjamin/Cummings, Redwood City, CA, 1995.]] Google ScholarDigital Library
16.J. D. Frens and D. S. Wise. Auto-blocking matrix multiplication, or tracking BLAS3 performance from source code. Proc. 6th ACM SIGPLAN Symp. on Principles and Practice of Parallel Program., SIGPLAN Not., 32(7):206-216, July 1997. http://www.acm.org/pubs/citations/proceedings/ppopp/263764/p206-frens/]] Google ScholarDigital Library
17.M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In Proc. 40th Ann. Symp. Foundations of Computer Science, session 6B. IEEE Computer Society, Los Alamitos, CA, 1999. http://www.computer.org/proceedings/focs/0409/04090285abs.htm]] Google ScholarDigital Library
18.I. Gargantini. An effective way to represent quadtrees. Commun. ACM, 25(12):905-910, Dec. 1982. http://www.acm.org/pubs/citations/journals/cacm/1982-25-12/p905-gargantini/]] Google ScholarDigital Library
19.P. W. Grant, J. A. Sharp, M. F. Webster, and X. Zhang. Experiences of parallising finite-element problemsin a functional style. Softw. Prac. Exper., 25(9):947-974, Sept. 1995.]] Google ScholarDigital Library
20.F. G. Gustavson. Recursion leads to automatic variable blocking for dense linear-algebra algorithms. IBM J. Res. Develop., 41(6):737-755, Nov. 1997. http://www.research.ibm.com/journal/rd/416/gustavson.html]] Google ScholarDigital Library
21.E.-J. Im and K. Yelick. Optimizing sparse matrix vector multimplication on SMPs. In 9th SIAM Conf. on Parallel Processing for Scientific Computing, volume 98 of Proc. in Applied Mathematics, Mar. 1999. http://www.siam.org/catalog/mcc07/pr98.htm]]Google Scholar
22.D. E. Knuth. Fundamental Algorithms, volume 1 of The Art of Computer Programming. Addison-Wesley, Reading, MA, third edition, 1997.]] Google ScholarDigital Library
23.I. Kodukula, N. Ahmed, and K. Pingali. Data-centric multi-level blocking. Proc. ACM SIGPLAN '97 Conf. on Program. Language Design and Implementation, SIGPLAN Not., 32(7):346-357, May 1997. http://www.acm.org/pubs/citations/proceedings/pldi/258915/p346-kodukula/]] Google ScholarDigital Library
24.A. R. Lebeck, X. Fan, H. Zeng, and C. Ellis. Power-aware page allocation. Proc. 9th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, SIGPLAN Not., 35(11):105-116, Nov. 2000. http://www.acm.org/pubs/citations/proceedings/asplos/356988/p105-lebeck/]] Google ScholarDigital Library
25.A. C. McKellar and E. G. Coffman, Jr. Organizing matrices and matrix operations for paged-memory systems. Commun. ACM, 12(3):153-165, Mar. 1969.]] Google ScholarDigital Library
26.G. M. Morton. A computer oriented geodetic data base and a new technique in file sequencing. Technical report, IBM Ltd., Ottawa, Ontario, Mar. 1, 1966.]]Google Scholar
27.G. Newman. Organizing arrays for paged memory systems. Commun. ACM, 38(7):93-103 + 108-110, July 1995. http://www.acm.org/pubs/citations/journals/cacm/1995-38-7/p93-newman/]] Google ScholarDigital Library
28.J. A. Orenstein and T. H. Merrett. A class of data structures for associative searching. In Proc. 3rd ACM SIGACT-SIGMOD Symp. on Principles of Database Systems, pages 181-190, New York, 1984. ACM Press.]] Google ScholarDigital Library
29.G. Peano. Sur une courbe, qui remplit toute une aire plaine. Math. Ann., 36:157-160, 1890.]]Google ScholarCross Ref
30.H. Samet. The Design and Analysis of Spatial Data Structures, section 2.7. Addison-Wesley, Reading, MA, 1990.]] Google ScholarDigital Library
31.G. Schrack. Finding neighbors of equal size in linear quadtrees and octrees in constant time. CVGIP: Image Underst., 55(3):221-230, May 1992.]] Google ScholarDigital Library
32.K. D. Tocher. The application of automatic computers to sampling experiments. J. Roy. Statist. Soc. Ser. B, 16(1):39-61, 1954. See pp. 53-55.]]Google Scholar
33.M. S. Warren. and J. K. Salmon. A parallel hashed oct-tree N-body problem. In Proc. Supercomputing '93, pages 12-21, Los Alamitos, CA, 1993. IEEE Computer Society Press.]] Google Scholar
34.R. C. Whaley and J. J. Dongarra. Automatically tuned linear algebra software. In Proc. Supercomputing '98, Los Alamitos, CA, 1998. IEEE Computer Society. http://www.supercomp.org/sc98/TechPapers/sc98 FullAbstracts/Whaley814/INDEX.HTM]] Google ScholarDigital Library
35.D. S. Wise. Ahnentafel indexing into Morton-ordered arrays, or matrix locality for free. In A. Bode, T. Ludwig, W. Karl, and R. Wism. uller, editors, Euro-Par 2000 - Parallel Processing, volume 1900 of Lecture Notes in Computer Science, pages 774-883, Heidelberg, 2000. Springer. http://link.springer.de/link/service/series/0558/bibs/1900/19000774.htm]] Google Scholar
36.D. S. Wise and J. D. Frens. Morton-order matrices deserve compilers' support. Technical Report 533, Computer Science Dept., Indiana University, Nov. 1999. http://www.cs.indiana.edu/ftp/techreports/TR533.html]]Google Scholar
37.Q.Yi, V. Adve, and K. Kennedy. Transforming loops to recursion for multi-level memory hierarchies. Proc. ACM SIGPLAN '00 Conf. on Program. Language Design and Implementation, SIGPLAN Not., 35(5):169-181, May 2000. http://www.acm.org/pubs/citations/proceedings/pldi/301618a/p169-yi/]] Google ScholarDigital Library

Index Terms

Recommendations

Language support for Morton-order matrices

The uniform representation of 2-dimensional arrays serially in Morton order (or {\eee} order) supports both their iterative scan with cartesian indices and their divide-and-conquer manipulation as quaternary trees. This data structure is important ...
Read More
Analyzing block locality in Morton-order and Morton-hybrid matrices
MEDEA '06: Proceedings of the 2006 workshop on MEmory performance: DEaling with Applications, systems and architectures

As the architectures of computers change, introducing more caches onto multicore chips, even more locality becomes necessary. With the bandwidth between caches and RAM now even more valuable, additional locality from new matrix representations will be ...
Read More
The Opie compiler from row-major source to Morton-ordered matrices
WMPI '04: Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture

The Opie Project aims to develop a compiler to transform C codes written for row-major matrix representation into equivalent codes for Morton-order matrix representation, and to apply its techniques to other languages. Accepting a possible reduction in ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PPoPP '01: Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
June 2001
142 pages
ISBN:1581133464
DOI:10.1145/379539
Chairmen:
Michael Heath
Univ. of Illinois, Illinois, IN
,
Andrew Lumsdaine
Indiana Univ.
ACM SIGPLAN Notices Volume 36, Issue 7
July 2001
143 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/568014
Issue’s Table of Contents
Copyright © 2001 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 June 2001
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
paging
quadtrees
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate230of1,014submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 27
  Total Citations
  View Citations
- 839
  Total Downloads
- Downloads (Last 12 months)11
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Language support for Morton-order matrices

PPoPP '01: Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming

ABSTRACT

References

Cited By

Index Terms

Recommendations

Language support for Morton-order matrices

Analyzing block locality in Morton-order and Morton-hybrid matrices

The Opie compiler from row-major source to Morton-ordered matrices