Automated empirical optimizations of software and the ATLAS project

https://doi.org/10.1016/S0167-8191(00)00087-9Get rights and content

Abstract

This paper describes the automatically tuned linear algebra software (ATLAS) project, as well as the fundamental principles that underly it. ATLAS is an instantiation of a new paradigm in high performance library production and maintenance, which we term automated empirical optimization of software (AEOS); this style of library management has been created in order to allow software to keep pace with the incredible rate of hardware advancement inherent in Moore's Law. ATLAS is the application of this new paradigm to linear algebra software, with the present emphasis on the basic linear algebra subprograms (BLAS), a widely used, performance-critical, linear algebra kernel library.

Introduction

The automatically tuned linear algebra software (ATLAS) project is an ongoing research effort focusing on applying empirical techniques in order to provide portable performance. Linear algebra routines are widely used in the computational sciences in general, and scientific modeling in particular. In many of these applications, the performance of the linear algebra operations are the main constraint preventing the scientist from modeling more complex problems, which would then more closely match reality. This then dictates an ongoing need for highly efficient routines; as more compute power becomes available, the scientist typically increases the complexity/accuracy of the model until the limits of the computational power are reached. Therefore, since many applications have no practical limit of “enough” accuracy, it is important that each generation of increasingly powerful computers has optimized linear algebra routines available.

Linear algebra is rich in operations which are highly optimizable, in the sense that a highly tuned code may run multiple orders of magnitude faster than a naively coded routine. However, these optimizations are platform specific, such that an optimization for a given computer architecture will actually cause a slow-down on another architecture. The traditional method of handling this problem has been to produce hand-optimized routines for a given machine. This is a painstaking process, typically requiring many man-months of highly trained (both in linear algebra and computational optimization) personnel. The incredible pace of hardware evolution makes this technique untenable in the long run, particularly so when considering that there are many software layers (e.g., operating systems, compilers, etc.), which also effect these kinds of optimizations, that are changing at similar, but independent rates.

Therefore a new paradigm is needed for the production of highly efficient routines in the modern age of computing, and ATLAS represents an implementation of such a set of new techniques. We call this paradigm “automated empirical optimization of software”, or AEOS. In an AEOS-enabled package such as ATLAS, the package provides many ways of doing the required operations, and uses empirical timings in order to choose the best method for a given architecture. Thus, if written generally enough, an AEOS-aware package can automatically adapt to a new computer architecture in a matter of hours, rather than requiring months or even years of highly trained professionals' time, as dictated by traditional methods.

ATLAS typically uses code generators (i.e., programs that write other programs) in order to provide the many different ways of doing a given operation, and has sophisticated search scripts and robust timing mechanisms in order to find the best ways of performing the operation for a given architecture.

One of the main performance kernels of linear algebra has traditionally been a standard API known as the basic linear algebra subprograms (BLAS) [5], [6], [7], [14], [17]. This API is supported by hand-tuned efforts of many hardware vendors, and thus provides a good first target for ATLAS, as there is both a large audience for this API, and on those platforms where vendor-supplied BLAS exist, an easy way to determine if ATLAS can provide the required level of performance.

Section snippets

AEOS in context

Historically, the research community has pursued two separate paths towards the goal of making software run at near peak levels. The first and most generally successful of these builds on research into compilers and their associated technologies. The holy grail of compilation research is to take an arbitrary code as an input and produce completely optimal code as output for given languages and hardware platforms. Despite the immense amount of effort that has been poured into this approach, its

ATLAS

ATLAS is the project from which our current understanding of AEOS methodologies grew, and now provides a test bed for their further development and testing. ATLAS was not, however, the first project to harness AEOS-like techniques for library production and maintenance. As far as we know, the first such successful project was FFTW [9], [10], [11], and the PHiPAC [3] project was the first to attempt to apply them to matrix multiply. Other projects with AEOS-like designs include [18], [19], [20].

Conclusion and future work

Results presented and referenced here demonstrate unambiguously that AEOS techniques can be utilized to build portable performance-critical libraries, which compete favorably with machine-specific, hand-tuned codes. We believe that the AEOS paradigm will ultimately have a major impact on high performance library development and maintenance.

ATLAS has produced a complete BLAS, and the ATLAS BLAS are already widely used in the linear algebra community. Further information, including the software

References (24)

  • E. Anderson et al.

    LAPACK Users' Guide

    (1995)
  • Z. Bai et al.

    The spectral decomposition of nonsymmetric matrices on distributed-memory computers

    SIAM J. Sci. Comput.

    (1997)
  • J. Bilmes, K. Asanovic, J. Demmel, D. Lam, C. Chin, Optimizing Matrix Multiply using PHiPAC: A Portable,...
  • M. Dayde et al.

    A parallel block implementation of level 3 BLAS for MIMD vector processors

    ACM Trans. Math. Software

    (1994)
  • J. Dongarra et al.

    A set of level 3 basic linear algebra subprograms

    ACM Trans. Math. Software

    (1990)
  • J. Dongarra et al.

    Algorithm 656: An extended set of basic linear algebra subprograms: Model implementation and test programs

    ACM Trans. Math. Software

    (1988)
  • J. Dongarra et al.

    An extended set of FORTRAN basic linear algebra subprograms

    ACM Trans. Math. Software

    (1988)
  • J. Dongarra et al.

    The IBM RISC System and linear algebra operations

    Supercomputer

    (1991)
  • M. Frigo, A Fast Fourier Transform Compiler, in: Proceedings of the ACM SIGPLAN Conference on Programming Language...
  • M. Frigo, FFTW: An Adaptive Software Architecture for the FFT, in: Proceedings of the ICASSP Conference, vol. 3, 1998,...
  • M. Frigo, S.G. Johnson, The Fastest Fourier Transform in the West, Technical Report MIT-LCS-TR-728, Massachusetts...
  • F. Gustavson, A. Henriksson, I. Jonsson, B. Kågström, P. Ling, Recursive blocked data formats and blas's for dense...
  • Cited by (959)

    • Automated tuning for the parameters of linear solvers

      2023, Journal of Computational Physics
    • Floating-point arithmetic

      2023, Acta Numerica
    • State of the Practice for Lattice Boltzmann Method Software

      2024, Archives of Computational Methods in Engineering
    View all citing articles on Scopus

    This work was supported in part by US Department of Energy under contract number DE-AC05-96OR22464; National Science Foundation Science and Technology Center Cooperative Agreement No. CCR-8809615; Los Alamos National Laboratory, University of California, subcontract #B76680017-3Z.

    View full text