article

Software pipelining: an effective scheduling technique for VLIW machines

Author:
Monica S. Lam

Stanford University

Stanford University
View Profile

Authors Info & Claims

ACM SIGPLAN Notices Volume 39 Issue 4April 2004pp 244–256https://doi.org/10.1145/989393.989420

Published:01 April 2004Publication History

ACM SIGPLAN Notices

Abstract

The basic idea behind software pipelining was first developed by Patel and Davidson for scheduling hardware pipe-lines. As instruction-level parallelism made its way into general-purpose computing, it became necessary to automate scheduling. How and whether instructions can be scheduled statically have major ramifications on the design of computer architectures. Rau and Glaeser were the first to use software pipelining in a compiler for a machine with specialized hardware designed to support software pipelining. In the meantime, trace scheduling was touted to be the scheduling technique of choice for VLIW (Very Long Instruction Word) machines. The most important contribution from this paper is to show that software pipelining is effective on VLIW machines without complicated hardware support. Our understanding of software pipelining subsequently deepened with the work of many others. And today, software pipelining is used in all advanced compilers for machines with instruction-level parallelism, none of which, except the Intel Itanium, relies on any specialized support for software pipelining.This paper shows that software pipelining is an effective and viable scheduling technique for VLIW processors. In software pipelining, iterations of a loop in the source program are continuously initiated at constant intervals, before the preceding iterations complete. The advantage of software pipelining is that optimal performance can be achieved with compact object code.This paper extends previous results of software pipelining in two ways: First, this paper shows that by using an improved algorithm, near-optimal performance can be obtained without specialized hardware. Second, we propose a hierarchical reduction scheme whereby entire control constructs are reduced to an object similar to an operation in a basic block. With this scheme, all innermost loops, including those containing conditional statements, can be software pipelined. It also diminishes the start-up cost of loops with small number of iterations. Hierarchical reduction complements the software pipelining technique, permitting a consistent performance improvement be obtained.The techniques proposed have been validated by an implementation of a compiler for Warp, a systolic array consisting of 10 VLIW processors. This compiler has been used for developing a large number of applications in the areas of image, signal and scientific processing.

References

S. Carr and K. Kennedy. Improving the ratio of memory operations to floating-point operations in loops. ACM Transactions on Programming Languages and Systems, 16(6):1768--1810, November 1994. Google ScholarDigital Library
Paul Feautrier. Fine-grain scheduling under resource constraints. In The 7th Annual Workshop on Languages and Compilers for Parallel Computing, pages 1--15, 1994. Google ScholarDigital Library
R. Govindarajan, E. R. Altman, and G. R. Gao. Minimizing register requirements under resource-constrained rate-optimal software pipelining. In Proceedings of the 27th Annual International Symposium on Microarchitecture, pages 85--94, November 1994. Google ScholarDigital Library
R. A. Huff. Lifetime-sensitive modulo scheduling. In Proceedings of the ACM SIGPLAN '93 Conference on Programming Language Design and Implementation, pages 258--267, June 1993. Google ScholarDigital Library
P. G. Lowney, S. M. Freudenberger, T. J. Karzes, W. D. Lichtenstein, Nix R. P., J. S. O'Donell, and J. C. Ruttenberg. The Multiflow trace scheduling compiler. The Journal of Supercomputing, 7(1--2):51--142, 1993. Google ScholarDigital Library
Qi Ning and Guang R. Gao. A novel framework of register allocation for software pipelining. In Conference Record of the Twentieth Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pages 29--42, Charleston, South Carolina, 1993. Google ScholarDigital Library
B. R. Rau and C. D. Glaeser. Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing. In Proceedings of the 14th Annual Workshop on Microprogramming, pages 183--198, October 1981. Google ScholarDigital Library
B. R. Rau, D. W. L. Yen, W. Yen, and R. A. Towle. The Cydra 5 departmental supercomputer. IEEE Computer, 22(1):12--35, January 1989. Google ScholarDigital Library
J. Ruttenberg, G. R. Gao, A. Stoutchnin, and W. Lichtenstein. Software pipelining showdown: optimal vs. heuristic methods in a production compiler. In Proceedings of the ACM SIGPLAN '96 Conference on Programming Language Design and Implementation, pages 1--11, 1996. Google ScholarDigital Library
M. S. Schlansker and B. R. Rau. EPIC: explicitly parallel instruction computing. IEEE Computer, 33(2):37--45, February 2000. Google ScholarDigital Library
M. E. Wolf, D. E. Maydan, and D.-K. Chen. Combining loop transformations considering caches and scheduling. In Proceedings of the 29th Annual International Symposium on Microarchitecture, pages 274--286, December 1996. Google ScholarDigital Library
Aiken, A. and Nicolau, A. Perfect Pipelining; A New Loop Parallelization Technique. Cornell University, Oct., 1987. Google ScholarDigital Library
Annaratone, M., Bitz, F., Clune E., Kung H. T., Maulik, P., Ribas, H., Tseng, P., and Webb, J. Applications of Warp. Proc. Compoon Spring 87, San Francisco, Feb., 1987, pp. 272--275.Google Scholar
Annaratone, M., Bitz, F., Deutch, J., Hamey, L., Kung, H. T., Maulik P. C., Tseng, P., and Webb, J. A. Applications Experience on Warp. Proc. 1987 National Computer Conference, AFIPS, Chicago, June, 1987, pp. 149--158.Google Scholar
Annaratone, M., Amould, E., Gross, T., Kung, H. T., Lam, M., Menzilcioglu, O. and Webb, J. A. "The Warp Computer: Architecture, Implementation and Performance". IEEE Transactions on Computer C-36, 12 (December 1987). Google ScholarDigital Library
Colwell, R. P., Nix, R. P., O'Donnell, J. J., Papworth, D. B., and Rodman, P. K. A VLIW Architecture for a Trace Scheduling Compiler. Proc. Second Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, Oct., 1987, pp. 180--192. Google ScholarCross Ref
Dantzig, G. B., Blattner, W. O. and Rao, M. R. All Shortest Routes from a Fixed Origin in a Graph. Theory of Graphs, Rome, July, 1967, pp. 85--90.Google Scholar
Ebcioglu, Kemal. A Compilation Technique for Software Pipelining of Loops with Conditional Jumps. Proc. 20th Annual Workshop on Microprogramming, Dec., 1987. Google ScholarDigital Library
Ellis, John R. Bulldog: A Compiler for VLIW Architectures. Ph.D. Th., Yale University, 1985. Google ScholarDigital Library
Fisher, J. A. The Optimization of Horizontal Microcode Within and Beyond Basic Blocks: An Application of Processor Scheduling with Resources. Ph.D. Th., New York Univ., Oct. 1979. Google ScholarDigital Library
Fisher, J. A. "Trace Scheduling: A Technique for Global Microcode Compaction". IEEE Trans. on Computers C-30, 7 (July 1981), 478--490.Google ScholarDigital Library
Fisher, J. A., Ellis, J. R., Ruttenberg, J. C. and Nicolau, A. Parallel Processing: A Smart Compiler and a Dumb Machine. Proc. ACM SIGPLAN '84 Symp. on Compiler Construction, Montreal, Canada, June, 1984, pp. 37--47. Google ScholarDigital Library
Fisher, J. A., Landskov, D. and Shriver, B. D. Microcode Compaction: Looking Backward and Looking Forward. Proc. 1981 National Computer Conference, 1981, pp. 95--102.Google ScholarDigital Library
Floyd, R. W. "Algorithm 97: Shortest Path". Comm. ACM 5, 6 (1962), 345. Google ScholarDigital Library
Garey, Michael R. and Johnson, David S. Computers and Intractability A Guide to the Theory of NP-Completeness. Freeman, 1979. Google ScholarDigital Library
Gross, T. and Lam, M. Compilation for a High-performance Svstolic Arrav. Proc. ACM SIGPLAN 86 Svmposium on Compiler Google ScholarDigital Library
Hsu, Peter. Highly Concurrent Scalar Processing. Ph.D. Th., University of Illinois at Urbana-Champaign, 1986. Google ScholarDigital Library
Isoda, Sadahiro, Kobayashi, Yoshizumi, and Ishida, Toru. "Global Compaction of Horizontal Microprograms Based on the Generalized Data Dependency Graph". IEEE Trans. on Computers c-32, 10 (October 1983), 922--933.Google ScholarDigital Library
Kuck, D. J., Kuhn, R. H., Padua, D. A., Leasure, B. and Wolfe, M. Dependence Graphs and Compiler Optimizations. Proc. ACM Symposium on Principles of Programming Languages, January, 1981, pp. 207--218. Google ScholarDigital Library
Lah, J. and Atkin, E. Tree Compaction of Microprograms. Proc. 16th Annual Workshop on Microprogramming, Oct., 1982, pp. 23--33.Google Scholar
Lam, Monica. Compiler Optimizations for Asynchronous Systolic Array Programs. Proc. Fifteenth Annual ACM Symposium on Principles of Programming Languages, Jan., 1988. Google ScholarDigital Library
Lam, Monica. A Systolic Array Optimizing Compiler. Ph.D. Th., Carn egie Mellon University, May 1987. Google ScholarDigital Library
Linn, Joseph L. SRDAG Compaction - A Generalization of Trace Scheduling to Increase the Use of Global Context Information. Proc. 16th Annual Workshop on Microprogramming, 1983, pp. 11--22.Google Scholar
McMahon, F. H. Lawrence Livermore National Laboratory FORTRAN Kemels: MFLOPS.Google Scholar
Patel, Janak H. and Davidson, Edward S. Improving the Throughput of a Pipeline by Insertion of Delays. Proc. 3rd Annual Symposium on Computer Architecture, Jan., 1976, pp. 159--164. Google ScholarDigital Library
Rau, B. R. and Glaeser, C. D. Some Scheduling Techniques and an Easily Schedulable Horizontal Architecture for High Performance Scientific Computing. Proc. 14th Annual Workshop on Microprogramming, Oct., 1981, pp. 183--198. Google ScholarDigital Library
Su, B., Ding, S. and Jin, L. An Improvement of Trace Scheduling for Global Microcode Compaction. Proc. 17th Annual Workshop in Microprogramming, Dec., 1984, pp. 78--85. Google ScholarDigital Library
Su, B., Ding, S., Wang, J. and Xia, J. GURPR -- A Method for Global Software Pipelining. Proc. 20th Annual Workshop on Microprogramming, Dec., 1987, pp. 88--96. Google ScholarDigital Library
Su, B., Ding, S. and Xia, J. URPR -- An Extension of URCR for Software Pipeline. Proc. 19th Annual Workshop on Microprogramming, Oct., 1986, pp. 104--108. Google ScholarDigital Library
Tarjan, R. E. "Depth first search and linear graph algorithms". SIAM J. Computing 1, 2 (1972), 146--160.Google ScholarCross Ref
Touzeau, R. F. A Fortran Compiler for the FPS-164 Scientific Computer. Proc. ACM SIGPLAN '84 Symp. on Compiler Construction, June, 1984, pp. 48--57. Google ScholarDigital Library
Weiss, S. and Smith, J. E. A Study of Scalar Compilation Techniques for Pipelined Supercomputers. Proc. Second Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, Oct., 1987, pp. 105--109. Google ScholarCross Ref
Wood, Graham. Global Optimization of Microprograms Through Modular Control Constructs. Proc. 12th Annual Workshop in Microprogramming, 1979, pp. 1--6. Google ScholarDigital Library

Index Terms

Software pipelining: an effective scheduling technique for VLIW machines
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
2. Theory of computation

Index terms have been assigned to the content through auto-classification.

Recommendations

Software Pipelining of Nested Loops
CC '01: Proceedings of the 10th International Conference on Compiler Construction

Software pipelining is a technique to improve the performance of a loop by overlapping the execution of several iterations. The execution of a software-pipelined loop goes through three phases: prolog, kernel, and epilog. Software pipelining works best ...
Read More
Parallel-stage decoupled software pipelining
CGO '08: Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization

In recent years, the microprocessor industry has embraced chip multiprocessors (CMPs), also known as multi-core architectures, as the dominant design paradigm. For existing and new applications to make effective use of CMPs, it is desirable that ...
Read More
Trace software pipelining
Abstract
Global software pipelining is a complex but efficient compilation technique to exploit instruction-level parallelism for loops with branches. This paper presents a novel global software pipelining technique, called Trace Software Pipelining, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGPLAN Notices Volume 39, Issue 4
20 Years of the ACM SIGPLAN Conference on Programming Language Design and Implementation 1979-1999: A Selection
April 2004
673 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/989393
Editor:
Kathryn S. McKinley
The University of Texas at Austin, USA
Issue’s Table of Contents
Copyright © 2004 Author
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 April 2004
Check for updates
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 20
  Total Citations
  View Citations
- 764
  Total Downloads
- Downloads (Last 12 months)15
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Software pipelining: an effective scheduling technique for VLIW machines

ACM SIGPLAN Notices

Abstract

References

Cited By

Index Terms

Recommendations

Software Pipelining of Nested Loops

Parallel-stage decoupled software pipelining

Trace software pipelining

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Software pipelining: an effective scheduling technique for VLIW machines

ACM SIGPLAN Notices

Abstract

References

Cited By

Index Terms

Recommendations

Software Pipelining of Nested Loops

Parallel-stage decoupled software pipelining

Trace software pipelining

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media