Abstract
The basic idea behind software pipelining was first developed by Patel and Davidson for scheduling hardware pipe-lines. As instruction-level parallelism made its way into general-purpose computing, it became necessary to automate scheduling. How and whether instructions can be scheduled statically have major ramifications on the design of computer architectures. Rau and Glaeser were the first to use software pipelining in a compiler for a machine with specialized hardware designed to support software pipelining. In the meantime, trace scheduling was touted to be the scheduling technique of choice for VLIW (Very Long Instruction Word) machines. The most important contribution from this paper is to show that software pipelining is effective on VLIW machines without complicated hardware support. Our understanding of software pipelining subsequently deepened with the work of many others. And today, software pipelining is used in all advanced compilers for machines with instruction-level parallelism, none of which, except the Intel Itanium, relies on any specialized support for software pipelining.This paper shows that software pipelining is an effective and viable scheduling technique for VLIW processors. In software pipelining, iterations of a loop in the source program are continuously initiated at constant intervals, before the preceding iterations complete. The advantage of software pipelining is that optimal performance can be achieved with compact object code.This paper extends previous results of software pipelining in two ways: First, this paper shows that by using an improved algorithm, near-optimal performance can be obtained without specialized hardware. Second, we propose a hierarchical reduction scheme whereby entire control constructs are reduced to an object similar to an operation in a basic block. With this scheme, all innermost loops, including those containing conditional statements, can be software pipelined. It also diminishes the start-up cost of loops with small number of iterations. Hierarchical reduction complements the software pipelining technique, permitting a consistent performance improvement be obtained.The techniques proposed have been validated by an implementation of a compiler for Warp, a systolic array consisting of 10 VLIW processors. This compiler has been used for developing a large number of applications in the areas of image, signal and scientific processing.
- S. Carr and K. Kennedy. Improving the ratio of memory operations to floating-point operations in loops. ACM Transactions on Programming Languages and Systems, 16(6):1768--1810, November 1994. Google ScholarDigital Library
- Paul Feautrier. Fine-grain scheduling under resource constraints. In The 7th Annual Workshop on Languages and Compilers for Parallel Computing, pages 1--15, 1994. Google ScholarDigital Library
- R. Govindarajan, E. R. Altman, and G. R. Gao. Minimizing register requirements under resource-constrained rate-optimal software pipelining. In Proceedings of the 27th Annual International Symposium on Microarchitecture, pages 85--94, November 1994. Google ScholarDigital Library
- R. A. Huff. Lifetime-sensitive modulo scheduling. In Proceedings of the ACM SIGPLAN '93 Conference on Programming Language Design and Implementation, pages 258--267, June 1993. Google ScholarDigital Library
- P. G. Lowney, S. M. Freudenberger, T. J. Karzes, W. D. Lichtenstein, Nix R. P., J. S. O'Donell, and J. C. Ruttenberg. The Multiflow trace scheduling compiler. The Journal of Supercomputing, 7(1--2):51--142, 1993. Google ScholarDigital Library
- Qi Ning and Guang R. Gao. A novel framework of register allocation for software pipelining. In Conference Record of the Twentieth Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pages 29--42, Charleston, South Carolina, 1993. Google ScholarDigital Library
- B. R. Rau and C. D. Glaeser. Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing. In Proceedings of the 14th Annual Workshop on Microprogramming, pages 183--198, October 1981. Google ScholarDigital Library
- B. R. Rau, D. W. L. Yen, W. Yen, and R. A. Towle. The Cydra 5 departmental supercomputer. IEEE Computer, 22(1):12--35, January 1989. Google ScholarDigital Library
- J. Ruttenberg, G. R. Gao, A. Stoutchnin, and W. Lichtenstein. Software pipelining showdown: optimal vs. heuristic methods in a production compiler. In Proceedings of the ACM SIGPLAN '96 Conference on Programming Language Design and Implementation, pages 1--11, 1996. Google ScholarDigital Library
- M. S. Schlansker and B. R. Rau. EPIC: explicitly parallel instruction computing. IEEE Computer, 33(2):37--45, February 2000. Google ScholarDigital Library
- M. E. Wolf, D. E. Maydan, and D.-K. Chen. Combining loop transformations considering caches and scheduling. In Proceedings of the 29th Annual International Symposium on Microarchitecture, pages 274--286, December 1996. Google ScholarDigital Library
- Aiken, A. and Nicolau, A. Perfect Pipelining; A New Loop Parallelization Technique. Cornell University, Oct., 1987. Google ScholarDigital Library
- Annaratone, M., Bitz, F., Clune E., Kung H. T., Maulik, P., Ribas, H., Tseng, P., and Webb, J. Applications of Warp. Proc. Compoon Spring 87, San Francisco, Feb., 1987, pp. 272--275.Google Scholar
- Annaratone, M., Bitz, F., Deutch, J., Hamey, L., Kung, H. T., Maulik P. C., Tseng, P., and Webb, J. A. Applications Experience on Warp. Proc. 1987 National Computer Conference, AFIPS, Chicago, June, 1987, pp. 149--158.Google Scholar
- Annaratone, M., Amould, E., Gross, T., Kung, H. T., Lam, M., Menzilcioglu, O. and Webb, J. A. "The Warp Computer: Architecture, Implementation and Performance". IEEE Transactions on Computer C-36, 12 (December 1987). Google ScholarDigital Library
- Colwell, R. P., Nix, R. P., O'Donnell, J. J., Papworth, D. B., and Rodman, P. K. A VLIW Architecture for a Trace Scheduling Compiler. Proc. Second Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, Oct., 1987, pp. 180--192. Google ScholarCross Ref
- Dantzig, G. B., Blattner, W. O. and Rao, M. R. All Shortest Routes from a Fixed Origin in a Graph. Theory of Graphs, Rome, July, 1967, pp. 85--90.Google Scholar
- Ebcioglu, Kemal. A Compilation Technique for Software Pipelining of Loops with Conditional Jumps. Proc. 20th Annual Workshop on Microprogramming, Dec., 1987. Google ScholarDigital Library
- Ellis, John R. Bulldog: A Compiler for VLIW Architectures. Ph.D. Th., Yale University, 1985. Google ScholarDigital Library
- Fisher, J. A. The Optimization of Horizontal Microcode Within and Beyond Basic Blocks: An Application of Processor Scheduling with Resources. Ph.D. Th., New York Univ., Oct. 1979. Google ScholarDigital Library
- Fisher, J. A. "Trace Scheduling: A Technique for Global Microcode Compaction". IEEE Trans. on Computers C-30, 7 (July 1981), 478--490.Google ScholarDigital Library
- Fisher, J. A., Ellis, J. R., Ruttenberg, J. C. and Nicolau, A. Parallel Processing: A Smart Compiler and a Dumb Machine. Proc. ACM SIGPLAN '84 Symp. on Compiler Construction, Montreal, Canada, June, 1984, pp. 37--47. Google ScholarDigital Library
- Fisher, J. A., Landskov, D. and Shriver, B. D. Microcode Compaction: Looking Backward and Looking Forward. Proc. 1981 National Computer Conference, 1981, pp. 95--102.Google ScholarDigital Library
- Floyd, R. W. "Algorithm 97: Shortest Path". Comm. ACM 5, 6 (1962), 345. Google ScholarDigital Library
- Garey, Michael R. and Johnson, David S. Computers and Intractability A Guide to the Theory of NP-Completeness. Freeman, 1979. Google ScholarDigital Library
- Gross, T. and Lam, M. Compilation for a High-performance Svstolic Arrav. Proc. ACM SIGPLAN 86 Svmposium on Compiler Google ScholarDigital Library
- Hsu, Peter. Highly Concurrent Scalar Processing. Ph.D. Th., University of Illinois at Urbana-Champaign, 1986. Google ScholarDigital Library
- Isoda, Sadahiro, Kobayashi, Yoshizumi, and Ishida, Toru. "Global Compaction of Horizontal Microprograms Based on the Generalized Data Dependency Graph". IEEE Trans. on Computers c-32, 10 (October 1983), 922--933.Google ScholarDigital Library
- Kuck, D. J., Kuhn, R. H., Padua, D. A., Leasure, B. and Wolfe, M. Dependence Graphs and Compiler Optimizations. Proc. ACM Symposium on Principles of Programming Languages, January, 1981, pp. 207--218. Google ScholarDigital Library
- Lah, J. and Atkin, E. Tree Compaction of Microprograms. Proc. 16th Annual Workshop on Microprogramming, Oct., 1982, pp. 23--33.Google Scholar
- Lam, Monica. Compiler Optimizations for Asynchronous Systolic Array Programs. Proc. Fifteenth Annual ACM Symposium on Principles of Programming Languages, Jan., 1988. Google ScholarDigital Library
- Lam, Monica. A Systolic Array Optimizing Compiler. Ph.D. Th., Carn egie Mellon University, May 1987. Google ScholarDigital Library
- Linn, Joseph L. SRDAG Compaction - A Generalization of Trace Scheduling to Increase the Use of Global Context Information. Proc. 16th Annual Workshop on Microprogramming, 1983, pp. 11--22.Google Scholar
- McMahon, F. H. Lawrence Livermore National Laboratory FORTRAN Kemels: MFLOPS.Google Scholar
- Patel, Janak H. and Davidson, Edward S. Improving the Throughput of a Pipeline by Insertion of Delays. Proc. 3rd Annual Symposium on Computer Architecture, Jan., 1976, pp. 159--164. Google ScholarDigital Library
- Rau, B. R. and Glaeser, C. D. Some Scheduling Techniques and an Easily Schedulable Horizontal Architecture for High Performance Scientific Computing. Proc. 14th Annual Workshop on Microprogramming, Oct., 1981, pp. 183--198. Google ScholarDigital Library
- Su, B., Ding, S. and Jin, L. An Improvement of Trace Scheduling for Global Microcode Compaction. Proc. 17th Annual Workshop in Microprogramming, Dec., 1984, pp. 78--85. Google ScholarDigital Library
- Su, B., Ding, S., Wang, J. and Xia, J. GURPR -- A Method for Global Software Pipelining. Proc. 20th Annual Workshop on Microprogramming, Dec., 1987, pp. 88--96. Google ScholarDigital Library
- Su, B., Ding, S. and Xia, J. URPR -- An Extension of URCR for Software Pipeline. Proc. 19th Annual Workshop on Microprogramming, Oct., 1986, pp. 104--108. Google ScholarDigital Library
- Tarjan, R. E. "Depth first search and linear graph algorithms". SIAM J. Computing 1, 2 (1972), 146--160.Google ScholarCross Ref
- Touzeau, R. F. A Fortran Compiler for the FPS-164 Scientific Computer. Proc. ACM SIGPLAN '84 Symp. on Compiler Construction, June, 1984, pp. 48--57. Google ScholarDigital Library
- Weiss, S. and Smith, J. E. A Study of Scalar Compilation Techniques for Pipelined Supercomputers. Proc. Second Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, Oct., 1987, pp. 105--109. Google ScholarCross Ref
- Wood, Graham. Global Optimization of Microprograms Through Modular Control Constructs. Proc. 12th Annual Workshop in Microprogramming, 1979, pp. 1--6. Google ScholarDigital Library
Index Terms
- Software pipelining: an effective scheduling technique for VLIW machines
Recommendations
Software Pipelining of Nested Loops
CC '01: Proceedings of the 10th International Conference on Compiler ConstructionSoftware pipelining is a technique to improve the performance of a loop by overlapping the execution of several iterations. The execution of a software-pipelined loop goes through three phases: prolog, kernel, and epilog. Software pipelining works best ...
Parallel-stage decoupled software pipelining
CGO '08: Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimizationIn recent years, the microprocessor industry has embraced chip multiprocessors (CMPs), also known as multi-core architectures, as the dominant design paradigm. For existing and new applications to make effective use of CMPs, it is desirable that ...
Trace software pipelining
AbstractGlobal software pipelining is a complex but efficient compilation technique to exploit instruction-level parallelism for loops with branches. This paper presents a novel global software pipelining technique, called Trace Software Pipelining, ...
Comments