skip to main content
article

Software pipelining: an effective scheduling technique for VLIW machines

Published:01 April 2004Publication History
Skip Abstract Section

Abstract

The basic idea behind software pipelining was first developed by Patel and Davidson for scheduling hardware pipe-lines. As instruction-level parallelism made its way into general-purpose computing, it became necessary to automate scheduling. How and whether instructions can be scheduled statically have major ramifications on the design of computer architectures. Rau and Glaeser were the first to use software pipelining in a compiler for a machine with specialized hardware designed to support software pipelining. In the meantime, trace scheduling was touted to be the scheduling technique of choice for VLIW (Very Long Instruction Word) machines. The most important contribution from this paper is to show that software pipelining is effective on VLIW machines without complicated hardware support. Our understanding of software pipelining subsequently deepened with the work of many others. And today, software pipelining is used in all advanced compilers for machines with instruction-level parallelism, none of which, except the Intel Itanium, relies on any specialized support for software pipelining.This paper shows that software pipelining is an effective and viable scheduling technique for VLIW processors. In software pipelining, iterations of a loop in the source program are continuously initiated at constant intervals, before the preceding iterations complete. The advantage of software pipelining is that optimal performance can be achieved with compact object code.This paper extends previous results of software pipelining in two ways: First, this paper shows that by using an improved algorithm, near-optimal performance can be obtained without specialized hardware. Second, we propose a hierarchical reduction scheme whereby entire control constructs are reduced to an object similar to an operation in a basic block. With this scheme, all innermost loops, including those containing conditional statements, can be software pipelined. It also diminishes the start-up cost of loops with small number of iterations. Hierarchical reduction complements the software pipelining technique, permitting a consistent performance improvement be obtained.The techniques proposed have been validated by an implementation of a compiler for Warp, a systolic array consisting of 10 VLIW processors. This compiler has been used for developing a large number of applications in the areas of image, signal and scientific processing.

References

  1. S. Carr and K. Kennedy. Improving the ratio of memory operations to floating-point operations in loops. ACM Transactions on Programming Languages and Systems, 16(6):1768--1810, November 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Paul Feautrier. Fine-grain scheduling under resource constraints. In The 7th Annual Workshop on Languages and Compilers for Parallel Computing, pages 1--15, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. Govindarajan, E. R. Altman, and G. R. Gao. Minimizing register requirements under resource-constrained rate-optimal software pipelining. In Proceedings of the 27th Annual International Symposium on Microarchitecture, pages 85--94, November 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. R. A. Huff. Lifetime-sensitive modulo scheduling. In Proceedings of the ACM SIGPLAN '93 Conference on Programming Language Design and Implementation, pages 258--267, June 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. P. G. Lowney, S. M. Freudenberger, T. J. Karzes, W. D. Lichtenstein, Nix R. P., J. S. O'Donell, and J. C. Ruttenberg. The Multiflow trace scheduling compiler. The Journal of Supercomputing, 7(1--2):51--142, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Qi Ning and Guang R. Gao. A novel framework of register allocation for software pipelining. In Conference Record of the Twentieth Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pages 29--42, Charleston, South Carolina, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. B. R. Rau and C. D. Glaeser. Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing. In Proceedings of the 14th Annual Workshop on Microprogramming, pages 183--198, October 1981. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. B. R. Rau, D. W. L. Yen, W. Yen, and R. A. Towle. The Cydra 5 departmental supercomputer. IEEE Computer, 22(1):12--35, January 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Ruttenberg, G. R. Gao, A. Stoutchnin, and W. Lichtenstein. Software pipelining showdown: optimal vs. heuristic methods in a production compiler. In Proceedings of the ACM SIGPLAN '96 Conference on Programming Language Design and Implementation, pages 1--11, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. S. Schlansker and B. R. Rau. EPIC: explicitly parallel instruction computing. IEEE Computer, 33(2):37--45, February 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M. E. Wolf, D. E. Maydan, and D.-K. Chen. Combining loop transformations considering caches and scheduling. In Proceedings of the 29th Annual International Symposium on Microarchitecture, pages 274--286, December 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Aiken, A. and Nicolau, A. Perfect Pipelining; A New Loop Parallelization Technique. Cornell University, Oct., 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Annaratone, M., Bitz, F., Clune E., Kung H. T., Maulik, P., Ribas, H., Tseng, P., and Webb, J. Applications of Warp. Proc. Compoon Spring 87, San Francisco, Feb., 1987, pp. 272--275.Google ScholarGoogle Scholar
  14. Annaratone, M., Bitz, F., Deutch, J., Hamey, L., Kung, H. T., Maulik P. C., Tseng, P., and Webb, J. A. Applications Experience on Warp. Proc. 1987 National Computer Conference, AFIPS, Chicago, June, 1987, pp. 149--158.Google ScholarGoogle Scholar
  15. Annaratone, M., Amould, E., Gross, T., Kung, H. T., Lam, M., Menzilcioglu, O. and Webb, J. A. "The Warp Computer: Architecture, Implementation and Performance". IEEE Transactions on Computer C-36, 12 (December 1987). Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Colwell, R. P., Nix, R. P., O'Donnell, J. J., Papworth, D. B., and Rodman, P. K. A VLIW Architecture for a Trace Scheduling Compiler. Proc. Second Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, Oct., 1987, pp. 180--192. Google ScholarGoogle ScholarCross RefCross Ref
  17. Dantzig, G. B., Blattner, W. O. and Rao, M. R. All Shortest Routes from a Fixed Origin in a Graph. Theory of Graphs, Rome, July, 1967, pp. 85--90.Google ScholarGoogle Scholar
  18. Ebcioglu, Kemal. A Compilation Technique for Software Pipelining of Loops with Conditional Jumps. Proc. 20th Annual Workshop on Microprogramming, Dec., 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Ellis, John R. Bulldog: A Compiler for VLIW Architectures. Ph.D. Th., Yale University, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Fisher, J. A. The Optimization of Horizontal Microcode Within and Beyond Basic Blocks: An Application of Processor Scheduling with Resources. Ph.D. Th., New York Univ., Oct. 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Fisher, J. A. "Trace Scheduling: A Technique for Global Microcode Compaction". IEEE Trans. on Computers C-30, 7 (July 1981), 478--490.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Fisher, J. A., Ellis, J. R., Ruttenberg, J. C. and Nicolau, A. Parallel Processing: A Smart Compiler and a Dumb Machine. Proc. ACM SIGPLAN '84 Symp. on Compiler Construction, Montreal, Canada, June, 1984, pp. 37--47. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Fisher, J. A., Landskov, D. and Shriver, B. D. Microcode Compaction: Looking Backward and Looking Forward. Proc. 1981 National Computer Conference, 1981, pp. 95--102.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Floyd, R. W. "Algorithm 97: Shortest Path". Comm. ACM 5, 6 (1962), 345. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Garey, Michael R. and Johnson, David S. Computers and Intractability A Guide to the Theory of NP-Completeness. Freeman, 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Gross, T. and Lam, M. Compilation for a High-performance Svstolic Arrav. Proc. ACM SIGPLAN 86 Svmposium on Compiler Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Hsu, Peter. Highly Concurrent Scalar Processing. Ph.D. Th., University of Illinois at Urbana-Champaign, 1986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Isoda, Sadahiro, Kobayashi, Yoshizumi, and Ishida, Toru. "Global Compaction of Horizontal Microprograms Based on the Generalized Data Dependency Graph". IEEE Trans. on Computers c-32, 10 (October 1983), 922--933.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Kuck, D. J., Kuhn, R. H., Padua, D. A., Leasure, B. and Wolfe, M. Dependence Graphs and Compiler Optimizations. Proc. ACM Symposium on Principles of Programming Languages, January, 1981, pp. 207--218. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Lah, J. and Atkin, E. Tree Compaction of Microprograms. Proc. 16th Annual Workshop on Microprogramming, Oct., 1982, pp. 23--33.Google ScholarGoogle Scholar
  31. Lam, Monica. Compiler Optimizations for Asynchronous Systolic Array Programs. Proc. Fifteenth Annual ACM Symposium on Principles of Programming Languages, Jan., 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Lam, Monica. A Systolic Array Optimizing Compiler. Ph.D. Th., Carn egie Mellon University, May 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Linn, Joseph L. SRDAG Compaction - A Generalization of Trace Scheduling to Increase the Use of Global Context Information. Proc. 16th Annual Workshop on Microprogramming, 1983, pp. 11--22.Google ScholarGoogle Scholar
  34. McMahon, F. H. Lawrence Livermore National Laboratory FORTRAN Kemels: MFLOPS.Google ScholarGoogle Scholar
  35. Patel, Janak H. and Davidson, Edward S. Improving the Throughput of a Pipeline by Insertion of Delays. Proc. 3rd Annual Symposium on Computer Architecture, Jan., 1976, pp. 159--164. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Rau, B. R. and Glaeser, C. D. Some Scheduling Techniques and an Easily Schedulable Horizontal Architecture for High Performance Scientific Computing. Proc. 14th Annual Workshop on Microprogramming, Oct., 1981, pp. 183--198. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Su, B., Ding, S. and Jin, L. An Improvement of Trace Scheduling for Global Microcode Compaction. Proc. 17th Annual Workshop in Microprogramming, Dec., 1984, pp. 78--85. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Su, B., Ding, S., Wang, J. and Xia, J. GURPR -- A Method for Global Software Pipelining. Proc. 20th Annual Workshop on Microprogramming, Dec., 1987, pp. 88--96. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Su, B., Ding, S. and Xia, J. URPR -- An Extension of URCR for Software Pipeline. Proc. 19th Annual Workshop on Microprogramming, Oct., 1986, pp. 104--108. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Tarjan, R. E. "Depth first search and linear graph algorithms". SIAM J. Computing 1, 2 (1972), 146--160.Google ScholarGoogle ScholarCross RefCross Ref
  41. Touzeau, R. F. A Fortran Compiler for the FPS-164 Scientific Computer. Proc. ACM SIGPLAN '84 Symp. on Compiler Construction, June, 1984, pp. 48--57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Weiss, S. and Smith, J. E. A Study of Scalar Compilation Techniques for Pipelined Supercomputers. Proc. Second Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, Oct., 1987, pp. 105--109. Google ScholarGoogle ScholarCross RefCross Ref
  43. Wood, Graham. Global Optimization of Microprograms Through Modular Control Constructs. Proc. 12th Annual Workshop in Microprogramming, 1979, pp. 1--6. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Software pipelining: an effective scheduling technique for VLIW machines
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM SIGPLAN Notices
            ACM SIGPLAN Notices  Volume 39, Issue 4
            20 Years of the ACM SIGPLAN Conference on Programming Language Design and Implementation 1979-1999: A Selection
            April 2004
            673 pages
            ISSN:0362-1340
            EISSN:1558-1160
            DOI:10.1145/989393
            Issue’s Table of Contents

            Copyright © 2004 Author

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 1 April 2004

            Check for updates

            Qualifiers

            • article

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader