Abstract
This paper proposes a technique that enables performing multi-cycle (multiplication, division, square-root …) computations in a single cycle. The technique is based on the notion of memoing: saving the input and output of previous calculations and using the output if the input is encountered again. This technique is especially suitable for Multi-Media (MM) processing. In MM applications the local entropy of the data tends to be low which results in repeated operations on the same datum.The inputs and outputs of assembly level operations are stored in cache-like lookup tables and accessed in parallel to the conventional computation. A successful lookup gives the result of a multi-cycle computation in a single cycle, and a failed lookup doesn't necessitate a penalty in computation time.Results of simulations have shown that on the average, for a modestly sized memo-table, about 40% of the floating point multiplications and 50% of the floating point divisions, in Multi-Media applications, can be avoided by using the values within the memo-table, leading to an average computational speedup of more than 20%.
- 1 Hennessy J. L. and Patterson D. A., "Computer Architecture: A Quantitative Approach," Morgan Kaufmann Publishers, San Mateo CA, 1990. Google ScholarDigital Library
- 2 http:///www.intel.com/design/Google Scholar
- 3 http:~//www.digital.com/infoGoogle Scholar
- 4 http://www.sgi.com/MIPS/products/rl0kGoogle Scholar
- 5 http://www.mot.com/SPS/PowerPC/productsGoogle Scholar
- 6 http://www.sun.com/microelectronics/datasheetsGoogle Scholar
- 7 http:/l/www.hp.com/wsg/strategiesGoogle Scholar
- 8 Michie D., "Memo Functions and Machine Learning," Nature 218, pp 19-22, 1968.Google ScholarCross Ref
- 9 L. Sterling and E. Shapiro, "The Art of Prolog, ~nd Ed.", MIT Press Cambridge MA, 1992.Google Scholar
- 10 Abelson, H. and Sussman, G.J. Structure and Interpretation of Computer Programs. MIT Press, Cambridge, Mass. 1985. Google ScholarDigital Library
- 11 R. Milner, M. Tofte, R. Harper, and D. MacQueen, The Definition of Standard ML (Revised).MIT Press, Cambridge, Mass. 1997. Google ScholarDigital Library
- 12 P. Soderquist and M. Leeser, "An area/performance comparison of subtractive and multiplicative divide/square root implementations," Proc. 12th IEEE Syrup. Computer Arithmetic, pp. 132-139, July 1995. Google ScholarDigital Library
- 13 Atkins, D.E. "Higher-radix division using estimates of the divisor and partial reminders," IEEE Trans. on Computers C-17:10, 925-934,1968.Google Scholar
- 14 S. Richardson, "Exploiting Trivial and Redundant Computation", Proc. of the 11th Syrup. on Computer Arithmetic, pp. 220-227, July 1993.Google Scholar
- 15 S. Oberman, M. Flynn, "Reducing Division Latency with Reciprocal Caches", Reliable Computing, Vol 2, no. 2, pages 147-153, April 1996.Google ScholarCross Ref
- 16 Price W.J. , "A Benchmark Tutorial," IEEE Micro, pp. 28-43, October 1989. Google ScholarDigital Library
- 17 http://www.netlib.org/benchwebGoogle Scholar
- 18 A. Sodani, G. Sohi, "Dynamic Instruction Reuse", Proc. of the ~~th Int. Syrup. on Computer Architecture, June 1997. Google ScholarDigital Library
- 19 Cmelik R. and Keppel D., Shade: A Fast instruction- Set Simulator for Execution Profiling, Sun Microsystems Laboratories. Google ScholarDigital Library
- 20 D. Argiro and C. Gage, "Khoros User's Manual," U. of New Mexico, 1991.Google Scholar
- 21 M. Franklin and G.Sohi, "Register Traffic Analysis for Streamlining Inter-Operation Communication in Fine- Grain Parallel Processors," Proc. of Micro 25, pp 236- 245, 1992. Google ScholarDigital Library
- 22 A. K. Jalin, "Fundamentals of Digital Image Processing," Prentice Hall, Englewood Cliffs N J, 1989. Google ScholarDigital Library
- 23 T. Yeh and Y. Patt, "A Comparison of Dynamic Branch Predictors that Use Two Levels of Branch History," Proc. of the 20th Int. Syrup. on Computer Architecture, pp 191-201, 1993. Google ScholarDigital Library
- 24 N. Jouppi, "Cache Write Policies and Performances," Proc. of the 20th int. Symp. on Computer Architecture, pp 191-201, 1993. Google ScholarDigital Library
- 25 J. Chen, A. Borg, N. Jouppi, "A Simulation Based Study of TLB Performance," Proc. of the 18th int. Syrup. on Computer Architecture, pp 114-123, 1991. Google ScholarDigital Library
Index Terms
- Accelerating multi-media processing by implementing memoing in multiplication and division units
Recommendations
Accelerating multi-media processing by implementing memoing in multiplication and division units
This paper proposes a technique that enables performing multi-cycle (multiplication, division, square-root …) computations in a single cycle. The technique is based on the notion of memoing: saving the input and output of previous calculations ...
Accelerating multi-media processing by implementing memoing in multiplication and division units
ASPLOS VIII: Proceedings of the eighth international conference on Architectural support for programming languages and operating systemsThis paper proposes a technique that enables performing multi-cycle (multiplication, division, square-root …) computations in a single cycle. The technique is based on the notion of memoing: saving the input and output of previous calculations ...
A Hardware Algorithm for Modular Multiplication/Division
A mixed radix-4/2 algorithm for modular multiplication/division suitable for VLSI implementation is proposed. The algorithm is based on Montgomery method for modular multiplication and on the extended Binary GCD algorithm for modular division. Both ...
Comments