Abstract
General-purpose processors, while tremendously versatile, pay a huge cost for their flexibility by wasting over 99% of the energy in programmability overheads. We observe that reducing this waste requires tuning data storage and compute structures and their connectivity to the data-flow and data-locality patterns in the algorithms. Hence, by backing off from full programmability and instead targeting key data-flow patterns used in a domain, we can create efficient engines that can be programmed and reused across a wide range of applications within that domain.
We present the Convolution Engine (CE)---a programmable processor specialized for the convolution-like data-flow prevalent in computational photography, computer vision, and video processing. The CE achieves energy efficiency by capturing data-reuse patterns, eliminating data transfer overheads, and enabling a large number of operations per memory access. We demonstrate that the CE is within a factor of 2--3× of the energy and area efficiency of custom units optimized for a single kernel. The CE improves energy and area efficiency by 8--15× over data-parallel Single Instruction Multiple Data (SIMD) engines for most image processing applications.<!-- END_PAGE_1 -->
- Bakhoda, A., Yuan, G., Fung, W.W.L., Wong, H., Aamodt, T.M. Analyzing CUDA workloads using a detailed GPU simulator. In ISPASS: IEEE International Symposium on Performance Analysis of Systems and Software (2009).Google ScholarCross Ref
- Balfour, J., Dally, W., Black-Schaffer, D., Parikh, V., Park, J. An energy-efficient processor architecture for embedded systems. Comput. Architect. Lett. 7, 1 (2007), 29--32. Google ScholarDigital Library
- Bayer, B. Color Imaging Array. US Patent Application No. 3971065 (1976).Google Scholar
- Chen, T.-C., Chien, S.-Y., Huang, Y.-W., Tsai, C.-H., Chen, C.-Y., Chen, T.-W., Chen, L.-G. Analysis and architecture design of an HDTV720p 30 frames/sec H.264/AVC encoder. IEEE Trans. Circuits Syst. Video Technol. 16, 6 (2006), 673--688. Google ScholarDigital Library
- Corbal, J., Valero, M., Espasa, R. Exploiting a new level of DLP in multimedia applications. In Proceedings of the 32nd Annual International Symposium on Microarchitecture (Nov. 1999), 72--79. Google ScholarDigital Library
- Gonzalez, R. Xtensa: A configurable and extensible processor. Micro IEEE 20, 2 (Mar. 2000), 60--70. Google ScholarDigital Library
- Hameed, R., Qadeer, W., Wachs, M., Azizi, O., Solomatnikov, A., Lee, B.C., Richardson, S., Kozyrakis, C., Horowitz, M. Understanding sources of inefficiency in general-purpose chips. In ISCA '10: Proceedings of the 37th Annual International Symposium on Computer Architecture (2010), ACM. Google ScholarDigital Library
- Hamilton, J.F., Adams, J.E. Adaptive Color Plane Interpolation in Single Sensor Color Electronic Camera. US Patent Application No. 5629734 (1997).Google Scholar
- Leng, J., Gilani, S., Hetherington, T., Tantawy, A.E., Kim, N.S., Aamodt, T.M., Reddi, V.J. GPUWattch: Enabling energy optimizations in GPGPUs. In ISCA 2013: International Symposium on Computer Architecture (2013). Google ScholarDigital Library
- Lowe, D. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60, 2 (2004), 91--110. Google ScholarDigital Library
- NVIDIA Inc. Tegra mobile processors. http://www.nvidia.com/object/tegra-4-processor.html.Google Scholar
- Shacham, O., Azizi, O., Wachs, M., Qadeer, W., Asgar, Z., Kelley, K., Stevenson, J., Solomatnikov A., Firoozshahian, A., Lee, B., Richardson, S., Horowitz, M. Rethinking digital design: Why design must change. IEEE Micro 30, 6 (Nov. 2010), 9--24. Google ScholarDigital Library
- Stratton, J.A., Rodrigues, C., Sung, I.-J., Obeid, N., Chang, L.W., Anssari, N., Liu, G.D., Hwu, W.-M.W. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. IMPACT Technical Report. In IMPACT-12-01, 2012.Google Scholar
- Tensilica Inc. Tensilica Instruction Extension (TIE) Language Reference Manual.Google Scholar
- Texas Instruments Inc. OMAP 5 platform. www.ti.com/omap.Google Scholar
- Venkatesh, G., Sampson, J., Goulding, N., Garcia, S., Bryksin, V., Lugo-Martinez, J., Swanson, S., Taylor, M.B. Conservation cores: Reducing the energy of mature computations. In ASPLOS'10 (2010), ACM. Google ScholarDigital Library
Index Terms
- Convolution engine: balancing efficiency and flexibility in specialized computing
Recommendations
Convolution engine: balancing efficiency & flexibility in specialized computing
ISCA '13: Proceedings of the 40th Annual International Symposium on Computer ArchitectureThis paper focuses on the trade-off between flexibility and efficiency in specialized computing. We observe that specialized units achieve most of their efficiency gains by tuning data storage and compute structures and their connectivity to the data-...
Convolution engine: balancing efficiency & flexibility in specialized computing
ICSA '13This paper focuses on the trade-off between flexibility and efficiency in specialized computing. We observe that specialized units achieve most of their efficiency gains by tuning data storage and compute structures and their connectivity to the data-...
Computing discrete transforms on the Cell Broadband Engine
Discrete transforms are of primary importance and fundamental kernels in many computationally intensive scientific applications. In this paper, we investigate the performance of two such algorithms; Fast Fourier Transform (FFT) and Discrete Wavelet ...
Comments