ABSTRACT
Numeric simulations often generate large amounts of data that need to be stored or sent to other compute nodes. This paper investigates whether GPUs are powerful enough to make real-time data compression and decompression possible in such environments, that is, whether they can operate at the 32- or 40-Gb/s throughput of emerging network cards. The fastest parallel CPU-based floating-point data compression algorithm operates below 20 Gb/s on eight Xeon cores, which is significantly slower than the network speed and thus insufficient for compression to be practical in high-end networks. As a remedy, we have created the highly parallel GFC compression algorithm for double-precision floating-point data. This algorithm is specifically designed for GPUs. It compresses at a minimum of 75 Gb/s, decompresses at 90 Gb/s and above, and can therefore improve internode communication throughput on current and upcoming networks by fully saturating the interconnection links with compressed data.
- Aqrawi, A. A. and Elster, A. C. 2010. Accelerating disk access using compression for large seismic datasets on modern GPU and CPU. Para 2010 State of the Art in Scientific and Parallel Computing, extended abstract #131.Google Scholar
- Balevic, A. 2009. Parallel variable-length encoding on GPGPUs. In Proceedings of the 2009 International Conference on Parallel Processing. Euro-Par'09. Springer-Verlag, Berlin, Heidelberg, 26--35. Google ScholarDigital Library
- Balevic, A., Rockstroh, L., Wroblewski, M. and S. Simon. 2008. Using arithmetic coding for reduction of resulting simulation data size on massively parallel GPGPUs. In Proceedings of the 15th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface. Springer-Verlag, Berlin, Heidelberg, 295--302. Google ScholarDigital Library
- Burtscher, M. and Ratanaworabhan, P. 2009. FPC: A high-speed compressor for double-precision floating-point data. IEEE Trans. Comput. 58, 1 (January 2009), 18--31. Google ScholarDigital Library
- Burtscher, M. and Ratanaworabhan, P. 2009. pFPC: A parallel compressor for floating-point data. In Proceedings of the 2009 Data Compression Conference. DCC'09. IEEE Computer Society, Washington, DC, 43--52. Google ScholarDigital Library
- Bzip2. Retrieved February 1, 2011 from http://www.bzip.org/.Google Scholar
- Castaño, I. 2009. High Quality DXT Compression using OpenCL for CUDA. Whitepaper. NVIDIA Corp. Retrieved February 1, 2011 from http://developer.download.nvidia.com/compute/cuda/3_0/sdk/website/OpenCL/website/OpenCL/src/oclDXTCompression/doc/opencl_dxtc.pdf.Google Scholar
- Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J. W. and Skadron, K. 2008. A performance study of general-purpose applications on graphics processors using CUDA. J. Parallel Distrib. Comput. 68, 10 (October 2008), 1370--1380. Google ScholarDigital Library
- CUDA C Programming Guide 3.2. 2010. Retrieved February 1, 2011 from http://developer.download.nvidia.com/compute/cuda/3_2/too lkit/docs/CUDA_C_Programming_Guide.pdf.Google Scholar
- FPC 1.1. 2009. Retrieved February 1, 2011 from http://www.csl.cornell.edu/~burtscher/research/FPC/.Google Scholar
- Gzip. Retrieved February 1, 2011 from http://www.gzip.org/.Google Scholar
- Harris, M., Sengupta, S. and Owens, J. D. 2007. Parallel prefix sum (scan) with CUDA. NVIDIA GPU Gems 3. Addison-Wesley Professional, chapter 39.Google Scholar
- InfiniBand Trade Association. 2010. Retrieved February 1, 2011 from http://www.infinibandta.org/content/pages.php?pg=press_room_item&rec_id=679.Google Scholar
- Ke, J., Burtscher, M. and Speight, E. 2004. Runtime compression of MPI messages to improve the performance and scalability of parallel applications. In Proceedings of the 2004 ACM/IEEE Conference on Supercomputing. SC'04. IEEE Computer Society, Washington, DC, 59--65. Google ScholarDigital Library
- Lietsch, S. and Marquardt, O. 2008. A CUDA-supported approach to remote rendering. In Proceedings of the 3rd International Conference on Advances in Visual Computing. ISVC'07. Springer -Verlag, Berlin, Heidelberg, 724--733. Google ScholarDigital Library
- Lindholm, E., Nickolls, J., Oberman, S. and Montrym, J. 2008. NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro 28, 2 (March 2008), 39--55. Google ScholarDigital Library
- Lindstrom, P. and Cohen, J. D. 2010. On-the-fly decompression and rendering of multiresolution terrain. In Proceedings of the 2010 ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games. I3D'10. ACM, New York, NY, 65--73. Google ScholarDigital Library
- Lonestar user guide. Retrieved February 1, 2011 from http://services.tacc.utexas.edu/index.php/lonestar-user-guide.Google Scholar
- Longhorn user guide. Retrieved February 1, 2011 from http://services.tacc.utexas.edu/index.php/longhorn-user-guide.Google Scholar
- Lzop. Retrieved February 1, 2011 from http://www.lzop.org/.Google Scholar
- pFPC v1.0. 2009. Retrieved February 1, 2011 from http://users.ices.utexas.edu/~burtscher/research/pFPC/.Google Scholar
- Scientific IEEE 754 64-Bit Double-Precision Floating-Point Datasets. 2009. Retrieved February 1, 2011 from http://www.csl.cornell.edu/~burtscher/research/FPC/datasets. html.Google Scholar
- Top500 fastest supercomputers. Retrieved February 1, 2011 from http://www.top500.org/.Google Scholar
Index Terms
- Floating-point data compression at 75 Gb/s on a GPU
Recommendations
A general purpose lossless data compression method for GPU
The paper describes a parallel method for a lossless data compression that uses graphical processing units (GPUs). Two commonly used statistical and dictionary approaches to data compression have been applied in our method. The reduction of compression ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingThe graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Optimized HPL for AMD GPU and multi-core CPU usage
The installation of the LOEWE-CSC ( http://csc.uni-frankfurt.de/csc/__ __51 ) supercomputer at the Goethe University in Frankfurt lead to the development of a Linpack which can fully utilize the installed AMD Cypress GPUs. At its core, a fast DGEMM for ...
Comments