Optimizing dataflow applications on heterogeneous environments

Teodoro, George; Hartley, Timothy D. R.; Catalyurek, Umit V.; Ferreira, Renato

doi:10.1007/s10586-010-0151-6

Optimizing dataflow applications on heterogeneous environments

Published: 24 March 2011

Volume 15, pages 125–144, (2012)
Cite this article

Cluster Computing Aims and scope Submit manuscript

George Teodoro¹,
Timothy D. R. Hartley²,
Umit V. Catalyurek² &
…
Renato Ferreira¹

258 Accesses
16 Citations
Explore all metrics

Abstract

The increases in multi-core processor parallelism and in the flexibility of many-core accelerator processors, such as GPUs, have turned traditional SMP systems into hierarchical, heterogeneous computing environments. Fully exploiting these improvements in parallel system design remains an open problem. Moreover, most of the current tools for the development of parallel applications for hierarchical systems concentrate on the use of only a single processor type (e.g., accelerators) and do not coordinate several heterogeneous processors. Here, we show that making use of all of the heterogeneous computing resources can significantly improve application performance. Our approach, which consists of optimizing applications at run-time by efficiently coordinating application task execution on all available processing units is evaluated in the context of replicated dataflow applications. The proposed techniques were developed and implemented in an integrated run-time system targeting both intra- and inter-node parallelism. The experimental results with a real-world complex biomedical application show that our approach nearly doubles the performance of the GPU-only implementation on a distributed heterogeneous accelerator cluster.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data Parallel Application Adaptivity and System-Wide Resource Management in Many-Core Architectures

Towards Semi-automated Parallelization of Data Stream Processing

Stream parallelism with ordered data constraints on multi-core systems

Article 17 July 2018

References

Arpaci-Dusseau, R.H., Anderson, E., Treuhaft, N., Culler, D.E., Hellerstein, J.M., Patterson, D., Yelick, K.: Cluster I/O with river: making the fast case common. In: IOPADS ’99: Input/Output for Parallel and Distributed Systems (1999)
Google Scholar
Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.A.: Starpu: A unified platform for task scheduling on heterogeneous multicore architectures. In: Euro-Par ’09: Proceedings of the 15th International Euro-Par Conference on Parallel Processing, pp. 863–874 (2009)
Google Scholar
Berman, F.D., Wolski, R., Figueira, S., Schopf, J., Shao, G.: Application-level scheduling on distributed heterogeneous networks. In: Supercomputing ’96: Proceedings of the 1996 ACM/IEEE Conference on Supercomputing, p. 39 (1996)
Chapter Google Scholar
Beynon, M., Ferreira, R., Kurc, T.M., Sussman, A., Saltz, J.H.: DataCutter: middleware for filtering very large scientific datasets on archival storage systems. In: IEEE Symposium on Mass Storage Systems, pp. 119–134 (2000)
Google Scholar
Beynon, M.D., Kurc, T., Catalyurek, U., Chang, C., Sussman, A., Saltz, J.: Distributed processing of very large datasets with DataCutter. Parallel Comput. 27(11), 1457–1478 (2001)
Article MATH Google Scholar
Bhatti, N.T., Hiltunen, M.A., Schlichting, R.D., Chiu, W.: Coyote: a system for constructing fine-grain configurable communication services. ACM Trans. Comput. Syst. 16(4), 321–366 (1998)
Article Google Scholar
Buck, I., Foley, T., Horn, D., Sugerman, J., Fatahalian, K., Houston, M., Hanrahan, P.: Brook for gpus: stream computing on graphics hardware. ACM Trans. Graph. 23(3), 777–786 (2004)
Article Google Scholar
Catalyurek, U., Beynon, M.D., Chang, C., Kurc, T., Sussman, A., Saltz, J.: The virtual microscope. IEEE Trans. Inf. Technol. Biomed. 7(4), 230–248 (2003)
Article Google Scholar
Fahringer, T., Zima, H.P.: A static parameter based performance prediction tool for parallel programs. In: ICS ’93: Proceedings of the 7th International Conference on Supercomputing, pp. 207–219 (1993)
Chapter Google Scholar
Fix, E., Hodges, J.: Discriminatory analysis, nonparametric discrimination, consistency properties. Computer science technical report, School of Aviation Medicine, Randolph Field, Texas (1951)
Hartley, T.D., Catalyurek, U.V., Ruiz, A., Ujaldon, M., Igual, F., Mayo, R.: Biomedical image analysis on a cooperative cluster of gpus and multicores. In: 22nd ACM Intl. Conference on Supercomputing (2008)
Google Scholar
He, B., Fang, W., Luo, Q., Govindaraju, N.K., Wang, T.: Mars: A mapreduce framework on graphics processors. In: Parallel Architectures and Compilation Techniques (2008)
Google Scholar
Hoppe, H.: View-dependent refinement of progressive meshes. In: SIGGRAPH 97 Proc., pp. 189–198 (1997). http://research.microsoft.com/hoppe/
Chapter Google Scholar
Hsu, C.H., Chen, T.L., Li, K.C.: Performance effective pre-scheduling strategy for heterogeneous grid systems in the master slave paradigm. Future Gener. Comput. Syst. (2007)
Iverson, M., Ozguner, F., Follen, G.: Parallelizing existing applications in a distributed heterogeneous environment. In: 4th Heterogeneous Computing Workshop (HCW’95) (1995)
Google Scholar
Kerbyson, D.J., Alme, H.J., Hoisie, A., Petrini, F., Wasserman, H.J., Gittings, M.: Predictive performance and scalability modeling of a large-scale application. In: Supercomputing ’01: Proceedings of the 2001 ACM/IEEE Conference on Supercomputing (CDROM), pp. 37–37 (2001)
Chapter Google Scholar
Kurc, T., Lee, F., Agrawal, G., Catalyurek, U., Ferreira, R., Saltz, J.: Optimizing reduction computations in a distributed environment. In: SC ’03: Proceedings of the 2003 ACM/IEEE Conference on Supercomputing, p. 9 (2003)
Chapter Google Scholar
Lee, S., Min, S.J., Eigenmann, R.: OpenMP to GPGPU: a compiler framework for automatic translation and optimization. In: PPoPP ’09: Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 101–110 (2009)
Google Scholar
Linderman, M.D., Collins, J.D., Wang, H., Meng, T.H.: Merge: a programming model for heterogeneous multi-core systems. ACM SIGPLAN Not. 43(3), 287–296 (2008)
Article Google Scholar
Low, S., Peterson, L., Wang, L.: Understanding tcp vegas: a duality model. In: Proceedings of ACM Sigmetrics (2001)
Google Scholar
Luk, C.K., Hong, S., Kim, H.: Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In: 42nd International Symposium on Microarchitecture (MICRO) (2009)
Google Scholar
Maes, F., Vandermeulen, D., Suetens, P.: Comparative evaluation of multiresolution optimization strategies for multimodality image registration by maximization of mutual information. Med. Image Anal. 3(4), 373–386 (1999)
Article Google Scholar
NVIDIA: NVIDIA CUDA SDK (2007). http://nvidia.com/cuda
O’Malley, S.W., Peterson, L.L.: A dynamic network architecture. ACM Trans. Comput. Syst. 10(2) (1992)
Patkar, N., Katsuno, A., Li, S., Maruyama, T., Savkar, S., Simone, M., Shen, G., Swami, R., Tovey, D.: Microarchitecture of hal’s cpu. In: IEEE International Computer Conference, p. 259 (1995)
Google Scholar
Ramanujam, J.: Toward automatic parallelization and auto-tuning of affine kernels for gpus. In: Workshop on Automatic Tuning for Petascale Systems (2008)
Google Scholar
Rocha, B.M., Campos, F.O., Plank, G., dos Santos, R.W., Liebmann4, M., Haase, G.: Simulations of the electrical activity in the heart with graphic processing units. Accepted for publication in Eighth International Conference on Parallel Processing and Applied Mathematics (2009)
Rosenfeld, A. (ed.): Multiresolution Image Processing and Analysis. Springer, Berlin (1984)
MATH Google Scholar
Ruiz, A., Sertel, O., Ujaldon, M., Catalyurek, U., Saltz, J., Gurcan, M.: Pathological image analysis using the gpu: Stroma classification for neuroblastoma. In: Proc. of IEEE Int. Conf. on Bioinformatics and Biomedicine (2007)
Google Scholar
Sancho, J.C., Kerbyson, D.J.: Analysis of double buffering on two different multicore architectures: quad-core opteron and the Cell-BE. In: International Parallel and Distributed Processing Symposium (IPDPS) (2008)
Google Scholar
Sertel, O., Kong, J., Shimada, H., Catalyurek, U.V., Saltz, J.H., Gurcan, M.N.: Computer-aided prognosis of neuroblastoma on whole-slide images: classification of stromal development. Pattern Recognit. 42(6) (2009)
Shimada, H., Ambros, I.M., Dehner, L.P., Ichi Hata, J., Joshi, V.V., Roald, B.: Terminology and morphologic criteria of neuroblastic tumors: recommendation by the international neuroblastoma pathology committee. Cancer 86(2) (1999)
Song, F., YarKhan, A., Dongarra, J.: Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems. In: SC ’09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (2009)
Google Scholar
Sundaram, N., Raghunathan, A., Chakradhar, S.T.: A framework for efficient and scalable execution of domain-specific templates on gpus. In: IPDPS ’09: Proceedings of the 2009 IEEE International Symposium on Parallel and Distributed Processing, pp. 1–12. (2009)
Google Scholar
Tavares, T., Teodoro, G., Kurc, T., Ferreira, R., Guedes, D., Meira, W.J., Catalyurek, U., Hastings, S., Oster, S., Langella, S., Saltz, J.: An efficient and reliable scientific workflow system. In: IEEE International Symposium on Cluster Computing and the Grid, pp. 445–452 (2007)
Google Scholar
Teodoro, G., Fireman, D., Guedes, D. Jr., Ferreira, R.: Achieving multi-level parallelism in filter-labeled stream programming model. In: The 37th International Conference on Parallel Processing (ICPP) (2008)
Google Scholar
Teodoro, G., Hartley, T.D.R., Catalyurek, U., Ferreira, R.: Run-time optimizations for replicated dataflows on heterogeneous environments. In: Proc. of the 19th ACM International Symposium on High Performance Distributed Computing (HPDC) (2010)
Google Scholar
Teodoro, G., Sachetto, R., Fireman, D., Guedes, D., Ferreira, R.: Exploiting computational resources in distributed heterogeneous platforms. In: 21st International Symposium on Computer Architecture and High Performance Computing, pp. 83–90 (2009)
Chapter Google Scholar
Teodoro, G., Sachetto, R., Sertel, O., Gurcan, M. Jr., Catalyurek, U., Ferreira, R.: Coordinating the use of GPU and CPU for improving performance of compute intensive applications. In: IEEE Cluster (2009)
Google Scholar
Teodoro, G., Tavares, T., Ferreira, R., Kurc, T., Meira, W., Guedes, D., Pan, T., Saltz, J.: Run-time support for efficient execution of scientific workflows on distributed environmments. In: International Symposium on Computer Architecture and High Performance Computing, Ouro Preto, Brazil (2006)
Google Scholar
Vrsalovic, D.F., Siewiorek, D.P., Segall, Z.Z., Gehringer, E.F.: Performance prediction and calibration for a class of multiprocessors. IEEE Trans. Comput. 37(11) (1988)
Welsh, M., Culler, D., Brewer, E.: Seda: an architecture for well-conditioned, scalable internet services. SIGOPS Oper. Syst. Rev. 35(5), 230–243 (2001)
Article Google Scholar
Woods, B., Clymer, B., Saltz, J., Kurc, T.: A parallel implementation of 4-dimensional haralick texture analysis for disk-resident image datasets. In: SC ’04: Proceedings of the 204 ACM/IEEE Conference on Supercomputing (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Computer Science, Universidade Federal de Minas Gerais, Belo Horizonte, MG, Brazil
George Teodoro & Renato Ferreira
Depts. of Biomedical Informatics, and Electrical & Computer Engineering, The Ohio State University, Columbus, OH, USA
Timothy D. R. Hartley & Umit V. Catalyurek

Authors

George Teodoro
View author publications
You can also search for this author in PubMed Google Scholar
Timothy D. R. Hartley
View author publications
You can also search for this author in PubMed Google Scholar
Umit V. Catalyurek
View author publications
You can also search for this author in PubMed Google Scholar
Renato Ferreira
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to George Teodoro.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Teodoro, G., Hartley, T.D.R., Catalyurek, U.V. et al. Optimizing dataflow applications on heterogeneous environments. Cluster Comput 15, 125–144 (2012). https://doi.org/10.1007/s10586-010-0151-6

Download citation

Received: 20 September 2010
Accepted: 29 December 2010
Published: 24 March 2011
Issue Date: June 2012
DOI: https://doi.org/10.1007/s10586-010-0151-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimizing dataflow applications on heterogeneous environments

Abstract

Access this article

Similar content being viewed by others

Data Parallel Application Adaptivity and System-Wide Resource Management in Many-Core Architectures

Towards Semi-automated Parallelization of Data Stream Processing

Stream parallelism with ordered data constraints on multi-core systems

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Optimizing dataflow applications on heterogeneous environments

Abstract

Access this article

Similar content being viewed by others

Data Parallel Application Adaptivity and System-Wide Resource Management in Many-Core Architectures

Towards Semi-automated Parallelization of Data Stream Processing

Stream parallelism with ordered data constraints on multi-core systems

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation