Skip to main content
Log in

Optimizing dataflow applications on heterogeneous environments

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

The increases in multi-core processor parallelism and in the flexibility of many-core accelerator processors, such as GPUs, have turned traditional SMP systems into hierarchical, heterogeneous computing environments. Fully exploiting these improvements in parallel system design remains an open problem. Moreover, most of the current tools for the development of parallel applications for hierarchical systems concentrate on the use of only a single processor type (e.g., accelerators) and do not coordinate several heterogeneous processors. Here, we show that making use of all of the heterogeneous computing resources can significantly improve application performance. Our approach, which consists of optimizing applications at run-time by efficiently coordinating application task execution on all available processing units is evaluated in the context of replicated dataflow applications. The proposed techniques were developed and implemented in an integrated run-time system targeting both intra- and inter-node parallelism. The experimental results with a real-world complex biomedical application show that our approach nearly doubles the performance of the GPU-only implementation on a distributed heterogeneous accelerator cluster.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Arpaci-Dusseau, R.H., Anderson, E., Treuhaft, N., Culler, D.E., Hellerstein, J.M., Patterson, D., Yelick, K.: Cluster I/O with river: making the fast case common. In: IOPADS ’99: Input/Output for Parallel and Distributed Systems (1999)

    Google Scholar 

  2. Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.A.: Starpu: A unified platform for task scheduling on heterogeneous multicore architectures. In: Euro-Par ’09: Proceedings of the 15th International Euro-Par Conference on Parallel Processing, pp. 863–874 (2009)

    Google Scholar 

  3. Berman, F.D., Wolski, R., Figueira, S., Schopf, J., Shao, G.: Application-level scheduling on distributed heterogeneous networks. In: Supercomputing ’96: Proceedings of the 1996 ACM/IEEE Conference on Supercomputing, p. 39 (1996)

    Chapter  Google Scholar 

  4. Beynon, M., Ferreira, R., Kurc, T.M., Sussman, A., Saltz, J.H.: DataCutter: middleware for filtering very large scientific datasets on archival storage systems. In: IEEE Symposium on Mass Storage Systems, pp. 119–134 (2000)

    Google Scholar 

  5. Beynon, M.D., Kurc, T., Catalyurek, U., Chang, C., Sussman, A., Saltz, J.: Distributed processing of very large datasets with DataCutter. Parallel Comput. 27(11), 1457–1478 (2001)

    Article  MATH  Google Scholar 

  6. Bhatti, N.T., Hiltunen, M.A., Schlichting, R.D., Chiu, W.: Coyote: a system for constructing fine-grain configurable communication services. ACM Trans. Comput. Syst. 16(4), 321–366 (1998)

    Article  Google Scholar 

  7. Buck, I., Foley, T., Horn, D., Sugerman, J., Fatahalian, K., Houston, M., Hanrahan, P.: Brook for gpus: stream computing on graphics hardware. ACM Trans. Graph. 23(3), 777–786 (2004)

    Article  Google Scholar 

  8. Catalyurek, U., Beynon, M.D., Chang, C., Kurc, T., Sussman, A., Saltz, J.: The virtual microscope. IEEE Trans. Inf. Technol. Biomed. 7(4), 230–248 (2003)

    Article  Google Scholar 

  9. Fahringer, T., Zima, H.P.: A static parameter based performance prediction tool for parallel programs. In: ICS ’93: Proceedings of the 7th International Conference on Supercomputing, pp. 207–219 (1993)

    Chapter  Google Scholar 

  10. Fix, E., Hodges, J.: Discriminatory analysis, nonparametric discrimination, consistency properties. Computer science technical report, School of Aviation Medicine, Randolph Field, Texas (1951)

  11. Hartley, T.D., Catalyurek, U.V., Ruiz, A., Ujaldon, M., Igual, F., Mayo, R.: Biomedical image analysis on a cooperative cluster of gpus and multicores. In: 22nd ACM Intl. Conference on Supercomputing (2008)

    Google Scholar 

  12. He, B., Fang, W., Luo, Q., Govindaraju, N.K., Wang, T.: Mars: A mapreduce framework on graphics processors. In: Parallel Architectures and Compilation Techniques (2008)

    Google Scholar 

  13. Hoppe, H.: View-dependent refinement of progressive meshes. In: SIGGRAPH 97 Proc., pp. 189–198 (1997). http://research.microsoft.com/hoppe/

    Chapter  Google Scholar 

  14. Hsu, C.H., Chen, T.L., Li, K.C.: Performance effective pre-scheduling strategy for heterogeneous grid systems in the master slave paradigm. Future Gener. Comput. Syst. (2007)

  15. Iverson, M., Ozguner, F., Follen, G.: Parallelizing existing applications in a distributed heterogeneous environment. In: 4th Heterogeneous Computing Workshop (HCW’95) (1995)

    Google Scholar 

  16. Kerbyson, D.J., Alme, H.J., Hoisie, A., Petrini, F., Wasserman, H.J., Gittings, M.: Predictive performance and scalability modeling of a large-scale application. In: Supercomputing ’01: Proceedings of the 2001 ACM/IEEE Conference on Supercomputing (CDROM), pp. 37–37 (2001)

    Chapter  Google Scholar 

  17. Kurc, T., Lee, F., Agrawal, G., Catalyurek, U., Ferreira, R., Saltz, J.: Optimizing reduction computations in a distributed environment. In: SC ’03: Proceedings of the 2003 ACM/IEEE Conference on Supercomputing, p. 9 (2003)

    Chapter  Google Scholar 

  18. Lee, S., Min, S.J., Eigenmann, R.: OpenMP to GPGPU: a compiler framework for automatic translation and optimization. In: PPoPP ’09: Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 101–110 (2009)

    Google Scholar 

  19. Linderman, M.D., Collins, J.D., Wang, H., Meng, T.H.: Merge: a programming model for heterogeneous multi-core systems. ACM SIGPLAN Not. 43(3), 287–296 (2008)

    Article  Google Scholar 

  20. Low, S., Peterson, L., Wang, L.: Understanding tcp vegas: a duality model. In: Proceedings of ACM Sigmetrics (2001)

    Google Scholar 

  21. Luk, C.K., Hong, S., Kim, H.: Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In: 42nd International Symposium on Microarchitecture (MICRO) (2009)

    Google Scholar 

  22. Maes, F., Vandermeulen, D., Suetens, P.: Comparative evaluation of multiresolution optimization strategies for multimodality image registration by maximization of mutual information. Med. Image Anal. 3(4), 373–386 (1999)

    Article  Google Scholar 

  23. NVIDIA: NVIDIA CUDA SDK (2007). http://nvidia.com/cuda

  24. O’Malley, S.W., Peterson, L.L.: A dynamic network architecture. ACM Trans. Comput. Syst. 10(2) (1992)

  25. Patkar, N., Katsuno, A., Li, S., Maruyama, T., Savkar, S., Simone, M., Shen, G., Swami, R., Tovey, D.: Microarchitecture of hal’s cpu. In: IEEE International Computer Conference, p. 259 (1995)

    Google Scholar 

  26. Ramanujam, J.: Toward automatic parallelization and auto-tuning of affine kernels for gpus. In: Workshop on Automatic Tuning for Petascale Systems (2008)

    Google Scholar 

  27. Rocha, B.M., Campos, F.O., Plank, G., dos Santos, R.W., Liebmann4, M., Haase, G.: Simulations of the electrical activity in the heart with graphic processing units. Accepted for publication in Eighth International Conference on Parallel Processing and Applied Mathematics (2009)

  28. Rosenfeld, A. (ed.): Multiresolution Image Processing and Analysis. Springer, Berlin (1984)

    MATH  Google Scholar 

  29. Ruiz, A., Sertel, O., Ujaldon, M., Catalyurek, U., Saltz, J., Gurcan, M.: Pathological image analysis using the gpu: Stroma classification for neuroblastoma. In: Proc. of IEEE Int. Conf. on Bioinformatics and Biomedicine (2007)

    Google Scholar 

  30. Sancho, J.C., Kerbyson, D.J.: Analysis of double buffering on two different multicore architectures: quad-core opteron and the Cell-BE. In: International Parallel and Distributed Processing Symposium (IPDPS) (2008)

    Google Scholar 

  31. Sertel, O., Kong, J., Shimada, H., Catalyurek, U.V., Saltz, J.H., Gurcan, M.N.: Computer-aided prognosis of neuroblastoma on whole-slide images: classification of stromal development. Pattern Recognit. 42(6) (2009)

  32. Shimada, H., Ambros, I.M., Dehner, L.P., Ichi Hata, J., Joshi, V.V., Roald, B.: Terminology and morphologic criteria of neuroblastic tumors: recommendation by the international neuroblastoma pathology committee. Cancer 86(2) (1999)

  33. Song, F., YarKhan, A., Dongarra, J.: Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems. In: SC ’09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (2009)

    Google Scholar 

  34. Sundaram, N., Raghunathan, A., Chakradhar, S.T.: A framework for efficient and scalable execution of domain-specific templates on gpus. In: IPDPS ’09: Proceedings of the 2009 IEEE International Symposium on Parallel and Distributed Processing, pp. 1–12. (2009)

    Google Scholar 

  35. Tavares, T., Teodoro, G., Kurc, T., Ferreira, R., Guedes, D., Meira, W.J., Catalyurek, U., Hastings, S., Oster, S., Langella, S., Saltz, J.: An efficient and reliable scientific workflow system. In: IEEE International Symposium on Cluster Computing and the Grid, pp. 445–452 (2007)

    Google Scholar 

  36. Teodoro, G., Fireman, D., Guedes, D. Jr., Ferreira, R.: Achieving multi-level parallelism in filter-labeled stream programming model. In: The 37th International Conference on Parallel Processing (ICPP) (2008)

    Google Scholar 

  37. Teodoro, G., Hartley, T.D.R., Catalyurek, U., Ferreira, R.: Run-time optimizations for replicated dataflows on heterogeneous environments. In: Proc. of the 19th ACM International Symposium on High Performance Distributed Computing (HPDC) (2010)

    Google Scholar 

  38. Teodoro, G., Sachetto, R., Fireman, D., Guedes, D., Ferreira, R.: Exploiting computational resources in distributed heterogeneous platforms. In: 21st International Symposium on Computer Architecture and High Performance Computing, pp. 83–90 (2009)

    Chapter  Google Scholar 

  39. Teodoro, G., Sachetto, R., Sertel, O., Gurcan, M. Jr., Catalyurek, U., Ferreira, R.: Coordinating the use of GPU and CPU for improving performance of compute intensive applications. In: IEEE Cluster (2009)

    Google Scholar 

  40. Teodoro, G., Tavares, T., Ferreira, R., Kurc, T., Meira, W., Guedes, D., Pan, T., Saltz, J.: Run-time support for efficient execution of scientific workflows on distributed environmments. In: International Symposium on Computer Architecture and High Performance Computing, Ouro Preto, Brazil (2006)

    Google Scholar 

  41. Vrsalovic, D.F., Siewiorek, D.P., Segall, Z.Z., Gehringer, E.F.: Performance prediction and calibration for a class of multiprocessors. IEEE Trans. Comput. 37(11) (1988)

  42. Welsh, M., Culler, D., Brewer, E.: Seda: an architecture for well-conditioned, scalable internet services. SIGOPS Oper. Syst. Rev. 35(5), 230–243 (2001)

    Article  Google Scholar 

  43. Woods, B., Clymer, B., Saltz, J., Kurc, T.: A parallel implementation of 4-dimensional haralick texture analysis for disk-resident image datasets. In: SC ’04: Proceedings of the 204 ACM/IEEE Conference on Supercomputing (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to George Teodoro.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Teodoro, G., Hartley, T.D.R., Catalyurek, U.V. et al. Optimizing dataflow applications on heterogeneous environments. Cluster Comput 15, 125–144 (2012). https://doi.org/10.1007/s10586-010-0151-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-010-0151-6

Keywords

Navigation