ABSTRACT
In this paper, we address the challenges of reducing the time-to-solution of the data intensive earthquake simulation workflow "CyberShake" by supplementing the high-performance parallel computing (HPC) resources on which it typically runs with distributed, heterogeneous resources that can be obtained opportunistically from grids and clouds. We seek to minimize time to solution by maximizing the amount of work that can be efficiently done on the distributed resources. We identify data movement as the main bottleneck in effectively utilizing the combined local and distributed resources. We address this by analyzing the I/O characteristics of the application, processor acquisition rate (from a pilot-job service), and the data movement throughput of the infrastructure. With these factors in mind, we explore a combination of strategies including partitioning of computation (over HPC and distributed resources) and job clustering.
We validate our approach with a theoretical study and with preliminary measurements on the Ranger HPC system and distributed Open Science Grid resources. More complete performance results will be presented in the final submission of this paper.
- Advanced Network and Distrbuted Storage Laboratory website.Google Scholar
- T. G. Armstrong, Z. Zhang, D. S. Katz, M. Wilde, and I. Foster. Scheduling many-task workloads on supercomputers: Dealing with trailing tasks. In Proceedings of Many-Task Computing on Grids and Supercomputers, 2010, 2010.Google ScholarCross Ref
- P. Avery, R. Roskies, and D. S. Katz. ExTENCI: Extending Science Through Enhanced National Cyberinfrastructure, 2010. Project homepage: https://sites.google.com/site/extenci/.Google Scholar
- S. Callaghan, E. Deelman, D. Gunter, G. Juve, P. Maechling, C. Brooks, K. Vahi, K. Milner, R. Graves, E. Field, D. Okaya, and T. Jordan. Scaling up workflow-based applications. Journal of Computer and System Sciences, 76(6):18, 2010. Google ScholarDigital Library
- S. Callaghan, P. Maechling, E. Deelman, K. Vahi, G. Mehta, G. Juve, K. Milner, R. Graves, E. Field, D. Okaya, D. Gunter, K. Beattie, and T. Jordan. Reducing Time-to-Solution Using Distributed High-Throughput Mega-Workflows - Experiences from SCEC CyberShake. In Fourth International Conference on eScience, pages 151--158, 2008. Google ScholarDigital Library
- P. Couvares, T. Kosar, A. Roy, J. Weber, and K. Wenger. Workflow Management in Condor, pages 357--375. Springer, 2007.Google Scholar
- E. Deelman, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, S. Patil, M.-H. Su, K. Vahi, and M. Livny. Pegasus: Mapping Scientific Workflows onto the Grid, volume 3165, pages 131--140. Springer Berlin / Heidelberg, 2004.Google Scholar
- E. Deelman, G. Singh, M. Su, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, K. Vahi, G. Berriman, J. Good, A. Laity, J. C. Jacob, and D. S. Katz. Pegasus: A framework for mapping complex scientific workflows onto distributed systems. Scientific Programming, 13(3):219--237, 2005. Google ScholarDigital Library
- A. Espinosa. Cybershake on Opportunistic Cyberinfrastructures. Master thesis, University of Chicago, Chicago, Mar. 2011.Google Scholar
- A. Espinosa, D. S. Katz, M. Wilde, K. Maheshwari, I. Foster, S. Callaghan, and P. Maechling. Data-intensive CyberShake computations on an opportunistic cyberinfrastructure. In Proceedings of the 2011 TeraGrid Conference: Extreme Digital Discovery. ACM, 2011. Google ScholarDigital Library
- I. Foster and C. Kesselman. The Globus toolkit, pages 259--278. Morgan Kaufmann Publishers Inc., 1999. Google ScholarDigital Library
- R. Graves, T. Jordan, S. Callaghan, E. Deelman, E. Field, G. Juve, C. Kesselman, P. Maechling, G. Mehta, K. Milner, D. Okaya, P. Small, and K. Vahi. CyberShake: A Physics-Based Seismic Hazard Model for Southern California. Pure and Applied Geophysics, Online Fir:1--15, May 2010.Google Scholar
- M. Hategan, J. Wozniak, and K. Maheshwari. Coasters: uniform resource provisioning and access for clouds and grids. In 4th IEEE/ACM International Conference on Utility and Cloud Computing (UCC 2011), Dec. 2011. Google ScholarDigital Library
- D. S. Katz, S. Callaghan, R. Harkness, S. Jha, K. Kurowski, S. Manos, S. Pamidighantam, M. Pierce, B. Plale, C. Song, and J. Towns. Science on the TeraGrid. Computational Methods in Science and Technology, Special Issue 2010:81--97, 2010.Google Scholar
- P. Maechling, E. Deelman, L. Zhao, R. Graves, G. Mehta, N. Gupta, J. Mehringer, C. Kesselman, S. Callaghan, D. Okaya, H. Francoeur, V. Gupta, Y. Cui, K. Vahi, T. Jordan, and E. Field. SCEC CyberShake Workflows -- Automating Probabilistic Seismic Hazard Analysis Calculations, pages 143--163. Springer London, London, 2007.Google Scholar
- J. McGee and C. Sehgal, 2011. Personal communication.Google Scholar
- R. Pordes, D. Petravick, B. Kramer, D. Olson, M. Livny, A. Roy, P. Avery, K. Blackburn, T. Wenaus, F. Würthwein, I. Foster, R. Gardner, M. Wilde, A. Blatecky, J. McGee, and R. Quick. The open science grid. Journal of Physics: Conference Series, 78:012057, July 2007.Google ScholarCross Ref
- I. Raicu, Y. Zhao, I. T. Foster, and A. Szalay. Accelerating large-scale data exploration through data diffusion. In Proceedings of the 2008 International Workshop on Data-Aware Distributed Computing (DADC '08), pages 9--18. ACM Press, June 2008. Google ScholarDigital Library
- K. Ranganathan and I. Foster. Simulation studies of computation and data scheduling algorithms for data grids. Journal of Grid Computing, 1(1):53--62, 2003.Google ScholarCross Ref
- D. Reed. Grids, the TeraGrid and beyond. Computer, 36(1):62--68, Jan. 2003. Google ScholarDigital Library
- M. Rynge, G. Juve, G. Mehta, E. Deelman, K. Larson, B. Holzman, I. Sfiligoi, F. Würthwein, G. B. Berriman, and S. Callaghan. Experiences Using GlideinWMS and the Corral Frontend Across Cyberinfrastructures. In Proceedings of the 7th IEEE International Conference on e-Science (e-Science 2011), 2011. Google ScholarDigital Library
- I. Sfiligoi, D. Bradley, B. Holzman, P. Mhashilkar, S. Padhi, and F. Würthwein. The Pilot Way to Grid Resources Using glideinWMS. In Computer Science and Information Engineering, 2009 WRI World Congress on, pages 428--432, 2009. Google ScholarDigital Library
- G. von Laszewski, I. Foster, J. Gawor, and P. Lane. A Java Commodity Grid Kit. Concurrency and Computation: Practice and Experience, 13(8--9), 2001.Google Scholar
- M. Wilde, M. Hategan, J. M. Wozniak, B. Clifford, D. S. Katz, and I. Foster. Swift: A language for distributed parallel scripting. Parallel Computing, 37(9):633--652, 2011. Google ScholarDigital Library
- J. M. Wozniak and M. Wilde. Case studies in storage access by loosely coupled petascale applications. In Proc. 4th Annual Workshop on Petascale Data Storage, pages 16--20, 2009. Google ScholarDigital Library
- XSEDE Project. XSEDE web site.Google Scholar
- L. Zhao, P. Chen, and T. Jordan. Strain Green's tensors, reciprocity, and their applications to seismic source and structure studies. Bulletin of the Seismological Society of America, 96(5):1753--1765, 2006.Google ScholarCross Ref
Index Terms
- Job and data clustering for aggregate use of multiple production cyberinfrastructures
Recommendations
Middleware support for many-task computing
Many-task computing aims to bridge the gap between two computing paradigms, high throughput computing and high performance computing. Many-task computing denotes high-performance computations comprising multiple distinct activities, coupled via file ...
Improving Multisite Workflow Performance Using Model-Based Scheduling
BRACIS '14: Proceedings of the 2014 Brazilian Conference on Intelligent SystemsWorkflows play an important role in expressing and executing scientific applications. In recent years, a variety of computational sites and resources have emerged, and users often have access to multiple resources that are geographically distributed. ...
Towards a Powerful European DCI Based on Desktop Grids
Service Grids like the EGEE Grid can not provide the required number of resources for many VOs. Therefore extending the capacity of these VOs with volunteer or institutional desktop Grids would significantly increase the number of accessible computing ...
Comments