skip to main content
10.1145/2807591.2807650acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Smart: a MapReduce-like framework for in-situ scientific analytics

Published:15 November 2015Publication History

ABSTRACT

In-situ analytics has lately been shown to be an effective approach to reduce both I/O and storage costs for scientific analytics. Developing an efficient in-situ implementation, however, involves many challenges, including parallelization, data movement or sharing, and resource allocation. Based on the premise that MapReduce can be an appropriate API for specifying scientific analytics applications, we present a novel MapReduce-like framework that supports efficient in-situ scientific analytics, and address several challenges that arise in applying the MapReduce idea for in-situ processing. Specifically, our implementation can load simulated data directly from distributed memory, and it uses a modified API that helps meet the strict memory constraints of in-situ analytics. The framework is designed so that analytics can be launched from the parallel code region of a simulation program. We have developed both time sharing and space sharing modes for maximizing the performance in different scenarios, with the former even avoiding any copying of data from simulation to the analytics program. We demonstrate the functionality, efficiency, and scalability of our system, by using different simulation and analytics programs, executed on clusters with multi-core and many-core nodes.

References

  1. Disco Project. http://discoproject.org/.Google ScholarGoogle Scholar
  2. Heat3D. http://dournac.org/info/parallel_heat3d.Google ScholarGoogle Scholar
  3. LULESH. https://codesign.llnl.gov/lulesh.php.Google ScholarGoogle Scholar
  4. Machine learning library (mllib) guide. https://spark.apache.org/docs/latest/mllib-guide.html.Google ScholarGoogle Scholar
  5. H. Abbasi, G. Eisenhauer, M. Wolf, K. Schwan, and S. Klasky. Just in time: adding value to the IO pipelines of high performance applications with JITStaging. In HPDC, pages 27--36. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. M. Aly, A. Sallam, B. M. Gnanasekaran, L. Nguyen-Dinh, W. G. Aref, M. Ouzzani, and A. Ghafoor. M3: Stream processing on main-memory mapreduce. In ICDE, pages 1253--1256. IEEE, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. D. A. Boyuka, S. Lakshminarasimham, X. Zou, Z. Gong, J. Jenkins, E. R. Schendel, N. Podhorszki, Q. Liu, S. Klasky, and N. F. Samatova. Transparent In Situ Data Transformations in ADIOS. In CCGRID, pages 256--266. IEEE, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. B. Buck, N. Watkins, J. LeFevre, K. Ioannidou, C. Maltzahn, N. Polyzotis, and S. Brandt. SciHadoop: Array-based Query Processing in Hadoop. In SC, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, and R. Sears. MapReduce Online. In NSDI, volume 10, page 20, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI, pages 137--150, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: amazon's highly available key-value store. In ACM SIGOPS Operating Systems Review, volume 41, pages 205--220. ACM, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. C. Docan, M. Parashar, J. Cummings, and S. Klasky. Moving the code to the data-dynamic code deployment using activespaces. In IPDPS, pages 758--769. IEEE, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. C. Docan, M. Parashar, and S. Klasky. DataSpaces: an interaction and coordination framework for coupled simulation workflows. Cluster Computing, 15(2):163--181, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Dorier. Src: Damaris-using dedicated i/o cores for scalable post-petascale hpc simulations. In ICS, pages 370--370. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. Ekanayake, S. Pallickara, and G. Fox. Mapreduce for data intensive scientific analyses. In eScience, pages 277--284. IEEE, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Z. Fadika, E. Dede, M. Govindaraju, and L. Ramakrishnan. Mariane: Mapreduce implementation adapted for hpc environments. In GRID, pages 82--89. IEEE, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Y. Geng, X. Huang, M. Zhu, H. Ruan, and G. Yang. SciHive: Array-based query processing with HiveQL. In TrustCom, pages 887--894. IEEE, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. C. Heitzinger, A. Hossinger, and S. Selberherr. On smoothing three-dimensional monte carlo ion implantation simulation results. TCAD, 22(7):879--883, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. L. Hsu, S. G. Self, D. Grove, T. Randolph, K. Wang, J. J. Delrow, L. Loo, and P. Porter. Denoising array-based comparative genomic hybridization data using wavelets. Biostatistics, 6(2):211--226, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  20. W. Jiang, V. T. Ravi, and G. Agrawal. A Map-Reduce System with an Alternate API for Multi-core Environments. In CCGRID, pages 84--93, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. H. Karimabadi, B. Loring, P. O'Leary, A. Majumdar, M. Tatineni, and B. Geveci. In-situ visualization for global hybrid simulations. In XSEDE, page 57. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. Kim, H. Abbasi, L. Chacon, C. Docan, S. Klasky, Q. Liu, N. Podhorszki, A. Shoshani, and K. Wu. Parallel in situ indexing for data-intensive computing. In LDAV, pages 65--72. IEEE, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  23. S. Klasky, H. Abbasi, J. Logan, M. Parashar, K. Schwan, A. Shoshani, M. Wolf, S. Ahern, I. Altintas, W. Bethel, et al. In situ data processing for extreme-scale computing. SciDAC, 2011.Google ScholarGoogle Scholar
  24. P. M. Kogge and T. J. Dysart. Using the top500 to trace and project technology and architecture trends. In SC, page 28. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. Lakshminarasimhan, D. A. Boyuka, S. V. Pendse, X. Zou, J. Jenkins, V. Vishwanath, M. E. Papka, and N. F. Samatova. Scalable in situ scientific data encoding for analytical query processing. In HPDC, pages 1--12. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. S. Lakshminarasimhan, N. Shah, S. Ethier, S. Klasky, R. Latham, R. Ross, and N. F. Samatova. Compressing the incompressible with ISABELA: In-situ reduction of spatio-temporal data. In Euro-Par, pages 366--379. Springer, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. A. G. Landge, V. Pascucci, A. Gyulassy, J. C. Bennett, H. Kolla, J. Chen, and P.-T. Bremer. In-situ feature extraction of large scale combustion simulations using segmented merge trees. In SC, pages 1020--1031. IEEE, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. M. Li, S. S. Vazhkudai, A. R. Butt, F. Meng, X. Ma, Y. Kim, C. Engelmann, and G. Shipman. Functional partitioning to optimize end-to-end performance on many-core architectures. In SC, pages 1--12. IEEE, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. T. Li, R. Verma, X. Duan, H. Jin, and I. Raicu. Exploring distributed hash tables in highend computing. ACM SIGMETRICS Performance Evaluation Review, 39(3):128--130, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. T. Li, X. Zhou, K. Brandstatter, and I. Raicu. Distributed key-value store on hpc and cloud systems. In GCASR. Citeseer, 2013.Google ScholarGoogle Scholar
  31. T. Li, X. Zhou, K. Brandstatter, D. Zhao, K. Wang, A. Rajendran, Z. Zhang, and I. Raicu. ZHT: A light-weight reliable persistent dynamic scalable zero-hop distributed hash table. In IPDPS, pages 775--787. IEEE, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Q. Liu, J. Logan, Y. Tian, H. Abbasi, N. Podhorszki, J. Y. Choi, S. Klasky, R. Tchoua, J. Lofstead, R. Oldfield, et al. Hello ADIOS: the challenges and lessons of developing leadership class I/O frameworks. Concurrency and Computation: Practice and Experience, 26(7):1453--1473, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. S. Loebman, D. Nunley, Y.-C. Kwon, B. Howe, M. Balazinska, and J. P. Gardner. Analyzing massive astrophysical datasets: Can Pig/Hadoop or a relational DBMS help? In CLUSTER, pages 1--10. IEEE, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  34. D. Logothetis, C. Trezzo, K. C. Webb, and K. Yocum. In-situ MapReduce for log processing. In USENIX ATC, page 115, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. M. Matsuda, N. Maruyama, and S. Takizawa. K MapReduce: A scalable tool for data-processing and search/ensemble applications on large-scale supercomputers. In CLUSTER, pages 1--8. IEEE, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  36. W. J. McCausland, S. Miller, and D. Pelletier. Simulation smoothing for state--space models: A computational efficiency analysis. Computational Statistics & Data Analysis, 55(1):199--212, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. H. Mohamed and S. Marchand-Maillet. MRO-MPI: MapReduce overlapping using MPI and an optimized data exchange policy. Parallel Computing, 39(12):851--866, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. R. A. Oldfield, G. D. Sjaardema, G. F. Lofstead II, and T. Kordenbrock. Trilinos i/o support (trios). Scientific Programming, 20(2):181--196, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. In SIGMOD, pages 1099--1110. ACM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. S. Owen, R. Anil, T. Dunning, and E. Friedman. Mahout in action. Manning Shelter Island, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. S. J. Plimpton and K. D. Devine. MapReduce in MPI for large-scale graph algorithms. Parallel Computing, 37(9):610--632, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis. Evaluating mapreduce for multi-core and multiprocessor systems. In HPCA, pages 13--24. IEEE, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. R. W. Schafer. What is a Savitzky-Golay filter?{lecture notes}. Signal Processing Magazine, IEEE, 28(4):111--117, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  44. S. Sehrish, G. Mackey, J. Wang, and J. Bent. MRAP: A Novel MapReduce-based Framework to Support HPC Analytics Applications with Access Patterns. In HPDC, pages 107--118, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. A. Shinnar, D. Cunningham, V. Saraswat, and B. Herta. M3R: increased performance for in-memory Hadoop jobs. VLDB, 5(12):1736--1747, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Y. Su, Y. Wang, and G. Agrawal. In-situ bitmaps generation and efficient data analysis based on bitmaps. In HPDC, pages 61--72. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Y. Su, Y. Wang, G. Agrawal, and R. Kettimuthu. SDQuery DSI: integrating data management support with a wide area data transfer protocol. In SC, page 47. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. J. Talbot, R. M. Yoo, and C. Kozyrakis. Phoenix++: modular MapReduce for shared-memory systems. In MapReduce'11, pages 9--16. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive - A Warehousing Solution Over a Map-Reduce Framework. PVLDB, 2(2):1626--1629, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. T. Tu, C. A. Rendleman, D. W. Borhani, R. O. Dror, J. Gullingsrud, M. Jensen, J. L. Klepeis, P. Maragakis, P. Miller, K. A. Stafford, et al. A scalable parallel framework for analyzing terascale molecular dynamics simulation trajectories. In SC, pages 1--12. IEEE, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. V. Vishwanath, M. Hereld, and M. E. Papka. Toward simulation-time data analysis and i/o acceleration on leadership-class systems. In LDAV, pages 9--14. IEEE, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  52. J. Wang, D. Crawl, and I. Altintas. Kepler + Hadoop: A General Architecture Facilitating Data-Intensive Applications in Scientific Workflow Systems. In SC-WORKS, pages --1--1, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. K. Wang, A. Kulkarni, X. Zhou, M. Lang, and I. Raicu. Using simulation to explore distributed key-value stores for exascale system services. In GCASR, 2013.Google ScholarGoogle Scholar
  54. K. Wang, X. Zhou, H. Chen, M. Lang, and I. Raicu. Next generation job management systems for extreme-scale ensemble computing. In HPDC, pages 111--114. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Y. Wang, G. Agrawal, T. Bicer, and W. Jiang. Smart: A MapReduce-Like Framework for In-Situ Scientific Analytics. Technical report, OSU-CISRC-4/15-TR05, Ohio State University, 2015.Google ScholarGoogle Scholar
  56. Y. Wang, A. Nandi, and G. Agrawal. SAGA: Array Storage as a DB with Support for Structural Aggregations. In SSDBM, page 9. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Y. Wang, Y. Su, and G. Agrawal. Supporting a Light-Weight Data Management Layer Over HDF5. In CCGRID, pages 335--342, may 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Y. Wang, J. Wei, and G. Agrawal. SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats. In CCGRID, pages 443--450, may 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. J. M. Wozniak, B. Jacobs, R. Latham, S. Lang, S. W. Son, and R. Ross. C-mpi: A dht implementation for grid and hpc environments. Preprint ANL/MCS-P1746-0410, 4, 2010.Google ScholarGoogle Scholar
  60. H. Yu, C. Wang, R. W. Grout, J. H. Chen, and K.-L. Ma. In situ visualization for large-scale combustion simulations. IEEE Computer Graphics and Applications, 30(3):45--57, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, pages 2--2. USENIX Association, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. In HotCloud, pages 10--10, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. B. Zhang, T. Estrada, P. Cicotti, and M. Taufer. Enabling in-situ data analysis for large protein-folding trajectory datasets. In IPDPS, pages 221--230. IEEE, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. F. Zhang, C. Docan, M. Parashar, S. Klasky, N. Podhorszki, and H. Abbasi. Enabling in-situ execution of coupled scientific workflow on multi-core platform. In IPDPS, pages 1352--1363. IEEE, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. F. Zhang, S. Lasluisa, T. Jin, I. Rodero, H. Bui, and M. Parashar. In-situ Feature-Based Objects Tracking for Large-Scale Scientific Simulations. In SCC, pages 736--740. IEEE, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. D. Zhao, Z. Zhang, X. Zhou, T. Li, K. Wang, D. Kimpe, P. Carns, R. Ross, and I. Raicu. Fusionfs: Toward supporting data-intensive scientific applications on extreme-scale high-performance computing systems. In Big Data, pages 61--70. IEEE, 2014.Google ScholarGoogle Scholar
  67. H. Zhao, S. Ai, Z. Lv, and B. Li. Parallel Accessing Massive NetCDF Data Based on MapReduce. In WISM, pages 425--431, Berlin, Heidelberg, 2010. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. F. Zheng, H. Abbasi, C. Docan, J. Lofstead, Q. Liu, S. Klasky, M. Parashar, N. Podhorszki, K. Schwan, and M. Wolf. PreDatA--preparatory data analytics on peta-scale machines. In IPDPS, pages 1--12. IEEE, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  69. F. Zheng, H. Yu, C. Hantas, M. Wolf, G. Eisenhauer, K. Schwan, H. Abbasi, and S. Klasky. GoldRush: resource efficient in situ scientific data analytics using fine-grained interference aware execution. In SC, page 78. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. F. Zheng, H. Zou, G. Eisenhauer, K. Schwan, M. Wolf, J. Dayal, T.-A. Nguyen, J. Cao, H. Abbasi, S. Klasky, et al. FlexIO: I/O Middleware for Location-Flexible Scientific Data Analytics. In IPDPS, pages 320--331. IEEE, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. H. Zou, K. Schwan, M. Slawinska, M. Wolf, G. Eisenhauer, F. Zheng, J. Dayal, J. Logan, Q. Liu, S. Klasky, et al. FlexQuery: An online query system for interactive remote visual data exploration at large scale. In CLUSTER, pages 1--8. IEEE, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  72. H. Zou, Y. Yu, W. Tang, and H.-W. M. Chen. FlexAnalytics: a flexible data analytics framework for big data applications with I/O performance improvement. Big Data Research, 1:4--13, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. H. Zou, F. Zheng, M. Wolf, G. Eisenhauer, K. Schwan, H. Abbasi, Q. Liu, N. Podhorszki, and S. Klasky. Quality-Aware Data Management for Large Scale Scientific Applications. In SCC, pages 816--820, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Smart: a MapReduce-like framework for in-situ scientific analytics

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in
              • Published in

                cover image ACM Conferences
                SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
                November 2015
                985 pages
                ISBN:9781450337236
                DOI:10.1145/2807591
                • General Chair:
                • Jackie Kern,
                • Program Chair:
                • Jeffrey S. Vetter

                Copyright © 2015 ACM

                Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                • Published: 15 November 2015

                Permissions

                Request permissions about this article.

                Request Permissions

                Check for updates

                Qualifiers

                • research-article

                Acceptance Rates

                SC '15 Paper Acceptance Rate79of358submissions,22%Overall Acceptance Rate1,516of6,373submissions,24%

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader