ABSTRACT
In-situ analytics has lately been shown to be an effective approach to reduce both I/O and storage costs for scientific analytics. Developing an efficient in-situ implementation, however, involves many challenges, including parallelization, data movement or sharing, and resource allocation. Based on the premise that MapReduce can be an appropriate API for specifying scientific analytics applications, we present a novel MapReduce-like framework that supports efficient in-situ scientific analytics, and address several challenges that arise in applying the MapReduce idea for in-situ processing. Specifically, our implementation can load simulated data directly from distributed memory, and it uses a modified API that helps meet the strict memory constraints of in-situ analytics. The framework is designed so that analytics can be launched from the parallel code region of a simulation program. We have developed both time sharing and space sharing modes for maximizing the performance in different scenarios, with the former even avoiding any copying of data from simulation to the analytics program. We demonstrate the functionality, efficiency, and scalability of our system, by using different simulation and analytics programs, executed on clusters with multi-core and many-core nodes.
- Disco Project. http://discoproject.org/.Google Scholar
- Heat3D. http://dournac.org/info/parallel_heat3d.Google Scholar
- LULESH. https://codesign.llnl.gov/lulesh.php.Google Scholar
- Machine learning library (mllib) guide. https://spark.apache.org/docs/latest/mllib-guide.html.Google Scholar
- H. Abbasi, G. Eisenhauer, M. Wolf, K. Schwan, and S. Klasky. Just in time: adding value to the IO pipelines of high performance applications with JITStaging. In HPDC, pages 27--36. ACM, 2011. Google ScholarDigital Library
- A. M. Aly, A. Sallam, B. M. Gnanasekaran, L. Nguyen-Dinh, W. G. Aref, M. Ouzzani, and A. Ghafoor. M3: Stream processing on main-memory mapreduce. In ICDE, pages 1253--1256. IEEE, 2012. Google ScholarDigital Library
- D. A. Boyuka, S. Lakshminarasimham, X. Zou, Z. Gong, J. Jenkins, E. R. Schendel, N. Podhorszki, Q. Liu, S. Klasky, and N. F. Samatova. Transparent In Situ Data Transformations in ADIOS. In CCGRID, pages 256--266. IEEE, 2014.Google ScholarDigital Library
- J. B. Buck, N. Watkins, J. LeFevre, K. Ioannidou, C. Maltzahn, N. Polyzotis, and S. Brandt. SciHadoop: Array-based Query Processing in Hadoop. In SC, 2011. Google ScholarDigital Library
- T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, and R. Sears. MapReduce Online. In NSDI, volume 10, page 20, 2010. Google ScholarDigital Library
- J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI, pages 137--150, 2004. Google ScholarDigital Library
- G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: amazon's highly available key-value store. In ACM SIGOPS Operating Systems Review, volume 41, pages 205--220. ACM, 2007. Google ScholarDigital Library
- C. Docan, M. Parashar, J. Cummings, and S. Klasky. Moving the code to the data-dynamic code deployment using activespaces. In IPDPS, pages 758--769. IEEE, 2011. Google ScholarDigital Library
- C. Docan, M. Parashar, and S. Klasky. DataSpaces: an interaction and coordination framework for coupled simulation workflows. Cluster Computing, 15(2):163--181, 2012. Google ScholarDigital Library
- M. Dorier. Src: Damaris-using dedicated i/o cores for scalable post-petascale hpc simulations. In ICS, pages 370--370. ACM, 2011. Google ScholarDigital Library
- J. Ekanayake, S. Pallickara, and G. Fox. Mapreduce for data intensive scientific analyses. In eScience, pages 277--284. IEEE, 2008. Google ScholarDigital Library
- Z. Fadika, E. Dede, M. Govindaraju, and L. Ramakrishnan. Mariane: Mapreduce implementation adapted for hpc environments. In GRID, pages 82--89. IEEE, 2011. Google ScholarDigital Library
- Y. Geng, X. Huang, M. Zhu, H. Ruan, and G. Yang. SciHive: Array-based query processing with HiveQL. In TrustCom, pages 887--894. IEEE, 2013. Google ScholarDigital Library
- C. Heitzinger, A. Hossinger, and S. Selberherr. On smoothing three-dimensional monte carlo ion implantation simulation results. TCAD, 22(7):879--883, 2003. Google ScholarDigital Library
- L. Hsu, S. G. Self, D. Grove, T. Randolph, K. Wang, J. J. Delrow, L. Loo, and P. Porter. Denoising array-based comparative genomic hybridization data using wavelets. Biostatistics, 6(2):211--226, 2005.Google ScholarCross Ref
- W. Jiang, V. T. Ravi, and G. Agrawal. A Map-Reduce System with an Alternate API for Multi-core Environments. In CCGRID, pages 84--93, 2010. Google ScholarDigital Library
- H. Karimabadi, B. Loring, P. O'Leary, A. Majumdar, M. Tatineni, and B. Geveci. In-situ visualization for global hybrid simulations. In XSEDE, page 57. ACM, 2013. Google ScholarDigital Library
- J. Kim, H. Abbasi, L. Chacon, C. Docan, S. Klasky, Q. Liu, N. Podhorszki, A. Shoshani, and K. Wu. Parallel in situ indexing for data-intensive computing. In LDAV, pages 65--72. IEEE, 2011.Google ScholarCross Ref
- S. Klasky, H. Abbasi, J. Logan, M. Parashar, K. Schwan, A. Shoshani, M. Wolf, S. Ahern, I. Altintas, W. Bethel, et al. In situ data processing for extreme-scale computing. SciDAC, 2011.Google Scholar
- P. M. Kogge and T. J. Dysart. Using the top500 to trace and project technology and architecture trends. In SC, page 28. ACM, 2011. Google ScholarDigital Library
- S. Lakshminarasimhan, D. A. Boyuka, S. V. Pendse, X. Zou, J. Jenkins, V. Vishwanath, M. E. Papka, and N. F. Samatova. Scalable in situ scientific data encoding for analytical query processing. In HPDC, pages 1--12. ACM, 2013. Google ScholarDigital Library
- S. Lakshminarasimhan, N. Shah, S. Ethier, S. Klasky, R. Latham, R. Ross, and N. F. Samatova. Compressing the incompressible with ISABELA: In-situ reduction of spatio-temporal data. In Euro-Par, pages 366--379. Springer, 2011. Google ScholarDigital Library
- A. G. Landge, V. Pascucci, A. Gyulassy, J. C. Bennett, H. Kolla, J. Chen, and P.-T. Bremer. In-situ feature extraction of large scale combustion simulations using segmented merge trees. In SC, pages 1020--1031. IEEE, 2014. Google ScholarDigital Library
- M. Li, S. S. Vazhkudai, A. R. Butt, F. Meng, X. Ma, Y. Kim, C. Engelmann, and G. Shipman. Functional partitioning to optimize end-to-end performance on many-core architectures. In SC, pages 1--12. IEEE, 2010. Google ScholarDigital Library
- T. Li, R. Verma, X. Duan, H. Jin, and I. Raicu. Exploring distributed hash tables in highend computing. ACM SIGMETRICS Performance Evaluation Review, 39(3):128--130, 2011. Google ScholarDigital Library
- T. Li, X. Zhou, K. Brandstatter, and I. Raicu. Distributed key-value store on hpc and cloud systems. In GCASR. Citeseer, 2013.Google Scholar
- T. Li, X. Zhou, K. Brandstatter, D. Zhao, K. Wang, A. Rajendran, Z. Zhang, and I. Raicu. ZHT: A light-weight reliable persistent dynamic scalable zero-hop distributed hash table. In IPDPS, pages 775--787. IEEE, 2013. Google ScholarDigital Library
- Q. Liu, J. Logan, Y. Tian, H. Abbasi, N. Podhorszki, J. Y. Choi, S. Klasky, R. Tchoua, J. Lofstead, R. Oldfield, et al. Hello ADIOS: the challenges and lessons of developing leadership class I/O frameworks. Concurrency and Computation: Practice and Experience, 26(7):1453--1473, 2014. Google ScholarDigital Library
- S. Loebman, D. Nunley, Y.-C. Kwon, B. Howe, M. Balazinska, and J. P. Gardner. Analyzing massive astrophysical datasets: Can Pig/Hadoop or a relational DBMS help? In CLUSTER, pages 1--10. IEEE, 2009.Google ScholarCross Ref
- D. Logothetis, C. Trezzo, K. C. Webb, and K. Yocum. In-situ MapReduce for log processing. In USENIX ATC, page 115, 2011. Google ScholarDigital Library
- M. Matsuda, N. Maruyama, and S. Takizawa. K MapReduce: A scalable tool for data-processing and search/ensemble applications on large-scale supercomputers. In CLUSTER, pages 1--8. IEEE, 2013.Google ScholarCross Ref
- W. J. McCausland, S. Miller, and D. Pelletier. Simulation smoothing for state--space models: A computational efficiency analysis. Computational Statistics & Data Analysis, 55(1):199--212, 2011. Google ScholarDigital Library
- H. Mohamed and S. Marchand-Maillet. MRO-MPI: MapReduce overlapping using MPI and an optimized data exchange policy. Parallel Computing, 39(12):851--866, 2013. Google ScholarDigital Library
- R. A. Oldfield, G. D. Sjaardema, G. F. Lofstead II, and T. Kordenbrock. Trilinos i/o support (trios). Scientific Programming, 20(2):181--196, 2012. Google ScholarDigital Library
- C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. In SIGMOD, pages 1099--1110. ACM, 2008. Google ScholarDigital Library
- S. Owen, R. Anil, T. Dunning, and E. Friedman. Mahout in action. Manning Shelter Island, 2011. Google ScholarDigital Library
- S. J. Plimpton and K. D. Devine. MapReduce in MPI for large-scale graph algorithms. Parallel Computing, 37(9):610--632, 2011. Google ScholarDigital Library
- C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis. Evaluating mapreduce for multi-core and multiprocessor systems. In HPCA, pages 13--24. IEEE, 2007. Google ScholarDigital Library
- R. W. Schafer. What is a Savitzky-Golay filter?{lecture notes}. Signal Processing Magazine, IEEE, 28(4):111--117, 2011.Google ScholarCross Ref
- S. Sehrish, G. Mackey, J. Wang, and J. Bent. MRAP: A Novel MapReduce-based Framework to Support HPC Analytics Applications with Access Patterns. In HPDC, pages 107--118, 2010. Google ScholarDigital Library
- A. Shinnar, D. Cunningham, V. Saraswat, and B. Herta. M3R: increased performance for in-memory Hadoop jobs. VLDB, 5(12):1736--1747, 2012. Google ScholarDigital Library
- Y. Su, Y. Wang, and G. Agrawal. In-situ bitmaps generation and efficient data analysis based on bitmaps. In HPDC, pages 61--72. ACM, 2015. Google ScholarDigital Library
- Y. Su, Y. Wang, G. Agrawal, and R. Kettimuthu. SDQuery DSI: integrating data management support with a wide area data transfer protocol. In SC, page 47. ACM, 2013. Google ScholarDigital Library
- J. Talbot, R. M. Yoo, and C. Kozyrakis. Phoenix++: modular MapReduce for shared-memory systems. In MapReduce'11, pages 9--16. ACM, 2011. Google ScholarDigital Library
- A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive - A Warehousing Solution Over a Map-Reduce Framework. PVLDB, 2(2):1626--1629, 2009. Google ScholarDigital Library
- T. Tu, C. A. Rendleman, D. W. Borhani, R. O. Dror, J. Gullingsrud, M. Jensen, J. L. Klepeis, P. Maragakis, P. Miller, K. A. Stafford, et al. A scalable parallel framework for analyzing terascale molecular dynamics simulation trajectories. In SC, pages 1--12. IEEE, 2008. Google ScholarDigital Library
- V. Vishwanath, M. Hereld, and M. E. Papka. Toward simulation-time data analysis and i/o acceleration on leadership-class systems. In LDAV, pages 9--14. IEEE, 2011.Google ScholarCross Ref
- J. Wang, D. Crawl, and I. Altintas. Kepler + Hadoop: A General Architecture Facilitating Data-Intensive Applications in Scientific Workflow Systems. In SC-WORKS, pages --1--1, 2009. Google ScholarDigital Library
- K. Wang, A. Kulkarni, X. Zhou, M. Lang, and I. Raicu. Using simulation to explore distributed key-value stores for exascale system services. In GCASR, 2013.Google Scholar
- K. Wang, X. Zhou, H. Chen, M. Lang, and I. Raicu. Next generation job management systems for extreme-scale ensemble computing. In HPDC, pages 111--114. ACM, 2014. Google ScholarDigital Library
- Y. Wang, G. Agrawal, T. Bicer, and W. Jiang. Smart: A MapReduce-Like Framework for In-Situ Scientific Analytics. Technical report, OSU-CISRC-4/15-TR05, Ohio State University, 2015.Google Scholar
- Y. Wang, A. Nandi, and G. Agrawal. SAGA: Array Storage as a DB with Support for Structural Aggregations. In SSDBM, page 9. ACM, 2014. Google ScholarDigital Library
- Y. Wang, Y. Su, and G. Agrawal. Supporting a Light-Weight Data Management Layer Over HDF5. In CCGRID, pages 335--342, may 2013.Google ScholarDigital Library
- Y. Wang, J. Wei, and G. Agrawal. SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats. In CCGRID, pages 443--450, may 2012. Google ScholarDigital Library
- J. M. Wozniak, B. Jacobs, R. Latham, S. Lang, S. W. Son, and R. Ross. C-mpi: A dht implementation for grid and hpc environments. Preprint ANL/MCS-P1746-0410, 4, 2010.Google Scholar
- H. Yu, C. Wang, R. W. Grout, J. H. Chen, and K.-L. Ma. In situ visualization for large-scale combustion simulations. IEEE Computer Graphics and Applications, 30(3):45--57, 2010. Google ScholarDigital Library
- M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, pages 2--2. USENIX Association, 2012. Google ScholarDigital Library
- M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. In HotCloud, pages 10--10, 2010. Google ScholarDigital Library
- B. Zhang, T. Estrada, P. Cicotti, and M. Taufer. Enabling in-situ data analysis for large protein-folding trajectory datasets. In IPDPS, pages 221--230. IEEE, 2014. Google ScholarDigital Library
- F. Zhang, C. Docan, M. Parashar, S. Klasky, N. Podhorszki, and H. Abbasi. Enabling in-situ execution of coupled scientific workflow on multi-core platform. In IPDPS, pages 1352--1363. IEEE, 2012. Google ScholarDigital Library
- F. Zhang, S. Lasluisa, T. Jin, I. Rodero, H. Bui, and M. Parashar. In-situ Feature-Based Objects Tracking for Large-Scale Scientific Simulations. In SCC, pages 736--740. IEEE, 2012. Google ScholarDigital Library
- D. Zhao, Z. Zhang, X. Zhou, T. Li, K. Wang, D. Kimpe, P. Carns, R. Ross, and I. Raicu. Fusionfs: Toward supporting data-intensive scientific applications on extreme-scale high-performance computing systems. In Big Data, pages 61--70. IEEE, 2014.Google Scholar
- H. Zhao, S. Ai, Z. Lv, and B. Li. Parallel Accessing Massive NetCDF Data Based on MapReduce. In WISM, pages 425--431, Berlin, Heidelberg, 2010. Springer-Verlag. Google ScholarDigital Library
- F. Zheng, H. Abbasi, C. Docan, J. Lofstead, Q. Liu, S. Klasky, M. Parashar, N. Podhorszki, K. Schwan, and M. Wolf. PreDatA--preparatory data analytics on peta-scale machines. In IPDPS, pages 1--12. IEEE, 2010.Google ScholarCross Ref
- F. Zheng, H. Yu, C. Hantas, M. Wolf, G. Eisenhauer, K. Schwan, H. Abbasi, and S. Klasky. GoldRush: resource efficient in situ scientific data analytics using fine-grained interference aware execution. In SC, page 78. ACM, 2013. Google ScholarDigital Library
- F. Zheng, H. Zou, G. Eisenhauer, K. Schwan, M. Wolf, J. Dayal, T.-A. Nguyen, J. Cao, H. Abbasi, S. Klasky, et al. FlexIO: I/O Middleware for Location-Flexible Scientific Data Analytics. In IPDPS, pages 320--331. IEEE, 2013. Google ScholarDigital Library
- H. Zou, K. Schwan, M. Slawinska, M. Wolf, G. Eisenhauer, F. Zheng, J. Dayal, J. Logan, Q. Liu, S. Klasky, et al. FlexQuery: An online query system for interactive remote visual data exploration at large scale. In CLUSTER, pages 1--8. IEEE, 2013.Google ScholarCross Ref
- H. Zou, Y. Yu, W. Tang, and H.-W. M. Chen. FlexAnalytics: a flexible data analytics framework for big data applications with I/O performance improvement. Big Data Research, 1:4--13, 2014.Google ScholarDigital Library
- H. Zou, F. Zheng, M. Wolf, G. Eisenhauer, K. Schwan, H. Abbasi, Q. Liu, N. Podhorszki, and S. Klasky. Quality-Aware Data Management for Large Scale Scientific Applications. In SCC, pages 816--820, 2012. Google ScholarDigital Library
Index Terms
- Smart: a MapReduce-like framework for in-situ scientific analytics
Recommendations
A novel big data analytics framework for smart cities
AbstractThe emergence of smart cities aims at mitigating the challenges raised due to the continuous urbanization development and increasing population density in cities. To face these challenges, governments and decision makers undertake ...
Smart health
Organized evaluation of various big data and smart system technology in healthcare context.Proposed a conceptual model on Big data enabled Smart Healthcare System Framework (BSHSF).We extract some depth information (some relevant examples) about ...
Soft Sensing in Smart Cities: Handling 3Vs Using Recommender Systems, Machine Intelligence, and Data Analytics
Today's existing smart city research involves many overtly futuristic applications such as smart transportation, in which smart roads warn drivers of bad traffic conditions ahead, smart parking, which communicates the location of unoccupied parking ...
Comments