research-article

Smart: a MapReduce-like framework for in-situ scientific analytics

Authors:
Yi Wang

The Ohio State University, Columbus, OH

The Ohio State University, Columbus, OH
View Profile

,
Gagan Agrawal

The Ohio State University, Columbus, OH

The Ohio State University, Columbus, OH
View Profile

,
Tekin Bicer

Argonne National Laboratory, Lemont, IL

Argonne National Laboratory, Lemont, IL
View Profile

,
Wei Jiang

Quantcast Corp.

Quantcast Corp.
View Profile

SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisNovember 2015Article No.: 51Pages 1–12https://doi.org/10.1145/2807591.2807650

Published:15 November 2015Publication History

SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Pages 1–12

ABSTRACT

In-situ analytics has lately been shown to be an effective approach to reduce both I/O and storage costs for scientific analytics. Developing an efficient in-situ implementation, however, involves many challenges, including parallelization, data movement or sharing, and resource allocation. Based on the premise that MapReduce can be an appropriate API for specifying scientific analytics applications, we present a novel MapReduce-like framework that supports efficient in-situ scientific analytics, and address several challenges that arise in applying the MapReduce idea for in-situ processing. Specifically, our implementation can load simulated data directly from distributed memory, and it uses a modified API that helps meet the strict memory constraints of in-situ analytics. The framework is designed so that analytics can be launched from the parallel code region of a simulation program. We have developed both time sharing and space sharing modes for maximizing the performance in different scenarios, with the former even avoiding any copying of data from simulation to the analytics program. We demonstrate the functionality, efficiency, and scalability of our system, by using different simulation and analytics programs, executed on clusters with multi-core and many-core nodes.

References

Disco Project. http://discoproject.org/.Google Scholar
Heat3D. http://dournac.org/info/parallel_heat3d.Google Scholar
LULESH. https://codesign.llnl.gov/lulesh.php.Google Scholar
Machine learning library (mllib) guide. https://spark.apache.org/docs/latest/mllib-guide.html.Google Scholar
H. Abbasi, G. Eisenhauer, M. Wolf, K. Schwan, and S. Klasky. Just in time: adding value to the IO pipelines of high performance applications with JITStaging. In HPDC, pages 27--36. ACM, 2011. Google ScholarDigital Library
A. M. Aly, A. Sallam, B. M. Gnanasekaran, L. Nguyen-Dinh, W. G. Aref, M. Ouzzani, and A. Ghafoor. M3: Stream processing on main-memory mapreduce. In ICDE, pages 1253--1256. IEEE, 2012. Google ScholarDigital Library
D. A. Boyuka, S. Lakshminarasimham, X. Zou, Z. Gong, J. Jenkins, E. R. Schendel, N. Podhorszki, Q. Liu, S. Klasky, and N. F. Samatova. Transparent In Situ Data Transformations in ADIOS. In CCGRID, pages 256--266. IEEE, 2014.Google ScholarDigital Library
J. B. Buck, N. Watkins, J. LeFevre, K. Ioannidou, C. Maltzahn, N. Polyzotis, and S. Brandt. SciHadoop: Array-based Query Processing in Hadoop. In SC, 2011. Google ScholarDigital Library
T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, and R. Sears. MapReduce Online. In NSDI, volume 10, page 20, 2010. Google ScholarDigital Library
J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI, pages 137--150, 2004. Google ScholarDigital Library
G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: amazon's highly available key-value store. In ACM SIGOPS Operating Systems Review, volume 41, pages 205--220. ACM, 2007. Google ScholarDigital Library
C. Docan, M. Parashar, J. Cummings, and S. Klasky. Moving the code to the data-dynamic code deployment using activespaces. In IPDPS, pages 758--769. IEEE, 2011. Google ScholarDigital Library
C. Docan, M. Parashar, and S. Klasky. DataSpaces: an interaction and coordination framework for coupled simulation workflows. Cluster Computing, 15(2):163--181, 2012. Google ScholarDigital Library
M. Dorier. Src: Damaris-using dedicated i/o cores for scalable post-petascale hpc simulations. In ICS, pages 370--370. ACM, 2011. Google ScholarDigital Library
J. Ekanayake, S. Pallickara, and G. Fox. Mapreduce for data intensive scientific analyses. In eScience, pages 277--284. IEEE, 2008. Google ScholarDigital Library
Z. Fadika, E. Dede, M. Govindaraju, and L. Ramakrishnan. Mariane: Mapreduce implementation adapted for hpc environments. In GRID, pages 82--89. IEEE, 2011. Google ScholarDigital Library
Y. Geng, X. Huang, M. Zhu, H. Ruan, and G. Yang. SciHive: Array-based query processing with HiveQL. In TrustCom, pages 887--894. IEEE, 2013. Google ScholarDigital Library
C. Heitzinger, A. Hossinger, and S. Selberherr. On smoothing three-dimensional monte carlo ion implantation simulation results. TCAD, 22(7):879--883, 2003. Google ScholarDigital Library
L. Hsu, S. G. Self, D. Grove, T. Randolph, K. Wang, J. J. Delrow, L. Loo, and P. Porter. Denoising array-based comparative genomic hybridization data using wavelets. Biostatistics, 6(2):211--226, 2005.Google ScholarCross Ref
W. Jiang, V. T. Ravi, and G. Agrawal. A Map-Reduce System with an Alternate API for Multi-core Environments. In CCGRID, pages 84--93, 2010. Google ScholarDigital Library
H. Karimabadi, B. Loring, P. O'Leary, A. Majumdar, M. Tatineni, and B. Geveci. In-situ visualization for global hybrid simulations. In XSEDE, page 57. ACM, 2013. Google ScholarDigital Library
J. Kim, H. Abbasi, L. Chacon, C. Docan, S. Klasky, Q. Liu, N. Podhorszki, A. Shoshani, and K. Wu. Parallel in situ indexing for data-intensive computing. In LDAV, pages 65--72. IEEE, 2011.Google ScholarCross Ref
S. Klasky, H. Abbasi, J. Logan, M. Parashar, K. Schwan, A. Shoshani, M. Wolf, S. Ahern, I. Altintas, W. Bethel, et al. In situ data processing for extreme-scale computing. SciDAC, 2011.Google Scholar
P. M. Kogge and T. J. Dysart. Using the top500 to trace and project technology and architecture trends. In SC, page 28. ACM, 2011. Google ScholarDigital Library
S. Lakshminarasimhan, D. A. Boyuka, S. V. Pendse, X. Zou, J. Jenkins, V. Vishwanath, M. E. Papka, and N. F. Samatova. Scalable in situ scientific data encoding for analytical query processing. In HPDC, pages 1--12. ACM, 2013. Google ScholarDigital Library
S. Lakshminarasimhan, N. Shah, S. Ethier, S. Klasky, R. Latham, R. Ross, and N. F. Samatova. Compressing the incompressible with ISABELA: In-situ reduction of spatio-temporal data. In Euro-Par, pages 366--379. Springer, 2011. Google ScholarDigital Library
A. G. Landge, V. Pascucci, A. Gyulassy, J. C. Bennett, H. Kolla, J. Chen, and P.-T. Bremer. In-situ feature extraction of large scale combustion simulations using segmented merge trees. In SC, pages 1020--1031. IEEE, 2014. Google ScholarDigital Library
M. Li, S. S. Vazhkudai, A. R. Butt, F. Meng, X. Ma, Y. Kim, C. Engelmann, and G. Shipman. Functional partitioning to optimize end-to-end performance on many-core architectures. In SC, pages 1--12. IEEE, 2010. Google ScholarDigital Library
T. Li, R. Verma, X. Duan, H. Jin, and I. Raicu. Exploring distributed hash tables in highend computing. ACM SIGMETRICS Performance Evaluation Review, 39(3):128--130, 2011. Google ScholarDigital Library
T. Li, X. Zhou, K. Brandstatter, and I. Raicu. Distributed key-value store on hpc and cloud systems. In GCASR. Citeseer, 2013.Google Scholar
T. Li, X. Zhou, K. Brandstatter, D. Zhao, K. Wang, A. Rajendran, Z. Zhang, and I. Raicu. ZHT: A light-weight reliable persistent dynamic scalable zero-hop distributed hash table. In IPDPS, pages 775--787. IEEE, 2013. Google ScholarDigital Library
Q. Liu, J. Logan, Y. Tian, H. Abbasi, N. Podhorszki, J. Y. Choi, S. Klasky, R. Tchoua, J. Lofstead, R. Oldfield, et al. Hello ADIOS: the challenges and lessons of developing leadership class I/O frameworks. Concurrency and Computation: Practice and Experience, 26(7):1453--1473, 2014. Google ScholarDigital Library
S. Loebman, D. Nunley, Y.-C. Kwon, B. Howe, M. Balazinska, and J. P. Gardner. Analyzing massive astrophysical datasets: Can Pig/Hadoop or a relational DBMS help? In CLUSTER, pages 1--10. IEEE, 2009.Google ScholarCross Ref
D. Logothetis, C. Trezzo, K. C. Webb, and K. Yocum. In-situ MapReduce for log processing. In USENIX ATC, page 115, 2011. Google ScholarDigital Library
M. Matsuda, N. Maruyama, and S. Takizawa. K MapReduce: A scalable tool for data-processing and search/ensemble applications on large-scale supercomputers. In CLUSTER, pages 1--8. IEEE, 2013.Google ScholarCross Ref
W. J. McCausland, S. Miller, and D. Pelletier. Simulation smoothing for state--space models: A computational efficiency analysis. Computational Statistics & Data Analysis, 55(1):199--212, 2011. Google ScholarDigital Library
H. Mohamed and S. Marchand-Maillet. MRO-MPI: MapReduce overlapping using MPI and an optimized data exchange policy. Parallel Computing, 39(12):851--866, 2013. Google ScholarDigital Library
R. A. Oldfield, G. D. Sjaardema, G. F. Lofstead II, and T. Kordenbrock. Trilinos i/o support (trios). Scientific Programming, 20(2):181--196, 2012. Google ScholarDigital Library
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. In SIGMOD, pages 1099--1110. ACM, 2008. Google ScholarDigital Library
S. Owen, R. Anil, T. Dunning, and E. Friedman. Mahout in action. Manning Shelter Island, 2011. Google ScholarDigital Library
S. J. Plimpton and K. D. Devine. MapReduce in MPI for large-scale graph algorithms. Parallel Computing, 37(9):610--632, 2011. Google ScholarDigital Library
C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis. Evaluating mapreduce for multi-core and multiprocessor systems. In HPCA, pages 13--24. IEEE, 2007. Google ScholarDigital Library
R. W. Schafer. What is a Savitzky-Golay filter?{lecture notes}. Signal Processing Magazine, IEEE, 28(4):111--117, 2011.Google ScholarCross Ref
S. Sehrish, G. Mackey, J. Wang, and J. Bent. MRAP: A Novel MapReduce-based Framework to Support HPC Analytics Applications with Access Patterns. In HPDC, pages 107--118, 2010. Google ScholarDigital Library
A. Shinnar, D. Cunningham, V. Saraswat, and B. Herta. M3R: increased performance for in-memory Hadoop jobs. VLDB, 5(12):1736--1747, 2012. Google ScholarDigital Library
Y. Su, Y. Wang, and G. Agrawal. In-situ bitmaps generation and efficient data analysis based on bitmaps. In HPDC, pages 61--72. ACM, 2015. Google ScholarDigital Library
Y. Su, Y. Wang, G. Agrawal, and R. Kettimuthu. SDQuery DSI: integrating data management support with a wide area data transfer protocol. In SC, page 47. ACM, 2013. Google ScholarDigital Library
J. Talbot, R. M. Yoo, and C. Kozyrakis. Phoenix++: modular MapReduce for shared-memory systems. In MapReduce'11, pages 9--16. ACM, 2011. Google ScholarDigital Library
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive - A Warehousing Solution Over a Map-Reduce Framework. PVLDB, 2(2):1626--1629, 2009. Google ScholarDigital Library
T. Tu, C. A. Rendleman, D. W. Borhani, R. O. Dror, J. Gullingsrud, M. Jensen, J. L. Klepeis, P. Maragakis, P. Miller, K. A. Stafford, et al. A scalable parallel framework for analyzing terascale molecular dynamics simulation trajectories. In SC, pages 1--12. IEEE, 2008. Google ScholarDigital Library
V. Vishwanath, M. Hereld, and M. E. Papka. Toward simulation-time data analysis and i/o acceleration on leadership-class systems. In LDAV, pages 9--14. IEEE, 2011.Google ScholarCross Ref
J. Wang, D. Crawl, and I. Altintas. Kepler + Hadoop: A General Architecture Facilitating Data-Intensive Applications in Scientific Workflow Systems. In SC-WORKS, pages --1--1, 2009. Google ScholarDigital Library
K. Wang, A. Kulkarni, X. Zhou, M. Lang, and I. Raicu. Using simulation to explore distributed key-value stores for exascale system services. In GCASR, 2013.Google Scholar
K. Wang, X. Zhou, H. Chen, M. Lang, and I. Raicu. Next generation job management systems for extreme-scale ensemble computing. In HPDC, pages 111--114. ACM, 2014. Google ScholarDigital Library
Y. Wang, G. Agrawal, T. Bicer, and W. Jiang. Smart: A MapReduce-Like Framework for In-Situ Scientific Analytics. Technical report, OSU-CISRC-4/15-TR05, Ohio State University, 2015.Google Scholar
Y. Wang, A. Nandi, and G. Agrawal. SAGA: Array Storage as a DB with Support for Structural Aggregations. In SSDBM, page 9. ACM, 2014. Google ScholarDigital Library
Y. Wang, Y. Su, and G. Agrawal. Supporting a Light-Weight Data Management Layer Over HDF5. In CCGRID, pages 335--342, may 2013.Google ScholarDigital Library
Y. Wang, J. Wei, and G. Agrawal. SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats. In CCGRID, pages 443--450, may 2012. Google ScholarDigital Library
J. M. Wozniak, B. Jacobs, R. Latham, S. Lang, S. W. Son, and R. Ross. C-mpi: A dht implementation for grid and hpc environments. Preprint ANL/MCS-P1746-0410, 4, 2010.Google Scholar
H. Yu, C. Wang, R. W. Grout, J. H. Chen, and K.-L. Ma. In situ visualization for large-scale combustion simulations. IEEE Computer Graphics and Applications, 30(3):45--57, 2010. Google ScholarDigital Library
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, pages 2--2. USENIX Association, 2012. Google ScholarDigital Library
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. In HotCloud, pages 10--10, 2010. Google ScholarDigital Library
B. Zhang, T. Estrada, P. Cicotti, and M. Taufer. Enabling in-situ data analysis for large protein-folding trajectory datasets. In IPDPS, pages 221--230. IEEE, 2014. Google ScholarDigital Library
F. Zhang, C. Docan, M. Parashar, S. Klasky, N. Podhorszki, and H. Abbasi. Enabling in-situ execution of coupled scientific workflow on multi-core platform. In IPDPS, pages 1352--1363. IEEE, 2012. Google ScholarDigital Library
F. Zhang, S. Lasluisa, T. Jin, I. Rodero, H. Bui, and M. Parashar. In-situ Feature-Based Objects Tracking for Large-Scale Scientific Simulations. In SCC, pages 736--740. IEEE, 2012. Google ScholarDigital Library
D. Zhao, Z. Zhang, X. Zhou, T. Li, K. Wang, D. Kimpe, P. Carns, R. Ross, and I. Raicu. Fusionfs: Toward supporting data-intensive scientific applications on extreme-scale high-performance computing systems. In Big Data, pages 61--70. IEEE, 2014.Google Scholar
H. Zhao, S. Ai, Z. Lv, and B. Li. Parallel Accessing Massive NetCDF Data Based on MapReduce. In WISM, pages 425--431, Berlin, Heidelberg, 2010. Springer-Verlag. Google ScholarDigital Library
F. Zheng, H. Abbasi, C. Docan, J. Lofstead, Q. Liu, S. Klasky, M. Parashar, N. Podhorszki, K. Schwan, and M. Wolf. PreDatA--preparatory data analytics on peta-scale machines. In IPDPS, pages 1--12. IEEE, 2010.Google ScholarCross Ref
F. Zheng, H. Yu, C. Hantas, M. Wolf, G. Eisenhauer, K. Schwan, H. Abbasi, and S. Klasky. GoldRush: resource efficient in situ scientific data analytics using fine-grained interference aware execution. In SC, page 78. ACM, 2013. Google ScholarDigital Library
F. Zheng, H. Zou, G. Eisenhauer, K. Schwan, M. Wolf, J. Dayal, T.-A. Nguyen, J. Cao, H. Abbasi, S. Klasky, et al. FlexIO: I/O Middleware for Location-Flexible Scientific Data Analytics. In IPDPS, pages 320--331. IEEE, 2013. Google ScholarDigital Library
H. Zou, K. Schwan, M. Slawinska, M. Wolf, G. Eisenhauer, F. Zheng, J. Dayal, J. Logan, Q. Liu, S. Klasky, et al. FlexQuery: An online query system for interactive remote visual data exploration at large scale. In CLUSTER, pages 1--8. IEEE, 2013.Google ScholarCross Ref
H. Zou, Y. Yu, W. Tang, and H.-W. M. Chen. FlexAnalytics: a flexible data analytics framework for big data applications with I/O performance improvement. Big Data Research, 1:4--13, 2014.Google ScholarDigital Library
H. Zou, F. Zheng, M. Wolf, G. Eisenhauer, K. Schwan, H. Abbasi, Q. Liu, N. Podhorszki, and S. Klasky. Quality-Aware Data Management for Large Scale Scientific Applications. In SCC, pages 816--820, 2012. Google ScholarDigital Library

Index Terms

Smart: a MapReduce-like framework for in-situ scientific analytics

Recommendations

A novel big data analytics framework for smart cities
Abstract
The emergence of smart cities aims at mitigating the challenges raised due to the continuous urbanization development and increasing population density in cities. To face these challenges, governments and decision makers undertake ...
Read More
Smart health

Organized evaluation of various big data and smart system technology in healthcare context.Proposed a conceptual model on Big data enabled Smart Healthcare System Framework (BSHSF).We extract some depth information (some relevant examples) about ...
Read More
Soft Sensing in Smart Cities: Handling 3Vs Using Recommender Systems, Machine Intelligence, and Data Analytics

Today's existing smart city research involves many overtly futuristic applications such as smart transportation, in which smart roads warn drivers of bad traffic conditions ahead, smart parking, which communicates the location of unoccupied parking ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2015
985 pages
ISBN:9781450337236
DOI:10.1145/2807591
General Chair:
Jackie Kern
University of Illinois at Urbana-Champaign, Urbana, Illinois
,
Program Chair:
Jeffrey S. Vetter
Oak Ridge National Laboratory and Georgia Institute of Technology, Oak Ridge, Tennessee
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 November 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
SC '15 Paper Acceptance Rate79of358submissions,22%Overall Acceptance Rate1,516of6,373submissions,24%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 37
  Total Citations
  View Citations
- 373
  Total Downloads
- Downloads (Last 12 months)7
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Smart: a MapReduce-like framework for in-situ scientific analytics

SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

A novel big data analytics framework for smart cities

Smart health

Soft Sensing in Smart Cities: Handling 3Vs Using Recommender Systems, Machine Intelligence, and Data Analytics

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Smart: a MapReduce-like framework for in-situ scientific analytics

SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

A novel big data analytics framework for smart cities

Smart health

Soft Sensing in Smart Cities: Handling 3Vs Using Recommender Systems, Machine Intelligence, and Data Analytics

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media