research-article

Performance evaluation of a MongoDB and hadoop platform for scientific data analysis

Authors:
Elif Dede

Binghamton University, Binghamton, NY, USA

Binghamton University, Binghamton, NY, USA
View Profile

,
Madhusudhan Govindaraju

Binghamton University, Binghamton, NY, USA

Binghamton University, Binghamton, NY, USA
View Profile

,
Daniel Gunter

Lawrence Berkeley National Lab, Berkeley, CA, USA

Lawrence Berkeley National Lab, Berkeley, CA, USA
View Profile

,
Richard Shane Canon

Lawrence Berkeley National Lab, Berkeley, CA, USA

Lawrence Berkeley National Lab, Berkeley, CA, USA
View Profile

,
Lavanya Ramakrishnan

Lawrence Berkeley National Lab, Berkeley, CA, USA

Lawrence Berkeley National Lab, Berkeley, CA, USA
View Profile

Science Cloud '13: Proceedings of the 4th ACM workshop on Scientific cloud computingJune 2013Pages 13–20https://doi.org/10.1145/2465848.2465849

Published:17 June 2013Publication History

Science Cloud '13: Proceedings of the 4th ACM workshop on Scientific cloud computing

Pages 13–20

ABSTRACT

Scientific facilities such as the Advanced Light Source (ALS) and Joint Genome Institute and projects such as the Materials Project have an increasing need to capture, store, and analyze dynamic semi-structured data and metadata. A similar growth of semi-structured data within large Internet service providers has led to the creation of NoSQL data stores for scalable indexing and MapReduce for scalable parallel analysis. MapReduce and NoSQL stores have been applied to scientific data. Hadoop, the most popular open source implementation of MapReduce, has been evaluated, utilized and modified for addressing the needs of different scientific analysis problems. ALS and the Materials Project are using MongoDB, a document oriented NoSQL store. However, there is a limited understanding of the performance trade-offs of using these two technologies together.In this paper we evaluate the performance, scalability and fault-tolerance of using MongoDB with Hadoop, towards the goal of identifying the right software environment for scientific data analysis.

References

10gen, The MongoDB Company. http://www.10gen.com.Google Scholar
S. Abiteboul. Querying semi-structured data. In Proceedings of the 6th International Conference on Database Theory, ICDT '97, pages 1--18, London, UK, UK, 1997. Springer-Verlag. Google ScholarDigital Library
Apache Hadoop. http://hadoop.apache.org.Google Scholar
Apache HBase. http://hbase.apache.org.Google Scholar
K. Bakshi. Considerations for big data: Architecture and approach. In Aerospace Conference, 2012 IEEE, pages 1--7, march 2012.Google ScholarCross Ref
Binary JSON. http://bsonspec.org/.Google Scholar
L. Bonnet, A. Laurent, M. Sala, B. Laurent, and N. Sicard. Reduce, you say: What can do for data aggregation and bi in large repositories. In Proceedings of the 2011 22nd International Workshop on Database and Expert Systems Applications, DEXA '11, pages 483--488, Washington, DC, USA, 2011. IEEE Computer Society. Google ScholarDigital Library
F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: a distributed storage system for structured data. In Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7, OSDI '06, pages 15--15, Berkeley, CA, USA, 2006. USENIX Association. Google ScholarDigital Library
B. F. Cooper, R. Ramakrishnan, U. Srivastava, A. Silberstein, P. Bohannon, H.-A. Jacobsen, N. Puz, D. Weaver, and R. Yerneni. Pnuts: Yahoo!'s hosted data serving platform. Proc. VLDB Endow., 1(2):1277--1288, Aug. 2008. Google ScholarDigital Library
B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. Benchmarking cloud serving systems with ycsb. In Proceedings of the 1st ACM symposium on Cloud computing, SoCC '10, pages 143--154, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Communications of the ACM, 51(1):107--113, 2008. Google ScholarDigital Library
T. Dory, B. Mej Aas, P. V. Roy, and N.-L. Tran. Measuring elasticity for cloud databases. In Proceedings of the The Second International Conference on Cloud Computing, GRIDs, and Virtualization, 2011.Google Scholar
Z. Fadika and M. Govindaraju. Lemo-mr: Low overhead and elastic mapreduce implementation optimized for memory and cpu-intensive applications. In Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science, CLOUDCOM '10, pages 1--8, Washington, DC, USA, 2010. IEEE Computer Society. Google ScholarDigital Library
A. Floratou, N. Teletia, D. Dewitt, J. Patel, and D. Z. Zhang. Can the elephants handle the nosql onslaught? VLDB, 2012. Google ScholarDigital Library
S. Ghemawat, H. Gobioff, and S.-T. Leung. The google file system. In Proceedings of the nineteenth ACM symposium on Operating systems principles, SOSP '03, pages 29--43, New York, NY, USA, 2003. ACM. Google ScholarDigital Library
A. Lakshman and P. Malik. Cassandra: structured storage system on a p2p network. In Proceedings of the 28th ACM symposium on Principles of distributed computing, PODC '09, pages 5--5, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
The Materials Project. http://materialsproject.org.Google Scholar
MongoDB. http://www.mongodb.org.Google Scholar
MongoDB + Hadoop Connector. http://api.mongodb.org/hadoop/.Google Scholar
E. Plugge, T. Hawkins, and P. Membrey. The Definitive Guide to MongoDB: The NoSQL Database for Cloud and Desktop Computing. Apress, Berkely, CA, USA, 1st edition, 2010. Google ScholarDigital Library
J. Pokorny. Nosql databases: a step to database scalability in web environment. In Proceedings of the 13th International Conference on Information Integration and Web-based Applications and Services, iiWAS '11, pages 278--283, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
Spider Monkey. https://developer.mozilla.org/en/SpiderMonkey.Google Scholar
The TPC-H Benchmark. http://www.tpc.org/tpch/.Google Scholar
A. Verma, X. Llora, S. Venkataraman, D. Goldberg, and R. Campbell. Scaling ecga model building via data-intensive computing. In Evolutionary Computation (CEC), 2010 IEEE Congress on, pages 1--8, july 2010.Google ScholarCross Ref

Index Terms

Performance evaluation of a MongoDB and hadoop platform for scientific data analysis
1. Computing methodologies
  1. Distributed computing methodologies
    1. Distributed programming languages
  2. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Distributed programming languages
        Parallel programming languages

Recommendations

Comparing NoSQL MongoDB to an SQL DB
ACMSE '13: Proceedings of the 51st ACM Southeast Conference

NoSQL database solutions are becoming more and more prevalent in a world currently dominated by SQL relational databases. NoSQL databases were designed to provide database solutions for large volumes of data that is not structured. However, the ...
Read More
An Evaluation of Cassandra for Hadoop
CLOUD '13: Proceedings of the 2013 IEEE Sixth International Conference on Cloud Computing

In the last decade, the increased use and growth of social media, unconventional web technologies, and mobile applications, have all encouraged development of a new breed of database models. NoSQL data stores target the unstructured data, which by ...
Read More
Implementation and evaluation of scalable data structure over HBase
ICACCI '12: Proceedings of the International Conference on Advances in Computing, Communications and Informatics

With the emergence of commodity hardware architectures and distributed open source software, users are performing analytics on more types of data. Web 2.0 applications like social networking sites have to deal with a lot of meta-data which in some cases ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
Science Cloud '13: Proceedings of the 4th ACM workshop on Scientific cloud computing
June 2013
64 pages
ISBN:9781450319799
DOI:10.1145/2465848
General Chair:
Kyle Chard
University of Chicago
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 17 June 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Hadoop
MapReduce
MongoDB
NoSQL
distributed computing
scientific computing
Qualifiers
- research-article
Conference

Acceptance Rates
Science Cloud '13 Paper Acceptance Rate7of14submissions,50%Overall Acceptance Rate44of151submissions,29%
More
Upcoming Conference
HPDC '24

Sponsor:

sigarch

The 33rd International Symposium on High-Performance Parallel and Distributed Computing

June 3 - 7, 2024

Pisa , Italy
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 91
  Total Citations
  View Citations
- 2,874
  Total Downloads
- Downloads (Last 12 months)91
- Downloads (Last 6 weeks)15
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Performance evaluation of a MongoDB and hadoop platform for scientific data analysis

Science Cloud '13: Proceedings of the 4th ACM workshop on Scientific cloud computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Comparing NoSQL MongoDB to an SQL DB

An Evaluation of Cassandra for Hadoop

Implementation and evaluation of scalable data structure over HBase

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Performance evaluation of a MongoDB and hadoop platform for scientific data analysis

Science Cloud '13: Proceedings of the 4th ACM workshop on Scientific cloud computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Comparing NoSQL MongoDB to an SQL DB

An Evaluation of Cassandra for Hadoop

Implementation and evaluation of scalable data structure over HBase

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media