ABSTRACT
Scientific facilities such as the Advanced Light Source (ALS) and Joint Genome Institute and projects such as the Materials Project have an increasing need to capture, store, and analyze dynamic semi-structured data and metadata. A similar growth of semi-structured data within large Internet service providers has led to the creation of NoSQL data stores for scalable indexing and MapReduce for scalable parallel analysis. MapReduce and NoSQL stores have been applied to scientific data. Hadoop, the most popular open source implementation of MapReduce, has been evaluated, utilized and modified for addressing the needs of different scientific analysis problems. ALS and the Materials Project are using MongoDB, a document oriented NoSQL store. However, there is a limited understanding of the performance trade-offs of using these two technologies together.In this paper we evaluate the performance, scalability and fault-tolerance of using MongoDB with Hadoop, towards the goal of identifying the right software environment for scientific data analysis.
- 10gen, The MongoDB Company. http://www.10gen.com.Google Scholar
- S. Abiteboul. Querying semi-structured data. In Proceedings of the 6th International Conference on Database Theory, ICDT '97, pages 1--18, London, UK, UK, 1997. Springer-Verlag. Google ScholarDigital Library
- Apache Hadoop. http://hadoop.apache.org.Google Scholar
- Apache HBase. http://hbase.apache.org.Google Scholar
- K. Bakshi. Considerations for big data: Architecture and approach. In Aerospace Conference, 2012 IEEE, pages 1--7, march 2012.Google ScholarCross Ref
- Binary JSON. http://bsonspec.org/.Google Scholar
- L. Bonnet, A. Laurent, M. Sala, B. Laurent, and N. Sicard. Reduce, you say: What can do for data aggregation and bi in large repositories. In Proceedings of the 2011 22nd International Workshop on Database and Expert Systems Applications, DEXA '11, pages 483--488, Washington, DC, USA, 2011. IEEE Computer Society. Google ScholarDigital Library
- F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: a distributed storage system for structured data. In Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7, OSDI '06, pages 15--15, Berkeley, CA, USA, 2006. USENIX Association. Google ScholarDigital Library
- B. F. Cooper, R. Ramakrishnan, U. Srivastava, A. Silberstein, P. Bohannon, H.-A. Jacobsen, N. Puz, D. Weaver, and R. Yerneni. Pnuts: Yahoo!'s hosted data serving platform. Proc. VLDB Endow., 1(2):1277--1288, Aug. 2008. Google ScholarDigital Library
- B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. Benchmarking cloud serving systems with ycsb. In Proceedings of the 1st ACM symposium on Cloud computing, SoCC '10, pages 143--154, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Communications of the ACM, 51(1):107--113, 2008. Google ScholarDigital Library
- T. Dory, B. Mej Aas, P. V. Roy, and N.-L. Tran. Measuring elasticity for cloud databases. In Proceedings of the The Second International Conference on Cloud Computing, GRIDs, and Virtualization, 2011.Google Scholar
- Z. Fadika and M. Govindaraju. Lemo-mr: Low overhead and elastic mapreduce implementation optimized for memory and cpu-intensive applications. In Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science, CLOUDCOM '10, pages 1--8, Washington, DC, USA, 2010. IEEE Computer Society. Google ScholarDigital Library
- A. Floratou, N. Teletia, D. Dewitt, J. Patel, and D. Z. Zhang. Can the elephants handle the nosql onslaught? VLDB, 2012. Google ScholarDigital Library
- S. Ghemawat, H. Gobioff, and S.-T. Leung. The google file system. In Proceedings of the nineteenth ACM symposium on Operating systems principles, SOSP '03, pages 29--43, New York, NY, USA, 2003. ACM. Google ScholarDigital Library
- A. Lakshman and P. Malik. Cassandra: structured storage system on a p2p network. In Proceedings of the 28th ACM symposium on Principles of distributed computing, PODC '09, pages 5--5, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- The Materials Project. http://materialsproject.org.Google Scholar
- MongoDB. http://www.mongodb.org.Google Scholar
- MongoDB + Hadoop Connector. http://api.mongodb.org/hadoop/.Google Scholar
- E. Plugge, T. Hawkins, and P. Membrey. The Definitive Guide to MongoDB: The NoSQL Database for Cloud and Desktop Computing. Apress, Berkely, CA, USA, 1st edition, 2010. Google ScholarDigital Library
- J. Pokorny. Nosql databases: a step to database scalability in web environment. In Proceedings of the 13th International Conference on Information Integration and Web-based Applications and Services, iiWAS '11, pages 278--283, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- Spider Monkey. https://developer.mozilla.org/en/SpiderMonkey.Google Scholar
- The TPC-H Benchmark. http://www.tpc.org/tpch/.Google Scholar
- A. Verma, X. Llora, S. Venkataraman, D. Goldberg, and R. Campbell. Scaling ecga model building via data-intensive computing. In Evolutionary Computation (CEC), 2010 IEEE Congress on, pages 1--8, july 2010.Google ScholarCross Ref
Index Terms
- Performance evaluation of a MongoDB and hadoop platform for scientific data analysis
Recommendations
Comparing NoSQL MongoDB to an SQL DB
ACMSE '13: Proceedings of the 51st ACM Southeast ConferenceNoSQL database solutions are becoming more and more prevalent in a world currently dominated by SQL relational databases. NoSQL databases were designed to provide database solutions for large volumes of data that is not structured. However, the ...
An Evaluation of Cassandra for Hadoop
CLOUD '13: Proceedings of the 2013 IEEE Sixth International Conference on Cloud ComputingIn the last decade, the increased use and growth of social media, unconventional web technologies, and mobile applications, have all encouraged development of a new breed of database models. NoSQL data stores target the unstructured data, which by ...
Implementation and evaluation of scalable data structure over HBase
ICACCI '12: Proceedings of the International Conference on Advances in Computing, Communications and InformaticsWith the emergence of commodity hardware architectures and distributed open source software, users are performing analytics on more types of data. Web 2.0 applications like social networking sites have to deal with a lot of meta-data which in some cases ...
Comments