skip to main content
10.1145/2465848.2465849acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

Performance evaluation of a MongoDB and hadoop platform for scientific data analysis

Published:17 June 2013Publication History

ABSTRACT

Scientific facilities such as the Advanced Light Source (ALS) and Joint Genome Institute and projects such as the Materials Project have an increasing need to capture, store, and analyze dynamic semi-structured data and metadata. A similar growth of semi-structured data within large Internet service providers has led to the creation of NoSQL data stores for scalable indexing and MapReduce for scalable parallel analysis. MapReduce and NoSQL stores have been applied to scientific data. Hadoop, the most popular open source implementation of MapReduce, has been evaluated, utilized and modified for addressing the needs of different scientific analysis problems. ALS and the Materials Project are using MongoDB, a document oriented NoSQL store. However, there is a limited understanding of the performance trade-offs of using these two technologies together.In this paper we evaluate the performance, scalability and fault-tolerance of using MongoDB with Hadoop, towards the goal of identifying the right software environment for scientific data analysis.

References

  1. 10gen, The MongoDB Company. http://www.10gen.com.Google ScholarGoogle Scholar
  2. S. Abiteboul. Querying semi-structured data. In Proceedings of the 6th International Conference on Database Theory, ICDT '97, pages 1--18, London, UK, UK, 1997. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Apache Hadoop. http://hadoop.apache.org.Google ScholarGoogle Scholar
  4. Apache HBase. http://hbase.apache.org.Google ScholarGoogle Scholar
  5. K. Bakshi. Considerations for big data: Architecture and approach. In Aerospace Conference, 2012 IEEE, pages 1--7, march 2012.Google ScholarGoogle ScholarCross RefCross Ref
  6. Binary JSON. http://bsonspec.org/.Google ScholarGoogle Scholar
  7. L. Bonnet, A. Laurent, M. Sala, B. Laurent, and N. Sicard. Reduce, you say: What can do for data aggregation and bi in large repositories. In Proceedings of the 2011 22nd International Workshop on Database and Expert Systems Applications, DEXA '11, pages 483--488, Washington, DC, USA, 2011. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: a distributed storage system for structured data. In Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7, OSDI '06, pages 15--15, Berkeley, CA, USA, 2006. USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. B. F. Cooper, R. Ramakrishnan, U. Srivastava, A. Silberstein, P. Bohannon, H.-A. Jacobsen, N. Puz, D. Weaver, and R. Yerneni. Pnuts: Yahoo!'s hosted data serving platform. Proc. VLDB Endow., 1(2):1277--1288, Aug. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. Benchmarking cloud serving systems with ycsb. In Proceedings of the 1st ACM symposium on Cloud computing, SoCC '10, pages 143--154, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Communications of the ACM, 51(1):107--113, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. T. Dory, B. Mej Aas, P. V. Roy, and N.-L. Tran. Measuring elasticity for cloud databases. In Proceedings of the The Second International Conference on Cloud Computing, GRIDs, and Virtualization, 2011.Google ScholarGoogle Scholar
  13. Z. Fadika and M. Govindaraju. Lemo-mr: Low overhead and elastic mapreduce implementation optimized for memory and cpu-intensive applications. In Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science, CLOUDCOM '10, pages 1--8, Washington, DC, USA, 2010. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. Floratou, N. Teletia, D. Dewitt, J. Patel, and D. Z. Zhang. Can the elephants handle the nosql onslaught? VLDB, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. Ghemawat, H. Gobioff, and S.-T. Leung. The google file system. In Proceedings of the nineteenth ACM symposium on Operating systems principles, SOSP '03, pages 29--43, New York, NY, USA, 2003. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. A. Lakshman and P. Malik. Cassandra: structured storage system on a p2p network. In Proceedings of the 28th ACM symposium on Principles of distributed computing, PODC '09, pages 5--5, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. The Materials Project. http://materialsproject.org.Google ScholarGoogle Scholar
  18. MongoDB. http://www.mongodb.org.Google ScholarGoogle Scholar
  19. MongoDB + Hadoop Connector. http://api.mongodb.org/hadoop/.Google ScholarGoogle Scholar
  20. E. Plugge, T. Hawkins, and P. Membrey. The Definitive Guide to MongoDB: The NoSQL Database for Cloud and Desktop Computing. Apress, Berkely, CA, USA, 1st edition, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. Pokorny. Nosql databases: a step to database scalability in web environment. In Proceedings of the 13th International Conference on Information Integration and Web-based Applications and Services, iiWAS '11, pages 278--283, New York, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Spider Monkey. https://developer.mozilla.org/en/SpiderMonkey.Google ScholarGoogle Scholar
  23. The TPC-H Benchmark. http://www.tpc.org/tpch/.Google ScholarGoogle Scholar
  24. A. Verma, X. Llora, S. Venkataraman, D. Goldberg, and R. Campbell. Scaling ecga model building via data-intensive computing. In Evolutionary Computation (CEC), 2010 IEEE Congress on, pages 1--8, july 2010.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Performance evaluation of a MongoDB and hadoop platform for scientific data analysis

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            Science Cloud '13: Proceedings of the 4th ACM workshop on Scientific cloud computing
            June 2013
            64 pages
            ISBN:9781450319799
            DOI:10.1145/2465848

            Copyright © 2013 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 17 June 2013

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            Science Cloud '13 Paper Acceptance Rate7of14submissions,50%Overall Acceptance Rate44of151submissions,29%

            Upcoming Conference

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader