skip to main content
10.1145/3295500.3356146acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article
Public Access

MIQS: metadata indexing and querying service for self-describing file formats

Published:17 November 2019Publication History

ABSTRACT

Scientific applications often store datasets in self-describing data file formats, such as HDF5 and netCDF. Regrettably, to efficiently search the metadata within these files remains challenging due to the sheer size of the datasets. Existing solutions extract the metadata and store it in external database management systems (DBMS) to locate desired data. However, this practice introduces significant overhead and complexity in extraction and querying. In this research, we propose a novel <u>M</u>etadata <u>I</u>ndexing and <u>Q</u>uerying <u>S</u>ervice (MIQS), which removes the external DBMS and utilizes in-memory index to achieve efficient metadata searching. MIQS follows the self-contained data management paradigm and provides portable and schema-free metadata indexing and querying functionalities for self-describing file formats. We have evaluated MIQS with the state-of-the-art MongoDB-based metadata indexing solution. MIQS achieved up to 99% time reduction in index construction and up to 172kx search performance improvement with up to 75% reduction in memory footprint.

References

  1. 2018. MongoDB Limits and Thresholds, https://docs.mongodb.com/manual/reference/limits/.Google ScholarGoogle Scholar
  2. 2018. Simple and fast C library implementing a thread-safe API to manage hash-tables, linked lists, lock-free ring buffers and queues, https://github.com/xant/libhl.Google ScholarGoogle Scholar
  3. MG Aartsen, K Abraham, M Ackermann, J Adams, JA Aguilar, M Ahlers, M Ahrens, D Altmann, K Andeen, T Anderson, et al. 2016. Search for sources of High-Energy neutrons with four years of data from the Icetop Detector. The Astrophysical Journal 830, 2 (2016), 129.Google ScholarGoogle ScholarCross RefCross Ref
  4. MG Aartsen, M Ackermann, J Adams, JA Aguilar, Markus Ahlers, M Ahrens, I Al Samarai, D Altmann, K Andeen, T Anderson, et al. 2017. Constraints on galactic neutrino emission with seven years of IceCube data. The Astrophysical Journal 849, 1 (2017), 67.Google ScholarGoogle ScholarCross RefCross Ref
  5. Christopher P Ahn, Rachael Alexandroff, Carlos Allende Prieto, Scott F Anderson, Timothy Anderton, Brett H Andrews, Éric Aubourg, Stephen Bailey, Eduardo Balbinot, Rory Barnes, et al. 2012. The ninth data release of the Sloan Digital Sky Survey: first spectroscopic data from the SDSS-III Baryon Oscillation Spectroscopic Survey. The Astrophysical Journal Supplement Series 203, 2 (2012), 21.Google ScholarGoogle ScholarCross RefCross Ref
  6. Shadab Alam, Metin Ata, Stephen Bailey, Florian Beutler, Dmitry Bizyaev, Jonathan A Blazek, Adam S Bolton, Joel R Brownstein, Angela Burden, Chia-Hsun Chuang, et al. 2017. The clustering of galaxies in the completed SDSS-III Baryon Oscillation Spectroscopic Survey: cosmological analysis of the DR12 galaxy sample. Monthly Notices of the Royal Astronomical Society 470, 3 (2017), 2617--2652.Google ScholarGoogle ScholarCross RefCross Ref
  7. bsonspec.org. 2018. Binary JSON Specification, http://bsonspec.org/spec.html.Google ScholarGoogle Scholar
  8. Chi Chen, Zhi Deng, Richard Tran, Hanmei Tang, Iek-Heng Chu, and Shyue Ping Ong. 2017. Accurate force field for molybdenum by machine learning large materials data. Physical Review Materials 1, 4 (2017), 043603.Google ScholarGoogle ScholarCross RefCross Ref
  9. Tull Craig E., Essiari Abdelilah, Gunter Dan, et al. 2013. The SPOT Suite project. http://spot.nersc.gov/.Google ScholarGoogle Scholar
  10. Digital Curation Conference (DCC). 2018. Scientific Metadata. http://www.dcc.ac.uk/resources/curation-reference-manual/chapters-production/scientific-metadata.Google ScholarGoogle Scholar
  11. Jeffrey J Donatelli, James A Sethian, and Peter H Zwart. 2017. Reconstruction from limited single-particle diffraction data via simultaneous determination of state, orientation, intensity, and phase. Proceedings of the National Academy of Sciences 114, 28 (2017), 7222--7227.Google ScholarGoogle ScholarCross RefCross Ref
  12. Bin Dong, Surendra Byna, and Kesheng Wu. 2015. Spatially clustered join on heterogeneous scientific data sets. In 2015 IEEE International Conference on Big Data (Big Data). IEEE, 371--380.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Mike Folk, Albert Cheng, and Kim Yates. 1999. HDF5: A file format and I/O library for high performance computing applications. In Proceedings of supercomputing, Vol. 99. 5--33.Google ScholarGoogle Scholar
  14. P Greenfield, M Droettboom, and E Bray. 2015. ASDF: A new data format for astronomy. Astronomy and Computing 12 (2015), 240--251.Google ScholarGoogle ScholarCross RefCross Ref
  15. The HDF Group. 2018. HDF5 Topic Parallel Indexing Branch. https://git.hdfgroup.org/users/jsoumagne/repos/hdf5/browse?at=refs%2Fheads%2Ftopic-parallel-indexing.Google ScholarGoogle Scholar
  16. The HDF Group. 2018. HDF5 Users. https://support.hdfgroup.org/HDF5/users5.html.Google ScholarGoogle Scholar
  17. Kohei Hiraga, Osamu Tatebe, and Hideyuki Kawashima. 2018. PPMDS: A Distributed Metadata Server Based on Nonblocking Transactions. In Fifth International Conference on Social Networks Analysis, Management and Security SNAMS 2018, Valencia, Spain, October 15--18, 2018. 202--208. Google ScholarGoogle ScholarCross RefCross Ref
  18. Joint Genome Institute. 2013. The JGI Archive and Meta-data Organizer(JAMO). http://cs.lbl.gov/news-media/news/2013/new-metadata-organizer-streamlines-jgi-data-management.Google ScholarGoogle Scholar
  19. json.org. 2018. Introducing JSON. https://www.json.org.Google ScholarGoogle Scholar
  20. Daniel Korenblum, Daniel Rubin, Sandy Napel, Cesar Rodriguez, and Chris Beaulieu. 2011. Managing biomedical image metadata for search and retrieval of similar images. Journal of digital imaging 24, 4 (2011), 739--748.Google ScholarGoogle ScholarCross RefCross Ref
  21. Margaret Lawson and Jay Lofstead. 2018. Using a Robust Metadata Management System to Accelerate Scientific Discovery at Extreme Scales. In Proceedings of the 2nd PDSW-DISCS '18. Google ScholarGoogle ScholarCross RefCross Ref
  22. Viktor Leis, Alfons Kemper, and Thomas Neumann. 2013. The Adaptive Radix Tree: ARTful Indexing for Main-memory Databases. In Proceedings of the 2013 IEEE International Conference on Data Engineering (ICDE 2013) (ICDE '13). IEEE Computer Society, Washington, DC, USA, 38--49. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Andrew W Leung, Minglong Shao, Timothy Bisson, Shankar Pasupathy, and Ethan L Miller. 2009. Spyglass: Fast, Scalable Metadata Search for Large-Scale Storage Systems.. In FAST, Vol. 9. 153--166.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Jialin Liu, Debbie Bard, Quincey Koziol, Stephen Bailey, et al. 2017. Searching for millions of objects in the BOSS spectroscopic survey data with H5Boss. In 2017 New York Scientific Data Summit (NYSDS). 1--9. Google ScholarGoogle ScholarCross RefCross Ref
  25. Yaning Liu, George Shu Heng Pau, and Stefan Finsterle. 2017. Implicit sampling combined with reduced order modeling for the inversion of vadose zone hydrological data. Computers & Geosciences (2017).Google ScholarGoogle Scholar
  26. Jay F. Lofstead, Scott Klasky, et al. 2008. Flexible IO and integration for scientific codes through the adaptable IO system (ADIOS). In CLADE. 15--24.Google ScholarGoogle Scholar
  27. Arun Mannodi-Kanakkithodi, Tran Doan Huan, and Rampi Ramprasad. 2017. Mining materials design rules from data: The example of polymer dielectrics. Chemistry of Materials 29, 21 (2017), 9001--9010.Google ScholarGoogle ScholarCross RefCross Ref
  28. Vilobh Meshram, Xavier Besseron, Xiangyong Ouyang, Raghunath Rajachandrasekar, Ravi Prakash, and Dhabaleswar K. Panda. 2011. Can a Decentralized Metadata Service Layer Benefit Parallel Filesystems?. In 2011 IEEE International Conference on Cluster Computing (CLUSTER), Austin, TX, USA, September 26--30, 2011. 484--493. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. MongoDB. 2018. MongoDB. https://www.mongodb.com.Google ScholarGoogle Scholar
  30. mongodb.com. 2018. The MongoDB 4.0 Manual, https://docs.mongodb.com/manual/.Google ScholarGoogle Scholar
  31. David Paez-Espino, I Chen, A Min, Krishna Palaniappan, Anna Ratner, Ken Chu, Ernest Szeto, Manoj Pillay, Jinghua Huang, Victor M Markowitz, et al. 2017. IMG/VR: a database of cultured and uncultured DNA Viruses and retroviruses. Nucleic acids research 45, D1 (2017), D457--D465.Google ScholarGoogle Scholar
  32. PostgreSQL. 2018. PostgreSQL. https://www.postgresql.org.Google ScholarGoogle Scholar
  33. Russ Rew and Glenn Davis. 1990. NetCDF: an interface for scientific data access. IEEE computer graphics and applications 10, 4 (1990), 76--82.Google ScholarGoogle Scholar
  34. Frank B Schmuck and Roger L Haskin. 2002. GPFS: A Shared-Disk File System for Large Computing Clusters.. In FAST, Vol. 2.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Philip Schwan et al. 2003. Lustre: Building a file system for 1000-node clusters. In Proceedings of the 2003 Linux symposium, Vol. 2003. 380--386.Google ScholarGoogle Scholar
  36. Self-balancing binary search tree. 2019. Self-balancing binary search tree --- Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/wiki/Self-balancing_binary_search_tree [Online; accessed 10-April-2019].Google ScholarGoogle Scholar
  37. Arie Shoshani and Doron Rotem. 2009. Scientific data management: challenges, technology, and deployment. Chapman and Hall/CRC.Google ScholarGoogle Scholar
  38. Hyogi Sim, Youngjae Kim, Sudharshan S. Vazhkudai, Geoffroy R. Vallée, Seung-Hwan Lim, and Ali Raza Butt. 2017. Tagit: an integrated indexing and search service for file systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2017, Denver, CO, USA, November 12 - 17, 2017. 5:1--5:12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. sqlite.org. 2017. SQLite. https://sqlite.org.Google ScholarGoogle Scholar
  40. Houjun Tang, Suren Byna, Bin Dong, Jialin Liu, and Quincey Koziol. 2017. SoMeta: Scalable Object-Centric Metadata Management for High Performance Computing. In Cluster Computing (CLUSTER), 2017 IEEE International Conference on. IEEE, 359--369.Google ScholarGoogle ScholarCross RefCross Ref
  41. Alexander Thomson and Daniel J. Abadi. 2015. CalvinFS: Consistent WAN Replication and Scalable Metadata Management for Distributed File Systems. In Proceedings of the 13th USENIX Conference on File and Storage Technologies, FAST 2015, Santa Clara, CA, USA, February 16--19, 2015, Jiri Schindler and Erez Zadok (Eds.). USENLX Association, 1--14. https://www.usenix.org/conference/fast15/technical-sessions/presentation/thomsonGoogle ScholarGoogle Scholar
  42. Teng Wang, Adam Moody, Yue Zhu, Kathryn Mohror, Kento Sato, Tanzima Islam, and Weikuan Yu. 2017. MetaKV: A Key-Value Store for Metadata Management of Distributed Burst Buffers. In 2017 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2017, Orlando, FL, USA, May 29 - June 2, 2017. IEEE Computer Society, 1174--1183. Google ScholarGoogle ScholarCross RefCross Ref
  43. Zeyi Wen, Xingyang Liu, Hongjian Cao, and Bingsheng He. 2018. RTSI: An Index Structure for Multi-Modal Real-Time Search on Live Audio Streaming Services. In 34th IEEE International Conference on Data Engineering, ICDE 2018, Paris, France, April 16--19, 2018. 1495--1506. Google ScholarGoogle ScholarCross RefCross Ref
  44. WiredTiger. 2018. WiredTiger. http://www.wiredtiger.com/.Google ScholarGoogle Scholar
  45. Quanqing Xu, Rajesh Vellore Arumugam, Khai Leong Yang, and Sridhar Mahadevan. 2013. Drop: Facilitating distributed metadata management in eb-scale storage systems. In 2013 IEEE 29th symposium on mass storage systems and technologies (MSST). IEEE, 1--10.Google ScholarGoogle ScholarCross RefCross Ref
  46. Wei Zhang, Houjun Tang, Suren Byna, and Yong Chen. 2018. DART: Distributed Adaptive Radix Tree for Efficient Affix-based Keyword Search on HPC Systems. In Proceedings of The 27th International Conference on Parallel Architectures and Compilation Techniques (PACT'18). Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Dongfang Zhao, Kan Qiao, Zhou Zhou, Tonglin Li, Zhihan Lu, and Xiaohua Xu. 2017. Toward Efficient and Flexible Metadata Indexing of Big Data Systems. IEEE Trans. Big Data 3, 1 (2017), 107--117.Google ScholarGoogle ScholarCross RefCross Ref
  48. Qing Zheng, Charles D. Cranor, Danhao Guo, Gregory R. Ganger, George Amvrosiadis, Garth A. Gibson, Bradley W. Settlemyer, Gary Grider, and Fan Guo. 2018. Scaling embedded in-situ indexing with deltaFS. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018, Dallas, TX, USA, November 11--16, 2018. 3:1--3:15. http://dl.acm.org/citation.cfm?id=3291660Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. MIQS: metadata indexing and querying service for self-describing file formats

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
      November 2019
      1921 pages
      ISBN:9781450362290
      DOI:10.1145/3295500

      Copyright © 2019 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 17 November 2019

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate1,516of6,373submissions,24%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader