ABSTRACT
Scientific applications often store datasets in self-describing data file formats, such as HDF5 and netCDF. Regrettably, to efficiently search the metadata within these files remains challenging due to the sheer size of the datasets. Existing solutions extract the metadata and store it in external database management systems (DBMS) to locate desired data. However, this practice introduces significant overhead and complexity in extraction and querying. In this research, we propose a novel <u>M</u>etadata <u>I</u>ndexing and <u>Q</u>uerying <u>S</u>ervice (MIQS), which removes the external DBMS and utilizes in-memory index to achieve efficient metadata searching. MIQS follows the self-contained data management paradigm and provides portable and schema-free metadata indexing and querying functionalities for self-describing file formats. We have evaluated MIQS with the state-of-the-art MongoDB-based metadata indexing solution. MIQS achieved up to 99% time reduction in index construction and up to 172kx search performance improvement with up to 75% reduction in memory footprint.
- 2018. MongoDB Limits and Thresholds, https://docs.mongodb.com/manual/reference/limits/.Google Scholar
- 2018. Simple and fast C library implementing a thread-safe API to manage hash-tables, linked lists, lock-free ring buffers and queues, https://github.com/xant/libhl.Google Scholar
- MG Aartsen, K Abraham, M Ackermann, J Adams, JA Aguilar, M Ahlers, M Ahrens, D Altmann, K Andeen, T Anderson, et al. 2016. Search for sources of High-Energy neutrons with four years of data from the Icetop Detector. The Astrophysical Journal 830, 2 (2016), 129.Google ScholarCross Ref
- MG Aartsen, M Ackermann, J Adams, JA Aguilar, Markus Ahlers, M Ahrens, I Al Samarai, D Altmann, K Andeen, T Anderson, et al. 2017. Constraints on galactic neutrino emission with seven years of IceCube data. The Astrophysical Journal 849, 1 (2017), 67.Google ScholarCross Ref
- Christopher P Ahn, Rachael Alexandroff, Carlos Allende Prieto, Scott F Anderson, Timothy Anderton, Brett H Andrews, Éric Aubourg, Stephen Bailey, Eduardo Balbinot, Rory Barnes, et al. 2012. The ninth data release of the Sloan Digital Sky Survey: first spectroscopic data from the SDSS-III Baryon Oscillation Spectroscopic Survey. The Astrophysical Journal Supplement Series 203, 2 (2012), 21.Google ScholarCross Ref
- Shadab Alam, Metin Ata, Stephen Bailey, Florian Beutler, Dmitry Bizyaev, Jonathan A Blazek, Adam S Bolton, Joel R Brownstein, Angela Burden, Chia-Hsun Chuang, et al. 2017. The clustering of galaxies in the completed SDSS-III Baryon Oscillation Spectroscopic Survey: cosmological analysis of the DR12 galaxy sample. Monthly Notices of the Royal Astronomical Society 470, 3 (2017), 2617--2652.Google ScholarCross Ref
- bsonspec.org. 2018. Binary JSON Specification, http://bsonspec.org/spec.html.Google Scholar
- Chi Chen, Zhi Deng, Richard Tran, Hanmei Tang, Iek-Heng Chu, and Shyue Ping Ong. 2017. Accurate force field for molybdenum by machine learning large materials data. Physical Review Materials 1, 4 (2017), 043603.Google ScholarCross Ref
- Tull Craig E., Essiari Abdelilah, Gunter Dan, et al. 2013. The SPOT Suite project. http://spot.nersc.gov/.Google Scholar
- Digital Curation Conference (DCC). 2018. Scientific Metadata. http://www.dcc.ac.uk/resources/curation-reference-manual/chapters-production/scientific-metadata.Google Scholar
- Jeffrey J Donatelli, James A Sethian, and Peter H Zwart. 2017. Reconstruction from limited single-particle diffraction data via simultaneous determination of state, orientation, intensity, and phase. Proceedings of the National Academy of Sciences 114, 28 (2017), 7222--7227.Google ScholarCross Ref
- Bin Dong, Surendra Byna, and Kesheng Wu. 2015. Spatially clustered join on heterogeneous scientific data sets. In 2015 IEEE International Conference on Big Data (Big Data). IEEE, 371--380.Google ScholarDigital Library
- Mike Folk, Albert Cheng, and Kim Yates. 1999. HDF5: A file format and I/O library for high performance computing applications. In Proceedings of supercomputing, Vol. 99. 5--33.Google Scholar
- P Greenfield, M Droettboom, and E Bray. 2015. ASDF: A new data format for astronomy. Astronomy and Computing 12 (2015), 240--251.Google ScholarCross Ref
- The HDF Group. 2018. HDF5 Topic Parallel Indexing Branch. https://git.hdfgroup.org/users/jsoumagne/repos/hdf5/browse?at=refs%2Fheads%2Ftopic-parallel-indexing.Google Scholar
- The HDF Group. 2018. HDF5 Users. https://support.hdfgroup.org/HDF5/users5.html.Google Scholar
- Kohei Hiraga, Osamu Tatebe, and Hideyuki Kawashima. 2018. PPMDS: A Distributed Metadata Server Based on Nonblocking Transactions. In Fifth International Conference on Social Networks Analysis, Management and Security SNAMS 2018, Valencia, Spain, October 15--18, 2018. 202--208. Google ScholarCross Ref
- Joint Genome Institute. 2013. The JGI Archive and Meta-data Organizer(JAMO). http://cs.lbl.gov/news-media/news/2013/new-metadata-organizer-streamlines-jgi-data-management.Google Scholar
- json.org. 2018. Introducing JSON. https://www.json.org.Google Scholar
- Daniel Korenblum, Daniel Rubin, Sandy Napel, Cesar Rodriguez, and Chris Beaulieu. 2011. Managing biomedical image metadata for search and retrieval of similar images. Journal of digital imaging 24, 4 (2011), 739--748.Google ScholarCross Ref
- Margaret Lawson and Jay Lofstead. 2018. Using a Robust Metadata Management System to Accelerate Scientific Discovery at Extreme Scales. In Proceedings of the 2nd PDSW-DISCS '18. Google ScholarCross Ref
- Viktor Leis, Alfons Kemper, and Thomas Neumann. 2013. The Adaptive Radix Tree: ARTful Indexing for Main-memory Databases. In Proceedings of the 2013 IEEE International Conference on Data Engineering (ICDE 2013) (ICDE '13). IEEE Computer Society, Washington, DC, USA, 38--49. Google ScholarDigital Library
- Andrew W Leung, Minglong Shao, Timothy Bisson, Shankar Pasupathy, and Ethan L Miller. 2009. Spyglass: Fast, Scalable Metadata Search for Large-Scale Storage Systems.. In FAST, Vol. 9. 153--166.Google ScholarDigital Library
- Jialin Liu, Debbie Bard, Quincey Koziol, Stephen Bailey, et al. 2017. Searching for millions of objects in the BOSS spectroscopic survey data with H5Boss. In 2017 New York Scientific Data Summit (NYSDS). 1--9. Google ScholarCross Ref
- Yaning Liu, George Shu Heng Pau, and Stefan Finsterle. 2017. Implicit sampling combined with reduced order modeling for the inversion of vadose zone hydrological data. Computers & Geosciences (2017).Google Scholar
- Jay F. Lofstead, Scott Klasky, et al. 2008. Flexible IO and integration for scientific codes through the adaptable IO system (ADIOS). In CLADE. 15--24.Google Scholar
- Arun Mannodi-Kanakkithodi, Tran Doan Huan, and Rampi Ramprasad. 2017. Mining materials design rules from data: The example of polymer dielectrics. Chemistry of Materials 29, 21 (2017), 9001--9010.Google ScholarCross Ref
- Vilobh Meshram, Xavier Besseron, Xiangyong Ouyang, Raghunath Rajachandrasekar, Ravi Prakash, and Dhabaleswar K. Panda. 2011. Can a Decentralized Metadata Service Layer Benefit Parallel Filesystems?. In 2011 IEEE International Conference on Cluster Computing (CLUSTER), Austin, TX, USA, September 26--30, 2011. 484--493. Google ScholarDigital Library
- MongoDB. 2018. MongoDB. https://www.mongodb.com.Google Scholar
- mongodb.com. 2018. The MongoDB 4.0 Manual, https://docs.mongodb.com/manual/.Google Scholar
- David Paez-Espino, I Chen, A Min, Krishna Palaniappan, Anna Ratner, Ken Chu, Ernest Szeto, Manoj Pillay, Jinghua Huang, Victor M Markowitz, et al. 2017. IMG/VR: a database of cultured and uncultured DNA Viruses and retroviruses. Nucleic acids research 45, D1 (2017), D457--D465.Google Scholar
- PostgreSQL. 2018. PostgreSQL. https://www.postgresql.org.Google Scholar
- Russ Rew and Glenn Davis. 1990. NetCDF: an interface for scientific data access. IEEE computer graphics and applications 10, 4 (1990), 76--82.Google Scholar
- Frank B Schmuck and Roger L Haskin. 2002. GPFS: A Shared-Disk File System for Large Computing Clusters.. In FAST, Vol. 2.Google ScholarDigital Library
- Philip Schwan et al. 2003. Lustre: Building a file system for 1000-node clusters. In Proceedings of the 2003 Linux symposium, Vol. 2003. 380--386.Google Scholar
- Self-balancing binary search tree. 2019. Self-balancing binary search tree --- Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/wiki/Self-balancing_binary_search_tree [Online; accessed 10-April-2019].Google Scholar
- Arie Shoshani and Doron Rotem. 2009. Scientific data management: challenges, technology, and deployment. Chapman and Hall/CRC.Google Scholar
- Hyogi Sim, Youngjae Kim, Sudharshan S. Vazhkudai, Geoffroy R. Vallée, Seung-Hwan Lim, and Ali Raza Butt. 2017. Tagit: an integrated indexing and search service for file systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2017, Denver, CO, USA, November 12 - 17, 2017. 5:1--5:12. Google ScholarDigital Library
- sqlite.org. 2017. SQLite. https://sqlite.org.Google Scholar
- Houjun Tang, Suren Byna, Bin Dong, Jialin Liu, and Quincey Koziol. 2017. SoMeta: Scalable Object-Centric Metadata Management for High Performance Computing. In Cluster Computing (CLUSTER), 2017 IEEE International Conference on. IEEE, 359--369.Google ScholarCross Ref
- Alexander Thomson and Daniel J. Abadi. 2015. CalvinFS: Consistent WAN Replication and Scalable Metadata Management for Distributed File Systems. In Proceedings of the 13th USENIX Conference on File and Storage Technologies, FAST 2015, Santa Clara, CA, USA, February 16--19, 2015, Jiri Schindler and Erez Zadok (Eds.). USENLX Association, 1--14. https://www.usenix.org/conference/fast15/technical-sessions/presentation/thomsonGoogle Scholar
- Teng Wang, Adam Moody, Yue Zhu, Kathryn Mohror, Kento Sato, Tanzima Islam, and Weikuan Yu. 2017. MetaKV: A Key-Value Store for Metadata Management of Distributed Burst Buffers. In 2017 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2017, Orlando, FL, USA, May 29 - June 2, 2017. IEEE Computer Society, 1174--1183. Google ScholarCross Ref
- Zeyi Wen, Xingyang Liu, Hongjian Cao, and Bingsheng He. 2018. RTSI: An Index Structure for Multi-Modal Real-Time Search on Live Audio Streaming Services. In 34th IEEE International Conference on Data Engineering, ICDE 2018, Paris, France, April 16--19, 2018. 1495--1506. Google ScholarCross Ref
- WiredTiger. 2018. WiredTiger. http://www.wiredtiger.com/.Google Scholar
- Quanqing Xu, Rajesh Vellore Arumugam, Khai Leong Yang, and Sridhar Mahadevan. 2013. Drop: Facilitating distributed metadata management in eb-scale storage systems. In 2013 IEEE 29th symposium on mass storage systems and technologies (MSST). IEEE, 1--10.Google ScholarCross Ref
- Wei Zhang, Houjun Tang, Suren Byna, and Yong Chen. 2018. DART: Distributed Adaptive Radix Tree for Efficient Affix-based Keyword Search on HPC Systems. In Proceedings of The 27th International Conference on Parallel Architectures and Compilation Techniques (PACT'18). Google ScholarDigital Library
- Dongfang Zhao, Kan Qiao, Zhou Zhou, Tonglin Li, Zhihan Lu, and Xiaohua Xu. 2017. Toward Efficient and Flexible Metadata Indexing of Big Data Systems. IEEE Trans. Big Data 3, 1 (2017), 107--117.Google ScholarCross Ref
- Qing Zheng, Charles D. Cranor, Danhao Guo, Gregory R. Ganger, George Amvrosiadis, Garth A. Gibson, Bradley W. Settlemyer, Gary Grider, and Fan Guo. 2018. Scaling embedded in-situ indexing with deltaFS. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018, Dallas, TX, USA, November 11--16, 2018. 3:1--3:15. http://dl.acm.org/citation.cfm?id=3291660Google ScholarDigital Library
Index Terms
- MIQS: metadata indexing and querying service for self-describing file formats
Recommendations
Using provenance to efficiently improve metadata searching performance in storage systems
In cloud storage systems, more than 50% of requests are metadata operations, and thus the file system metadata search performance has become increasingly important to different users. With the rapid growth of storage system scales in volume, traditional ...
A Metadata Search Approach with Branch and Bound Algorithm to Keyword Query in Relational Databases
ICCIT '09: Proceedings of the 2009 Fourth International Conference on Computer Sciences and Convergence Information TechnologySearch engines on the Web have popularized the keyword-based search paradigm, while searching in databases users need to know a database schema and a query language. Keyword search techniques on the Web cannot directly be applied to databases because ...
Hybrid query by humming and metadata search system (HQMS)
FIT '10: Proceedings of the 8th International Conference on Frontiers of Information TechnologyQuery by humming (QbH) is a technique that is used for audio content retrieval. Many QbH systems are based on a feature of humming comparison to audio files, which can be further improved by accompanying other approaches along with humming. In our study,...
Comments