Abstract
As data collections become larger and larger, users are faced with increasing bottlenecks in their data analysis. More data means more time to prepare and to load the data into the database before executing the desired queries. Many applications already avoid using database systems, for example, scientific data analysis and social networks, due to the complexity and the increased data-to-query time, that is, the time between getting the data and retrieving its first useful results. For many applications data collections keep growing fast, even on a daily basis, and this data deluge will only increase in the future, where it is expected to have much more data than what we can move or store, let alone analyze.
We here present the design and roadmap of a new paradigm in database systems, called NoDB, which do not require data loading while still maintaining the whole feature set of a modern database system. In particular, we show how to make raw data files a first-class citizen, fully integrated with the query engine. Through our design and lessons learned by implementing the NoDB philosophy over a modern Database Management Systems (DBMS), we discuss the fundamental limitations as well as the strong opportunities that such a research path brings. We identify performance bottlenecks specific for in situ processing, namely the repeated parsing and tokenizing overhead and the expensive data type conversion. To address these problems, we introduce an adaptive indexing mechanism that maintains positional information to provide efficient access to raw data files, together with a flexible caching structure. We conclude that NoDB systems are feasible to design and implement over modern DBMS, bringing an unprecedented positive effect in usability and performance.<!-- END_PAGE_1 -->
- Agrawal, S., Chaudhuri, S., Kollar, L., Marathe, A., Narasayya, V., Syamala, M. Database tuning advisor for Microsoft SQL server 2005. In VLDB (2004), 1110--1121.Google Scholar
- Ailamaki, A., Kantere, V., Dash, D. Managing scientific data. Commun. ACM 53 (2010), 68--78. Google ScholarDigital Library
- Alagiannis, I., Idreos, S., Ailamaki, A. H2O: A hands-free adaptive store. In SIGMOD (2014), 1103--1114. Google ScholarDigital Library
- Bruno, N., Chaudhuri, S. To tune or not to tune? A lightweight physical design alerter. In VLDB (2006), 499--510. Google ScholarDigital Library
- Cohen, J., Dolan, B., Dunlap, M., Hellerstein, J., Welton, C. MAD skills: New analysis practices for big data. PVLDB 2 (2009), 1481--1492. Google ScholarDigital Library
- Dash, D., Polyzotis, N., Ailamaki, A. CoPhy: A scalable, portable, and interactive index advisor for large workloads. PVLDB 4 (2011), 362--372. Google ScholarDigital Library
- Graefe, G., Kuno, H. Self-selecting, self-tuning, incrementally optimized indexes. In EDBT (2010), 371--381. Google ScholarDigital Library
- Gray, J., Liu, D., Nieto-Santisteban, M., Szalay, A., DeWitt, D., Heber, G. Scientific data management in the coming decade. SIGMOD Rec. 34 (2005), 34--41. Google ScholarDigital Library
- Idreos, S., Alagiannis, I., Johnson, R., Ailamaki, A. Here are my data files. Here are my queries. Where are my results? In CIDR (2011).Google Scholar
- Idreos, S., Kersten, M., Manegold, S. Database cracking. In CIDR (2007).Google Scholar
- Idreos, S., Kersten, M., Manegold, S. Self-organizing tuple reconstruction in column-stores. In SIGMOD (2009), 297--308. Google ScholarDigital Library
- Idreos, S., Liarou, E. dbTouch: Analytics at your fingertips. In CIDR (2013).Google Scholar
- Idreos, S., Manegold, S., Kuno, H., Graefe, G. Merging what's cracked, cracking what's merged: Adaptive indexing in main-memory column-stores. PVLDB 4 (2011), 586--597. Google ScholarDigital Library
- Jagadish, H.V., Chapman, A., Elkiss, A., Jayapandian, M., Li, Y., Nandi, A., Yu, C. Making database systems usable. In SIGMOD (2007), 13--24. Google ScholarDigital Library
- Jain, A., Doan, A., Gravano, L. Optimizing SQL queries over text databases. In ICDE (2008), 636--645. Google ScholarDigital Library
- Kersten, M., Idreos, S., Manegold, S., Liarou, E. The researcher's guide to the data deluge: Querying a scientific database in just a few seconds. In PVLDB. Volume 4 (2011), 1474--1477.Google ScholarDigital Library
- Nandi, A., Jagadish, H.V. Guided interaction: Rethinking the query-result paradigm. In PVLDB. Volume 4 (2011), 1466--1469.Google ScholarDigital Library
- Papadomanolakis, S., Ailamaki, A. AutoPart: Automating schema design for large scientific databases using data partitioning. In SSDBM (2004), 383--392. Google ScholarDigital Library
- Roth, M.T., Schwarz, P. Don't scrap it, wrap it! A wrapper architecture for legacy data sources. In VLDB (1997), 266--275. Google ScholarDigital Library
- Schnaitter, K., Abiteboul, S., Milo, T., Polyzotis, N. COLT: Continuous on-line tuning. In SIGMOD (2006), 793--795. Google ScholarDigital Library
- Stonebraker, M., Becla, J., DeWitt, D., Lim, K.-T., Maier, D., Ratzesberger, O., Zdonik, S. Requirements for science data bases and SciDB. In CIDR (2009).Google Scholar
- Zilio, D., Rao, J., Lightstone, S., Lohman, G., Storm, A., Garcia-Arellano, C., Fadden, S. DB2 design advisor: Integrated automatic physical database. In VLDB (2004). Google ScholarDigital Library
Index Terms
- NoDB: efficient query execution on raw data files
Recommendations
NoDB: efficient query execution on raw data files
SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of DataAs data collections become larger and larger, data loading evolves to a major bottleneck. Many applications already avoid using database systems, e.g., scientific data analysis and social networks, due to the complexity and the increased data-to-query ...
NoDB in action: adaptive query processing on raw data
As data collections become larger and larger, users are faced with increasing bottlenecks in their data analysis. More data means more time to prepare the data, to load the data into the database and to execute the desired queries. Many applications ...
Comments