NoDB: efficient query execution on raw data files

Authors:
Ioannis Alagiannis

École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland

École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
View Profile

,
Renata Borovica-Gajic

École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland

École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
View Profile

,
Miguel Branco

École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland

École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
View Profile

,
Stratos Idreos

Harvard University, Cambridge, MA

Harvard University, Cambridge, MA
View Profile

,
Anastasia Ailamaki

École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland

École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
View Profile

Authors Info & Claims

Communications of the ACM Volume 58 Issue 12December 2015pp 112–121https://doi.org/10.1145/2830508

Published:23 November 2015Publication History

Communications of the ACM

Abstract

As data collections become larger and larger, users are faced with increasing bottlenecks in their data analysis. More data means more time to prepare and to load the data into the database before executing the desired queries. Many applications already avoid using database systems, for example, scientific data analysis and social networks, due to the complexity and the increased data-to-query time, that is, the time between getting the data and retrieving its first useful results. For many applications data collections keep growing fast, even on a daily basis, and this data deluge will only increase in the future, where it is expected to have much more data than what we can move or store, let alone analyze.

We here present the design and roadmap of a new paradigm in database systems, called NoDB, which do not require data loading while still maintaining the whole feature set of a modern database system. In particular, we show how to make raw data files a first-class citizen, fully integrated with the query engine. Through our design and lessons learned by implementing the NoDB philosophy over a modern Database Management Systems (DBMS), we discuss the fundamental limitations as well as the strong opportunities that such a research path brings. We identify performance bottlenecks specific for in situ processing, namely the repeated parsing and tokenizing overhead and the expensive data type conversion. To address these problems, we introduce an adaptive indexing mechanism that maintains positional information to provide efficient access to raw data files, together with a flexible caching structure. We conclude that NoDB systems are feasible to design and implement over modern DBMS, bringing an unprecedented positive effect in usability and performance.

References

Agrawal, S., Chaudhuri, S., Kollar, L., Marathe, A., Narasayya, V., Syamala, M. Database tuning advisor for Microsoft SQL server 2005. In VLDB (2004), 1110--1121.Google Scholar
Ailamaki, A., Kantere, V., Dash, D. Managing scientific data. Commun. ACM 53 (2010), 68--78. Google ScholarDigital Library
Alagiannis, I., Idreos, S., Ailamaki, A. H₂O: A hands-free adaptive store. In SIGMOD (2014), 1103--1114. Google ScholarDigital Library
Bruno, N., Chaudhuri, S. To tune or not to tune? A lightweight physical design alerter. In VLDB (2006), 499--510. Google ScholarDigital Library
Cohen, J., Dolan, B., Dunlap, M., Hellerstein, J., Welton, C. MAD skills: New analysis practices for big data. PVLDB 2 (2009), 1481--1492. Google ScholarDigital Library
Dash, D., Polyzotis, N., Ailamaki, A. CoPhy: A scalable, portable, and interactive index advisor for large workloads. PVLDB 4 (2011), 362--372. Google ScholarDigital Library
Graefe, G., Kuno, H. Self-selecting, self-tuning, incrementally optimized indexes. In EDBT (2010), 371--381. Google ScholarDigital Library
Gray, J., Liu, D., Nieto-Santisteban, M., Szalay, A., DeWitt, D., Heber, G. Scientific data management in the coming decade. SIGMOD Rec. 34 (2005), 34--41. Google ScholarDigital Library
Idreos, S., Alagiannis, I., Johnson, R., Ailamaki, A. Here are my data files. Here are my queries. Where are my results? In CIDR (2011).Google Scholar
Idreos, S., Kersten, M., Manegold, S. Database cracking. In CIDR (2007).Google Scholar
Idreos, S., Kersten, M., Manegold, S. Self-organizing tuple reconstruction in column-stores. In SIGMOD (2009), 297--308. Google ScholarDigital Library
Idreos, S., Liarou, E. dbTouch: Analytics at your fingertips. In CIDR (2013).Google Scholar
Idreos, S., Manegold, S., Kuno, H., Graefe, G. Merging what's cracked, cracking what's merged: Adaptive indexing in main-memory column-stores. PVLDB 4 (2011), 586--597. Google ScholarDigital Library
Jagadish, H.V., Chapman, A., Elkiss, A., Jayapandian, M., Li, Y., Nandi, A., Yu, C. Making database systems usable. In SIGMOD (2007), 13--24. Google ScholarDigital Library
Jain, A., Doan, A., Gravano, L. Optimizing SQL queries over text databases. In ICDE (2008), 636--645. Google ScholarDigital Library
Kersten, M., Idreos, S., Manegold, S., Liarou, E. The researcher's guide to the data deluge: Querying a scientific database in just a few seconds. In PVLDB. Volume 4 (2011), 1474--1477.Google ScholarDigital Library
Nandi, A., Jagadish, H.V. Guided interaction: Rethinking the query-result paradigm. In PVLDB. Volume 4 (2011), 1466--1469.Google ScholarDigital Library
Papadomanolakis, S., Ailamaki, A. AutoPart: Automating schema design for large scientific databases using data partitioning. In SSDBM (2004), 383--392. Google ScholarDigital Library
Roth, M.T., Schwarz, P. Don't scrap it, wrap it! A wrapper architecture for legacy data sources. In VLDB (1997), 266--275. Google ScholarDigital Library
Schnaitter, K., Abiteboul, S., Milo, T., Polyzotis, N. COLT: Continuous on-line tuning. In SIGMOD (2006), 793--795. Google ScholarDigital Library
Stonebraker, M., Becla, J., DeWitt, D., Lim, K.-T., Maier, D., Ratzesberger, O., Zdonik, S. Requirements for science data bases and SciDB. In CIDR (2009).Google Scholar
Zilio, D., Rao, J., Lightstone, S., Lohman, G., Storm, A., Garcia-Arellano, C., Fadden, S. DB2 design advisor: Integrated automatic physical database. In VLDB (2004). Google ScholarDigital Library

Index Terms

NoDB: efficient query execution on raw data files
1. Information systems
  1. Data management systems
    1. Database management system engines
  2. Information systems applications

Recommendations

NoDB: efficient query execution on raw data files
SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data

As data collections become larger and larger, data loading evolves to a major bottleneck. Many applications already avoid using database systems, e.g., scientific data analysis and social networks, due to the complexity and the increased data-to-query ...
Read More
NoDB in action: adaptive query processing on raw data

As data collections become larger and larger, users are faced with increasing bottlenecks in their data analysis. More data means more time to prepare the data, to load the data into the database and to execute the desired queries. Many applications ...
Read More
Sql: Learn Basics of Queries and Implement Easily (sql programming, SQL 2016, sql database programming, sql for beginners, sql beginners guide, sql ... sql workbook, sql guide, MSSQL) (Volume 1)
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Communications of the ACM Volume 58, Issue 12
December 2015
115 pages
ISSN:0001-0782
EISSN:1557-7317
DOI:10.1145/2847579
Editor:
Moshe Y. Vardi
Association for Computing Machinery, New York, NY
Issue’s Table of Contents
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 November 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 14
  Total Citations
  View Citations
- 6,684
  Total Downloads
- Downloads (Last 12 months)273
- Downloads (Last 6 weeks)38
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

NoDB: efficient query execution on raw data files

Communications of the ACM

Abstract

References

Cited By

Index Terms

Recommendations

NoDB: efficient query execution on raw data files

NoDB in action: adaptive query processing on raw data

Sql: Learn Basics of Queries and Implement Easily (sql programming, SQL 2016, sql database programming, sql for beginners, sql beginners guide, sql ... sql workbook, sql guide, MSSQL) (Volume 1)