skip to main content
research-article
Free Access

NoDB: efficient query execution on raw data files

Authors Info & Claims
Published:23 November 2015Publication History
Skip Abstract Section

Abstract

As data collections become larger and larger, users are faced with increasing bottlenecks in their data analysis. More data means more time to prepare and to load the data into the database before executing the desired queries. Many applications already avoid using database systems, for example, scientific data analysis and social networks, due to the complexity and the increased data-to-query time, that is, the time between getting the data and retrieving its first useful results. For many applications data collections keep growing fast, even on a daily basis, and this data deluge will only increase in the future, where it is expected to have much more data than what we can move or store, let alone analyze.

We here present the design and roadmap of a new paradigm in database systems, called NoDB, which do not require data loading while still maintaining the whole feature set of a modern database system. In particular, we show how to make raw data files a first-class citizen, fully integrated with the query engine. Through our design and lessons learned by implementing the NoDB philosophy over a modern Database Management Systems (DBMS), we discuss the fundamental limitations as well as the strong opportunities that such a research path brings. We identify performance bottlenecks specific for in situ processing, namely the repeated parsing and tokenizing overhead and the expensive data type conversion. To address these problems, we introduce an adaptive indexing mechanism that maintains positional information to provide efficient access to raw data files, together with a flexible caching structure. We conclude that NoDB systems are feasible to design and implement over modern DBMS, bringing an unprecedented positive effect in usability and performance.<!-- END_PAGE_1 -->

References

  1. Agrawal, S., Chaudhuri, S., Kollar, L., Marathe, A., Narasayya, V., Syamala, M. Database tuning advisor for Microsoft SQL server 2005. In VLDB (2004), 1110--1121.Google ScholarGoogle Scholar
  2. Ailamaki, A., Kantere, V., Dash, D. Managing scientific data. Commun. ACM 53 (2010), 68--78. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Alagiannis, I., Idreos, S., Ailamaki, A. H2O: A hands-free adaptive store. In SIGMOD (2014), 1103--1114. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Bruno, N., Chaudhuri, S. To tune or not to tune? A lightweight physical design alerter. In VLDB (2006), 499--510. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Cohen, J., Dolan, B., Dunlap, M., Hellerstein, J., Welton, C. MAD skills: New analysis practices for big data. PVLDB 2 (2009), 1481--1492. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Dash, D., Polyzotis, N., Ailamaki, A. CoPhy: A scalable, portable, and interactive index advisor for large workloads. PVLDB 4 (2011), 362--372. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Graefe, G., Kuno, H. Self-selecting, self-tuning, incrementally optimized indexes. In EDBT (2010), 371--381. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Gray, J., Liu, D., Nieto-Santisteban, M., Szalay, A., DeWitt, D., Heber, G. Scientific data management in the coming decade. SIGMOD Rec. 34 (2005), 34--41. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Idreos, S., Alagiannis, I., Johnson, R., Ailamaki, A. Here are my data files. Here are my queries. Where are my results? In CIDR (2011).Google ScholarGoogle Scholar
  10. Idreos, S., Kersten, M., Manegold, S. Database cracking. In CIDR (2007).Google ScholarGoogle Scholar
  11. Idreos, S., Kersten, M., Manegold, S. Self-organizing tuple reconstruction in column-stores. In SIGMOD (2009), 297--308. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Idreos, S., Liarou, E. dbTouch: Analytics at your fingertips. In CIDR (2013).Google ScholarGoogle Scholar
  13. Idreos, S., Manegold, S., Kuno, H., Graefe, G. Merging what's cracked, cracking what's merged: Adaptive indexing in main-memory column-stores. PVLDB 4 (2011), 586--597. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Jagadish, H.V., Chapman, A., Elkiss, A., Jayapandian, M., Li, Y., Nandi, A., Yu, C. Making database systems usable. In SIGMOD (2007), 13--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Jain, A., Doan, A., Gravano, L. Optimizing SQL queries over text databases. In ICDE (2008), 636--645. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Kersten, M., Idreos, S., Manegold, S., Liarou, E. The researcher's guide to the data deluge: Querying a scientific database in just a few seconds. In PVLDB. Volume 4 (2011), 1474--1477.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Nandi, A., Jagadish, H.V. Guided interaction: Rethinking the query-result paradigm. In PVLDB. Volume 4 (2011), 1466--1469.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Papadomanolakis, S., Ailamaki, A. AutoPart: Automating schema design for large scientific databases using data partitioning. In SSDBM (2004), 383--392. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Roth, M.T., Schwarz, P. Don't scrap it, wrap it! A wrapper architecture for legacy data sources. In VLDB (1997), 266--275. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Schnaitter, K., Abiteboul, S., Milo, T., Polyzotis, N. COLT: Continuous on-line tuning. In SIGMOD (2006), 793--795. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Stonebraker, M., Becla, J., DeWitt, D., Lim, K.-T., Maier, D., Ratzesberger, O., Zdonik, S. Requirements for science data bases and SciDB. In CIDR (2009).Google ScholarGoogle Scholar
  22. Zilio, D., Rao, J., Lightstone, S., Lohman, G., Storm, A., Garcia-Arellano, C., Fadden, S. DB2 design advisor: Integrated automatic physical database. In VLDB (2004). Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. NoDB: efficient query execution on raw data files

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image Communications of the ACM
        Communications of the ACM  Volume 58, Issue 12
        December 2015
        115 pages
        ISSN:0001-0782
        EISSN:1557-7317
        DOI:10.1145/2847579
        • Editor:
        • Moshe Y. Vardi
        Issue’s Table of Contents

        Copyright © 2015 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 23 November 2015

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format