research-article

The DataPath system: a data-centric analytic processing engine for large data warehouses

Authors:
Subi Arumugam

University of Florida, Gainesville, FL, USA

University of Florida, Gainesville, FL, USA
View Profile

,
Alin Dobra

University of Florida, Gainesville, FL, USA

University of Florida, Gainesville, FL, USA
View Profile

,
Christopher M. Jermaine

Rice University, Houston, TX, USA

Rice University, Houston, TX, USA
View Profile

,
Niketan Pansare

Rice University, Houston, TX, USA

Rice University, Houston, TX, USA
View Profile

,
Luis Perez

Rice University, Houston, TX, USA

Rice University, Houston, TX, USA
View Profile

SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of dataJune 2010Pages 519–530https://doi.org/10.1145/1807167.1807224

Published:06 June 2010Publication History

SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data

Pages 519–530

ABSTRACT

Since the 1970's, database systems have been "compute-centric". When a computation needs the data, it requests the data, and the data are pulled through the system. We believe that this is problematic for two reasons. First, requests for data naturally incur high latency as the data are pulled through the memory hierarchy, and second, it makes it difficult or impossible for multiple queries or operations that are interested in the same data to amortize the bandwidth and latency costs associated with their data access.

In this paper, we describe a purely-push based, research prototype database system called DataPath. DataPath is "data-centric". In DataPath, queries do not request data. Instead, data are automatically pushed onto processors, where they are then processed by any interested computation. We show experimentally on a multi-terabyte benchmark that this basic design principle makes for a very lean and fast database system.

References

D. Abadi, D. Carney, U. Çetintemel, M. Cherniack, C. Convey, S. Lee, M. Stonebraker, N. Tatbul, and S. Zdonik. Aurora: a new model and architecture for data stream management. VLDB J., 12(2):120--139, 2003. Google ScholarDigital Library
A. Ailamaki, D. DeWitt, M. Hill, and M. Skounakis. Weaving relations for cache performance. In VLDB, pages 169--180, 2001. Google ScholarDigital Library
R. Avnur and J. Hellerstein. Eddies: Continuously adaptive query processing. In SIGMOD, pages 261--272, 2000. Google ScholarDigital Library
G. Candea, N. Polyzotis, and R. Vingralek. A scalable, predictable join operator for highly concurrent data warehouses. PVLDB, 2(1):277--288, 2009. Google ScholarDigital Library
J. Chen, D. DeWitt, F. Tian, and Y. Wang. Niagaracq: A scalable continuous query system for internet databases. In SIGMOD Conference, pages 379--390, 2000. Google ScholarDigital Library
S. Chen, A. Ailamaki, P. Gibbons, and T. Mowry. Improving hash join performance through prefetching. ACM Trans. Database Syst., 32(3):17, 2007. Google ScholarDigital Library
S. C. et al. Telegraphcq: Continuous dataflow processing for an uncertain world. In CIDR, 2003.Google Scholar
G. Graefe. Query evaluation techniques for large databases. ACM Comput. Surv., 25(2):73--170, 1993. Google ScholarDigital Library
G. Graefe. Volcano - an extensible and parallel query evaluation system. IEEE TKDE, 6(1):120--135, 1994. Google ScholarDigital Library
S. Harizopoulos and A. Ailamaki. Stageddb: Designing database servers for modern hardware. IEEE Data Eng. Bull., 28(2):11--16, 2005.Google Scholar
S. Harizopoulos, V. Shkapenyuk, and A. Ailamaki. Qpipe: A simultaneously pipelined relational query engine. In SIGMOD, pages 383--394, 2005. Google ScholarDigital Library
A. Kemper, G. Moerkotte, K. Peithner, and M. Steinbrunn. Optimizing disjunctive queries with expensive predicates. In SIGMOD Conference, pages 336--347, 1994. Google ScholarDigital Library
W. Litwin. Linear hashing: A new tool for file and table addressing. In VLDB, pages 212--223. IEEE Computer Society, 1980. Google ScholarDigital Library
S. Manegold, P. Boncz, and N. Nes. Cache-conscious radix-decluster projections. In VLDB, pages 684--695, 2004. Google ScholarDigital Library
M. Mannino, P. Chu, and T. Sager. Statistical profile estimation in database systems. ACM Comput. Surv., 20(3):191--221, 1988. Google ScholarDigital Library
T. K. Sellis. Global query optimization. In SIGMOD Conference, pages 191--205, 1986. Google ScholarDigital Library
T. K. Sellis. Multiple-query optimization. ACM Trans. Database Syst., 13(1):23--52, 1988. Google ScholarDigital Library
M. Steinbrunn, K. Peithner, G. Moerkotte, and A. Kemper. Bypassing joins in disjunctive queries. In VLDB, pages 228--238, 1995. Google ScholarDigital Library
P. Unterbrunner, G. Giannikis, G. Alonso, D. Fauser, and D. Kossmann. Predictable performance for unpredictable workloads. PVLDB, 2(1):706--717, 2009. Google ScholarDigital Library
M. Zukowski, S. Héman, and P. Boncz. Architecture-conscious hashing. In DaMoN, page 6, 2006. Google ScholarDigital Library
M. Zukowski, S. Héman, N. Nes, and P. Boncz. Cooperative scans: Dynamic bandwidth sharing in a dbms. In VLDB, pages 723--734, 2007. Google ScholarDigital Library

Index Terms

The DataPath system: a data-centric analytic processing engine for large data warehouses
1. Information systems
  1. Data management systems
    1. Database management system engines

Recommendations

Cache matching: thread scheduling to maximize data reuse
HPC '14: Proceedings of the High Performance Computing Symposium

Datacenters today often execute multiple data-intensive threads concurrently. To improve the latency of threads accessing slow external storage, data is often cached in memory. The way in which the cache is shared between concurrent threads has a ...
Read More
Improved Techniques for Caches of Search Engines Results
WISM '10: Proceedings of the 2010 International Conference on Web Information Systems and Mining - Volume 01

Result caching is an efficient technique for reducing the query processing load, hence it is commonly used in search engines. In this paper, we study query result caching and proposes a cache management policy for achieving higher hit ratios compared to ...
Read More
Matrix multiplication: a case study of enhanced data cache utilization

Modern machines present two challenges to algorithm engineers and compiler writers: They have superscalar, super-pipelined structure, and they have elaborate memory subsystems specifically designed to reduce latency and increase bandwidth. Matrix ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
June 2010
1286 pages
ISBN:9781450300322
DOI:10.1145/1807167
General Chair:
Ahmed Elmagarmid
Purdue University, USA
,
Program Chair:
Divyakant Agrawal
University of California at Santa Barbara, USA
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 6 June 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
algorithms
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate785of4,003submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 77
  Total Citations
  View Citations
- 1,360
  Total Downloads
- Downloads (Last 12 months)12
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

The DataPath system: a data-centric analytic processing engine for large data warehouses

SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Cache matching: thread scheduling to maximize data reuse

Improved Techniques for Caches of Search Engines Results

Matrix multiplication: a case study of enhanced data cache utilization