research-article

CS2: a new database synopsis for query estimation

Authors:
Feng Yu

Southern Illinois University, Carbondale, IL, USA

Southern Illinois University, Carbondale, IL, USA
View Profile

,
Wen-Chi Hou

Southern Illinois University, Carbondale, IL, USA

Southern Illinois University, Carbondale, IL, USA
View Profile

,
Cheng Luo

Coppin State University, Baltimore, MD, USA

Coppin State University, Baltimore, MD, USA
View Profile

,
Dunren Che

Southern Illinois University, Carbondale, IL, USA

Southern Illinois University, Carbondale, IL, USA
View Profile

,
Mengxia Zhu

Southern Illinois University, Carbondale, IL, USA

Southern Illinois University, Carbondale, IL, USA
View Profile

SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of DataJune 2013Pages 469–480https://doi.org/10.1145/2463676.2463701

Published:22 June 2013Publication History

SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

Pages 469–480

ABSTRACT

Fast and accurate estimations for complex queries are profoundly beneficial for large databases with heavy workloads. In this research, we propose a statistical summary for a database, called CS2 (Correlated Sample Synopsis), to provide rapid and accurate result size estimations for all queries with joins and arbitrary selections. Unlike the state-of-the-art techniques, CS2 does not completely rely on simple random samples, but mainly consists of correlated sample tuples that retain join relationships with less storage. We introduce a statistical technique, called reverse sample, and design a powerful estimator, called reverse estimator, to fully utilize correlated sample tuples for query estimation. We prove both theoretically and empirically that the reverse estimator is unbiased and accurate using CS2. Extensive experiments on multiple datasets show that CS2 is fast to construct and derives more accurate estimations than existing methods with the same space budget.

References

S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. Join synopses for approximate query answering. SIGMOD Rec., 28:275--286, June 1999. Google ScholarDigital Library
N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. In JOURNAL OF COMPUTER AND SYSTEM SCIENCES, pages 20--29, 1996. Google ScholarDigital Library
S. Chaudhuri, G. Das, and V. Narasayya. Optimized stratified sampling for approximate query processing. ACM Trans. Database Syst., 32(2), June 2007. Google ScholarDigital Library
S. Chaudhuri and V. Narasayya. Program for tpc-d data generation with skew. ftp://ftp.research.microsoft.com/users/viveknar/tpcdskew.Google Scholar
S. Christodoulakis. Implications of certain assumptions in database performance evauation. ACM Trans. Database Syst., 9:163--186, June 1984. Google ScholarDigital Library
W. G. Cochran. Sampling Techniques, 3rd Edition. John Wiley, 1977.Google Scholar
G. Cormode and S. Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. J. Algorithms, 55:58--75, April 2005. Google ScholarDigital Library
A. Deshpande, M. Garofalakis, and R. Rastogi. Independence is good: dependency-based histogram synopses for high-dimensional data. SIGMOD Rec., 30:199--210, May 2001. Google ScholarDigital Library
J. Diederich. FacetedDBLP. http://dblp.l3s.de/dblp++.php.Google Scholar
P. B. Gibbons, Y. Matias, and V. Poosala. Fast incremental maintenance of approximate histograms. VLDB '97, pages 466--475, 1997. Google ScholarDigital Library
P. J. Haas, J. F. Naughton, S. Seshadri, and A. N. Swami. Fixed-precision estimation of join selectivity. PODS '93, pages 190--201, 1993. Google ScholarDigital Library
P. J. Haas and A. N. Swami. Sequential sampling procedures for query size estimation. SIGMOD Rec., 21:341--350, June 1992. Google ScholarDigital Library
W.-C. Hou, G. Ozsoyoglu, and B. K. Taneja. Statistical estimators for relational algebra expressions. PODS '88, pages 276--287, 1988. Google ScholarDigital Library
Y. E. Ioannidis and Y. C. Kang. Left-deep vs. bushy trees: an analysis of strategy spaces and its implications for query optimization. SIGMOD Rec., 20:168--177, April 1991. Google ScholarDigital Library
R. P. Kooi. The optimization of queries in relational databases. PhD thesis, Cleveland, OH, USA, 1980. AAI8109596. Google ScholarDigital Library
J.-H. Lee, D.-H. Kim, and C.-W. Chung. Multi-dimensional selectivity estimation using compressed histogram information. SIGMOD Rec., 28:205--214, June 1999. Google ScholarDigital Library
M. Ley. The DBLP computer science bibliography. http://www.informatik.uni-trier.de/~ley/db/.Google Scholar
R. J. Lipton, J. F. Naughton, and D. A. Schneider. Practical selectivity estimation through adaptive sampling. SIGMOD Rec., 19:1--11, May 1990. Google ScholarDigital Library
V. Markl, V. Raman, D. Simmen, G. Lohman, H. Pirahesh, and M. Cilimdzic. Robust query processing through progressive optimization. SIGMOD '04, pages 659--670, 2004. Google ScholarDigital Library
Y. Matias, J. S. Vitter, and M. Wang. Wavelet-based histograms for selectivity estimation. SIGMOD Rec., 27:448--459, June 1998. Google ScholarDigital Library
Y. Matias, J. S. Vitter, and M. Wang. Dynamic maintenance of wavelet-based histograms. VLDB '00, pages 101--110, 2000. Google ScholarDigital Library
M. Muralikrishna and D. J. DeWitt. Equi-depth histograms for estimating selectivity factors for multi-dimensional queries. SIGMOD '88, pages 28--36, 1988.Google ScholarDigital Library
R. Perdisci. JBIRCH. http://roberto.perdisci.com/projects/jbirch.Google Scholar
V. Poosala and Y. E. Ioannidis. Selectivity estimation without the attribute value independence assumption. VLDB '97, pages 486--495, 1997. Google ScholarDigital Library
P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access path selection in a relational database management system. SIGMOD '79, pages 23--34, 1979. Google ScholarDigital Library
J. M. Smith and P. Y.-T. Chang. Optimizing the performance of a relational algebra database interface. Commun. ACM, 18:568--579, October 1975. Google ScholarDigital Library
J. Spiegel and N. Polyzotis. Graph-based synopses for relational selectivity estimation. SIGMOD '06, pages 205--216, 2006. Google ScholarDigital Library
J. Spiegel and N. Polyzotis. Tug synopses for approximate query answering. ACM Trans. Database Syst., 34(1):3:1--3:56, Apr. 2009. Google ScholarDigital Library
J. D. Ullman, H. Garcia-Molina, and J. Widom. Database Systems: The Complete Book. Prentice Hall PTR, Upper Saddle River, NJ, USA, 1st edition, 2001. Google ScholarDigital Library
E. Wong and K. Youssefi. Decomposition - a strategy for query processing. ACM Transactions on Database Systems, 1:223--241, 1976. Google ScholarDigital Library
Y.-L. Wu, D. Agrawal, and A. El Abbadi. Applying the golden rule of sampling for query estimation. SIGMOD Rec., 30(2):449--460, May 2001. Google ScholarDigital Library
T. Zhang, R. Ramakrishnan, and M. Livny. Birch: an efficient data clustering method for very large databases. SIGMOD Rec., 25:103--114, June 1996. Google ScholarDigital Library

Index Terms

CS2: a new database synopsis for query estimation
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Recommendations

Sampling-Based Query Re-Optimization
SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

Despite of decades of work, query optimizers still make mistakes on "difficult" queries because of bad cardinality estimates, often due to the interaction of multiple predicates and correlations in the data. In this paper, we propose a low-cost post-...
Read More
MAXENT: consistent cardinality estimation in action
SIGMOD '06: Proceedings of the 2006 ACM SIGMOD international conference on Management of data

When comparing alternative query execution plans (QEPs), a cost-based query optimizer in a relational database management system needs to estimate the selectivity of conjunctive predicates. To avoid inaccurate independence assumptions, modern optimizers ...
Read More
Improved selectivity estimator for XML queries based on structural synopsis

With the increasing popularity of XML database applications, the use of efficient XML query optimizers is becoming very essential. The performance of an XML query optimizer depends heavily on the query selectivity estimators it uses to find the best ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
June 2013
1322 pages
ISBN:9781450320375
DOI:10.1145/2463676
General Chairs:
Kenneth Ross
Columbia University
,
Divesh Srivastava
AT&T Research
,
Program Chair:
Dimitris Papadias
HKUST
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 22 June 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
database synopsis
query optimization
selectivity estimation
Qualifiers
- research-article
Conference

Acceptance Rates
SIGMOD '13 Paper Acceptance Rate76of372submissions,20%Overall Acceptance Rate785of4,003submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 30
  Total Citations
  View Citations
- 778
  Total Downloads
- Downloads (Last 12 months)21
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

CS2: a new database synopsis for query estimation

SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Sampling-Based Query Re-Optimization

MAXENT: consistent cardinality estimation in action

Improved selectivity estimator for XML queries based on structural synopsis