ABSTRACT
Fast and accurate estimations for complex queries are profoundly beneficial for large databases with heavy workloads. In this research, we propose a statistical summary for a database, called CS2 (Correlated Sample Synopsis), to provide rapid and accurate result size estimations for all queries with joins and arbitrary selections. Unlike the state-of-the-art techniques, CS2 does not completely rely on simple random samples, but mainly consists of correlated sample tuples that retain join relationships with less storage. We introduce a statistical technique, called reverse sample, and design a powerful estimator, called reverse estimator, to fully utilize correlated sample tuples for query estimation. We prove both theoretically and empirically that the reverse estimator is unbiased and accurate using CS2. Extensive experiments on multiple datasets show that CS2 is fast to construct and derives more accurate estimations than existing methods with the same space budget.
- S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. Join synopses for approximate query answering. SIGMOD Rec., 28:275--286, June 1999. Google ScholarDigital Library
- N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. In JOURNAL OF COMPUTER AND SYSTEM SCIENCES, pages 20--29, 1996. Google ScholarDigital Library
- S. Chaudhuri, G. Das, and V. Narasayya. Optimized stratified sampling for approximate query processing. ACM Trans. Database Syst., 32(2), June 2007. Google ScholarDigital Library
- S. Chaudhuri and V. Narasayya. Program for tpc-d data generation with skew. ftp://ftp.research.microsoft.com/users/viveknar/tpcdskew.Google Scholar
- S. Christodoulakis. Implications of certain assumptions in database performance evauation. ACM Trans. Database Syst., 9:163--186, June 1984. Google ScholarDigital Library
- W. G. Cochran. Sampling Techniques, 3rd Edition. John Wiley, 1977.Google Scholar
- G. Cormode and S. Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. J. Algorithms, 55:58--75, April 2005. Google ScholarDigital Library
- A. Deshpande, M. Garofalakis, and R. Rastogi. Independence is good: dependency-based histogram synopses for high-dimensional data. SIGMOD Rec., 30:199--210, May 2001. Google ScholarDigital Library
- J. Diederich. FacetedDBLP. http://dblp.l3s.de/dblp++.php.Google Scholar
- P. B. Gibbons, Y. Matias, and V. Poosala. Fast incremental maintenance of approximate histograms. VLDB '97, pages 466--475, 1997. Google ScholarDigital Library
- P. J. Haas, J. F. Naughton, S. Seshadri, and A. N. Swami. Fixed-precision estimation of join selectivity. PODS '93, pages 190--201, 1993. Google ScholarDigital Library
- P. J. Haas and A. N. Swami. Sequential sampling procedures for query size estimation. SIGMOD Rec., 21:341--350, June 1992. Google ScholarDigital Library
- W.-C. Hou, G. Ozsoyoglu, and B. K. Taneja. Statistical estimators for relational algebra expressions. PODS '88, pages 276--287, 1988. Google ScholarDigital Library
- Y. E. Ioannidis and Y. C. Kang. Left-deep vs. bushy trees: an analysis of strategy spaces and its implications for query optimization. SIGMOD Rec., 20:168--177, April 1991. Google ScholarDigital Library
- R. P. Kooi. The optimization of queries in relational databases. PhD thesis, Cleveland, OH, USA, 1980. AAI8109596. Google ScholarDigital Library
- J.-H. Lee, D.-H. Kim, and C.-W. Chung. Multi-dimensional selectivity estimation using compressed histogram information. SIGMOD Rec., 28:205--214, June 1999. Google ScholarDigital Library
- M. Ley. The DBLP computer science bibliography. http://www.informatik.uni-trier.de/~ley/db/.Google Scholar
- R. J. Lipton, J. F. Naughton, and D. A. Schneider. Practical selectivity estimation through adaptive sampling. SIGMOD Rec., 19:1--11, May 1990. Google ScholarDigital Library
- V. Markl, V. Raman, D. Simmen, G. Lohman, H. Pirahesh, and M. Cilimdzic. Robust query processing through progressive optimization. SIGMOD '04, pages 659--670, 2004. Google ScholarDigital Library
- Y. Matias, J. S. Vitter, and M. Wang. Wavelet-based histograms for selectivity estimation. SIGMOD Rec., 27:448--459, June 1998. Google ScholarDigital Library
- Y. Matias, J. S. Vitter, and M. Wang. Dynamic maintenance of wavelet-based histograms. VLDB '00, pages 101--110, 2000. Google ScholarDigital Library
- M. Muralikrishna and D. J. DeWitt. Equi-depth histograms for estimating selectivity factors for multi-dimensional queries. SIGMOD '88, pages 28--36, 1988.Google ScholarDigital Library
- R. Perdisci. JBIRCH. http://roberto.perdisci.com/projects/jbirch.Google Scholar
- V. Poosala and Y. E. Ioannidis. Selectivity estimation without the attribute value independence assumption. VLDB '97, pages 486--495, 1997. Google ScholarDigital Library
- P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access path selection in a relational database management system. SIGMOD '79, pages 23--34, 1979. Google ScholarDigital Library
- J. M. Smith and P. Y.-T. Chang. Optimizing the performance of a relational algebra database interface. Commun. ACM, 18:568--579, October 1975. Google ScholarDigital Library
- J. Spiegel and N. Polyzotis. Graph-based synopses for relational selectivity estimation. SIGMOD '06, pages 205--216, 2006. Google ScholarDigital Library
- J. Spiegel and N. Polyzotis. Tug synopses for approximate query answering. ACM Trans. Database Syst., 34(1):3:1--3:56, Apr. 2009. Google ScholarDigital Library
- J. D. Ullman, H. Garcia-Molina, and J. Widom. Database Systems: The Complete Book. Prentice Hall PTR, Upper Saddle River, NJ, USA, 1st edition, 2001. Google ScholarDigital Library
- E. Wong and K. Youssefi. Decomposition - a strategy for query processing. ACM Transactions on Database Systems, 1:223--241, 1976. Google ScholarDigital Library
- Y.-L. Wu, D. Agrawal, and A. El Abbadi. Applying the golden rule of sampling for query estimation. SIGMOD Rec., 30(2):449--460, May 2001. Google ScholarDigital Library
- T. Zhang, R. Ramakrishnan, and M. Livny. Birch: an efficient data clustering method for very large databases. SIGMOD Rec., 25:103--114, June 1996. Google ScholarDigital Library
Index Terms
- CS2: a new database synopsis for query estimation
Recommendations
Sampling-Based Query Re-Optimization
SIGMOD '16: Proceedings of the 2016 International Conference on Management of DataDespite of decades of work, query optimizers still make mistakes on "difficult" queries because of bad cardinality estimates, often due to the interaction of multiple predicates and correlations in the data. In this paper, we propose a low-cost post-...
MAXENT: consistent cardinality estimation in action
SIGMOD '06: Proceedings of the 2006 ACM SIGMOD international conference on Management of dataWhen comparing alternative query execution plans (QEPs), a cost-based query optimizer in a relational database management system needs to estimate the selectivity of conjunctive predicates. To avoid inaccurate independence assumptions, modern optimizers ...
Improved selectivity estimator for XML queries based on structural synopsis
With the increasing popularity of XML database applications, the use of efficient XML query optimizers is becoming very essential. The performance of an XML query optimizer depends heavily on the query selectivity estimators it uses to find the best ...
Comments