skip to main content
10.1145/2463676.2463701acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

CS2: a new database synopsis for query estimation

Authors Info & Claims
Published:22 June 2013Publication History

ABSTRACT

Fast and accurate estimations for complex queries are profoundly beneficial for large databases with heavy workloads. In this research, we propose a statistical summary for a database, called CS2 (Correlated Sample Synopsis), to provide rapid and accurate result size estimations for all queries with joins and arbitrary selections. Unlike the state-of-the-art techniques, CS2 does not completely rely on simple random samples, but mainly consists of correlated sample tuples that retain join relationships with less storage. We introduce a statistical technique, called reverse sample, and design a powerful estimator, called reverse estimator, to fully utilize correlated sample tuples for query estimation. We prove both theoretically and empirically that the reverse estimator is unbiased and accurate using CS2. Extensive experiments on multiple datasets show that CS2 is fast to construct and derives more accurate estimations than existing methods with the same space budget.

References

  1. S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. Join synopses for approximate query answering. SIGMOD Rec., 28:275--286, June 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. In JOURNAL OF COMPUTER AND SYSTEM SCIENCES, pages 20--29, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Chaudhuri, G. Das, and V. Narasayya. Optimized stratified sampling for approximate query processing. ACM Trans. Database Syst., 32(2), June 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. Chaudhuri and V. Narasayya. Program for tpc-d data generation with skew. ftp://ftp.research.microsoft.com/users/viveknar/tpcdskew.Google ScholarGoogle Scholar
  5. S. Christodoulakis. Implications of certain assumptions in database performance evauation. ACM Trans. Database Syst., 9:163--186, June 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. W. G. Cochran. Sampling Techniques, 3rd Edition. John Wiley, 1977.Google ScholarGoogle Scholar
  7. G. Cormode and S. Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. J. Algorithms, 55:58--75, April 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. Deshpande, M. Garofalakis, and R. Rastogi. Independence is good: dependency-based histogram synopses for high-dimensional data. SIGMOD Rec., 30:199--210, May 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Diederich. FacetedDBLP. http://dblp.l3s.de/dblp++.php.Google ScholarGoogle Scholar
  10. P. B. Gibbons, Y. Matias, and V. Poosala. Fast incremental maintenance of approximate histograms. VLDB '97, pages 466--475, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. P. J. Haas, J. F. Naughton, S. Seshadri, and A. N. Swami. Fixed-precision estimation of join selectivity. PODS '93, pages 190--201, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. P. J. Haas and A. N. Swami. Sequential sampling procedures for query size estimation. SIGMOD Rec., 21:341--350, June 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. W.-C. Hou, G. Ozsoyoglu, and B. K. Taneja. Statistical estimators for relational algebra expressions. PODS '88, pages 276--287, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Y. E. Ioannidis and Y. C. Kang. Left-deep vs. bushy trees: an analysis of strategy spaces and its implications for query optimization. SIGMOD Rec., 20:168--177, April 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. R. P. Kooi. The optimization of queries in relational databases. PhD thesis, Cleveland, OH, USA, 1980. AAI8109596. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J.-H. Lee, D.-H. Kim, and C.-W. Chung. Multi-dimensional selectivity estimation using compressed histogram information. SIGMOD Rec., 28:205--214, June 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Ley. The DBLP computer science bibliography. http://www.informatik.uni-trier.de/~ley/db/.Google ScholarGoogle Scholar
  18. R. J. Lipton, J. F. Naughton, and D. A. Schneider. Practical selectivity estimation through adaptive sampling. SIGMOD Rec., 19:1--11, May 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. V. Markl, V. Raman, D. Simmen, G. Lohman, H. Pirahesh, and M. Cilimdzic. Robust query processing through progressive optimization. SIGMOD '04, pages 659--670, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Y. Matias, J. S. Vitter, and M. Wang. Wavelet-based histograms for selectivity estimation. SIGMOD Rec., 27:448--459, June 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Y. Matias, J. S. Vitter, and M. Wang. Dynamic maintenance of wavelet-based histograms. VLDB '00, pages 101--110, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. Muralikrishna and D. J. DeWitt. Equi-depth histograms for estimating selectivity factors for multi-dimensional queries. SIGMOD '88, pages 28--36, 1988.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. R. Perdisci. JBIRCH. http://roberto.perdisci.com/projects/jbirch.Google ScholarGoogle Scholar
  24. V. Poosala and Y. E. Ioannidis. Selectivity estimation without the attribute value independence assumption. VLDB '97, pages 486--495, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access path selection in a relational database management system. SIGMOD '79, pages 23--34, 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. M. Smith and P. Y.-T. Chang. Optimizing the performance of a relational algebra database interface. Commun. ACM, 18:568--579, October 1975. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J. Spiegel and N. Polyzotis. Graph-based synopses for relational selectivity estimation. SIGMOD '06, pages 205--216, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. J. Spiegel and N. Polyzotis. Tug synopses for approximate query answering. ACM Trans. Database Syst., 34(1):3:1--3:56, Apr. 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. J. D. Ullman, H. Garcia-Molina, and J. Widom. Database Systems: The Complete Book. Prentice Hall PTR, Upper Saddle River, NJ, USA, 1st edition, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. E. Wong and K. Youssefi. Decomposition - a strategy for query processing. ACM Transactions on Database Systems, 1:223--241, 1976. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Y.-L. Wu, D. Agrawal, and A. El Abbadi. Applying the golden rule of sampling for query estimation. SIGMOD Rec., 30(2):449--460, May 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. T. Zhang, R. Ramakrishnan, and M. Livny. Birch: an efficient data clustering method for very large databases. SIGMOD Rec., 25:103--114, June 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. CS2: a new database synopsis for query estimation

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
        June 2013
        1322 pages
        ISBN:9781450320375
        DOI:10.1145/2463676

        Copyright © 2013 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 22 June 2013

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        SIGMOD '13 Paper Acceptance Rate76of372submissions,20%Overall Acceptance Rate785of4,003submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader