Abstract
Cost and cardinality estimation is vital to query optimizer, which can guide the query plan selection. However traditional empirical cost and cardinality estimation techniques cannot provide high-quality estimation, because they may not effectively capture the correlation between multiple tables. Recently the database community shows that the learning-based cardinality estimation is better than the empirical methods. However, existing learning-based methods have several limitations. Firstly, they focus on estimating the cardinality, but cannot estimate the cost. Secondly, they are either too heavy or hard to represent complicated structures, e.g., complex predicates.
To address these challenges, we propose an effective end-to-end learning-based cost estimation framework based on a tree-structured model, which can estimate both cost and cardinality simultaneously. We propose effective feature extraction and encoding techniques, which consider both queries and physical operations in feature extraction. We embed these features into our tree-structured model. We propose an effective method to encode string values, which can improve the generalization ability for predicate matching. As it is prohibitively expensive to enumerate all string values, we design a patten-based method, which selects patterns to cover string values and utilizes the patterns to embed string values. We conducted experiments on real-world datasets and experimental results showed that our method outperformed baselines.
- Order statistics and estimating cardinalities of massive data sets. Discrete Applied Mathematics, 157(2):406 -- 427, 2009.Google ScholarDigital Library
- M. Akdere, U. Çetintemel, M. Riondato, E. Upfal, and S. B. Zdonik. Learning-based query performance modeling and prediction. In ICDE, pages 390--401, 2012.Google ScholarDigital Library
- A. Caprara, P. Toth, and M. Fischetti. Algorithms for the set covering problem. Annals of Operations Research, 98(1):353--371, 2000.Google ScholarCross Ref
- R. Caruana. Multitask learning. Machine Learning, 28(1):41--75, 1997.Google ScholarDigital Library
- M. Durand and P. Flajolet. Loglog counting of large cardinalities. In Algorithms - ESA, pages 605--617, 2003.Google ScholarCross Ref
- P. Flajolet, E. Fusy, O. Gandouet, and et al. Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. In AOFA, 2007.Google Scholar
- P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. Journal of Computer and System Sciences, 31(2):182--209, 1985.Google ScholarDigital Library
- A. Ganapathi, H. Kuno, U. Dayal, J. L. Wiener, A. Fox, M. Jordan, and D. Patterson. Predicting multiple metrics for queries: Better decisions enabled by machine learning. In ICDE, pages 592--603, 2009.Google ScholarDigital Library
- S. Gulwani. Automating string processing in spreadsheets using input-output examples. In POPL, pages 317--330, 2011.Google ScholarDigital Library
- S. Gulwani, W. R. Harris, and R. Singh. Spreadsheet data manipulation using examples. Commun. ACM, 55(8):97--105, 2012.Google ScholarDigital Library
- Y. E. Ioannidis. The history of histograms (abridged). In PVLDB, pages 19--30, 2003.Google ScholarCross Ref
- A. Kipf, T. Kipf, B. Radke, V. Leis, P. A. Boncz, and A. Kemper. Learned cardinalities: Estimating correlated joins with deep learning. In CIDR, 2019.Google Scholar
- V. Leis, A. Gubichev, A. Mirchev, P. A. Boncz, A. Kemper, and T. Neumann. How good are query optimizers, really? PVLDB, 9(3):204--215, 2015.Google ScholarDigital Library
- V. Leis, B. Radke, A. Gubichev, A. Kemper, and T. Neumann. Cardinality estimation done right: Index-based join sampling. In CIDR, 2017.Google Scholar
- G. Li, X. Zhou, and S. Li. Xuanyuan: An ai-native database. Data Engineering, page 70, 2019.Google Scholar
- G. Li, X. Zhou, S. Li, and B. Gao. Qtune: A query-aware database tuning system with deep reinforcement learning. PVLDB, 12(12):2118--2130, 2019.Google ScholarDigital Library
- J. Li, A. C. König, V. Narasayya, and S. Chaudhuri. Robust estimation of resource consumption for sql queries using statistical techniques. PVLDB, 5(11):1555--1566, 2012.Google ScholarDigital Library
- R. J. Lipton, J. F. Naughton, and D. A. Schneider. Practical selectivity estimation through adaptive sampling. SIGMOD Rec., 19(2):1--11, May 1990.Google ScholarDigital Library
- F. Liu and S. Blanas. Forecasting the cost of processing multi-join queries via hashing for main-memory databases. In SoCC, pages 153--166, 2015.Google ScholarDigital Library
- G. Lohman. Is query optimization a "solved" problem?, 2014.Google Scholar
- T. Malik, R. C. Burns, and N. V. Chawla. A black-box approach to query cardinality estimation. In CIDR, pages 56--67, 2007.Google Scholar
- R. Marcus, P. Negi, H. Mao, C. Zhang, M. Alizadeh, T. Kraska, O. Papaemmanouil, and N. Tatbul. Neo: A learned query optimizer. CoRR, abs/1904.03711, 2019.Google Scholar
- R. Marcus and O. Papaemmanouil. Plan-structured deep neural network models for query performance prediction. CoRR, abs/1902.00132, 2019.Google Scholar
- T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS, pages 3111--3119, 2013.Google ScholarDigital Library
- T. Neumann, V. Leis, and A. Kemper. The complete story of joins (in hyper). In BTW, pages 31--50, 2017.Google Scholar
- F. Olken and D. Rotem. Random sampling from database files: A survey. In Statistical and Scientific Database Management, 1990.Google ScholarCross Ref
- J. Ortiz, M. Balazinska, J. Gehrke, and S. S. Keerthi. Learning state representations for query optimization with deep reinforcement learning. In DEEM@SIGMOD, pages 4:1--4:4, 2018.Google Scholar
- K. Whang, B. T. V. Zanden, and H. M. Taylor. A linear-time probabilistic counting algorithm for database applications. ACM Trans. Database Syst., 15(2):208--229, 1990.Google ScholarDigital Library
- W. Wu, Y. Chi, S. Zhu, J. Tatemura, H. Hacigumus, and J. F. Naughton. Predicting query execution time: Are optimizer cost models really unusable? In ICDE, pages 1081--1092, 2013.Google Scholar
- W. Wu, J. F. Naughton, and H. Singh. Sampling-based query re-optimization. In SIGMOD, pages 1721--1736, 2016.Google ScholarDigital Library
- Z. Yang, E. Liang, A. Kamsetty, C. Wu, Y. Duan, X. Chen, P. Abbeel, J. M. Hellerstein, S. Krishnan, and I. Stoica. Selectivity estimation with deep likelihood models. CoRR, abs/1905.04278, 2019.Google Scholar
- X. Yu, G. Li, C. Chai, and N. Tang. Reinforcement learning with tree-lstm for join order selection. ICDE, 2020.Google Scholar
- J. Zhang, Y. Liu, K. Zhou, and G. Li. An end-to-end automatic cloud database tuning system using deep reinforcement learning. In SIGMOD, pages 415--432, 2019.Google ScholarDigital Library
- N. Zhang, P. J. Haas, V. Josifovski, G. M. Lohman, and C. Zhang. Statistical learning techniques for costing xml queries. In PVLDB, pages 289--300, 2005.Google Scholar
Index Terms
- An end-to-end learning-based cost estimator
Recommendations
Speeding Up End-to-end Query Execution via Learning-based Progressive Cardinality Estimation
PACMMODFast query execution requires learning-based cardinality estimators to have short inference time (as model inference time adds to end-to-end query execution time) and high estimation accuracy (which is crucial for finding good execution plan). However, ...
On the end-performance metric estimator selection
It is well known that appropriately biasing an estimator can potentially lead to a lower mean square error (MSE) than the achievable MSE within the class of unbiased estimators. Nevertheless, the choice of an appropriate bias is generally unclear and ...
D2F: discriminative dense fusion of appearance and motion modalities for end-to-end video classification
AbstractRecently, two-stream networks with multi-modality inputs have shown to be of vital importance for state-of-the-art video understanding. Previous deep systems typically employ a late fusion strategy, however, despite its simplicity and ...
Comments