ABSTRACT
MapReduce has been widely recognized as an efficient tool for large-scale data analysis. It achieves high performance by exploiting parallelism among processing nodes while providing a simple interface for upper-layer applications. Some vendors have enhanced their data warehouse systems by integrating MapReduce into the systems. However, existing MapReduce-based query processing systems, such as Hive, fall short of the query optimization and competency of conventional database systems. Given an SQL query, Hive translates the query into a set of MapReduce jobs sentence by sentence. This design assumes that the user can optimize his query before submitting it to the system. Unfortunately, manual query optimization is time consuming and difficult, even to an experienced database user or administrator. In this paper, we propose a query optimization scheme for MapReduce-based processing systems. Specifically, we embed into Hive a query optimizer which is designed to generate an efficient query plan based on our proposed cost model. Experiments carried out on our in-house cluster confirm the effectiveness of our query optimizer.
- http://hadoop.apache.org.Google Scholar
- http://wiki.apache.org/hadoop/hive/languagemanual/joins.Google Scholar
- http://www.aster.com.Google Scholar
- http://www.greenplum.com.Google Scholar
- http://www.tpc.org/tpch/.Google Scholar
- F. N. Afrati and J. D. Ullman. Optimizing joins in a map-reduce environment. EDBT, 2009. Google ScholarDigital Library
- P. A. Bernstein, N. Goodman, E. Wong, C. L. Reeve, and J. B. Rothnie, Jr. Query processing in a system for distributed databases (sdd-1). ACM Trans. Database Syst., 6(4):602--625, 1981. Google ScholarDigital Library
- Y. Cao, C. Chen, F. Guo, D. Jiang, Y. Lin, B. C. Ooi, H. T. Vo, S. Wu, and Q. Xu. Es2: A cloud data storage system for supporting both oltp and olap. In ICDE, pages 291--302, 2011. Google ScholarDigital Library
- S. Chaudhuri. An overview of query optimization in relational systems. In PODS, pages 34--43, 1998. Google ScholarDigital Library
- C. Chen, G. Chen, D. Jiang, B. C. Ooi, L. Shi, H. T. Vo, and S. Wu. E3: an elastic execution engine for scalable data processing. Technical Report, National University of Singapore, School of Computing. TRA07/11, 2011.Google Scholar
- C. Chen, G. Chen, D. Jiang, B. C. Ooi, H. T. Vo, S. Wu, and Q. Xu. Providing scalable database services on the cloud. In WISE, pages 1--19, 2010. Google ScholarDigital Library
- G. Chen, H. T. Vo, S. Wu, B. C. Ooi, and M. T. Özsu. A framework for supporting dbms-like indexes in the cloud. In VLDB, 2011.Google Scholar
- M.-S. Chen, P. S. Yu, and K.-L. Wu. Optimization of parallel execution for multi-join queries. IEEE Trans. on Knowl. and Data Eng., 8(3):416--428, 1996. Google ScholarDigital Library
- T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, and R. Sears. Mapreduce online. Technical report, EECS Department, University of California, Berkeley, Oct 2009.Google Scholar
- J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. pages 137--150.Google Scholar
- M. J. Franklin, B. T. Jónsson, and D. Kossmann. Performance tradeoffs for client-server query processing. SIGMOD Rec., 25(2):149--160, 1996. Google ScholarDigital Library
- E. Friedman, P. Pawlowski, and J. Cieslewicz. Sql/mapreduce: a practical approach to self-describing, polymorphic, and parallelizable user-defined functions. VLDB, 2009. Google ScholarDigital Library
- S. Ganguly, W. Hasan, and R. Krishnamurthy. Query optimization for parallel execution. SIGMOD Rec., 21(2), 1992. Google ScholarDigital Library
- M. Jarke and J. Koch. Query optimization in database systems. ACM Comput. Surv., 16(2):111--152, 1984. Google ScholarDigital Library
- Y. Jia. Running tpc-h queries on hive. In http://issues.apache.org/jira/browse/HIVE-600, 2009.Google Scholar
- D. Jiang, B. C. Ooi, L. Shi, and S. Wu. The performance of mapreduce: An in-depth study. PVLDB, 3(1):472--483, 2010. Google ScholarDigital Library
- Y. Lin, D. Agrawal, C. Chen, B. C. Ooi, and S. Wu. Llama: leveraging columnar storage for scalable join processing in the mapreduce framework. In SIGMOD Conference, pages 961--972, 2011. Google ScholarDigital Library
- T. Nykiel, M. Potamias, C. Mishra, G. Kollios, and N. Koudas. Mrshare: Sharing across multiple queries in mapreduce. In VLDB, 2010. Google ScholarDigital Library
- C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In SIGMOD, 2008. Google ScholarDigital Library
- K. Ono and G. M. Lohman. Measuring the complexity of join enumeration in query optimization. In VLDB, pages 314--325, 1990. Google ScholarDigital Library
- V. Poosala, P. J. Haas, Y. E. Ioannidis, and E. J. Shekita. Improved histograms for selectivity estimation of range predicates. SIGMOD Rec., 25(2), 1996. Google ScholarDigital Library
- R. Stewart. Performance and programmability comparison mapreduce query languages. In Master Thesis, Heriot-Watt University, 2010.Google Scholar
- A. Thusoo, R. Murthy, J. S. Sarma, Z. Shao, N. Jain, P. Chakka, S. Anthony, H. Liu, and N. Zhang. Hive -- a petabyte scale data warehousing using hadoop. In ICDE, 2010.Google ScholarCross Ref
- A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wychoff, and R. Murthy. Hive -- a warehousing solution over a map-reduce framework. In VLDB, 2009. Google ScholarDigital Library
- M. Zaharia, D. Borthakur, J. S. Sarma, K. Elmeleegy, S. Shenker, and I. Stoica. Job scheduling for multi-user mapreduce clusters. In Technical Report, UCB/EECS-2009-55, University of California at Berkeley, 2009.Google Scholar
Index Terms
- Query optimization for massively parallel data processing
Recommendations
Query optimization using column statistics in hive
IDEAS '11: Proceedings of the 15th Symposium on International Database Engineering & ApplicationsHive is a data warehousing solution on top of the Hadoop MapReduce framework that has been designed to handle large amounts of data and store them in tables like a relational database management system or a conventional data warehouse while using the ...
Considering data skew factor in multi-way join query optimization for parallel execution
Parallelism in database systemsA consensus on parallel architecture for very large database management has emerged. This architecture is based on a shared-nothing hardware organization. The computation model is very sensitive to skew in tuple distribution, however. Recently, several ...
Materialized view selection using evolutionary algorithm for speeding up big data query processing
For speeding up query processing on Big Data, frequent sub-queries or views may be materialized such that the query processing cost is minimized with optimum cost of maintaining the materialized views and/or queries. Materializing frequent sub-queries ...
Comments