research-article

Query optimization for massively parallel data processing

Authors:
Sai Wu

National University of Singapore, Singapore

National University of Singapore, Singapore
View Profile

,
Feng Li

National University of Singapore, Singapore

National University of Singapore, Singapore
View Profile

,
Sharad Mehrotra

University of California at Irvine

University of California at Irvine
View Profile

,
Beng Chin Ooi

National University of Singapore, Singapore

National University of Singapore, Singapore
View Profile

SOCC '11: Proceedings of the 2nd ACM Symposium on Cloud ComputingOctober 2011Article No.: 12Pages 1–13https://doi.org/10.1145/2038916.2038928

Published:26 October 2011Publication History

SOCC '11: Proceedings of the 2nd ACM Symposium on Cloud Computing

Pages 1–13

ABSTRACT

MapReduce has been widely recognized as an efficient tool for large-scale data analysis. It achieves high performance by exploiting parallelism among processing nodes while providing a simple interface for upper-layer applications. Some vendors have enhanced their data warehouse systems by integrating MapReduce into the systems. However, existing MapReduce-based query processing systems, such as Hive, fall short of the query optimization and competency of conventional database systems. Given an SQL query, Hive translates the query into a set of MapReduce jobs sentence by sentence. This design assumes that the user can optimize his query before submitting it to the system. Unfortunately, manual query optimization is time consuming and difficult, even to an experienced database user or administrator. In this paper, we propose a query optimization scheme for MapReduce-based processing systems. Specifically, we embed into Hive a query optimizer which is designed to generate an efficient query plan based on our proposed cost model. Experiments carried out on our in-house cluster confirm the effectiveness of our query optimizer.

References

http://hadoop.apache.org.Google Scholar
http://wiki.apache.org/hadoop/hive/languagemanual/joins.Google Scholar
http://www.aster.com.Google Scholar
http://www.greenplum.com.Google Scholar
http://www.tpc.org/tpch/.Google Scholar
F. N. Afrati and J. D. Ullman. Optimizing joins in a map-reduce environment. EDBT, 2009. Google ScholarDigital Library
P. A. Bernstein, N. Goodman, E. Wong, C. L. Reeve, and J. B. Rothnie, Jr. Query processing in a system for distributed databases (sdd-1). ACM Trans. Database Syst., 6(4):602--625, 1981. Google ScholarDigital Library
Y. Cao, C. Chen, F. Guo, D. Jiang, Y. Lin, B. C. Ooi, H. T. Vo, S. Wu, and Q. Xu. Es²: A cloud data storage system for supporting both oltp and olap. In ICDE, pages 291--302, 2011. Google ScholarDigital Library
S. Chaudhuri. An overview of query optimization in relational systems. In PODS, pages 34--43, 1998. Google ScholarDigital Library
C. Chen, G. Chen, D. Jiang, B. C. Ooi, L. Shi, H. T. Vo, and S. Wu. E3: an elastic execution engine for scalable data processing. Technical Report, National University of Singapore, School of Computing. TRA07/11, 2011.Google Scholar
C. Chen, G. Chen, D. Jiang, B. C. Ooi, H. T. Vo, S. Wu, and Q. Xu. Providing scalable database services on the cloud. In WISE, pages 1--19, 2010. Google ScholarDigital Library
G. Chen, H. T. Vo, S. Wu, B. C. Ooi, and M. T. Özsu. A framework for supporting dbms-like indexes in the cloud. In VLDB, 2011.Google Scholar
M.-S. Chen, P. S. Yu, and K.-L. Wu. Optimization of parallel execution for multi-join queries. IEEE Trans. on Knowl. and Data Eng., 8(3):416--428, 1996. Google ScholarDigital Library
T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, and R. Sears. Mapreduce online. Technical report, EECS Department, University of California, Berkeley, Oct 2009.Google Scholar
J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. pages 137--150.Google Scholar
M. J. Franklin, B. T. Jónsson, and D. Kossmann. Performance tradeoffs for client-server query processing. SIGMOD Rec., 25(2):149--160, 1996. Google ScholarDigital Library
E. Friedman, P. Pawlowski, and J. Cieslewicz. Sql/mapreduce: a practical approach to self-describing, polymorphic, and parallelizable user-defined functions. VLDB, 2009. Google ScholarDigital Library
S. Ganguly, W. Hasan, and R. Krishnamurthy. Query optimization for parallel execution. SIGMOD Rec., 21(2), 1992. Google ScholarDigital Library
M. Jarke and J. Koch. Query optimization in database systems. ACM Comput. Surv., 16(2):111--152, 1984. Google ScholarDigital Library
Y. Jia. Running tpc-h queries on hive. In http://issues.apache.org/jira/browse/HIVE-600, 2009.Google Scholar
D. Jiang, B. C. Ooi, L. Shi, and S. Wu. The performance of mapreduce: An in-depth study. PVLDB, 3(1):472--483, 2010. Google ScholarDigital Library
Y. Lin, D. Agrawal, C. Chen, B. C. Ooi, and S. Wu. Llama: leveraging columnar storage for scalable join processing in the mapreduce framework. In SIGMOD Conference, pages 961--972, 2011. Google ScholarDigital Library
T. Nykiel, M. Potamias, C. Mishra, G. Kollios, and N. Koudas. Mrshare: Sharing across multiple queries in mapreduce. In VLDB, 2010. Google ScholarDigital Library
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In SIGMOD, 2008. Google ScholarDigital Library
K. Ono and G. M. Lohman. Measuring the complexity of join enumeration in query optimization. In VLDB, pages 314--325, 1990. Google ScholarDigital Library
V. Poosala, P. J. Haas, Y. E. Ioannidis, and E. J. Shekita. Improved histograms for selectivity estimation of range predicates. SIGMOD Rec., 25(2), 1996. Google ScholarDigital Library
R. Stewart. Performance and programmability comparison mapreduce query languages. In Master Thesis, Heriot-Watt University, 2010.Google Scholar
A. Thusoo, R. Murthy, J. S. Sarma, Z. Shao, N. Jain, P. Chakka, S. Anthony, H. Liu, and N. Zhang. Hive -- a petabyte scale data warehousing using hadoop. In ICDE, 2010.Google ScholarCross Ref
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wychoff, and R. Murthy. Hive -- a warehousing solution over a map-reduce framework. In VLDB, 2009. Google ScholarDigital Library
M. Zaharia, D. Borthakur, J. S. Sarma, K. Elmeleegy, S. Shenker, and I. Stoica. Job scheduling for multi-user mapreduce clusters. In Technical Report, UCB/EECS-2009-55, University of California at Berkeley, 2009.Google Scholar

Index Terms

Query optimization for massively parallel data processing
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
      2. Parallel and distributed DBMSs
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Recommendations

Query optimization using column statistics in hive
IDEAS '11: Proceedings of the 15th Symposium on International Database Engineering & Applications

Hive is a data warehousing solution on top of the Hadoop MapReduce framework that has been designed to handle large amounts of data and store them in tables like a relational database management system or a conventional data warehouse while using the ...
Read More
Considering data skew factor in multi-way join query optimization for parallel execution
Parallelism in database systems

A consensus on parallel architecture for very large database management has emerged. This architecture is based on a shared-nothing hardware organization. The computation model is very sensitive to skew in tuple distribution, however. Recently, several ...
Read More
Materialized view selection using evolutionary algorithm for speeding up big data query processing

For speeding up query processing on Big Data, frequent sub-queries or views may be materialized such that the query processing cost is minimized with optimum cost of maintaining the materialized views and/or queries. Materializing frequent sub-queries ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SOCC '11: Proceedings of the 2nd ACM Symposium on Cloud Computing
October 2011
377 pages
ISBN:9781450309769
DOI:10.1145/2038916
Program Chairs:
Jeffrey S. Chase
Duke University
,
Amr El Abbadi
Univ of California, Santa Barbara
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 October 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Hive
MapReduce
multi-way join
query optimization
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate169of722submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 106
  Total Citations
  View Citations
- 1,779
  Total Downloads
- Downloads (Last 12 months)52
- Downloads (Last 6 weeks)10
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Query optimization for massively parallel data processing

SOCC '11: Proceedings of the 2nd ACM Symposium on Cloud Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Query optimization using column statistics in hive

Considering data skew factor in multi-way join query optimization for parallel execution

Materialized view selection using evolutionary algorithm for speeding up big data query processing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Query optimization for massively parallel data processing

SOCC '11: Proceedings of the 2nd ACM Symposium on Cloud Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Query optimization using column statistics in hive

Considering data skew factor in multi-way join query optimization for parallel execution

Materialized view selection using evolutionary algorithm for speeding up big data query processing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media