Spark SQL: Relational Data Processing in Spark

Authors:
Michael Armbrust

Databricks, San Francisco, USA

Databricks, San Francisco, USA
View Profile

,
Reynold S. Xin

Databricks, San Francisco, USA

Databricks, San Francisco, USA
View Profile

,
Cheng Lian

Databricks, San Francisco, USA

Databricks, San Francisco, USA
View Profile

,
Yin Huai

Databricks, San Francisco, USA

Databricks, San Francisco, USA
View Profile

,
Davies Liu

Databricks, San Francisco, USA

Databricks, San Francisco, USA
View Profile

,
Joseph K. Bradley

Databricks, San Francisco, USA

Databricks, San Francisco, USA
View Profile

,
Xiangrui Meng

Databricks, San Francisco, USA

Databricks, San Francisco, USA
View Profile

,
Tomer Kaftan

UC Berkeley, Berkeley, USA

UC Berkeley, Berkeley, USA
View Profile

,
Michael J. Franklin

UC Berkeley, Berkeley, USA

UC Berkeley, Berkeley, USA
View Profile

,
Ali Ghodsi

Databricks, Berkeley, USA

Databricks, Berkeley, USA
View Profile

,
Matei Zaharia

Databricks, San Francisco, USA

Databricks, San Francisco, USA
View Profile

SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of DataMay 2015Pages 1383–1394https://doi.org/10.1145/2723372.2742797

Published:27 May 2015Publication History

SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

Pages 1383–1394

ABSTRACT

Spark SQL is a new module in Apache Spark that integrates relational processing with Spark's functional programming API. Built on our experience with Shark, Spark SQL lets Spark programmers leverage the benefits of relational processing (e.g. declarative queries and optimized storage), and lets SQL users call complex analytics libraries in Spark (e.g. machine learning). Compared to previous systems, Spark SQL makes two main additions. First, it offers much tighter integration between relational and procedural processing, through a declarative DataFrame API that integrates with procedural Spark code. Second, it includes a highly extensible optimizer, Catalyst, built using features of the Scala programming language, that makes it easy to add composable rules, control code generation, and define extension points. Using Catalyst, we have built a variety of features (e.g. schema inference for JSON, machine learning types, and query federation to external databases) tailored for the complex needs of modern data analysis. We see Spark SQL as an evolution of both SQL-on-Spark and of Spark itself, offering richer APIs and optimizations while keeping the benefits of the Spark programming model.

References

A. Abouzied, D. J. Abadi, and A. Silberschatz. Invisible loading: Access-driven data transfer from raw files into database systems. In EDBT, 2013. Google ScholarDigital Library
A. Alexandrov et al. The Stratosphere platform for big data analytics. The VLDB Journal, 23(6):939--964, Dec. 2014. Google ScholarDigital Library
AMPLab big data benchmark. https://amplab.cs.berkeley.edu/benchmark.Google Scholar
Apache Avro project. http://avro.apache.org.Google Scholar
Apache Parquet project. http://parquet.incubator.apache.org.Google Scholar
Apache Spark project. http://spark.apache.org.Google Scholar
M. Armbrust, N. Lanham, S. Tu, A. Fox, M. J. Franklin, and D. A. Patterson. The case for PIQL: a performance insightful query language. In SOCC, 2010. Google ScholarDigital Library
A. Behm et al. Asterix: towards a scalable, semistructured data platform for evolving-world models. Distributed and Parallel Databases, 29(3):185--216, 2011. Google ScholarDigital Library
G. J. Bex, F. Neven, and S. Vansummeren. Inferring XML schema definitions from XML data. In VLDB, 2007. Google ScholarDigital Library
BigDF project. https://github.com/AyasdiOpenSource/bigdf.Google Scholar
C. Chambers, A. Raniwala, F. Perry, S. Adams, R. R. Henry, R. Bradshaw, and N. Weizenbaum. FlumeJava: Easy, efficient data-parallel pipelines. In PLDI, 2010. Google ScholarDigital Library
J. Cohen, B. Dolan, M. Dunlap, J. Hellerstein, and C. Welton. MAD skills: new analysis practices for big data. VLDB, 2009. Google ScholarDigital Library
DDF project. http://ddf.io.Google Scholar
B. Emir, M. Odersky, and J. Williams. Matching objects with patterns. In ECOOP 2007 -- Object-Oriented Programming, volume 4609 of LNCS, pages 273--298. Springer, 2007. Google ScholarDigital Library
J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica. GraphX: Graph processing in a distributed dataflow framework. In OSDI, 2014. Google ScholarDigital Library
G. Graefe. The Cascades framework for query optimization. IEEE Data Engineering Bulletin, 18(3), 1995.Google Scholar
G. Graefe and D. DeWitt. The EXODUS optimizer generator. In SIGMOD, 1987. Google ScholarDigital Library
J. Hegewald, F. Naumann, and M. Weis. XStruct: efficient schema extraction from multiple and large XML documents. In ICDE Workshops, 2006. Google ScholarDigital Library
Hive data definition language. https://cwiki.apache.org/confluence/display/Hive/LanguageManualGoogle Scholar
DDL.Google Scholar
M. Isard and Y. Yu. Distributed data-parallel computing using a high-level programming language. In SIGMOD, 2009. Google ScholarDigital Library
Jackson JSON processor. http://jackson.codehaus.org.Google Scholar
Y. Klonatos, C. Koch, T. Rompf, and H. Chafi. Building efficient query engines in a high-level language. PVLDB, 7(10):853--864, 2014. Google ScholarDigital Library
M. Kornacker et al. Impala: A modern, open-source SQL engine for Hadoop. In CIDR, 2015.Google Scholar
Y. Low et al. Distributed GraphLab: a framework for machine learning and data mining in the cloud. VLDB, 2012. Google ScholarDigital Library
S. Melnik et al. Dremel: interactive analysis of web-scale datasets. Proc. VLDB Endow., 3:330--339, Sept 2010. Google ScholarDigital Library
X. Meng, J. Bradley, E. Sparks, and S. Venkataraman. ML pipelines: a new high-level API for MLlib. https://databricks.com/blog/2015/01/07/ml-pipelines-a-new-high-level-api-for-mllib.html.Google Scholar
S. Nestorov, S. Abiteboul, and R. Motwani. Extracting schema from semistructured data. In ICDM, 1998.Google ScholarDigital Library
F. A. Nothaft, M. Massie, T. Danford, Z. Zhang, U. Laserson, C. Yeksigian, J. Kottalam, A. Ahuja, J. Hammerbacher, M. Linderman, M. J. Franklin, A. D. Joseph, and D. A. Patterson. Rethinking data-intensive science using scalable analytics systems. In SIGMOD, 2015. Google ScholarDigital Library
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: a not-so-foreign language for data processing. In SIGMOD, 2008. Google ScholarDigital Library
\textttpandas Python data analysis library. http://pandas.pydata.org.Google Scholar
A. Pavlo et al. A comparison of approaches to large-scale data analysis. In SIGMOD, 2009. Google ScholarDigital Library
R project for statistical computing. http://www.r-project.org.Google Scholar
scikit-learn: machine learning in Python. http://scikit-learn.org.Google Scholar
D. Shabalin, E. Burmako, and M. Odersky. Quasiquotes for Scala, a technical report. Technical Report 185242, École Polytechnique Fédérale de Lausanne, 2013.Google Scholar
D. Tahara, T. Diamond, and D. J. Abadi. Sinew: A SQL system for multi-structured data. In SIGMOD, 2014. Google ScholarDigital Library
A. Thusoo et al. Hive--a petabyte scale data warehouse using Hadoop. In ICDE, 2010.Google ScholarCross Ref
P. Wadler. Monads for functional programming. In Advanced Functional Programming, pages 24--52. Springer, 1995. Google ScholarDigital Library
R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica. Shark: SQL and rich analytics at scale. In SIGMOD, 2013. Google ScholarDigital Library
M. Zaharia et al. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In NSDI, 2012. Google ScholarDigital Library
K. Zeng et al. G-OLA: Generalized online aggregation for interactive analysis on big data. In SIGMOD, 2015. Google ScholarDigital Library

Index Terms

Spark SQL: Relational Data Processing in Spark
1. Information systems
  1. Data management systems
    1. Database management system engines

Recommendations

Shark: SQL and rich analytics at scale
SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

Shark is a new data analysis system that marries query processing with complex analytics on large clusters. It leverages a novel distributed memory abstraction to provide a unified engine that can run SQL queries and sophisticated analytics functions (...
Read More
Evaluating SQL-on-Hadoop for Big Data Warehousing on Not-So-Good Hardware
IDEAS '17: Proceedings of the 21st International Database Engineering & Applications Symposium

Big Data is currently conceptualized as data whose volume, variety or velocity impose significant difficulties in traditional techniques and technologies. Big Data Warehousing is emerging as a new concept for Big Data analytics. In this context, SQL-on-...
Read More
A Spark-Based Big Data Platform for Massive Remote Sensing Data Processing
ICDS 2015: Proceedings of the Second International Conference on Data Science - Volume 9208

With the fast development of remote sensing techniques, the volume of acquired data grows exponentially. This brings a big challenge to process massive remote sensing data. In the paper, an in-memory computing framework is proposed to address this ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
May 2015
2110 pages
ISBN:9781450327589
DOI:10.1145/2723372
General Chair:
Timos Sellis
RMIT University, Australia
,
Program Chairs:
Susan B. Davidson
University of Pennsylvania, USA
,
Zack Ives
University of Pennsylvania, USA
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 May 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data warehouse
databases
hadoop
machine learning
spark
Qualifiers
- research-article
Conference

Acceptance Rates
SIGMOD '15 Paper Acceptance Rate106of415submissions,26%Overall Acceptance Rate785of4,003submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 871
  Total Citations
  View Citations
- 12,740
  Total Downloads
- Downloads (Last 12 months)1,808
- Downloads (Last 6 weeks)250
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Spark SQL: Relational Data Processing in Spark

SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Shark: SQL and rich analytics at scale

Evaluating SQL-on-Hadoop for Big Data Warehousing on Not-So-Good Hardware

A Spark-Based Big Data Platform for Massive Remote Sensing Data Processing