ABSTRACT
Spark SQL is a new module in Apache Spark that integrates relational processing with Spark's functional programming API. Built on our experience with Shark, Spark SQL lets Spark programmers leverage the benefits of relational processing (e.g. declarative queries and optimized storage), and lets SQL users call complex analytics libraries in Spark (e.g. machine learning). Compared to previous systems, Spark SQL makes two main additions. First, it offers much tighter integration between relational and procedural processing, through a declarative DataFrame API that integrates with procedural Spark code. Second, it includes a highly extensible optimizer, Catalyst, built using features of the Scala programming language, that makes it easy to add composable rules, control code generation, and define extension points. Using Catalyst, we have built a variety of features (e.g. schema inference for JSON, machine learning types, and query federation to external databases) tailored for the complex needs of modern data analysis. We see Spark SQL as an evolution of both SQL-on-Spark and of Spark itself, offering richer APIs and optimizations while keeping the benefits of the Spark programming model.
- A. Abouzied, D. J. Abadi, and A. Silberschatz. Invisible loading: Access-driven data transfer from raw files into database systems. In EDBT, 2013. Google ScholarDigital Library
- A. Alexandrov et al. The Stratosphere platform for big data analytics. The VLDB Journal, 23(6):939--964, Dec. 2014. Google ScholarDigital Library
- AMPLab big data benchmark. https://amplab.cs.berkeley.edu/benchmark.Google Scholar
- Apache Avro project. http://avro.apache.org.Google Scholar
- Apache Parquet project. http://parquet.incubator.apache.org.Google Scholar
- Apache Spark project. http://spark.apache.org.Google Scholar
- M. Armbrust, N. Lanham, S. Tu, A. Fox, M. J. Franklin, and D. A. Patterson. The case for PIQL: a performance insightful query language. In SOCC, 2010. Google ScholarDigital Library
- A. Behm et al. Asterix: towards a scalable, semistructured data platform for evolving-world models. Distributed and Parallel Databases, 29(3):185--216, 2011. Google ScholarDigital Library
- G. J. Bex, F. Neven, and S. Vansummeren. Inferring XML schema definitions from XML data. In VLDB, 2007. Google ScholarDigital Library
- BigDF project. https://github.com/AyasdiOpenSource/bigdf.Google Scholar
- C. Chambers, A. Raniwala, F. Perry, S. Adams, R. R. Henry, R. Bradshaw, and N. Weizenbaum. FlumeJava: Easy, efficient data-parallel pipelines. In PLDI, 2010. Google ScholarDigital Library
- J. Cohen, B. Dolan, M. Dunlap, J. Hellerstein, and C. Welton. MAD skills: new analysis practices for big data. VLDB, 2009. Google ScholarDigital Library
- DDF project. http://ddf.io.Google Scholar
- B. Emir, M. Odersky, and J. Williams. Matching objects with patterns. In ECOOP 2007 -- Object-Oriented Programming, volume 4609 of LNCS, pages 273--298. Springer, 2007. Google ScholarDigital Library
- J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica. GraphX: Graph processing in a distributed dataflow framework. In OSDI, 2014. Google ScholarDigital Library
- G. Graefe. The Cascades framework for query optimization. IEEE Data Engineering Bulletin, 18(3), 1995.Google Scholar
- G. Graefe and D. DeWitt. The EXODUS optimizer generator. In SIGMOD, 1987. Google ScholarDigital Library
- J. Hegewald, F. Naumann, and M. Weis. XStruct: efficient schema extraction from multiple and large XML documents. In ICDE Workshops, 2006. Google ScholarDigital Library
- Hive data definition language. https://cwiki.apache.org/confluence/display/Hive/LanguageManualGoogle Scholar
- DDL.Google Scholar
- M. Isard and Y. Yu. Distributed data-parallel computing using a high-level programming language. In SIGMOD, 2009. Google ScholarDigital Library
- Jackson JSON processor. http://jackson.codehaus.org.Google Scholar
- Y. Klonatos, C. Koch, T. Rompf, and H. Chafi. Building efficient query engines in a high-level language. PVLDB, 7(10):853--864, 2014. Google ScholarDigital Library
- M. Kornacker et al. Impala: A modern, open-source SQL engine for Hadoop. In CIDR, 2015.Google Scholar
- Y. Low et al. Distributed GraphLab: a framework for machine learning and data mining in the cloud. VLDB, 2012. Google ScholarDigital Library
- S. Melnik et al. Dremel: interactive analysis of web-scale datasets. Proc. VLDB Endow., 3:330--339, Sept 2010. Google ScholarDigital Library
- X. Meng, J. Bradley, E. Sparks, and S. Venkataraman. ML pipelines: a new high-level API for MLlib. https://databricks.com/blog/2015/01/07/ml-pipelines-a-new-high-level-api-for-mllib.html.Google Scholar
- S. Nestorov, S. Abiteboul, and R. Motwani. Extracting schema from semistructured data. In ICDM, 1998.Google ScholarDigital Library
- F. A. Nothaft, M. Massie, T. Danford, Z. Zhang, U. Laserson, C. Yeksigian, J. Kottalam, A. Ahuja, J. Hammerbacher, M. Linderman, M. J. Franklin, A. D. Joseph, and D. A. Patterson. Rethinking data-intensive science using scalable analytics systems. In SIGMOD, 2015. Google ScholarDigital Library
- C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: a not-so-foreign language for data processing. In SIGMOD, 2008. Google ScholarDigital Library
- \textttpandas Python data analysis library. http://pandas.pydata.org.Google Scholar
- A. Pavlo et al. A comparison of approaches to large-scale data analysis. In SIGMOD, 2009. Google ScholarDigital Library
- R project for statistical computing. http://www.r-project.org.Google Scholar
- scikit-learn: machine learning in Python. http://scikit-learn.org.Google Scholar
- D. Shabalin, E. Burmako, and M. Odersky. Quasiquotes for Scala, a technical report. Technical Report 185242, École Polytechnique Fédérale de Lausanne, 2013.Google Scholar
- D. Tahara, T. Diamond, and D. J. Abadi. Sinew: A SQL system for multi-structured data. In SIGMOD, 2014. Google ScholarDigital Library
- A. Thusoo et al. Hive--a petabyte scale data warehouse using Hadoop. In ICDE, 2010.Google ScholarCross Ref
- P. Wadler. Monads for functional programming. In Advanced Functional Programming, pages 24--52. Springer, 1995. Google ScholarDigital Library
- R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica. Shark: SQL and rich analytics at scale. In SIGMOD, 2013. Google ScholarDigital Library
- M. Zaharia et al. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In NSDI, 2012. Google ScholarDigital Library
- K. Zeng et al. G-OLA: Generalized online aggregation for interactive analysis on big data. In SIGMOD, 2015. Google ScholarDigital Library
Index Terms
- Spark SQL: Relational Data Processing in Spark
Recommendations
Shark: SQL and rich analytics at scale
SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of DataShark is a new data analysis system that marries query processing with complex analytics on large clusters. It leverages a novel distributed memory abstraction to provide a unified engine that can run SQL queries and sophisticated analytics functions (...
Evaluating SQL-on-Hadoop for Big Data Warehousing on Not-So-Good Hardware
IDEAS '17: Proceedings of the 21st International Database Engineering & Applications SymposiumBig Data is currently conceptualized as data whose volume, variety or velocity impose significant difficulties in traditional techniques and technologies. Big Data Warehousing is emerging as a new concept for Big Data analytics. In this context, SQL-on-...
A Spark-Based Big Data Platform for Massive Remote Sensing Data Processing
ICDS 2015: Proceedings of the Second International Conference on Data Science - Volume 9208With the fast development of remote sensing techniques, the volume of acquired data grows exponentially. This brings a big challenge to process massive remote sensing data. In the paper, an in-memory computing framework is proposed to address this ...
Comments