skip to main content
10.1145/2723372.2742797acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Public Access

Spark SQL: Relational Data Processing in Spark

Published:27 May 2015Publication History

ABSTRACT

Spark SQL is a new module in Apache Spark that integrates relational processing with Spark's functional programming API. Built on our experience with Shark, Spark SQL lets Spark programmers leverage the benefits of relational processing (e.g. declarative queries and optimized storage), and lets SQL users call complex analytics libraries in Spark (e.g. machine learning). Compared to previous systems, Spark SQL makes two main additions. First, it offers much tighter integration between relational and procedural processing, through a declarative DataFrame API that integrates with procedural Spark code. Second, it includes a highly extensible optimizer, Catalyst, built using features of the Scala programming language, that makes it easy to add composable rules, control code generation, and define extension points. Using Catalyst, we have built a variety of features (e.g. schema inference for JSON, machine learning types, and query federation to external databases) tailored for the complex needs of modern data analysis. We see Spark SQL as an evolution of both SQL-on-Spark and of Spark itself, offering richer APIs and optimizations while keeping the benefits of the Spark programming model.

References

  1. A. Abouzied, D. J. Abadi, and A. Silberschatz. Invisible loading: Access-driven data transfer from raw files into database systems. In EDBT, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Alexandrov et al. The Stratosphere platform for big data analytics. The VLDB Journal, 23(6):939--964, Dec. 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. AMPLab big data benchmark. https://amplab.cs.berkeley.edu/benchmark.Google ScholarGoogle Scholar
  4. Apache Avro project. http://avro.apache.org.Google ScholarGoogle Scholar
  5. Apache Parquet project. http://parquet.incubator.apache.org.Google ScholarGoogle Scholar
  6. Apache Spark project. http://spark.apache.org.Google ScholarGoogle Scholar
  7. M. Armbrust, N. Lanham, S. Tu, A. Fox, M. J. Franklin, and D. A. Patterson. The case for PIQL: a performance insightful query language. In SOCC, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. Behm et al. Asterix: towards a scalable, semistructured data platform for evolving-world models. Distributed and Parallel Databases, 29(3):185--216, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. G. J. Bex, F. Neven, and S. Vansummeren. Inferring XML schema definitions from XML data. In VLDB, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. BigDF project. https://github.com/AyasdiOpenSource/bigdf.Google ScholarGoogle Scholar
  11. C. Chambers, A. Raniwala, F. Perry, S. Adams, R. R. Henry, R. Bradshaw, and N. Weizenbaum. FlumeJava: Easy, efficient data-parallel pipelines. In PLDI, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Cohen, B. Dolan, M. Dunlap, J. Hellerstein, and C. Welton. MAD skills: new analysis practices for big data. VLDB, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. DDF project. http://ddf.io.Google ScholarGoogle Scholar
  14. B. Emir, M. Odersky, and J. Williams. Matching objects with patterns. In ECOOP 2007 -- Object-Oriented Programming, volume 4609 of LNCS, pages 273--298. Springer, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica. GraphX: Graph processing in a distributed dataflow framework. In OSDI, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. G. Graefe. The Cascades framework for query optimization. IEEE Data Engineering Bulletin, 18(3), 1995.Google ScholarGoogle Scholar
  17. G. Graefe and D. DeWitt. The EXODUS optimizer generator. In SIGMOD, 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. Hegewald, F. Naumann, and M. Weis. XStruct: efficient schema extraction from multiple and large XML documents. In ICDE Workshops, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Hive data definition language. https://cwiki.apache.org/confluence/display/Hive/LanguageManualGoogle ScholarGoogle Scholar
  20. DDL.Google ScholarGoogle Scholar
  21. M. Isard and Y. Yu. Distributed data-parallel computing using a high-level programming language. In SIGMOD, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Jackson JSON processor. http://jackson.codehaus.org.Google ScholarGoogle Scholar
  23. Y. Klonatos, C. Koch, T. Rompf, and H. Chafi. Building efficient query engines in a high-level language. PVLDB, 7(10):853--864, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. M. Kornacker et al. Impala: A modern, open-source SQL engine for Hadoop. In CIDR, 2015.Google ScholarGoogle Scholar
  25. Y. Low et al. Distributed GraphLab: a framework for machine learning and data mining in the cloud. VLDB, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. S. Melnik et al. Dremel: interactive analysis of web-scale datasets. Proc. VLDB Endow., 3:330--339, Sept 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. X. Meng, J. Bradley, E. Sparks, and S. Venkataraman. ML pipelines: a new high-level API for MLlib. https://databricks.com/blog/2015/01/07/ml-pipelines-a-new-high-level-api-for-mllib.html.Google ScholarGoogle Scholar
  28. S. Nestorov, S. Abiteboul, and R. Motwani. Extracting schema from semistructured data. In ICDM, 1998.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. F. A. Nothaft, M. Massie, T. Danford, Z. Zhang, U. Laserson, C. Yeksigian, J. Kottalam, A. Ahuja, J. Hammerbacher, M. Linderman, M. J. Franklin, A. D. Joseph, and D. A. Patterson. Rethinking data-intensive science using scalable analytics systems. In SIGMOD, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: a not-so-foreign language for data processing. In SIGMOD, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. \textttpandas Python data analysis library. http://pandas.pydata.org.Google ScholarGoogle Scholar
  32. A. Pavlo et al. A comparison of approaches to large-scale data analysis. In SIGMOD, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. R project for statistical computing. http://www.r-project.org.Google ScholarGoogle Scholar
  34. scikit-learn: machine learning in Python. http://scikit-learn.org.Google ScholarGoogle Scholar
  35. D. Shabalin, E. Burmako, and M. Odersky. Quasiquotes for Scala, a technical report. Technical Report 185242, École Polytechnique Fédérale de Lausanne, 2013.Google ScholarGoogle Scholar
  36. D. Tahara, T. Diamond, and D. J. Abadi. Sinew: A SQL system for multi-structured data. In SIGMOD, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. A. Thusoo et al. Hive--a petabyte scale data warehouse using Hadoop. In ICDE, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  38. P. Wadler. Monads for functional programming. In Advanced Functional Programming, pages 24--52. Springer, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica. Shark: SQL and rich analytics at scale. In SIGMOD, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. M. Zaharia et al. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In NSDI, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. K. Zeng et al. G-OLA: Generalized online aggregation for interactive analysis on big data. In SIGMOD, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Spark SQL: Relational Data Processing in Spark

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
      May 2015
      2110 pages
      ISBN:9781450327589
      DOI:10.1145/2723372

      Copyright © 2015 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 27 May 2015

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      SIGMOD '15 Paper Acceptance Rate106of415submissions,26%Overall Acceptance Rate785of4,003submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader