skip to main content
10.1145/2723372.2723713acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Learning Generalized Linear Models Over Normalized Data

Published:27 May 2015Publication History

ABSTRACT

Enterprise data analytics is a booming area in the data management industry. Many companies are racing to develop toolkits that closely integrate statistical and machine learning techniques with data management systems. Almost all such toolkits assume that the input to a learning algorithm is a single table. However, most relational datasets are not stored as single tables due to normalization. Thus, analysts often perform key-foreign key joins before learning on the join output. This strategy of learning after joins introduces redundancy avoided by normalization, which could lead to poorer end-to-end performance and maintenance overheads due to data duplication. In this work, we take a step towards enabling and optimizing learning over joins for a common class of machine learning techniques called generalized linear models that are solved using gradient descent algorithms in an RDBMS setting. We present alternative approaches to learn over a join that are easy to implement over existing RDBMSs. We introduce a new approach named factorized learning that pushes ML computations through joins and avoids redundancy in both I/O and computations. We study the tradeoff space for all our approaches both analytically and empirically. Our results show that factorized learning is often substantially faster than the alternatives, but is not always the fastest, necessitating a cost-based approach. We also discuss extensions of all our approaches to multi-table joins as well as to Hive.

References

  1. Apache Hive. hive.apache.org.Google ScholarGoogle Scholar
  2. Apache Mahout. mahout.apache.org.Google ScholarGoogle Scholar
  3. IBM Report on Big Data. www-01.ibm.com/software/data/bigdata/.Google ScholarGoogle Scholar
  4. Microsoft SQL Server Data Mining.Google ScholarGoogle Scholar
  5. On Learning Over Joins. pages.cs.wisc.edu/~arun/orion/TechReport.pdf.Google ScholarGoogle Scholar
  6. Oracle R Enterprise.Google ScholarGoogle Scholar
  7. SAS Report on Analytics. sas.com/reg/wp/corp/23876.Google ScholarGoogle Scholar
  8. A. Agarwal, O. Chapelle, M. Dudík, and J. Langford. A Reliable Effective Terascale Linear Learning System. JMLR, 15:1111--1133, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. Anderson et~al. Brainwash: A Data System for Feature Engineering. In CIDR, 2013.Google ScholarGoogle Scholar
  10. N. Bakibayev, T. Kociský, D. Olteanu, and J. Zavodny. Aggregation and Ordering in Factorised Databases. In VLDB, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. N. Bakibayev, D. Olteanu, and J. Zavodny. FDB: A Query Engine for Factorised Relational Databases. In VLDB, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Z. Cai et~al. Simulation of Database-valued Markov Chains Using SimSQL. In SIGMOD, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. R. Chaiken et~al. SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets. In VLDB, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Chaudhuri and K. Shim. Including Group-By in Query Optimization. In VLDB, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. Das et~al. Ricardo: Integrating R and Hadoop. In SIGMOD, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. X. Feng, A. Kumar, B. Recht, and C. Ré. Towards a Unified Architecture for in-RDBMS Analytics. In SIGMOD, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. L. Getoor and B. Taskar. Introduction to Statistical Relational Learning). The MIT Press, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. A. Ghoting et~al. SystemML: Declarative Machine Learning on MapReduce. In ICDE, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. Gray et~al. Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals. Data Min. Knowl. Discov., 1(1):29--53, Jan. 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of Statistical Learning: Data mining, inference, and prediction. Springer-Verlag, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  22. J. Hellerstein et~al. The MADlib Analytics Library or MAD Skills, the SQL. In VLDB, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. P. Konda, A. Kumar, C. Ré, and V. Sashikanth. Feature Selection in Enterprise Analytics: A Demonstration using an R-based Data Analytics System. In VLDB, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. T. Kraska et~al. MLbase: A Distributed Machine-learning System. In CIDR, 2013.Google ScholarGoogle Scholar
  25. A. Kumar, F. Niu, and C. Ré. Hazy: Making it Easier to Build and Maintain Big-data Analytics. CACM, 56(3):40--49, March 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Y. Low et~al. GraphLab: A New Framework For Parallel Machine Learning. In UAI, 2010.Google ScholarGoogle Scholar
  27. T. M. Mitchell. Machine Learning. McGraw Hill, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. J. Nocedal and S. J. Wright. Numerical Optimization. Springer, 2006.Google ScholarGoogle Scholar
  29. R. Ramakrishnan and J. Gehrke. Database Management Systems. McGraw-Hill, Inc., New York, NY, USA, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. S. Rendle. Scaling Factorization Machines to Relational Data. In VLDB, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. P. G. Selinger et~al. Access path Selection in a Relational Database Mmanagement System. In SIGMOD, 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. T. K. Sellis. Multiple-Query Optimization. ACM TODS, 13(1):23--52, Mar. 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. L. D. Shapiro. Join Processing in Database Systems with Large Main Memories. ACM TODS, 11(3):239--264, Aug. 1986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. W. P. Yan and P.-Å. Larson. Eager Aggregation and Lazy Aggregation. In VLDB, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. C. Zhang, A. Kumar, and C. Ré. Materialization Optimizations for Feature Selection Workloads. In SIGMOD, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Y. Zhang, W. Zhang, and J. Yang. I/O-Efficient Statistical Computing with RIOT. In ICDE, 2010.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Learning Generalized Linear Models Over Normalized Data

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
      May 2015
      2110 pages
      ISBN:9781450327589
      DOI:10.1145/2723372

      Copyright © 2015 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 27 May 2015

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      SIGMOD '15 Paper Acceptance Rate106of415submissions,26%Overall Acceptance Rate785of4,003submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader