research-article

Learning Generalized Linear Models Over Normalized Data

Authors:
Arun Kumar

University of Wisconsin-Madison, Madison, WI, USA

University of Wisconsin-Madison, Madison, WI, USA
View Profile

,
Jeffrey Naughton

University of Wisconsin-Madison, Madison, WI, USA

University of Wisconsin-Madison, Madison, WI, USA
View Profile

,
Jignesh M. Patel

University of Wisconsin-Madison, Madison, WI, USA

University of Wisconsin-Madison, Madison, WI, USA
View Profile

SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of DataMay 2015Pages 1969–1984https://doi.org/10.1145/2723372.2723713

Published:27 May 2015Publication History

SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

Pages 1969–1984

ABSTRACT

Enterprise data analytics is a booming area in the data management industry. Many companies are racing to develop toolkits that closely integrate statistical and machine learning techniques with data management systems. Almost all such toolkits assume that the input to a learning algorithm is a single table. However, most relational datasets are not stored as single tables due to normalization. Thus, analysts often perform key-foreign key joins before learning on the join output. This strategy of learning after joins introduces redundancy avoided by normalization, which could lead to poorer end-to-end performance and maintenance overheads due to data duplication. In this work, we take a step towards enabling and optimizing learning over joins for a common class of machine learning techniques called generalized linear models that are solved using gradient descent algorithms in an RDBMS setting. We present alternative approaches to learn over a join that are easy to implement over existing RDBMSs. We introduce a new approach named factorized learning that pushes ML computations through joins and avoids redundancy in both I/O and computations. We study the tradeoff space for all our approaches both analytically and empirically. Our results show that factorized learning is often substantially faster than the alternatives, but is not always the fastest, necessitating a cost-based approach. We also discuss extensions of all our approaches to multi-table joins as well as to Hive.

References

Apache Hive. hive.apache.org.Google Scholar
Apache Mahout. mahout.apache.org.Google Scholar
IBM Report on Big Data. www-01.ibm.com/software/data/bigdata/.Google Scholar
Microsoft SQL Server Data Mining.Google Scholar
On Learning Over Joins. pages.cs.wisc.edu/~arun/orion/TechReport.pdf.Google Scholar
Oracle R Enterprise.Google Scholar
SAS Report on Analytics. sas.com/reg/wp/corp/23876.Google Scholar
A. Agarwal, O. Chapelle, M. Dudík, and J. Langford. A Reliable Effective Terascale Linear Learning System. JMLR, 15:1111--1133, 2014. Google ScholarDigital Library
M. Anderson et~al. Brainwash: A Data System for Feature Engineering. In CIDR, 2013.Google Scholar
N. Bakibayev, T. Kociský, D. Olteanu, and J. Zavodny. Aggregation and Ordering in Factorised Databases. In VLDB, 2013. Google ScholarDigital Library
N. Bakibayev, D. Olteanu, and J. Zavodny. FDB: A Query Engine for Factorised Relational Databases. In VLDB, 2012. Google ScholarDigital Library
Z. Cai et~al. Simulation of Database-valued Markov Chains Using SimSQL. In SIGMOD, 2013. Google ScholarDigital Library
R. Chaiken et~al. SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets. In VLDB, 2008. Google ScholarDigital Library
S. Chaudhuri and K. Shim. Including Group-By in Query Optimization. In VLDB, 1994. Google ScholarDigital Library
S. Das et~al. Ricardo: Integrating R and Hadoop. In SIGMOD, 2010. Google ScholarDigital Library
X. Feng, A. Kumar, B. Recht, and C. Ré. Towards a Unified Architecture for in-RDBMS Analytics. In SIGMOD, 2012. Google ScholarDigital Library
M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., 1979. Google ScholarDigital Library
L. Getoor and B. Taskar. Introduction to Statistical Relational Learning). The MIT Press, 2007. Google ScholarDigital Library
A. Ghoting et~al. SystemML: Declarative Machine Learning on MapReduce. In ICDE, 2011. Google ScholarDigital Library
J. Gray et~al. Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals. Data Min. Knowl. Discov., 1(1):29--53, Jan. 1997. Google ScholarDigital Library
T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of Statistical Learning: Data mining, inference, and prediction. Springer-Verlag, 2001.Google ScholarCross Ref
J. Hellerstein et~al. The MADlib Analytics Library or MAD Skills, the SQL. In VLDB, 2012. Google ScholarDigital Library
P. Konda, A. Kumar, C. Ré, and V. Sashikanth. Feature Selection in Enterprise Analytics: A Demonstration using an R-based Data Analytics System. In VLDB, 2013. Google ScholarDigital Library
T. Kraska et~al. MLbase: A Distributed Machine-learning System. In CIDR, 2013.Google Scholar
A. Kumar, F. Niu, and C. Ré. Hazy: Making it Easier to Build and Maintain Big-data Analytics. CACM, 56(3):40--49, March 2013. Google ScholarDigital Library
Y. Low et~al. GraphLab: A New Framework For Parallel Machine Learning. In UAI, 2010.Google Scholar
T. M. Mitchell. Machine Learning. McGraw Hill, 1997. Google ScholarDigital Library
J. Nocedal and S. J. Wright. Numerical Optimization. Springer, 2006.Google Scholar
R. Ramakrishnan and J. Gehrke. Database Management Systems. McGraw-Hill, Inc., New York, NY, USA, 2003. Google ScholarDigital Library
S. Rendle. Scaling Factorization Machines to Relational Data. In VLDB, 2013. Google ScholarDigital Library
P. G. Selinger et~al. Access path Selection in a Relational Database Mmanagement System. In SIGMOD, 1979. Google ScholarDigital Library
T. K. Sellis. Multiple-Query Optimization. ACM TODS, 13(1):23--52, Mar. 1988. Google ScholarDigital Library
L. D. Shapiro. Join Processing in Database Systems with Large Main Memories. ACM TODS, 11(3):239--264, Aug. 1986. Google ScholarDigital Library
W. P. Yan and P.-Å. Larson. Eager Aggregation and Lazy Aggregation. In VLDB, 1995. Google ScholarDigital Library
C. Zhang, A. Kumar, and C. Ré. Materialization Optimizations for Feature Selection Workloads. In SIGMOD, 2014. Google ScholarDigital Library
Y. Zhang, W. Zhang, and J. Yang. I/O-Efficient Statistical Computing with RIOT. In ICDE, 2010.Google ScholarCross Ref

Index Terms

Learning Generalized Linear Models Over Normalized Data
1. Information systems
  1. Data management systems

Recommendations

Scale-out beyond map-reduce
KDD '13: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

The amount and variety of data being collected in the enterprise is growing at a staggering pace. The default now is to capture and store any and all data, in anticipation of potential future strategic value, and vast amounts of data are being generated ...
Read More
OVI-3: A NoSQL visual query system supporting efficient anti-joins
Abstract
The aim of this work was to develop a technique to speed up complex joins in an incremental visual query system. When designing a visual, highly interactive interface for ad-hoc (read-only) queries, fast response times are of paramount importance. ...
Read More
Using Visualization to Illustrate Machine Learning Models for Genomic Data
ACSW '19: Proceedings of the Australasian Computer Science Week Multiconference

Massive amounts of genomic data are created for the advent of Next Generation Sequencing technologies. Visualizing these complex genomic data requires not only simply plotting of data but should also invite a decision or a choice. Machine learning has ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
May 2015
2110 pages
ISBN:9781450327589
DOI:10.1145/2723372
General Chair:
Timos Sellis
RMIT University, Australia
,
Program Chairs:
Susan B. Davidson
University of Pennsylvania, USA
,
Zack Ives
University of Pennsylvania, USA
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 May 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
analytics
feature engineering
joins
machine learning
Qualifiers
- research-article
Conference

Acceptance Rates
SIGMOD '15 Paper Acceptance Rate106of415submissions,26%Overall Acceptance Rate785of4,003submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 87
  Total Citations
  View Citations
- 805
  Total Downloads
- Downloads (Last 12 months)93
- Downloads (Last 6 weeks)10
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Learning Generalized Linear Models Over Normalized Data

SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Scale-out beyond map-reduce

OVI-3: A NoSQL visual query system supporting efficient anti-joins

Using Visualization to Illustrate Machine Learning Models for Genomic Data