ABSTRACT
Enterprise data analytics is a booming area in the data management industry. Many companies are racing to develop toolkits that closely integrate statistical and machine learning techniques with data management systems. Almost all such toolkits assume that the input to a learning algorithm is a single table. However, most relational datasets are not stored as single tables due to normalization. Thus, analysts often perform key-foreign key joins before learning on the join output. This strategy of learning after joins introduces redundancy avoided by normalization, which could lead to poorer end-to-end performance and maintenance overheads due to data duplication. In this work, we take a step towards enabling and optimizing learning over joins for a common class of machine learning techniques called generalized linear models that are solved using gradient descent algorithms in an RDBMS setting. We present alternative approaches to learn over a join that are easy to implement over existing RDBMSs. We introduce a new approach named factorized learning that pushes ML computations through joins and avoids redundancy in both I/O and computations. We study the tradeoff space for all our approaches both analytically and empirically. Our results show that factorized learning is often substantially faster than the alternatives, but is not always the fastest, necessitating a cost-based approach. We also discuss extensions of all our approaches to multi-table joins as well as to Hive.
- Apache Hive. hive.apache.org.Google Scholar
- Apache Mahout. mahout.apache.org.Google Scholar
- IBM Report on Big Data. www-01.ibm.com/software/data/bigdata/.Google Scholar
- Microsoft SQL Server Data Mining.Google Scholar
- On Learning Over Joins. pages.cs.wisc.edu/~arun/orion/TechReport.pdf.Google Scholar
- Oracle R Enterprise.Google Scholar
- SAS Report on Analytics. sas.com/reg/wp/corp/23876.Google Scholar
- A. Agarwal, O. Chapelle, M. Dudík, and J. Langford. A Reliable Effective Terascale Linear Learning System. JMLR, 15:1111--1133, 2014. Google ScholarDigital Library
- M. Anderson et~al. Brainwash: A Data System for Feature Engineering. In CIDR, 2013.Google Scholar
- N. Bakibayev, T. Kociský, D. Olteanu, and J. Zavodny. Aggregation and Ordering in Factorised Databases. In VLDB, 2013. Google ScholarDigital Library
- N. Bakibayev, D. Olteanu, and J. Zavodny. FDB: A Query Engine for Factorised Relational Databases. In VLDB, 2012. Google ScholarDigital Library
- Z. Cai et~al. Simulation of Database-valued Markov Chains Using SimSQL. In SIGMOD, 2013. Google ScholarDigital Library
- R. Chaiken et~al. SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets. In VLDB, 2008. Google ScholarDigital Library
- S. Chaudhuri and K. Shim. Including Group-By in Query Optimization. In VLDB, 1994. Google ScholarDigital Library
- S. Das et~al. Ricardo: Integrating R and Hadoop. In SIGMOD, 2010. Google ScholarDigital Library
- X. Feng, A. Kumar, B. Recht, and C. Ré. Towards a Unified Architecture for in-RDBMS Analytics. In SIGMOD, 2012. Google ScholarDigital Library
- M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., 1979. Google ScholarDigital Library
- L. Getoor and B. Taskar. Introduction to Statistical Relational Learning). The MIT Press, 2007. Google ScholarDigital Library
- A. Ghoting et~al. SystemML: Declarative Machine Learning on MapReduce. In ICDE, 2011. Google ScholarDigital Library
- J. Gray et~al. Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals. Data Min. Knowl. Discov., 1(1):29--53, Jan. 1997. Google ScholarDigital Library
- T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of Statistical Learning: Data mining, inference, and prediction. Springer-Verlag, 2001.Google ScholarCross Ref
- J. Hellerstein et~al. The MADlib Analytics Library or MAD Skills, the SQL. In VLDB, 2012. Google ScholarDigital Library
- P. Konda, A. Kumar, C. Ré, and V. Sashikanth. Feature Selection in Enterprise Analytics: A Demonstration using an R-based Data Analytics System. In VLDB, 2013. Google ScholarDigital Library
- T. Kraska et~al. MLbase: A Distributed Machine-learning System. In CIDR, 2013.Google Scholar
- A. Kumar, F. Niu, and C. Ré. Hazy: Making it Easier to Build and Maintain Big-data Analytics. CACM, 56(3):40--49, March 2013. Google ScholarDigital Library
- Y. Low et~al. GraphLab: A New Framework For Parallel Machine Learning. In UAI, 2010.Google Scholar
- T. M. Mitchell. Machine Learning. McGraw Hill, 1997. Google ScholarDigital Library
- J. Nocedal and S. J. Wright. Numerical Optimization. Springer, 2006.Google Scholar
- R. Ramakrishnan and J. Gehrke. Database Management Systems. McGraw-Hill, Inc., New York, NY, USA, 2003. Google ScholarDigital Library
- S. Rendle. Scaling Factorization Machines to Relational Data. In VLDB, 2013. Google ScholarDigital Library
- P. G. Selinger et~al. Access path Selection in a Relational Database Mmanagement System. In SIGMOD, 1979. Google ScholarDigital Library
- T. K. Sellis. Multiple-Query Optimization. ACM TODS, 13(1):23--52, Mar. 1988. Google ScholarDigital Library
- L. D. Shapiro. Join Processing in Database Systems with Large Main Memories. ACM TODS, 11(3):239--264, Aug. 1986. Google ScholarDigital Library
- W. P. Yan and P.-Å. Larson. Eager Aggregation and Lazy Aggregation. In VLDB, 1995. Google ScholarDigital Library
- C. Zhang, A. Kumar, and C. Ré. Materialization Optimizations for Feature Selection Workloads. In SIGMOD, 2014. Google ScholarDigital Library
- Y. Zhang, W. Zhang, and J. Yang. I/O-Efficient Statistical Computing with RIOT. In ICDE, 2010.Google ScholarCross Ref
Index Terms
- Learning Generalized Linear Models Over Normalized Data
Recommendations
Scale-out beyond map-reduce
KDD '13: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data miningThe amount and variety of data being collected in the enterprise is growing at a staggering pace. The default now is to capture and store any and all data, in anticipation of potential future strategic value, and vast amounts of data are being generated ...
OVI-3: A NoSQL visual query system supporting efficient anti-joins
AbstractThe aim of this work was to develop a technique to speed up complex joins in an incremental visual query system. When designing a visual, highly interactive interface for ad-hoc (read-only) queries, fast response times are of paramount importance. ...
Using Visualization to Illustrate Machine Learning Models for Genomic Data
ACSW '19: Proceedings of the Australasian Computer Science Week MulticonferenceMassive amounts of genomic data are created for the advent of Next Generation Sequencing technologies. Visualizing these complex genomic data requires not only simply plotting of data but should also invite a decision or a choice. Machine learning has ...
Comments