research-article

Sparkler: supporting large-scale matrix factorization

Authors:
Boduo Li

University of Massachusetts Amherst, Amherst, MA

University of Massachusetts Amherst, Amherst, MA
View Profile

,
Sandeep Tata

IBM Almaden Research Center, San Jose, CA

IBM Almaden Research Center, San Jose, CA
View Profile

,
Yannis Sismanis

IBM Almaden Research Center, San Jose, CA

IBM Almaden Research Center, San Jose, CA
View Profile

EDBT '13: Proceedings of the 16th International Conference on Extending Database TechnologyMarch 2013Pages 625–636https://doi.org/10.1145/2452376.2452449

Published:18 March 2013Publication History

EDBT '13: Proceedings of the 16th International Conference on Extending Database Technology

Pages 625–636

ABSTRACT

Low-rank matrix factorization has recently been applied with great success on matrix completion problems for applications like recommendation systems, link predictions for social networks, and click prediction for web search. However, as this approach is applied to increasingly larger datasets, such as those encountered in web-scale recommender systems like Netflix and Pandora, the data management aspects quickly become challenging and form a road-block. In this paper, we introduce a system called Sparkler to solve such large instances of low rank matrix factorizations. Sparkler extends Spark, an existing platform for running parallel iterative algorithms on datasets that fit in the aggregate main memory of a cluster. Sparkler supports distributed stochastic gradient descent as an approach to solving the factorization problem -- an iterative technique that has been shown to perform very well in practice. We identify the shortfalls of Spark in solving large matrix factorization problems, especially when running on the cloud, and solve this by introducing a novel abstraction called "Carousel Maps" (CMs). CMs are well suited to storing large matrices in the aggregate memory of a cluster and can efficiently support the operations performed on them during distributed stochastic gradient descent. We describe the design, implementation, and the use of CMs in Sparkler programs. Through a variety of experiments, we demonstrate that Sparkler is faster than Spark by 4x to 21x, with bigger advantages for larger problems. Equally importantly, we show that this can be done without imposing any changes to the ease of programming. We argue that Sparkler provides a convenient and efficient extension to Spark for solving matrix factorization problems on very large datasets.

References

Apache Hadoop. https://hadoop.apache.org.Google Scholar
D. Battré, S. Ewen, F. Hueske, O. Kao, V. Markl, and D. Warneke. Nephele/pacts: a programming model and execution framework for web-scale analytical processing. In SoCC '10. Google ScholarDigital Library
K. S. Beyer, P. J. Haas, B. Reinwald, Y. Sismanis, and R. Gemulla. On synopses for distinct-value estimation under multiset operations. In SIGMOD '07. Google ScholarDigital Library
V. R. Borkar, M. J. Carey, R. Grover, N. Onose, and R. Vernica. Hyracks: A flexible and extensible foundation for data-intensive computing. In ICDE '11. Google ScholarDigital Library
Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst. Haloop: efficient iterative data processing on large clusters. PVLDB '10. Google ScholarDigital Library
E. J. Candès, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? J. ACM '11, 58(3).Google Scholar
M. Chowdhury, M. Zaharia, J. Ma, M. I. Jordan, and I. Stoica. Managing data transfers in computer clusters with orchestra. In SIGCOMM '11. Google ScholarDigital Library
A. S. Das, M. Datar, A. Garg, and S. Rajaram. Google News Personalization: Scalable Online Collaborative Filtering. WWW '07. Google ScholarDigital Library
J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. CACM '08. Google ScholarDigital Library
J. Dongarra and F. Sullivan. Guest editors' introduction: The top 10 algorithms. Computing in Science & Engineering, 2:22--23, 2000. Google ScholarDigital Library
J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu, and G. Fox. Twister: a runtime for iterative mapreduce. In HPDC '10. Google ScholarDigital Library
X. Feng, A. Kumar, B. Recht, and C. Ré. Towards a unified architecture for in-RDBMS analytics. In SIGMOD '12. Google ScholarDigital Library
P. Ferreira, M. Shapiro, X. Blondel, O. Fambon, J. a. Garcia, S. Kloosterman, N. Richer, M. Roberts, F. Sandakly, G. Coulouris, J. Dollimore, P. Guedes, D. Hagimont, and S. Krakowiak. PerDiS: Design, Implementation, and Use of a PERsistent DIstributed Store. In LNCS: Advanced Distributed Computing '99. Google ScholarDigital Library
R. Gemulla, P. J. Haas, E. Nijkamp, and Y. Sismanis. Large-scale matrix factorization with distributed stochastic gradient descent. Technical Report RJ10481, IBM Almaden Research Center, San Jose, CA, 2011. Available at www.almaden.ibm.com/cs/people/peterh/dsgdTechRep.pdf.Google ScholarDigital Library
R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis. Large-scale matrix factorization with distributed stochastic gradient descent. In KDD, pages 69--77, 2011. Google ScholarDigital Library
A. Ghoting, R. Krishnamurthy, E. Pednault, B. Reinwald, V. Sindhwani, S. Tatikonda, Y. Tian, and S. Vaithyanathan. SystemML: Declarative machine learning on MapReduce. In ICDE '11. Google ScholarDigital Library
B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. Katz, S. Shenker, and I. Stoica. Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. In USENIX, NSDI '11. Google ScholarDigital Library
T. Hofmann. Probabilistic latent semantic indexing. In SIGIR, pages 50--57, 1999. Google ScholarDigital Library
R. H. Keshavan, A. Montanari, and S. Oh. Matrix completion from noisy entries. JMLR '10, 11. Google ScholarDigital Library
Y. Koren, R. Bell, and C. Volinsky. Matrix Factorization Techniques for Recommender Systems. IEEE Computer '09, 42. Google ScholarDigital Library
G. Koutrika, B. Bercovitz, and H. Garcia-Molina. Flexrecs: expressing and combining flexible recommendations. In SIGMOD '09. Google ScholarDigital Library
C. Liu, H.-c. Yang, J. Fan, L.-W. He, and Y.-M. Wang. Distributed Nonnegative Matrix Factorization for Web-scale Dyadic Data Analysis on Mapreduce. In WWW '10. Google ScholarDigital Library
L. W. Mackey, A. Talwalkar, and M. I. Jordan. Divide-and-conquer matrix factorization. CoRR, abs/1107.0789, 2011.Google Scholar
G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD '10. Google ScholarDigital Library
Memcached. http://memcached.org.Google Scholar
K. Min, Z. Zhang, J. Wright, and Y. Ma. Decomposing background topics from keywords by principal component pursuit. In CIKM '10. Google ScholarDigital Library
F. Niu, B. Recht, C. Re, and S. J. Wright. Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. CoRR, abs/1106.5730, 2011.Google Scholar
Y. Peng, A. Ganesh, J. Wright, W. Xu, and Y. Ma. Rasl: Robust alignment by sparse and low-rank decomposition for linearly correlated images. In CVPR '10.Google Scholar
R. Power and J. Li. Piccolo: Building Fast, Distributed Programs with Partitioned Tables. In USENIX, OSDI '10, Berkeley, CA, USA. Google ScholarDigital Library
Z. Qian, X. Chen, N. Kang, M. Chen, Y. Yu, T. Moscibroda, and Z. Zhang. Madlinq: large-scale distributed matrix computation for the cloud. In EuroSys '12, pages 197--210. ACM, 2012. Google ScholarDigital Library
T. Rompf and M. Odersky. Lightweight Modular Staging: A Pragmatic Approach to Runtime Code Generation and Compiled DSLs. GPCE '10. Google ScholarDigital Library
A. Rowstron and P. Druschel. Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems. LNCS '01. Google ScholarDigital Library
R. Salakhutdinov and A. Mnih. Probabilistic matrix factorization. In NIPS '07.Google Scholar
A. P. Singh and G. J. Gordon. A unified view of matrix factorization models. In ECML PKDD, pages 358--373, 2008.Google ScholarCross Ref
H. H. Song, T. W. Cho, V. Dave, Y. Zhang, and L. Qiu. Scalable proximity estimation and link prediction in online social networks. In Internet Measurement Conference '09. Google ScholarDigital Library
G. Stewart. The decompositional approach to matrix computation. Computing in Science & Engineering, 2:50--59, 2000. Google ScholarDigital Library
I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan. Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications. In SIGCOMM '01. Google ScholarDigital Library
X. Su and T. M. Khoshgoftaar. A survey of collaborative filtering techniques. Adv. in Artif. Intell. '09. Google ScholarDigital Library
C. Teflioudi, F. M. Manshadi, and R. Gemulla. Distributed algorithms for matrix completion. In ICDM '12.Google Scholar
S. Venkataraman, I. Roy, A. AuYoung, and R. S. Schreiber. Using r for iterative and incremental processing. In HotCloud, 2012. Google ScholarDigital Library
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In USENIX, NSDI '12. Google ScholarDigital Library

Index Terms

Sparkler: supporting large-scale matrix factorization
1. Information systems
  1. Data management systems
    1. Database management system engines

Recommendations

A Fast Randomized Algorithm for Computing a Hierarchically Semiseparable Representation of a Matrix

Randomized sampling has recently been proven a highly efficient technique for computing approximate factorizations of matrices that have low numerical rank. This paper describes an extension of such techniques to a wider class of matrices that are not ...
Read More
Addressing cold-start

Cold start problem for new users and new items is a major challenge facing most collaborative filtering systems. Existing methods to collaborative filtering (CF) emphasize to scale well up to large and sparse dataset, lacking of scalable approach to ...
Read More
List-wise learning to rank with matrix factorization for collaborative filtering
RecSys '10: Proceedings of the fourth ACM conference on Recommender systems

A ranking approach, ListRank-MF, is proposed for collaborative filtering that combines a list-wise learning-to-rank algorithm with matrix factorization (MF). A ranked list of items is obtained by minimizing a loss function that represents the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
EDBT '13: Proceedings of the 16th International Conference on Extending Database Technology
March 2013
793 pages
ISBN:9781450315975
DOI:10.1145/2452376
General Chair:
Giovanna Guerrini
Università di Genova, Italy
,
Program Chair:
Norman W. Paton
University of Manchester, UK
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 March 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
iterative data processing
matrix factorization
recommendation
scalability
spark
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate7of10submissions,70%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 23
  Total Citations
  View Citations
- 452
  Total Downloads
- Downloads (Last 12 months)5
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Sparkler: supporting large-scale matrix factorization

EDBT '13: Proceedings of the 16th International Conference on Extending Database Technology

ABSTRACT

References

Cited By

Index Terms

Recommendations

A Fast Randomized Algorithm for Computing a Hierarchically Semiseparable Representation of a Matrix

Addressing cold-start

List-wise learning to rank with matrix factorization for collaborative filtering