ABSTRACT
Low-rank matrix factorization has recently been applied with great success on matrix completion problems for applications like recommendation systems, link predictions for social networks, and click prediction for web search. However, as this approach is applied to increasingly larger datasets, such as those encountered in web-scale recommender systems like Netflix and Pandora, the data management aspects quickly become challenging and form a road-block. In this paper, we introduce a system called Sparkler to solve such large instances of low rank matrix factorizations. Sparkler extends Spark, an existing platform for running parallel iterative algorithms on datasets that fit in the aggregate main memory of a cluster. Sparkler supports distributed stochastic gradient descent as an approach to solving the factorization problem -- an iterative technique that has been shown to perform very well in practice. We identify the shortfalls of Spark in solving large matrix factorization problems, especially when running on the cloud, and solve this by introducing a novel abstraction called "Carousel Maps" (CMs). CMs are well suited to storing large matrices in the aggregate memory of a cluster and can efficiently support the operations performed on them during distributed stochastic gradient descent. We describe the design, implementation, and the use of CMs in Sparkler programs. Through a variety of experiments, we demonstrate that Sparkler is faster than Spark by 4x to 21x, with bigger advantages for larger problems. Equally importantly, we show that this can be done without imposing any changes to the ease of programming. We argue that Sparkler provides a convenient and efficient extension to Spark for solving matrix factorization problems on very large datasets.
- Apache Hadoop. https://hadoop.apache.org.Google Scholar
- D. Battré, S. Ewen, F. Hueske, O. Kao, V. Markl, and D. Warneke. Nephele/pacts: a programming model and execution framework for web-scale analytical processing. In SoCC '10. Google ScholarDigital Library
- K. S. Beyer, P. J. Haas, B. Reinwald, Y. Sismanis, and R. Gemulla. On synopses for distinct-value estimation under multiset operations. In SIGMOD '07. Google ScholarDigital Library
- V. R. Borkar, M. J. Carey, R. Grover, N. Onose, and R. Vernica. Hyracks: A flexible and extensible foundation for data-intensive computing. In ICDE '11. Google ScholarDigital Library
- Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst. Haloop: efficient iterative data processing on large clusters. PVLDB '10. Google ScholarDigital Library
- E. J. Candès, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? J. ACM '11, 58(3).Google Scholar
- M. Chowdhury, M. Zaharia, J. Ma, M. I. Jordan, and I. Stoica. Managing data transfers in computer clusters with orchestra. In SIGCOMM '11. Google ScholarDigital Library
- A. S. Das, M. Datar, A. Garg, and S. Rajaram. Google News Personalization: Scalable Online Collaborative Filtering. WWW '07. Google ScholarDigital Library
- J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. CACM '08. Google ScholarDigital Library
- J. Dongarra and F. Sullivan. Guest editors' introduction: The top 10 algorithms. Computing in Science & Engineering, 2:22--23, 2000. Google ScholarDigital Library
- J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu, and G. Fox. Twister: a runtime for iterative mapreduce. In HPDC '10. Google ScholarDigital Library
- X. Feng, A. Kumar, B. Recht, and C. Ré. Towards a unified architecture for in-RDBMS analytics. In SIGMOD '12. Google ScholarDigital Library
- P. Ferreira, M. Shapiro, X. Blondel, O. Fambon, J. a. Garcia, S. Kloosterman, N. Richer, M. Roberts, F. Sandakly, G. Coulouris, J. Dollimore, P. Guedes, D. Hagimont, and S. Krakowiak. PerDiS: Design, Implementation, and Use of a PERsistent DIstributed Store. In LNCS: Advanced Distributed Computing '99. Google ScholarDigital Library
- R. Gemulla, P. J. Haas, E. Nijkamp, and Y. Sismanis. Large-scale matrix factorization with distributed stochastic gradient descent. Technical Report RJ10481, IBM Almaden Research Center, San Jose, CA, 2011. Available at www.almaden.ibm.com/cs/people/peterh/dsgdTechRep.pdf.Google ScholarDigital Library
- R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis. Large-scale matrix factorization with distributed stochastic gradient descent. In KDD, pages 69--77, 2011. Google ScholarDigital Library
- A. Ghoting, R. Krishnamurthy, E. Pednault, B. Reinwald, V. Sindhwani, S. Tatikonda, Y. Tian, and S. Vaithyanathan. SystemML: Declarative machine learning on MapReduce. In ICDE '11. Google ScholarDigital Library
- B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. Katz, S. Shenker, and I. Stoica. Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. In USENIX, NSDI '11. Google ScholarDigital Library
- T. Hofmann. Probabilistic latent semantic indexing. In SIGIR, pages 50--57, 1999. Google ScholarDigital Library
- R. H. Keshavan, A. Montanari, and S. Oh. Matrix completion from noisy entries. JMLR '10, 11. Google ScholarDigital Library
- Y. Koren, R. Bell, and C. Volinsky. Matrix Factorization Techniques for Recommender Systems. IEEE Computer '09, 42. Google ScholarDigital Library
- G. Koutrika, B. Bercovitz, and H. Garcia-Molina. Flexrecs: expressing and combining flexible recommendations. In SIGMOD '09. Google ScholarDigital Library
- C. Liu, H.-c. Yang, J. Fan, L.-W. He, and Y.-M. Wang. Distributed Nonnegative Matrix Factorization for Web-scale Dyadic Data Analysis on Mapreduce. In WWW '10. Google ScholarDigital Library
- L. W. Mackey, A. Talwalkar, and M. I. Jordan. Divide-and-conquer matrix factorization. CoRR, abs/1107.0789, 2011.Google Scholar
- G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD '10. Google ScholarDigital Library
- Memcached. http://memcached.org.Google Scholar
- K. Min, Z. Zhang, J. Wright, and Y. Ma. Decomposing background topics from keywords by principal component pursuit. In CIKM '10. Google ScholarDigital Library
- F. Niu, B. Recht, C. Re, and S. J. Wright. Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. CoRR, abs/1106.5730, 2011.Google Scholar
- Y. Peng, A. Ganesh, J. Wright, W. Xu, and Y. Ma. Rasl: Robust alignment by sparse and low-rank decomposition for linearly correlated images. In CVPR '10.Google Scholar
- R. Power and J. Li. Piccolo: Building Fast, Distributed Programs with Partitioned Tables. In USENIX, OSDI '10, Berkeley, CA, USA. Google ScholarDigital Library
- Z. Qian, X. Chen, N. Kang, M. Chen, Y. Yu, T. Moscibroda, and Z. Zhang. Madlinq: large-scale distributed matrix computation for the cloud. In EuroSys '12, pages 197--210. ACM, 2012. Google ScholarDigital Library
- T. Rompf and M. Odersky. Lightweight Modular Staging: A Pragmatic Approach to Runtime Code Generation and Compiled DSLs. GPCE '10. Google ScholarDigital Library
- A. Rowstron and P. Druschel. Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems. LNCS '01. Google ScholarDigital Library
- R. Salakhutdinov and A. Mnih. Probabilistic matrix factorization. In NIPS '07.Google Scholar
- A. P. Singh and G. J. Gordon. A unified view of matrix factorization models. In ECML PKDD, pages 358--373, 2008.Google ScholarCross Ref
- H. H. Song, T. W. Cho, V. Dave, Y. Zhang, and L. Qiu. Scalable proximity estimation and link prediction in online social networks. In Internet Measurement Conference '09. Google ScholarDigital Library
- G. Stewart. The decompositional approach to matrix computation. Computing in Science & Engineering, 2:50--59, 2000. Google ScholarDigital Library
- I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan. Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications. In SIGCOMM '01. Google ScholarDigital Library
- X. Su and T. M. Khoshgoftaar. A survey of collaborative filtering techniques. Adv. in Artif. Intell. '09. Google ScholarDigital Library
- C. Teflioudi, F. M. Manshadi, and R. Gemulla. Distributed algorithms for matrix completion. In ICDM '12.Google Scholar
- S. Venkataraman, I. Roy, A. AuYoung, and R. S. Schreiber. Using r for iterative and incremental processing. In HotCloud, 2012. Google ScholarDigital Library
- M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In USENIX, NSDI '12. Google ScholarDigital Library
Index Terms
- Sparkler: supporting large-scale matrix factorization
Recommendations
A Fast Randomized Algorithm for Computing a Hierarchically Semiseparable Representation of a Matrix
Randomized sampling has recently been proven a highly efficient technique for computing approximate factorizations of matrices that have low numerical rank. This paper describes an extension of such techniques to a wider class of matrices that are not ...
Addressing cold-start
Cold start problem for new users and new items is a major challenge facing most collaborative filtering systems. Existing methods to collaborative filtering (CF) emphasize to scale well up to large and sparse dataset, lacking of scalable approach to ...
List-wise learning to rank with matrix factorization for collaborative filtering
RecSys '10: Proceedings of the fourth ACM conference on Recommender systemsA ranking approach, ListRank-MF, is proposed for collaborative filtering that combines a list-wise learning-to-rank algorithm with matrix factorization (MF). A ranked list of items is obtained by minimizing a loss function that represents the ...
Comments