skip to main content
10.1145/2452376.2452449acmotherconferencesArticle/Chapter ViewAbstractPublication PagesedbtConference Proceedingsconference-collections
research-article

Sparkler: supporting large-scale matrix factorization

Published:18 March 2013Publication History

ABSTRACT

Low-rank matrix factorization has recently been applied with great success on matrix completion problems for applications like recommendation systems, link predictions for social networks, and click prediction for web search. However, as this approach is applied to increasingly larger datasets, such as those encountered in web-scale recommender systems like Netflix and Pandora, the data management aspects quickly become challenging and form a road-block. In this paper, we introduce a system called Sparkler to solve such large instances of low rank matrix factorizations. Sparkler extends Spark, an existing platform for running parallel iterative algorithms on datasets that fit in the aggregate main memory of a cluster. Sparkler supports distributed stochastic gradient descent as an approach to solving the factorization problem -- an iterative technique that has been shown to perform very well in practice. We identify the shortfalls of Spark in solving large matrix factorization problems, especially when running on the cloud, and solve this by introducing a novel abstraction called "Carousel Maps" (CMs). CMs are well suited to storing large matrices in the aggregate memory of a cluster and can efficiently support the operations performed on them during distributed stochastic gradient descent. We describe the design, implementation, and the use of CMs in Sparkler programs. Through a variety of experiments, we demonstrate that Sparkler is faster than Spark by 4x to 21x, with bigger advantages for larger problems. Equally importantly, we show that this can be done without imposing any changes to the ease of programming. We argue that Sparkler provides a convenient and efficient extension to Spark for solving matrix factorization problems on very large datasets.

References

  1. Apache Hadoop. https://hadoop.apache.org.Google ScholarGoogle Scholar
  2. D. Battré, S. Ewen, F. Hueske, O. Kao, V. Markl, and D. Warneke. Nephele/pacts: a programming model and execution framework for web-scale analytical processing. In SoCC '10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. K. S. Beyer, P. J. Haas, B. Reinwald, Y. Sismanis, and R. Gemulla. On synopses for distinct-value estimation under multiset operations. In SIGMOD '07. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. V. R. Borkar, M. J. Carey, R. Grover, N. Onose, and R. Vernica. Hyracks: A flexible and extensible foundation for data-intensive computing. In ICDE '11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst. Haloop: efficient iterative data processing on large clusters. PVLDB '10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. E. J. Candès, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? J. ACM '11, 58(3).Google ScholarGoogle Scholar
  7. M. Chowdhury, M. Zaharia, J. Ma, M. I. Jordan, and I. Stoica. Managing data transfers in computer clusters with orchestra. In SIGCOMM '11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. S. Das, M. Datar, A. Garg, and S. Rajaram. Google News Personalization: Scalable Online Collaborative Filtering. WWW '07. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. CACM '08. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Dongarra and F. Sullivan. Guest editors' introduction: The top 10 algorithms. Computing in Science & Engineering, 2:22--23, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu, and G. Fox. Twister: a runtime for iterative mapreduce. In HPDC '10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. X. Feng, A. Kumar, B. Recht, and C. Ré. Towards a unified architecture for in-RDBMS analytics. In SIGMOD '12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. P. Ferreira, M. Shapiro, X. Blondel, O. Fambon, J. a. Garcia, S. Kloosterman, N. Richer, M. Roberts, F. Sandakly, G. Coulouris, J. Dollimore, P. Guedes, D. Hagimont, and S. Krakowiak. PerDiS: Design, Implementation, and Use of a PERsistent DIstributed Store. In LNCS: Advanced Distributed Computing '99. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. R. Gemulla, P. J. Haas, E. Nijkamp, and Y. Sismanis. Large-scale matrix factorization with distributed stochastic gradient descent. Technical Report RJ10481, IBM Almaden Research Center, San Jose, CA, 2011. Available at www.almaden.ibm.com/cs/people/peterh/dsgdTechRep.pdf.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis. Large-scale matrix factorization with distributed stochastic gradient descent. In KDD, pages 69--77, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. A. Ghoting, R. Krishnamurthy, E. Pednault, B. Reinwald, V. Sindhwani, S. Tatikonda, Y. Tian, and S. Vaithyanathan. SystemML: Declarative machine learning on MapReduce. In ICDE '11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. Katz, S. Shenker, and I. Stoica. Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. In USENIX, NSDI '11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. T. Hofmann. Probabilistic latent semantic indexing. In SIGIR, pages 50--57, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. R. H. Keshavan, A. Montanari, and S. Oh. Matrix completion from noisy entries. JMLR '10, 11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Y. Koren, R. Bell, and C. Volinsky. Matrix Factorization Techniques for Recommender Systems. IEEE Computer '09, 42. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. G. Koutrika, B. Bercovitz, and H. Garcia-Molina. Flexrecs: expressing and combining flexible recommendations. In SIGMOD '09. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. C. Liu, H.-c. Yang, J. Fan, L.-W. He, and Y.-M. Wang. Distributed Nonnegative Matrix Factorization for Web-scale Dyadic Data Analysis on Mapreduce. In WWW '10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. L. W. Mackey, A. Talwalkar, and M. I. Jordan. Divide-and-conquer matrix factorization. CoRR, abs/1107.0789, 2011.Google ScholarGoogle Scholar
  24. G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD '10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Memcached. http://memcached.org.Google ScholarGoogle Scholar
  26. K. Min, Z. Zhang, J. Wright, and Y. Ma. Decomposing background topics from keywords by principal component pursuit. In CIKM '10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. F. Niu, B. Recht, C. Re, and S. J. Wright. Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. CoRR, abs/1106.5730, 2011.Google ScholarGoogle Scholar
  28. Y. Peng, A. Ganesh, J. Wright, W. Xu, and Y. Ma. Rasl: Robust alignment by sparse and low-rank decomposition for linearly correlated images. In CVPR '10.Google ScholarGoogle Scholar
  29. R. Power and J. Li. Piccolo: Building Fast, Distributed Programs with Partitioned Tables. In USENIX, OSDI '10, Berkeley, CA, USA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Z. Qian, X. Chen, N. Kang, M. Chen, Y. Yu, T. Moscibroda, and Z. Zhang. Madlinq: large-scale distributed matrix computation for the cloud. In EuroSys '12, pages 197--210. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. T. Rompf and M. Odersky. Lightweight Modular Staging: A Pragmatic Approach to Runtime Code Generation and Compiled DSLs. GPCE '10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. A. Rowstron and P. Druschel. Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems. LNCS '01. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. R. Salakhutdinov and A. Mnih. Probabilistic matrix factorization. In NIPS '07.Google ScholarGoogle Scholar
  34. A. P. Singh and G. J. Gordon. A unified view of matrix factorization models. In ECML PKDD, pages 358--373, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  35. H. H. Song, T. W. Cho, V. Dave, Y. Zhang, and L. Qiu. Scalable proximity estimation and link prediction in online social networks. In Internet Measurement Conference '09. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. G. Stewart. The decompositional approach to matrix computation. Computing in Science & Engineering, 2:50--59, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan. Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications. In SIGCOMM '01. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. X. Su and T. M. Khoshgoftaar. A survey of collaborative filtering techniques. Adv. in Artif. Intell. '09. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. C. Teflioudi, F. M. Manshadi, and R. Gemulla. Distributed algorithms for matrix completion. In ICDM '12.Google ScholarGoogle Scholar
  40. S. Venkataraman, I. Roy, A. AuYoung, and R. S. Schreiber. Using r for iterative and incremental processing. In HotCloud, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In USENIX, NSDI '12. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Sparkler: supporting large-scale matrix factorization

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      EDBT '13: Proceedings of the 16th International Conference on Extending Database Technology
      March 2013
      793 pages
      ISBN:9781450315975
      DOI:10.1145/2452376

      Copyright © 2013 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 18 March 2013

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate7of10submissions,70%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader