skip to main content
10.1145/3297858.3304056acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article
Public Access

Phoenix: A Substrate for Resilient Distributed Graph Analytics

Published:04 April 2019Publication History

ABSTRACT

This paper presents Phoenix, a communication and synchronization substrate that implements a novel protocol for recovering from fail-stop faults when executing graph analytics applications on distributed-memory machines. The standard recovery technique in this space is checkpointing, which rolls back the state of the entire computation to a state that existed before the fault occurred. The insight behind Phoenix is that this is not necessary since it is sufficient to continue the computation from a state that will ultimately produce the correct result. We show that for graph analytics applications, the necessary state adjustment can be specified easily by the programmer using a thin API supported by Phoenix. Phoenix has no observable overhead during fault-free execution, and it is resilient to any number of faults while guaranteeing that the correct answer will be produced at the end of the computation. This is in contrast to other systems in this space which may either have overheads even during fault-free execution or produce only approximate answers when faults occur during execution. We incorporated Phoenix into D-Galois, the state-of-the-art distributed graph analytics system, and evaluated it on two production clusters. Our evaluation shows that in the absence of faults, Phoenix is ~24x faster than GraphX, which provides fault tolerance using the Spark system. Phoenix also outperforms the traditional checkpoint-restart technique implemented in D-Galois: in fault-free execution, Phoenix has no observable overhead, while the checkpointing technique has 31% overhead. Furthermore, Phoenix mostly outperforms checkpointing when faults occur, particularly in the common case when only a small number of hosts fail simultaneously.

References

  1. Saurabh Agarwal, Rahul Garg, Meeta S. Gupta, and Jose E. Moreira. 2004. Adaptive Incremental Checkpointing for Massively Parallel Systems. In Proceedings of the 18th Annual International Conference on Supercomputing (ICS '04). ACM, New York, NY, USA, 277--286. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Paolo Boldi, Marco Rosa, Massimo Santini, and Sebastiano Vigna. 2011. Layered Label Propagation: A MultiResolution Coordinate-Free Ordering for Compressing Social Networks. In Proceedings of the 20th international conference on World Wide Web, Sadagopan Srinivasan, Krithi Ramamritham, Arun Kumar, M. P. Ravindra, Elisa Bertino, and Ravi Kumar (Eds.). ACM Press, 587--596. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Paolo Boldi and Sebastiano Vigna. 2004. The WebGraph Framework I: Compression Techniques. In Proc. of the Thirteenth International World Wide Web Conference (WWW 2004). ACM Press, Manhattan, USA, 595--601. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. George Bosilca, Remi Delmas, Jack Dongarra, and Julien Langou. 2009. Algorithm-based fault tolerance applied to high performance computing. J. Parallel and Distrib. Comput., Vol. 69, 4 (2009), 410 -- 416. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Greg Bronevetsky, Rohit Fernandes, Daniel Marques, Keshav Pingali, and Paul Stodghill. 2006. Recent Advances in Checkpoint/Recovery Systems. In Proceedings of the 20th International Conference on Parallel and Distributed Processing (IPDPS'06). IEEE Computer Society, Washington, DC, USA, 282--282. http://dl.acm.org/citation.cfm?id=1898699.1898802 Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Greg Bronevetsky, Daniel Marques, Keshav Pingali, and Paul Stodghill. 2003. Automated Application-level Checkpointing of MPI Programs. In Proceedings of the Ninth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '03). ACM, New York, NY, USA, 84--94. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Greg Bronevetsky, Daniel Marques, Keshav Pingali, and Paul Stodghill. 2004. C3: A System for Automating Application-Level Checkpointing of MPI Programs. 357--373.Google ScholarGoogle Scholar
  8. Deepayan Chakrabarti, Yiping Zhan, and Christos Faloutsos. 2004. R-MAT: A Recursive Model for Graph Mining. 442--446.Google ScholarGoogle Scholar
  9. K. Mani Chandy and Leslie Lamport. 1985. Distributed Snapshots: Determining Global States of Distributed Systems. ACM Trans. Comput. Syst., Vol. 3, 1 (Feb. 1985), 63--75. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Zizhong Chen. 2013. Online-ABFT: An Online Algorithm Based Fault Tolerance Scheme for Soft Error Detection in Iterative Methods. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '13). ACM, New York, NY, USA, 167--176. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Zizhong Chen, Graham E. Fagg, Edgar Gabriel, Julien Langou, Thara Angskun, George Bosilca, and Jack Dongarra. 2005. Fault Tolerant High Performance Computing by a Coding Approach. In Proceedings of the Tenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '05). ACM, New York, NY, USA, 213--223. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Ge-Ming Chiu and Cheng-Ru Young. 1996. Efficient Rollback-Recovery Technique in Distributed Computing Systems. IEEE Trans. Parallel Distrib. Syst., Vol. 7, 6 (June 1996), 565--577. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Roshan Dathathri, Gurbinder Gill, Loc Hoang, Hoang-Vu Dang, Alex Brooks, Nikoli Dryden, Marc Snir, and Keshav Pingali. 2018. Gluon: A Communication-optimizing Substrate for Distributed Heterogeneous Graph Analytics. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2018). ACM, New York, NY, USA, 752--768. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Teresa Davies, Christer Karlsson, Hui Liu, Chong Ding, and Zizhong Chen. 2011. High Performance Linpack Benchmark: A Fault Tolerant Implementation Without Checkpointing. In Proceedings of the International Conference on Supercomputing (ICS '11). ACM, New York, NY, USA, 162--171. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: simplified data processing on large clusters. In OSDI'04 . Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Edsger W. Dijkstra. 1974. Self-stabilizing Systems in Spite of Distributed Control. Commun. ACM, Vol. 17, 11 (Nov. 1974), 643--644. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Elmootazbellah N. Elnozahy and Willy Zwaenepoel. 1992. Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit. IEEE Trans. Comput., Vol. 41, 5 (May 1992), 526--531. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. E. N. (Mootaz) Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and David B. Johnson. 2002. A Survey of Rollback-recovery Protocols in Message-passing Systems. ACM Comput. Surv., Vol. 34, 3 (Sept. 2002), 375--408. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. L. A. B. Gomez, N. Maruyama, F. Cappello, and S. Matsuoka. 2010. Distributed Diskless Checkpoint for Large Scale Systems. In 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing. 63--72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. 2012. PowerGraph: Distributed Graph-parallel Computation on Natural Graphs. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation (OSDI'12). USENIX Association, Berkeley, CA, USA, 17--30. http://dl.acm.org/citation.cfm?id=2387880.2387883 Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Ruining He and Julian McAuley. 2016. Ups and Downs: Modeling the Visual Evolution of Fashion Trends with One-Class Collaborative Filtering. In Proceedings of the 25th International Conference on World Wide Web (WWW '16). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 507--517. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Loc Hoang, Matteo Pontecorvi, Roshan Dathathri, Gurbinder Gill, Bozhi You, Keshav Pingali, and Vijaya Ramachandran. 2019. A Round-Efficient Distributed Betweenness Centrality Algorithm. In Proceedings of the 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'19) (PPoPP). 15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Mark Frederick Hoemmen and Michael Allen Heroux. 2011. Fault-tolerant iterative methods. Technical Report. Sandia National Laboratories (SNL-NM), Albuquerque, NM (United States).Google ScholarGoogle Scholar
  24. Kuang-Hua Huang and Abraham. 1984. Algorithm-Based Fault Tolerance for Matrix Operations. IEEE Trans. Comput., Vol. C-33, 6 (June 1984), 518--528. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. 2007. Dryad: Distributed Data-parallel Programs from Sequential Building Blocks. In Proceedings of the 2Nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007 (EuroSys '07). ACM, New York, NY, USA, 59--72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. R. Koo and S. Toueg. 1987. Checkpointing and Rollback-Recovery for Distributed Systems. IEEE Transactions on Software Engineering, Vol. SE-13, 1 (Jan 1987), 23--31. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010. What is Twitter, a Social Network or a News Media?. In Proceedings of the 19th International Conference on World Wide Web (WWW '10). ACM, New York, NY, USA, 591--600. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Andrew Lenharth, Donald Nguyen, and Keshav Pingali. 2016. Parallel Graph Analytics. Commun. ACM, Vol. 59, 5 (April 2016), 78--87. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Jure Leskovec, Deepayan Chakrabarti, Jon Kleinberg, Christos Faloutsos, and Zoubin Ghahramani. 2010. Kronecker Graphs: An Approach to Modeling Networks. J. Mach. Learn. Res., Vol. 11 (March 2010), 985--1042. http://dl.acm.org/citation.cfm?id=1756006.1756039 Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M. Hellerstein. 2012. Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud. Proc. VLDB Endow., Vol. 5, 8 (April 2012), 716--727. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. David E. Lowell, Subhachandra Chandra, and Peter M. Chen. 2000. Exploring Failure Transparency and the Limits of Generic Recovery. In Proceedings of the 4th Conference on Symposium on Operating System Design & Implementation - Volume 4 (OSDI'00). USENIX Association, Berkeley, CA, USA, Article 20. http://dl.acm.org/citation.cfm?id=1251229.1251249 Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Grzegorz Malewicz, Matthew H. Austern, Aart J.C Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: a system for large-scale graph processing. In Proc. ACM SIGMOD Intl Conf. on Management of Data (SIGMOD '10). 135--146. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. D. Manivannan and M. Singhal. 1999. Quasi-synchronous checkpointing: Models, characterization, and classification. IEEE Transactions on Parallel and Distributed Systems, Vol. 10, 7 (July 1999), 703--713. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton van den Hengel. 2015. Image-Based Recommendations on Styles and Substitutes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '15). ACM, New York, NY, USA, 43--52. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Robert Meusel, Sebastiano Vigna, Oliver Lehmberg, and Christian Bizer. 2012. Web Data Commons - Hyperlink Graphs. http://webdatacommons.org/hyperlinkgraph/Google ScholarGoogle Scholar
  36. Robert Meusel, Sebastiano Vigna, Oliver Lehmberg, and Christian Bizer. 2014. Graph Structure in the Web -- Revisited: A Trick of the Heavy Tail. In Proceedings of the 23rd International Conference on World Wide Web (WWW '14 Companion). ACM, New York, NY, USA, 427--432. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. A. Moody, G. Bronevetsky, K. Mohror, and B. R. d. Supinski. 2010. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System. In SC '10: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. 1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Diego Ongaro, Stephen M. Rumble, Ryan Stutsman, John Ousterhout, and Mendel Rosenblum. 2011. Fast Crash Recovery in RAMCloud. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles (SOSP '11). ACM, New York, NY, USA, 29--41. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Carlos Pachajoa and Wilfried N. Gansterer. 2018. On the Resilience of Conjugate Gradient and Multigrid Methods to Node Failures. In Euro-Par 2017: Parallel Processing Workshops. Springer International Publishing, Cham, 569--580.Google ScholarGoogle Scholar
  40. Keshav Pingali, Donald Nguyen, Milind Kulkarni, Martin Burtscher, M. Amber Hassaan, Rashid Kaleem, Tsung-Hsien Lee, Andrew Lenharth, Roman Manevich, Mario Méndez-Lojo, Dimitrios Prountzos, and Xin Sui. 2011. The TAO of parallelism in algorithms. In Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation (PLDI '11). 12--25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Russell Power and Jinyang Li. 2010. Piccolo: Building Fast, Distributed Programs with Partitioned Tables. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation (OSDI'10). USENIX Association, Berkeley, CA, USA, 293--306. http://dl.acm.org/citation.cfm?id=1924943.1924964 Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. D. Presser, L. C. Lung, and M. Correia. 2015. Greft: Arbitrary Fault-Tolerant Distributed Graph Processing. In 2015 IEEE International Congress on Big Data. 452--459. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. The Lemur Project. 2013. The ClueWeb12 Dataset. http://lemurproject.org/clueweb12/Google ScholarGoogle Scholar
  44. Mayank Pundir, Luke M. Leslie, Indranil Gupta, and Roy H. Campbell. 2015. Zorro: Zero-cost Reactive Failure Recovery in Distributed Graph Processing. In Proceedings of the Sixth ACM Symposium on Cloud Computing (SoCC '15). ACM, New York, NY, USA, 195--208. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Feng Qin, Joseph Tucek, Yuanyuan Zhou, and Jagadeesan Sundaresan. 2007. Rx: Treating Bugs As Allergies - a Safe Method to Survive Software Failures. ACM Trans. Comput. Syst., Vol. 25, 3, Article 7 (Aug. 2007). Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Semih Salihoglu and Jennifer Widom. 2013. GPS: A Graph Processing System. In Proceedings of the 25th International Conference on Scientific and Statistical Database Management (SSDBM). ACM, New York, NY, USA, Article 22, bibinfonumpages12 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Sriram Sankaran, Jeffrey M. Squyres, Brian Barrett, Vishal Sahay, Andrew Lumsdaine, Jason Duell, Paul Hargrove, and Eric Roman. 2005. The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing. The International Journal of High Performance Computing Applications, Vol. 19, 4 (2005), 479--493. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Piyush Sao, Oded Green, Chirag Jain, and Richard Vuduc. 2016. A Self-Correcting Connected Components Algorithm. In Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS '16). ACM, New York, NY, USA, 9--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Piyush Sao and Richard Vuduc. 2013. Self-stabilizing Iterative Solvers. In Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA '13). ACM, New York, NY, USA, Article 4, bibinfonumpages8 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Sebastian Schelter, Stephan Ewen, Kostas Tzoumas, and Volker Markl. 2013. "All Roads Lead to Rome": Optimistic Recovery for Distributed Iterative Data Processing. In Proceedings of the 22Nd ACM International Conference on Information & Knowledge Management (CIKM '13). ACM, New York, NY, USA, 1919--1928. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. B. Schroeder and G. Gibson. 2010. A Large-Scale Study of Failures in High-Performance Computing Systems. IEEE Transactions on Dependable and Secure Computing, Vol. 7, 4 (Oct 2010), 337--350. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Bin Shao, Haixun Wang, and Yatao Li. 2013. Trinity: A Distributed Graph Engine on a Memory Cloud. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (SIGMOD '13). ACM, New York, NY, USA, 505--516. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Yanyan Shen, Gang Chen, H. V. Jagadish, Wei Lu, Beng Chin Ooi, and Bogdan Marius Tudor. 2014. Fast Failure Recovery in Distributed Graph Processing Systems. Proc. VLDB Endow., Vol. 8, 4 (Dec. 2014), 437--448. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Vilas Sridharan, Nathan DeBardeleben, Sean Blanchard, Kurt B. Ferreira, Jon Stearley, John Shalf, and Sudhanva Gurumurthi. 2015. Memory Errors in Modern Systems: The Good, The Bad, and The Ugly. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '15). ACM, New York, NY, USA, 297--310. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Dan Stanzione, Bill Barth, Niall Gaffney, Kelly Gaither, Chris Hempel, Tommy Minyard, S. Mehringer, Eric Wernert, H. Tufo, D. Panda, and P. Teller. 2017. Stampede 2: The Evolution of an XSEDE Supercomputer. In Proceedings of the Practice and Experience in Advanced Research Computing 2017 on Sustainability, Success and Impact (PEARC17). ACM, New York, NY, USA, Article 15, bibinfonumpages8 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Rob Strom and Shaula Yemini. 1985. Optimistic Recovery in Distributed Systems. ACM Trans. Comput. Syst., Vol. 3, 3 (Aug. 1985), 204--226. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Xiongchao Tang, Jidong Zhai, Bowen Yu, Wenguang Chen, and Weimin Zheng. 2017. Self-Checkpoint: An In-Memory Checkpoint Method Using Less Space and Its Practice on Fault-Tolerant HPL. In Proceedings of the 22Nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17). ACM, New York, NY, USA, 401--413. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Leslie G. Valiant. 1990. A bridging model for parallel computation. Commun. ACM, Vol. 33, 8 (1990), 103--111. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Keval Vora, Chen Tian, Rajiv Gupta, and Ziang Hu. 2017. CoRAL: Confined Recovery in Distributed Asynchronous Graph Processing. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '17). ACM, New York, NY, USA, 223--236. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. P. Wang, K. Zhang, R. Chen, H. Chen, and H. Guan. 2014. Replication-Based Fault-Tolerance for Large-Scale Graph Processing. In 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 562--573. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. R. Wang, E. Yao, M. Chen, G. Tan, P. Balaji, and D. Buntinas. 2011. Building algorithmically nonstop fault tolerant MPI programs. In 2011 18th International Conference on High Performance Computing. 1--9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Panruo Wu and Zizhong Chen. 2014. FT-ScaLAPACK: Correcting Soft Errors On-line for ScaLAPACK Cholesky, QR, and LU Factorization Routines. In Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing (HPDC '14). ACM, New York, NY, USA, 49--60. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Reynold S. Xin, Joseph E. Gonzalez, Michael J. Franklin, and Ion Stoica. 2013. GraphX: A Resilient Distributed Graph System on Spark. In First International Workshop on Graph Data Management Experiences and Systems (GRADES '13). ACM, New York, NY, USA, Article 2, bibinfonumpages6 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. E. Yao, R. Wang, M. Chen, G. Tan, and N. Sun. 2012. A Case Study of Designing Efficient Algorithm-based Fault Tolerant Application for Exascale Parallelism. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium. 438--448. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. John W. Young. 1974. A First Order Approximation to the Optimum Checkpoint Interval. Commun. ACM, Vol. 17, 9 (Sept. 1974), 530--531. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI'12). USENIX Association, Berkeley, CA, USA, 2--2. http://dl.acm.org/citation.cfm?id=2228298.2228301 Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. G. Zheng, Xiang Ni, and L. V. Kalé. 2012. A scalable double in-memory checkpoint and restart scheme towards exascale. In IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012). 1--6.Google ScholarGoogle ScholarCross RefCross Ref
  68. Xiaowei Zhu, Wenguang Chen, Weimin Zheng, and Xiaosong Ma. 2016. Gemini: A Computation-centric Distributed Graph Processing System. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI'16). USENIX Association, Berkeley, CA, USA, 301--316. http://dl.acm.org/citation.cfm?id=3026877.3026901 Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Phoenix: A Substrate for Resilient Distributed Graph Analytics

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        ASPLOS '19: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems
        April 2019
        1126 pages
        ISBN:9781450362405
        DOI:10.1145/3297858

        Copyright © 2019 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 4 April 2019

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        ASPLOS '19 Paper Acceptance Rate74of351submissions,21%Overall Acceptance Rate535of2,713submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader