skip to main content
10.1145/2815400.2815428acmconferencesArticle/Chapter ViewAbstractPublication PagessospConference Proceedingsconference-collections
research-article

IronFleet: proving practical distributed systems correct

Published:04 October 2015Publication History

ABSTRACT

Distributed systems are notorious for harboring subtle bugs. Verification can, in principle, eliminate these bugs a priori, but verification has historically been difficult to apply at full-program scale, much less distributed-system scale.

We describe a methodology for building practical and provably correct distributed systems based on a unique blend of TLA-style state-machine refinement and Hoare-logic verification. We demonstrate the methodology on a complex implementation of a Paxos-based replicated state machine library and a lease-based sharded key-value store. We prove that each obeys a concise safety specification, as well as desirable liveness requirements. Each implementation achieves performance competitive with a reference system. With our methodology and lessons learned, we aim to raise the standard for distributed systems from "tested" to "correct."

Skip Supplemental Material Section

Supplemental Material

p1.mp4

mp4

2.2 GB

References

  1. Abadi, M., and Lamport, L. The existence of refinement mappings. Theoretical Computer Science 82, 2 (May 1991). Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Blackham, B., Shi, Y., Chattopadhyay, S., Roychoudhury, A., and Heiser, G. Timing analysis of a protected operating system kernel. In Proceedings of the IEEE Real-Time Systems Symposium (RTSS) (2011). Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Bokor, P., Kinder, J., Serafini, M., and Suri, N. Efficient model checking of fault-tolerant distributed protocols. In Proceedings of the Conference on Dependable Systems and Networks (DSN) (2011). Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Bolosky, W. J., Douceur, J. R., and Howell, J. The Farsite project: a retrospective. ACM SIGOPS Operating Systems Review 41 (2) (April 2007). Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Burrows, M. The Chubby lock service for loosely-coupled distributed systems. In Proceedings of the Symposium on Operating Systems Design and Implementation (OSDI) (2006). Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Castro, M., and Liskov, B. A correctness proof for a practical Byzantine-fault-tolerant replication algorithm. Tech. Rep. MIT/LCS/TM-590, MIT Laboratory for Computer Science, June 1999. Google ScholarGoogle Scholar
  7. Castro, M., and Liskov, B. Practical Byzantine fault tolerance and proactive recovery. ACM Transactions on Computer Systems (TOCS) 20, 4 (Nov. 2002). Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Cohen, E. First-order verification of cryptographic protocols. Journal of Computer Security 11, 2 (2003). Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Cohen, E., and Lamport, L. Reduction in TLA. In Concurrency Theory (CONCUR) (1998). Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Constable, R. L., Allen, S. F., Bromley, H. M., Cleaveland, W. R., Cremer, J. F., Harper, R. W., Howe, D. J., Knoblock, T. B., Mendler, N. P., Panangaden, P., Sasaki, J. T., and Smith, S. F. Implementing Mathematics with the Nuprl Proof Development System. Prentice-Hall, Inc., 1986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. de Moura, L. M., and Bjørner, N. Z3: An efficient SMT solver. In Proceedings of the Conference on Tools and Algorithms for the Construction and Analysis of Systems (2008). Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Detlefs, D., Nelson, G., and Saxe, J. B. Simplify: A theorem prover for program checking. In J. ACM (2003). Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Douceur, J. R., and Howell, J. Distributed directory service in the Farsite file system. In Proceedings of the Symposium on Operating Systems Design and Implementation (OSDI) (November 2006). Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Elmas, T., Qadeer, S., and Tasiran, S. A calculus of atomic actions. In Proceedings of the ACM Symposium on Principles of Programming Languages (POPL) (Jan. 2009). Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. EPaxos code. https://github.com/efficient/epaxos/, 2013.Google ScholarGoogle Scholar
  16. Fischer, M. J., Lynch, N. A., and Paterson, M. S. Impossibility of distributed consensus with one faulty process. Journal of the ACM (JACM) 32, 2 (April 1985). Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Floyd, R. Assigning meanings to programs. In Proceedings of Symposia in Applied Mathematics (1967).Google ScholarGoogle Scholar
  18. Garland, S. J., and Lynch, N. A. Using I/O automata for developing distributed systems. Foundations of Component-Based Systems 13 (2000). Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Gu, R., Koenig, J., Ramananandro, T., Shao, Z., Wu, X. N., Weng, S.-C., Zhang, H., and Guo, Y. Deep specifications and certified abstraction layers. In Proceedings of the ACM Symposium on Principles of Programming Languages (POPL) (2015). Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Guo, H., Wu, M., Zhou, L., Hu, G., Yang, J., and Zhang, L. Practical software model checking via dynamic interface reduction. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP) (2011), ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Hawblitzel, C., Howell, J., Lorch, J. R., Narayan, A., Parno, B., Zhang, D., and Zill, B. Ironclad apps: End-to-end security via automated full-system verification. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI) (October 2014). Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Hoare, T. An axiomatic basis for computer programming. Communications of the ACM 12 (1969). Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Howell, J., Lorch, J. R., and Douceur, J. R. Correctness of Paxos with replica-set-specific views. Tech. Rep. MSR-TR-2004-45, Microsoft Research, 2004.Google ScholarGoogle Scholar
  24. Hunt, P., Konar, M., Junqueira, F. P., and Reed, B. ZooKeeper: Wait-free coordination for Internet-scale systems. In Proceedings of the USENIX Annual Technical Conference (ATC) (2010). Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. IronFleet code. https://research.microsoft.com/projects/ironclad/, 2015.Google ScholarGoogle Scholar
  26. Jones, E. Model checking a Paxos implementation. http://www.evanjones.ca/model-checking-paxos.html, 2009.Google ScholarGoogle Scholar
  27. Joshi, R., Lamport, L., Matthews, J., Tasiran, S., Tuttle, M., and Yu, Y. Checking cache coherence protocols with TLA+. Journal of Formal Methods in System Design 22, 2 (March 2003). Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Junqueira, F. P., Reed, B. C., and Serafini, M. Dissecting Zab. Tech. Rep. YL-2010-007, Yahoo! Research, December 2010.Google ScholarGoogle Scholar
  29. Junqueira, F. P., Reed, B. C., and Serafini, M. Zab: High-performance broadcast for primary-backup systems. In Proceedings of the IEEE/IFIP Conference on Dependable Systems & Networks (DSN) (2011). Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Kellomäki, P. An annotated specification of the consensus protocol of Paxos using superposition in PVS. Tech. Rep. 36, Tampere University of Technology, 2004.Google ScholarGoogle Scholar
  31. Killian, C. E., Anderson, J. W., Braud, R., Jhala, R., and Vahdat, A. M. Mace: Language support for building distributed systems. In Proceedings of the ACM Conference on Programming Language Design and Implementation (PLDI) (2007). Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Klein, G., Andronick, J., Elphinstone, K., Murray, T., Sewell, T., Kolanski, R., and Heiser, G. Comprehensive formal verification of an OS microkernel. ACM Transactions on Computer Systems 32, 1 (2014). Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Lamport, L. A theorem on atomicity in distributed algorithms. Tech. Rep. SRC-28, DEC Systems Research Center, May 1988.Google ScholarGoogle Scholar
  34. Lamport, L. The temporal logic of actions. ACM Transactions on Programming Languages and Systems 16, 3 (May 1994). Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Lamport, L. The part-time parliament. ACM Transactions on Computer Systems (TOCS) 16, 2 (May 1998). Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Lamport, L. Specifying Systems: The TLA+ Languange and Tools for Hardware and Software Engineers. Addison-Wesley, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Lamport, L. The PlusCal algorithm language. In Proceedings of the International Colloquium on Theoretical Aspects of Computing (ICTAC) (Aug. 2009). Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Lamport, L. Byzantizing Paxos by refinement. In Proceedings of the International Conference on Distributed Computing (DISC) (2011). Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Leino, K. R. M. Dafny: An automatic program verifier for functional correctness. In Proceedings of the Conference on Logic for Programming, Artificial Intelligence, and Reasoning (LPAR) (2010). Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Lipton, R. J. Reduction: A method of proving properties of parallel programs. Communications of the ACM, 18, 12 (1975). Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Lorch, J. R., Adya, A., Bolosky, W. J., Chaiken, R., Douceur, J. R., and Howell, J. The SMART way to migrate replicated stateful services. In Proceedings of the ACM European Conference on Computer Systems (EuroSys) (2006). Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Lu, T., Merz, S., Weidenbach, C., Bendisposto, J., Leuschel, M., Roggenbach, M., Margaria, T., Padberg, J., Taentzer, G., Lu, T., Merz, S., and Weidenbach, C. Model checking the Pastry routing protocol. In Workshop on Automated Verification of Critical Systems (2010).Google ScholarGoogle Scholar
  43. Mai, H., Pek, E., Xue, H., King, S. T., and Madhusudan, P. Verifying security invariants in ExpressOS. In Proceedings of the ACM Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (March 2013). Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Moraru, I., Andersen, D. G., and Kaminsky, M. A proof of correctness of Egalitarian Paxos. Tech. Rep. CMU-PDL-13-111, Carnegie Mellon University Parallel Data Laboratory, August 2013.Google ScholarGoogle Scholar
  45. Moraru, I., Andersen, D. G., and Kaminsky, M. There is more consensus in egalitarian parliaments. In Proceedings of the ACM Symposium on Operating System Principles (SOSP) (2013). Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Musuvathi, M., Park, D., Chou, A., Engler, D., and Dill, D. L. CMC: A pragmatic approach to model checking real code. In Proceedings of the USENIX Symposium Operating Systems Design and Implementation (OSDI) (2002). Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Musuvathi, M., Qadeer, S., Ball, T., Basler, G., Nainar, P. A., and Neamtiu, I. Finding and reproducing heisenbugs in concurrent programs. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI) (2008). Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Newcombe, C., Rath, T., Zhang, F., Munteanu, B., Brooker, M., and Deardeuff, M. How Amazon Web Services uses formal methods. Communications of the ACM 58, 4 (Apr. 2015). Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Ongaro, D. Consensus: Bridging theory and practice. Tech. Rep. Ph.D. thesis, Stanford University, August 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Ongaro, D., and Ousterhour, J. In search of an understandable consensus algorithm. In Proceedings of the USENIX Annual Technical Conference (ATC) (June 2014). Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Parkinson, M. The next 700 separation logics. In Proceedings of the IFIP Conference on Verified Software: Theories, Tools, Experiments (VSTTE) (Aug. 2010). Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Parno, B., Lorch, J. R., Douceur, J. R., Mickens, J., and McCune, J. M. Memoir: Practical state continuity for protected modules. In Proceedings of the IEEE Symposium on Security and Privacy (May 2011). Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Pek, E., and Bogunovic, N. Formal verification of communication protocols in distributed systems. In Proceedings of the Joint Conferences on Computers in Technical Systems and Intelligent Systems (2003).Google ScholarGoogle Scholar
  54. Prior, A. N. Papers on Time and Tense. Oxford University Press, 1968.Google ScholarGoogle Scholar
  55. Prisco, R. D., and Lampson, B. Revisiting the Paxos algorithm. In Proceedings of the International Workshop on Distributed Algorithms (WDAG) (1997). Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Rahli, V. Interfacing with proof assistants for domain specific programming using EventML. In Proceedings of the International Workshop on User Interfaces for Theorem Provers (UITP) (July 2012).Google ScholarGoogle Scholar
  57. Rahli, V., Schiper, N., Bickford, M., Constable, R., and van Renesse, R. Developing correctly replicated databases using formal tools. In Proceedings of the IEEE/IFIP Conference on Dependable Systems and Networks (DSN) (June 2014). Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Redis. http://redis.io/. Implementation used: version 2.8.2101 of the MSOpenTech distribution https://github.com/MSOpenTech/redis, 2015.Google ScholarGoogle Scholar
  59. Ridge, T. Verifying distributed systems: The operational approach. In Proceedings of the ACM Symposium on Principles of Programming Languages (POPL) (January 2009). Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Saissi, H., Bokor, P., Muftuoglu, C., Suri, N., and Serafini, M. Efficient verification of distributed protocols using stateful model checking. In Proceedings of the Symposium on Reliable Distributed Systems SRDS (Sept 2013). Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Sciascio, E., Donini, F., Mongiello, M., and Piscitelli, G. Automatic support for verification of secure transactions in distributed environment using symbolic model checking. In Conference on Information Technology Interfaces (June 2001), vol. 1.Google ScholarGoogle ScholarCross RefCross Ref
  62. Stoica, I., Morris, R., Karger, D., Kaashoek, M. F., and Balakrishnan, H. Chord: A scalable peer-to-peer lookup service for Internet applications. In Proceedings of the ACM SIGCOMM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication (August 2001). Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Stoica, I., Morris, R., Karger, D., Kaashoek, M. F., and Balakrishnan, H. Chord: A scalable peer-to-peer lookup service for Internet applications. Tech. Rep. MIT/LCS/TR-819, MIT Laboratory for Computer Science, March 2001.Google ScholarGoogle Scholar
  64. Tasiran, S., Yu, Y., Batson, B., and Kreider, S. Using formal specifications to monitor and guide simulation: Verifying the cache coherence engine of the Alpha 21364 microprocessor. In International Workshop on Microprocessor Test and Verification (June 2002), IEEE.Google ScholarGoogle Scholar
  65. Wang, L., and Stoller, S. D. Runtime analysis of atomicity for multithreaded programs. IEEE Transactions on Software Engineering 32 (Feb. 2006). Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Wang, Y., Kelly, T., Kudlur, M., Lafortune, S., and Mahlke, S. A. Gadara: Dynamic deadlock avoidance for multithreaded programs. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI) (December 2008). Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Wilcox, J., Woos, D., Panchekha, P., Tatlock, Z., Wang, X., Ernst, M., and Anderson, T. UW CSE News: UW CSE's Verdi team completes first full formal verification of Raft consensus protocol. https://news.cs.washington.edu/2015/08/07/, August 2015.Google ScholarGoogle Scholar
  68. Wilcox, J. R., Woos, D., Panchekha, P., Tatlock, Z., Wang, X., Ernst, M. D., and Anderson, T. Verdi: A framework for implementing and formally verifying distributed systems. In Proceedings of the ACM Conference on Programming Language Design and Implementation (PLDI) (June 2015). Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Yang, J., Chen, T., Wu, M., Xu, Z., Liu, X., Lin, H., Yang, M., Long, F., Zhang, L., and Zhou, L. MODIST: Transparent model checking of unmodified distributed systems. In Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI) (April 2009). Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. Yuan, D., Luo, Y., Zhuang, X., Rodrigues, G. R., Zhao, X., Zhang, Y., Jain, P. U., and Stumm, M. Simple testing can prevent most critical failures: An analysis of production failures in distributed data-intensive systems. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI) (October 2014). Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Zave, P. Using lightweight modeling to understand Chord. ACM SIGCOMM Computer Communication Review 42, 2 (April 2012). Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. Zave, P. How to make Chord correct (using a stable base). Tech. Rep. 1502.06461 {cs.DC}, arXiv, February 2015.Google ScholarGoogle Scholar

Index Terms

  1. IronFleet: proving practical distributed systems correct

                      Recommendations

                      Comments

                      Login options

                      Check if you have access through your login credentials or your institution to get full access on this article.

                      Sign in
                      • Published in

                        cover image ACM Conferences
                        SOSP '15: Proceedings of the 25th Symposium on Operating Systems Principles
                        October 2015
                        499 pages
                        ISBN:9781450338349
                        DOI:10.1145/2815400

                        Copyright © 2015 ACM

                        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

                        Publisher

                        Association for Computing Machinery

                        New York, NY, United States

                        Publication History

                        • Published: 4 October 2015

                        Permissions

                        Request permissions about this article.

                        Request Permissions

                        Check for updates

                        Qualifiers

                        • research-article

                        Acceptance Rates

                        SOSP '15 Paper Acceptance Rate30of181submissions,17%Overall Acceptance Rate131of716submissions,18%

                        Upcoming Conference

                        SOSP '24

                      PDF Format

                      View or Download as a PDF file.

                      PDF

                      eReader

                      View online with eReader.

                      eReader