research-article

ZZ and the art of practical BFT execution

Authors:
Timothy Wood

University of Massachusetts Amherst, Amherst, MA, USA

University of Massachusetts Amherst, Amherst, MA, USA
View Profile

,
Rahul Singh

University of Massachusetts Amherst, Amherst, MA, USA

University of Massachusetts Amherst, Amherst, MA, USA
View Profile

,
Arun Venkataramani

University of Massachusetts Amherst, Amherst, MA, USA

University of Massachusetts Amherst, Amherst, MA, USA
View Profile

,
Prashant Shenoy

University of Massachusetts Amherst, Amherst, MA, USA

University of Massachusetts Amherst, Amherst, MA, USA
View Profile

,
Emmanuel Cecchet

University of Massachusetts Amherst, Amherst, MA, USA

University of Massachusetts Amherst, Amherst, MA, USA
View Profile

EuroSys '11: Proceedings of the sixth conference on Computer systemsApril 2011Pages 123–138https://doi.org/10.1145/1966445.1966457

Published:10 April 2011Publication History

EuroSys '11: Proceedings of the sixth conference on Computer systems

Pages 123–138

ABSTRACT

The high replication cost of Byzantine fault-tolerance (BFT) methods has been a major barrier to their widespread adoption in commercial distributed applications. We present ZZ, a new approach that reduces the replication cost of BFT services from 2f+1 to practically f+1. The key insight in ZZ is to use f+1 execution replicas in the normal case and to activate additional replicas only upon failures. In data centers where multiple applications share a physical server, ZZ reduces the aggregate number of execution replicas running in the data center, improving throughput and response times. ZZ relies on virtualization---a technology already employed in modern data centers---for fast replica activation upon failures, and enables newly activated replicas to immediately begin processing requests by fetching state on-demand. A prototype implementation of ZZ using the BASE library and Xen shows that, when compared to a system with 2f+1 replicas, our approach yields lower response times and up to 33% higher throughput in a prototype data center with four BFT web applications. We also show that ZZ can handle simultaneous failures and achieve sub-second recovery.

References

Michael Abd-El-Malek, Gregory R. Ganger, Garth R. Goodson, Michael K. Reiter, and Jay J. Wylie. Faultscalable Byzantine Fault-Tolerant Services. SIGOPS Oper. Syst. Rev., 39(5):59--74, 2005. ISSN 0163-5980. Google ScholarDigital Library
Atul Adya, William J. Bolosky, Miguel Castro, Gerald Cermak, Ronnie Chaiken, John R. Douceur, Jon Howell, Jacob R. Lorch, Marvin Theimer, and Roger P. Wattenhofer. FARSITE: Federated, Available, and Reliable Storage for an Incompletely Trusted Environment. In Proc. of the 5th Symposium on Operating Systems Design and Implementation (OSDI), 2002. Google ScholarDigital Library
M. Castro and B. Liskov. Practical Byzantine Fault Tolerance. In Proceedings of the Third Symposium on Operating Systems Design and Implementation, February 1999. Google ScholarDigital Library
Miguel Castro and Barbara Liskov. Practical Byzantine Fault Tolerance and Proactive Recovery. ACM Transactions on Computer Systems (TOCS), 20(4), November 2002. Google ScholarDigital Library
A. Clement, M. Marchetti, E.Wong, L. Alvisi, and M. Dahlin. Making Byzantine Fault Tolerant Systems Tolerate Byzantine Faults. In 6th USENIX Symposium on Networked Systems Design and Implementation (NSDI), April 2009. Google ScholarDigital Library
James Cowling, Daniel Myers, Barbara Liskov, Rodrigo Rodrigues, and Liuba Shrira. HQ Replication: A Hybrid Quorum Protocol for Byzantine Fault Tolerance. In Proceedings of the Seventh Symposium on Operating Systems Design and Implementations (OSDI), Seattle, Washington, November 2006. Google ScholarDigital Library
Brendan Cully, Geoffrey Lefebvre, Dutch Meyer, Mike Feeley, Norm Hutchinson, and Andrew Warfield. Remus: High Availability via Asynchronous Virtual Machine Replication. In NSDI, 2008. Google ScholarDigital Library
Tobias Distler and R&#252;diger Kapitza. Increasing Performance in Byzantine Fault-Tolerant Systems with On-Demand Replica Consistency. In European Chapter of ACM SIGOPS, editor, Proceedings of the EuroSys 2011 Conference (EuroSys '11), 2011. Google ScholarDigital Library
Tobias Distler, R&#252;diger Kapitza, Ivan Popov, Hans P. Reiser, and Wolfgang Schr&#246;der-Preikschat. SPARE: Replicas on Hold. In Internet Society (ISOC), editor, Proceedings of the 18th Network and Distributed System Security Symposium (NDSS '11), 2011.Google Scholar
Cynthia Dwork, Nancy Lynch, and Larry Stockmeyer. Consensus in the Presence of Partial Synchrony. Journal of the ACM, 35(2), 1988. Google ScholarDigital Library
Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson. Impossibility of Distributed Consensus with One Faulty Process. J. ACM, 32(2):374--382, 1985. ISSN 0004-5411. Google ScholarDigital Library
Tal Garfinkel, Ben Pfaff, Jim Chow, Mendel Rosenblum, and Dan Boneh. Terra: a Virtual Machine-based Platform for Trusted Computing. In SOSP '03: Proceedings of the nineteenth ACM symposium on Operating systems principles, pages 193--206, New York, NY, USA, 2003. ACM Press. Google ScholarDigital Library
Rachid Guerraoui, Nikola Kne&#382;evi&#263;, Vivien Qu&#233;ma, and Marko Vukoli&#263;. The Next 700 BFT Protocols. In EuroSys '10: Proceedings of the 5th European conference on Computer systems, pages 363--376, New York, NY, USA, 2010. ACM. ISBN 978-1-60558-577-2. Google ScholarDigital Library
Kim Potter Kihlstrom, L. E. Moser, and P. M. Melliar-Smith. The SecureRing Protocols for Securing Group Communication. In HICSS '98: Proceedings of the Thirty-First Annual Hawaii International Conference on System Sciences-Volume 3, Washington, DC, USA, 1998. Google ScholarDigital Library
Ramakrishna Kotla, Lorenzo Alvisi, Mike Dahlin, Allen Clement, and Edmund Wong. Zyzzyva: Speculative Byzantine Fault Tolerance. In SOSP '07: Proceedings of twentyfirst ACM SIGOPS Symposium on Operating Systems Principles, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
L. Lamport. Part Time Parliament. ACM Transactions on Computer Systems, 16(2), May 1998. Google ScholarDigital Library
L. Lamport, R. Shostack, and M. Pease. The Byzantine Generals Problem. ACM Transactions on Programming Languages and Systems, 4(3):382--401, 1982. Google ScholarDigital Library
Leslie Lamport. Time, Clocks, and the Ordering of Events in a Distributed System. Commun. ACM, 21(7), 1978. Google ScholarDigital Library
Leslie Lamport and Mike Massa. Cheap Paxos. In DSN '04: Proceedings of the 2004 International Conference on Depen dable Systems and Networks, page 307,Washington, DC, USA, 2004. IEEE Computer Society. ISBN 0-7695-2052-9. Google ScholarDigital Library
Brian M. Oki and Barbara H. Liskov. Viewstamped Replication: a General Primary Copy. In PODC '88: Proceedings of the seventh annual ACM Symposium on Principles of distributed computing, New York, NY, USA, 1988. ACM. Google ScholarDigital Library
Michael K. Reiter. The Rampart Toolkit for Building High-integrity Services. In Selected Papers from the International Workshop on Theory and Practice in Distributed Systems, London, UK, 1995. Springer-Verlag. Google ScholarDigital Library
Rodrigo Rodrigues, Miguel Castro, and Barbara Liskov. BASE: Using Abstraction to Improve Fault Tolerance. In Proceedings of the eighteenth ACM symposium on Operating systems principles, New York, NY, USA, 2001. Google ScholarDigital Library
Atul Singh, Tathagata Das, Petros Maniatis, Peter Druschel, and Timothy Roscoe. BFT Protocols Under Fire. In NSDI '08: Proceedings of the Usenix Symposium on Networked System Design and Implementation, 2008. Google ScholarDigital Library
Paulo Sousa, Alysson N. Bessani, Miguel Correia, Nuno F. Neves, and Paulo Verissimo. Resilient Intrusion Tolerance Through Proactive and Reactive Recovery. In Proceedings of the 13th Pacific Rim International Symposium on Dependable Computing, Washington, DC, USA, 2007. Google ScholarDigital Library
Ben Vandiver, Hari Balakrishnan, Barbara Liskov, and Sam Madden. Tolerating Byzantine Faults in Database Systems Using Commit Barrier Scheduling. In Proceedings of the 21st ACM Symposium on Operating Systems Principles (SOSP), Stevenson, Washington, USA, October 2007. Google ScholarDigital Library
Timothy Wood, Rahul Singh, Arun Venkataramani, Prashant Shenoy, and Emmanuel Cecchet. ZZ and the Art of Practical BFT. Technical report, University of Massachusetts Amherst, Feb. 2011.Google Scholar
J. Yin, J.P. Martin, A. Venkataramani, L. Alvisi, and M. Dahlin. Separating Agreement from Execution for Byzantine Fault Tolerant Services. In Proceedings of the 19th ACM Symposium on Operating Systems Principles, October 2003. Google ScholarDigital Library
ZFS. The Last Word in File Systems. http://www.sun.com/2004-0914/feature/, 2004.Google Scholar

Index Terms

ZZ and the art of practical BFT execution
1. Software and its engineering
  1. Software organization and properties
    1. Extra-functional properties
      1. Software fault tolerance

Recommendations

BFT: the time is now
LADIS '08: Proceedings of the 2nd Workshop on Large-Scale Distributed Systems and Middleware

Data centers strive to provide reliable access to the data and services that they host. This reliable access requires the hosted data and services hosted by the data center to be both consistent and available. Byzantine fault tolerance (BFT) replication ...
Read More
Separating agreement from execution for byzantine fault tolerant services
SOSP '03

We describe a new architecture for Byzantine fault tolerant state machine replication that separates agreement that orders requests from execution that processes requests. This separation yields two fundamental and practically significant advantages ...
Read More
Separating agreement from execution for byzantine fault tolerant services
SOSP '03: Proceedings of the nineteenth ACM symposium on Operating systems principles

We describe a new architecture for Byzantine fault tolerant state machine replication that separates agreement that orders requests from execution that processes requests. This separation yields two fundamental and practically significant advantages ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
EuroSys '11: Proceedings of the sixth conference on Computer systems
April 2011
370 pages
ISBN:9781450306348
DOI:10.1145/1966445
General Chair:
Christoph Kirsch
University of Salzburg, Austria
,
Program Chair:
Gernot Heiser
University of New South Wales, Australia
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 April 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
byzantine fault tolerance
data centers
virtualization
Qualifiers
- research-article
Conference

Acceptance Rates
EuroSys '11 Paper Acceptance Rate24of161submissions,15%Overall Acceptance Rate241of1,308submissions,18%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 56
  Total Citations
  View Citations
- 385
  Total Downloads
- Downloads (Last 12 months)18
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.