Software Upgrades in Distributed Systems



Progress Report: July 1, 2002‹December 31, 2002


Barbara Liskov




Project Overview


With the help of support from NTT, we have been working on two projects: support for automatic software upgrades in distributed systems, and work on Byzantine fault tolerance.


Software Upgrades:

Our work on software upgrades is aimed at solving the problem of how to upgrade large scale distributed systems that are intended to provide continuous service over a very long lifetime. Upgrades are needed to correct software errors, to improve performance, or to change system behavior, e.g., to support new features.

Upgrades must propagate automatically: for the systems of interest, it is impractical (or impossible) for a human to control upgrades, e.g., by doing a remote login to a node from some centralized management console and installing the upgrade, taking care of any problems as they arise. Furthermore, upgrades must not interrupt service: upgrades must be installed without bringing the system down.

However, upgrades cannot happen instantaneously. It isn't possible to freeze the system and then cause all nodes to upgrade while the system is frozen. For one thing, this would take too long, since at any moment some nodes may be powered down or not communicating. Furthermore, the system designer may not be willing to have an upgrade happen all at once. Instead, providing continuous service may require that just a few nodes upgrade at a time. Also, the designer may want to try out the upgrade on a few nodes to see that the new code is working satisfactorily before upgrading other nodes.

Upgrades might correct errors or extend interfaces. However, they might also change interfaces in incompatible ways: a node might no longer support some behavior or store some state that it used to. Incompatible changes are the most difficult to deal with because they can lead to long periods when nodes that need to communicate assume different interfaces. Yet because of the requirement for continuous service, the system must continue to run (possibly in a somewhat degraded mode) even when such incompatibilities exist.


Byzantine Fault Tolerance:

This project is aimed at developing algorithms and implementation techniques to build practical Byzantine-fault-tolerant systems, that is, systems that work correctly even when some components are faulty and exhibit arbitrary behavior. We believe that these systems will be increasingly important in the future because malicious attacks and software errors are increasingly common and can cause faulty nodes to exhibit arbitrary behavior.

We developed a new replication algorithm, BFT, that can be used to build highly available systems that tolerate Byzantine faults. This work shows, for the first time, how to build Byzantine-fault-tolerant systems that can be used in practice to implement real services because they do not rely on unrealistic assumptions and they perform well. BFT works in asynchronous environments like the Internet incorporates mechanisms to defend against Byzantine-faulty clients, and recovers replicas proactively. The recovery mechanism allows the algorithm to tolerate any number of faults over the lifetime of the system provided fewer than 1/3 of the replicas become faulty within a small window of vulnerability. The window may increase under a denial-of-service attack but the algorithm can detect and respond to such attacks and it can also detect when the state of a replica is corrupted by an attacker.

BFT has been implemented as a generic program library with a simple interface. However, this library requires all replicas to run the same service implementation and to update their state in a deterministic way. Therefore, it cannot tolerate deterministic software errors that cause all replicas to fail concurrently and it complicates reuse of existing service implementations because it requires extensive modifications to ensure identical values for the state of each replica.

To overcome this, we developed a replication technique, BASE (BFT with Abstract Specification Encapsulation), that corrects these problems. This technique is based on the concepts of abstract specification and abstraction function from work on data abstraction. We start by defining a common abstract specification for the service, which specifies an abstract state and describes how each operation manipulates the state. Then we implement a conformance wrapper for each distinct implementation to make it behave according to the common specification. The last step is to implement an abstraction function (and one of its inverses) to map from the concrete state of each implementation to the common abstract state (and vice versa).

BASE reduces the cost of Byzantine fault tolerance because it enables reuse of off-the-shelf service implementations. It improves availability because each replica can be repaired periodically using an abstract view of the state stored by correct replicas, and because each replica can run distinct or non-deterministic service implementations, which reduces the probability of correlated failures.

BFT and BASE assume that each replica holds a complete copy of the service state. We are also looking at the problem of applying Byzantine fault tolerance to larger services, where the entire service state cannot be stored at each replica. In such a setting it would be desirable to add more nodes to the system as increasing load or storage needs require it, and it is likely that nodes will fail and need to be evicted from the system. This is similar to what happens in a peer-to-peer system, and like these systems we would like to be able to self-organize our system in the presence of a dynamic membership with essentially zero manual intervention. This motivates our new system called Rosebud.


Progress Through December 2002


Software Upgrades:

We have defined an architecture to support automated upgrades. Our approach:

·       provides an automatic way to control the scheduling of node upgrades

·       enables the system to provide service when nodes are running at different versions

·       provides a way to upgrade nodes from one version to the next

To support upgrade scheduling, we provide scheduling functions (SFs): procedures that run on nodes and let them coordinate the timing of their upgrades. SFs are defined by the upgrader and support a wide range of behaviors, from upgrading nodes in rapid succession to fix a critical bug to upgrading just a few nodes a day to minimize service disruption. SFs can also communicate with a centralized upgrade database to enable an administrator to monitor and control upgrade progress.

To support communication between nodes running at different versions, we provide simulation objects (SOs): adapters that convert calls from one version to the next higher or lower version. SOs let nodes support not only past versions' behavior (i.e., backward compatibility) but also future versions' behavior (i.e., forward compatibility). This is vital for asynchronous upgrades, since upgraded nodes may make calls on non-upgraded ones, and vice versa. And unlike the adapters used in previous work, SOs can contain their own state and so can implement behaviors that are outside the scope of a node's current version.

To support the actual upgrades of nodes, we provide transform functions: procedures that convert a node's state from one version to the next.

Our upgrade architecture has the following components. A logically centralized upgrade server publishes upgrades for download. An upgrade database lets human operators monitor and control upgrade progress and also lets nodes coordinate global upgrade schedules. Per-node upgrade managers subscribe to the upgrade server to learn about new upgrades and install them on nodes. Per-node upgrade layers gossip with their counterparts on other nodes to learn about new upgrades and handle cross-version calls using simulation objects. The upgrade layer is transparent to applications, so that developers can write their software as if every node in the system were running the same version.


Byzantine Fault Tolerance:

We have presented a preliminary design for Rosebud. This design resembles existing peer-to-peer systems: nodes are assigned random node IDs, and data is partitioned among the participants based on those IDs. Rosebud also provides a simple read/write interface that is common to the existing distributed hash tables that are built on top of current peer-to-peer overlays.

We differ from peer-to-peer distributed hash tables in that our algorithms to determine the current membership of the system and to store and retrieve data are resilient to arbitrary failures of a subset of the nodes.

Our system has a hybrid architecture, consisting of a set of servers ­ not unreliable client machines that participate in the system intermittently ­ acting as the peer-to-peer nodes, and a configuration service (CS). The CS is responsible for computing the current configuration (including removing faulty peer-to-peer nodes from the system), and informing the peer-to-peer nodes about the current configuration.


Research Plan for the Next Six Months


Software Upgrades:

We plan to start work on an implementation of our architecture. In addition we plan to work on correctness conditions for simulation objects and transform functions.


Byzantine Fault Tolerance:

We have started to work on the implementation of Rosebud, which should be finished in the next six months. We intend to deploy and test the system on a wide-area testbed (such as PlanetLab).