Cooperative Computing in Dynamic Environments

Cooperative Computing in Dynamic Environments

MIT9904-12

Progress Report: July 1, 2002ãDecember 31, 2002

Nancy Lynch and Idit Keidar

Project Overview

The Theory of Distributed Systems group at MIT pursues research on the development of models, analysis and verification methods, and algorithms for distributed systems. Much of its recent work focuses on systems that are highly dynamic, where participants may join and leave the system and may change location. Moreover, the network topology may change, and components may fail and recover. We address the problems brought about by such difficult environments

(1) by developing formal modeling and analysis techniques based on interacting state machines; and

(2) by developing useful ``building blocks'' for dynamic systems---definitions of global services and efficient algorithms to implement them.

The work on formal modeling and analysis involves extending the basic I/O automaton model to support new features such as dynamic process creation, mobility, timing, and continuous behavior. It includes the development of new methods for analyzing performance and fault-tolerance properties. We are also developing the high-level distributed programming language IOA and a toolkit to facilitate the design, analysis, and verification of systems using the techniques developed as a result of our theoretical work. The work on services and algorithms focuses on high-level communication and data-management services, including services that support dynamic reconfiguration.

Progress Through December 2002

(1) Formal modeling and analysis:

During the past six months, we have continued our work on implementing static checks related to composition in ioaCheck, the front end for the IOA toolset. We have spelled out in detail, and are now implementing, both these

checks and the syntactic transformations involved in expanding the definition of a composite I/O automaton into the definition of an equivalent primitive I/O automaton.

A major part of our recent effort has been devoted to reducing the amount of human interaction required to discover and prove interesting properties of distributed systems. To this end, we have been investigating ways to combine executions (e.g., as performed by the IOA simulator), dynamic program analysis (e.g., as performed by the Daikon invariant detector [Ernst]), and automated deduction (e.g., as performed by the Larch and Isabelle theorem provers) to increase what the proof tools in the IOA toolset can accomplish automatically.

To use Daikon, we enhanced the IOA simulator to generate traces of state variables during simulation, and we extended Daikon to generalize over the values observed in these IOA traces in the same way it formerly generalized over traces of C, C++, Java, and Perl programs. Daikon's generalization mechanism uses an efficient generate-and-test algorithm to winnow a set of possible properties (e.g., of the form $x \le y$ or $x \in S$) and to report those it tests to a sufficient degree without falsifying them.

Current proof tools often require significant human input, particularly in the form of detailed lemmas, to establish their goals. Several experiments indicate that the IOA simulator and Daikon can discover facts that theorem provers can prove automatically and use as lemmas. These experiments involved establishing invariants and/or implementation correctness of four distributed algorithms: the Dijkstra and Peterson mutual exclusion algorithms, an algorithm for ensuring memory atomicity in the presence of distributed caches, and the Paxos algorithm for achieving distributed consensus. Although the correctness results are not new, the experiments show that a combination of dynamic invariant detection and automated deduction can be just as automatic as model checking. Furthermore, application of these techniques is not limited to finite-state systems, as is model checking.

Dilsun Kirli Kaynar presented a first paper [KCDGLNR0] on this topic, coauthored with Chefter, Dean, Garland, Lynch, Ne Win, and Ramirez, at a special Tools Day held in conjunction with the CONCUR '02 conference. This paper describes how to use the IOA simulator to test a purported implementation relation: the simulator executes a low-level implementation automaton and, given a proposed correspondence between its steps and those of a higher-level specification automaton, generates and checks an execution of the higher-level automaton. The paper also describes a preliminary experiment in which a checked implementation relation and step correspondence was used to construct a formal proof, using the Larch Prover, of the correctness of an implementation of the Dijkstra mutual exclusion algorithm.

Toh Ne Win presented a second paper [NeWinEGKL03:VMCAI] on this topic, co-authored with Ernst, Garland, Kirli, and Lynch, at VMCAI '03. This paper described recent extensions to Daikon and the remaining three experiments, in which Daikon discovered the mutual exclusion property for the Peterson algorithm and all the lemmas required by either the Larch Prover or Isabelle to prove that property. In these experiments, Daikon also discovered the key lemma required to prove that a caching algorithm implemented an atomic shared memory, and it discovered four of the six lemmas needed to prove that the high level design of the Paxos algorithm satisfied the specification for consensus.

For these experiments, we have developed an interface between IOA and the Isabelle theorem prover. This interface is based on the earlier interface between IOA and the Larch Prover, about which Andrej Bogdanov presented a paper at FORTE '02, and on work done by Chris Luhrs last summer [Luhrs2002]. The attraction in using Isabelle is that it supports programmable proof tactics, which we intend to use to further reduce the amount of guidance humans must supply during semi-automated proofs.

(2) Algorithms for dynamic distributed systems:

Bar-Joseph, Keidar, and Lynch presented their work on dynamic atomic broadcast [BKL02a] [BKL02b] at DISC'02. In this work, they introduce a new problem of atomic broadcast in a dynamic setting where processes may join, leave voluntarily, or fail (by stopping) during the course of computation. They also present and analyze a new algorithm for its solution. The algorithm exhibits constant message delivery latency in the absence of failures, even during periods when participants join or leave. When failures occur, the latency bound is linear in the number of actual failures. These bounds improve upon previously suggested algorithms solving similar problems in the context of view-oriented group communication. Their algorithm uses a solution to a variation on the standard distributed consensus problem, in which participants do not know a priori who the other participants are. They define the new problem, which they call Consensus with Unknown Participants, and give an early-stopping algorithm to solve it.

Lynch, Shvartsman, and several students and co-workers, have continued their work on algorithms for emulating atomic read/write shared objects in dynamic network settings. Lynch and Shvartsman completed their analysis of their original RAMBO algorithm (this stands for Reconfigurable Atomic Memory for Basic Objects), finished a technical report on the results [LS02b], and wrote and presented a DISC paper at DISC'02 [LS02a]. The algorithm replicates each object at several network locations. To ensure atomicity, it performs reads and writes using "quorum configurations", each of which consists of a set of members plus sets of read-quorums and write-quorums. The algorithm is reconfigurable: the quorum configuration is allowed to change during computation, and such changes do not cause violations of atomicity. The algorithm tolerates processor and link failures.

RAMBO performs three major activities, all concurrently:

(1) reading and writing the objects,

(2) choosing new configurations and notifying members, and

(3) identifying and removing ("garbage-collecting") obsolete configurations.

The algorithm is composed of two sub-algorithms: a main algorithm, which handles reading, writing, and garbage-collection, and a reconfiguration algorithm, which handles the selection and dissemination of new configurations.

Atomicity, holds for arbitrary patterns of asynchrony. Performance properties depend on particular failure and timing assumptions. In particular, if participants gossip periodically in the background, if garbage-collection is scheduled periodically, if reconfiguration is not requested too frequently, and if quorums of active configurations do not fail, then read and write operations complete within time proportional to the maximum message latency.

Seth Gilbert has recently improved RAMBO significantly by introducing a new technique for garbage-collecting old configurations concurrently; when several old configurations pile up in the original RAMBO algorithm, it garbage-collects the old configurations sequentially. The improved algorithm improves both the time for garbage-collection

and the fault-tolerance of the entire system. He has proved that the new algorithm preserves atomicity, and has

proved conditional performance results, including results about situations in which the timing and failure behavior of the underlying system stabilize from some point onward. These results have been submitted to DSN'03.

Two LAN implementations of RAMBO are in progress; Peter Musial, a PhD student working with Shvartsman at U. Conn., has completed one implementation and is currently conducting experiments. Matt Bachmann, an MEng students at MIT, is working on an alternative implementation. Both students are using this implementation as a case study leading to strategies for generating real distributed code from IOA programs.

Lynch and Stoica have recently designed an algorithm to implement a fault-tolerant overlay network for a dynamically-changing wide-area network. Our algorithm, called "MultiChord", is based on the Chord algorithm of

Karger et al., but improves upon it by

(1) making the joining protocol heavier-weight, essentially bringing a participant up-to-date before releasing any information about that participant to the rest of the system, and

(2) adding extra redundancy features for fault-tolerance. We are currently both simulating and analyzing the resulting design.

Rui Fan has just completed his MS thesis [Fa02], which contains a network implementation of atomic objects that separates the handling of the data from the quorum management. The result is a substantial improvement in communication cost, over typical algorithms that do not make this separation. He has also managed to prove two lower bound results that say that some of the remaining costs of his algorithm (the need for readers to write, the need to keep many copies of files) are inherent.

We have begun studying algorithms for new environments involving networks of sensors. Problems we have begun considering include topology control algorithms based on limiting power consumption, clock synchronization, and tracking and routing. One paper, on topology control [Hajiaghayi-etal], has so far appeared.

During a visit to MIT during this reporting period, Dr. Tadashi Araragi began discussion with Nancy Lynch on two problems: global snapshots in a dynamic setting and leader election. We hope that this will lead to future collaborations.

Keidar and Rajsbaum presented a tutorial at PODC'02, on the performance of fault-tolerant consensus algorithms in synchronous failure-free runs [KR01]. This also includes a new lower bound proof. Also at PODC'02, Keidar and Bakr presented their work on performance evaluation of distributed algorithms deployed in a widely distributed setting over the Internet using TCP [BK02]. They considered four popular algorithms for implementing a simple communication primitive corresponding to an all-to-all communication round. Their main observation is that message loss has a large impact on algorithm running times, which causes leader-based algorithms to outperform decentralized algorithms in most cases.

Recently, Keidar and Bakr have carried out a similar study for a second communication primitive, which propagates information from a quorum of hosts to a quorum of hosts. The results of this study are described in Bakr's Master thesis [Bakr03].

Livadas continued his work on analyzing and comparing reliable multicast protocols. During this reporting period, he completed his analysis of the standard Scalable Reliable Multicast (SRM) protocol, by removing an earlier constraint that hosts neither leave the multicast group nor crash. He also analyzed his new CESRM protocol,

which is a caching-enhanced version of SRM. He proved that CESRM is a correct implementation of his specification for a reliable multicast service. He also proved a bound on the time for CESRM to recover a lost message, conditioned on certain reasonable timeliness assumptions. This analysis shows that, in cases when the expedited recovery occurs, the latency is only about one fourth of that of un-enhanced SRM. By analyzing real IP multicast traces, he has shown that expedited recoveries occur about one third of the time. Livadas also presented his earlier results on SRM at FORTE'02 [LL02].

Research Plan for the Next Six Months

(1) Formal modeling and analysis:

In the next six months, we expect to finish the implementation of static checks related to composition in the IOA front end and to finish a prototype implementation of the IOA code generator tool.

Other plans for the next six months include finishing a preliminary interface between IOA and the Isabelle theorem prover, enhancing that interface to generate proof tactics appropriate for proofs of invariants and implementation relations, enhancing the interface between IOA and Daikon to suggest additional lemmas for use in these proofs, and evaluating the effectiveness of these techniques in reducing the amount of human interaction required to discover and prove interesting properties of distributed algorithms.

We also plan to begin designing extensions to the IOA language and toolset for specifying and reasoning about timing behavior.

(2) Algorithms for dynamic distributed systems:

In the next six months, we will continue our work on RAMBO and its extensions. We plan to analyze the performance of RAMBO in more cases, especially situations in which the timing and failure behavior stabilize from some point onward. We will work on more algorithmic improvements and optimizations, including limiting the amount of communication, avoiding the second phase of read operations and the first phase of write operations in some cases, choosing good configurations, pre-releasing values of reads, and providing backup strategies for when quorums fail. We will continue our work on implementations and experiments. We will consider new implementations targeted to mobile settings (such as Oxygen) and peer-to-peer settings (such as Chord).

We plan to complete our simulation and analysis of MultiChord, carry out experiments, and compare the experimental and theoretical results. We will also explore the possibility of building RAMBO and similar data-management services on top of MultiChord and similar overlay networks, essentially using the overlay networks to suggest appropriate configurations.

We will expand our work on networks of sensors, focusing on problems of time synchronization, and tracking/routing. We will attempt to understand, from a theoretical point of view, the stack of layers that are needed to build effective systems for such platforms.

Livadas will experiment with CESRM using simulation techniques. He will also model another protocol, the LMS-based reliable multicast protocol of Papadopoulos et al., and will analyze its correctness and performance.

Other problems we are planning to consider, which may be of particular interest to our NTT collaborators, include problems of resource allocation, global snapshots, and leader election in highly dynamic networks. We hope to develop closer collaborations, for example, with Dr. Manabe on resource allocation and with Dr. Araragi on the other two problems.

(2) Algorithms for dynamic distributed systems:

including limiting the amount of communication, avoiding the second phase of read operations and the first

phase of write operations in some cases, choosing good configurations, pre-releasing values of reads, and providing backup strategies for when quorums fail. We will continue our work on implementations and experiments. We will consider new implementations targeted to mobile settings (such as Oxygen) and peer-to-peer settings (such as Chord).

data-management services on top of MultiChord and similar overlay networks, essentially using the overlay networks to suggest appropriate configurations.

References

[Bakr03] Omar Bakr. Performance Evaluation of Distributed Algorithms over the Internet. Master of Engineering in Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, February 2003.

[BGL02] Andrej Bogdanov, Stephen Garland, and Nancy A. Lynch. Mechanical Translation of I/O Automaton Specifications into First-Order Logic. In Doron Peled, Moshe Y. Vardi, editors, Formal Techniques for Networked and Distributed Systems - FORTE 2002 (Proceedings of the 22nd IFIP WG 6.1 International Conference,

Houston, Texas, USA, November 11-14, 2002), volume 2529 of Lecture Notes in Computer Science, pages 364-368, Springer 2002.

[BK02] Omar Bakr and Idit Keidar. Evaluating the Running Time of a Communication Round over the Internet. Proceedings of the 21st ACM Symposium on Principles of Distributed Computing (PODC '02), Monterey, CA, USA, July 2002.

[BKL02b] Ziv Bar-Joseph and Idit Keidar and and Nancy Lynch. Real-Time Dynamic Atomic Broadcast. In D. Malkhi, editor, Distributed Computing (Proceedings of the 16th international Symposium on DIStributed Computing (DISC) October 2002, Toulouse, France), volume 2508 of Lecture Notes in Computer Science, pages 1-16, 2002. Springer-Verlag.

[BKL02a] Ziv Bar-Joseph and Idit Keidar and and Nancy Lynch. Real-Time Dynamic Atomic Broadcast. Technical Report MIT-LCS-TR-840, MIT Laboratory for Computer Science, Cambridge, MA, April 2002.

[Ernst] Michael Ernst, Jake Cockrell, William G. Griswold, and David Notkin. Dynamically Discovering Likely Program Invariants to Support Program Evolution. IEEE Transactions on Software Engineering, 27(2):1-25, 2001.

[Fa02] Rui Fan. Efficient Replication of Large Data Objects. Masters Thesis, MIT Department of Electrical Engineering and Computer Science, Cambridge, MA, February 2003.

[Hajiaghayi-etal] M. T. Hajiaghayi, M. Bahramgiri, and V. S. Mirrokni. Fault-tolerant and 3-Dimensional distributed topology control algorithms in wireless multi-hop networks. Proceedings of the 11th IEEE International Conference on Computer Communications and Networks (IC3N), pages 392-398, October 14-16, 2002, Miami,

Floria. Also, MIT Technical Report MIT-LCS-TR-862, Cambridge, MA 02139, 2002.

[KCDGLNR02] Dilsun Kirli, Anna Chefter, Laura Dean, Stephen Garland, Nancy Lynch, Toh Ne Win, and Antonio Ramirez. Simulating Nondeterministic Systems at Multiple Levels of Abstraction, Tools Day held in conjunction with CONCUR '02, Brno, Czech Republic, August 2002.

[KR01] Idit Keidar and Sergio Rajsbaum. On the Cost of Fault-Tolerant Consensus When There Are No Faults -- A Tutorial. MIT Technical Report MIT-LCS-TR-821, May 24 2001. Preliminary version in SIGACT News 32(2), Distributed Computing column, pages 45-63, June 2001 (published in May 15th).

[LL02] Carolos Livadas and Nancy A. Lynch. A Formal Venture into Reliable Multicast Territory. In Doron Peled, Moshe Y. Vardi, editors, Formal Techniques for Networked and Distributed Systems -

FORTE 2002 (Proceedings of the 22nd IFIP WG 6.1 International Conference, Houston, Texas, USA, November 11-14, 2002), volume 2529 of Lecture Notes in Computer Science, pages 146-161, Springer 2002.

[Luhrs2002] Chris Luhrs. Technical Memo (available at http://www.theory.lcs.mit.edu/tds/ioa), 2002.

[LS02a] Nancy Lynch and Alex Shvartsman, RAMBO: A Reconfigurable Atomic Memory Service for Dynamic Networks. In D. Malkhi, editor, Distributed Computing (Proceedings of the 16th International Symposium on DIStributed Computing (DISC) October 2002, Toulouse, France), volume 2508 of Lecture Notes in Computer

Science, pages 173-190, 2002. Springer-Verlag.

[LS02b] Nancy Lynch and Alex Shvartsman. RAMBO: A Reconfigurable Atomic Memory Service for Dynamic Networks. Technical Report MIT-LCS-TR-856, MIT Laboratory for Computer Science, Cambridge, MA, 2002.

[LSV02] Nancy Lynch and Roberto Segala and Frits Vaandraager. Hybrid I/O Automata. To appear in Information and Computation. Also, Technical Report MIT-LCS-TR-827d, MIT Laboratory for Computer Science, Cambridge, MA 02139, January 13, 2003.

[NeWinEGKL03:VMCAI] Toh Ne Win, Michael D. Ernst, Stephen J. Garland, Dilsun K. Kaynar, and Nancy Lynch. Using simulated execution in verifying distributed algorithms. Proceedings of Fourth International Conference on Verification, Model Checking and Abstract Interpretation (VMCAI'03), pages 283-297, Courant Institute of Mathematical Sciences, New York University, New York, January 2003. Also, to appear in Lecture Notes in Computer Science, Springer-Verlag.