Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing

Johnson, David B.; Zwaenepoel, Willy

Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing

Active / Technical Report | Accession Number: ADA222056 |

Open PDF

Abstract:

In a distributed system using message logging and checkpointing to provide fault tolerance, there is always a unique maximum recoverable system state, regardless of the message logging protocol used. The proof of this relies on the observation that the set of system states that have occurred during any single execution of a system forms a lattice, with the sets of consistent and recoverable system states as sublattices. The maximum recoverable system state never decreases, and if all messages are eventually logged, the domino effect cannot occur. This paper presents a general model for reasoning about recovery in such a system and, based on this model, an efficient algorithm for determining the maximum recoverable system state at any time. This work unifies existing approaches to fault tolerance based on message logging and checkpointing, and improves on existing methods for optimistic recovery in distributed systems.

Author(s):

Johnson, David B. ; Zwaenepoel, Willy

Author Organization(s):

RICE UNIV HOUSTON TX

Supplementary Note:

Pub. in Proceedings of the Symposium on Principles of Distributed Computing (7th) p171-181 Aug 1988.

Pagination:

0010

Security Markings

DOCUMENT & CONTEXTUAL SUMMARY

Distribution:

Approved For Public Release

Distribution Statement:

Approved For Public Release; Distribution Is Unlimited.

RECORD

Collection: TR

Identifying Numbers

Contract/Grant Number(s):

N00014-88-K-0140, NSF-DCR85-11436

Subject Terms

Joint Capability Areas:

JCA_6_Net Centric; JCA_6.1_Information Transport; JCA_6.2.1_Information Sharing; JCA_6.2.2_Computing Services; JCA_6.2_Enterprise Services

Communities of Interest:

No COI(s) Identified

Descriptor(s):

*MESSAGE PROCESSING, *INFORMATION RETRIEVAL, *DISTRIBUTED DATA PROCESSING, MODELS, FAULTS, REASONING, EFFICIENCY, TOLERANCE, OPTIMIZATION, RECOVERY, ALGORITHMS, DISTRIBUTION

Field(s)/Group(s):

Computer Systems

Report Date:

1988 Aug 01