Accession Number:

ADA222075

Title:

Distributed System Fault Tolerance Using Message Logging and Checkpointing

Descriptive Note:

Doctoral thesis,

Corporate Author:

RICE UNIV HOUSTON TX DEPT OF COMPUTER SCIENCE

Personal Author(s):

Report Date:

1989-12-01

Pagination or Media Count:

134.0

Abstract:

Fault tolerance can allow processes executing in a computer system to survive failures within the system. This thesis addresses the theory and practice of transparent fault-tolerance methods using message logging and checkpointing in distributed systems. A general model for reasoning about the behavior and correctness of these methods is developed, and the design, implementation, and performance of two new low-overhead methods based on this model are presented. No specialized hardware is required with these new methods. The model is independent of the protocols used in the system. Each process state is represented by a dependency vector, and each system state is represented by a dependency matrix showing a collection of process states. The set of system states that have occurred during any single execution of a system forms a lattice, with the sets of consistent and recoverable system states as sublattices. There is thus always a unique maximum recoverable system state. KR

Subject Categories:

  • Computer Programming and Software

Distribution Statement:

APPROVED FOR PUBLIC RELEASE