Checkpointing and Error Recovery in Distributed Systems,

reportActive / Technical Report | Accession Number: ADA093463 | Open PDF

Abstract:

This paper discusses some of the problems of producing fault tolerant distributed computer systems, in particular those of software error recovery. It shows how checkpoints may be used in error recovery, it defines the information that checkpoints must contain, and discusses alternate strategies for checkpointing. It describes models of error recovery and extends an existing recovery protocol to cater for certain types of checkpoint inconsistencies. The paper defines protocols for systematically generating checkpoints so that they can be used by the recovery protocols. It also defines a protocol for discarding checkpoints when they are no longer of use, which prevents the set of checkpoints growing indefinitely. The paper concludes by considering some of the problems of implementing the protocols. Author

Security Markings

DOCUMENT & CONTEXTUAL SUMMARY

Distribution:
Approved For Public Release

RECORD

Collection: TR
Identifying Numbers
Subject Terms