Checkpointing and Rollback Recovery in Distributed Shared Memory Systems
ILLINOIS UNIV AT URBANA CENTER FOR RELIABLE AND HIGH-PERFORMANCE COMPUTING
Pagination or Media Count:
Checkpointing techniques in parallel systems use dependency tracking andor message logging to ensure that a system rolls back to a consistent state. Traditional dependency tracking in distributed shared memory systems DSM is expensive because of high frequency of communication. In this paper we show that, because of information redundancy, not all message-passing dependences need to be considered to roll back to a consistent state in DSM systems, resulting in reduced dependency tracking overhead and reduced potential for rollback propagation. We develop a model of execution where client processes running an application interact atomically with a set of shared-memory server processes on every access to shared data. We show that under this model, dependences are significantly reduced over the message-passing model. We use results from simulation with multiprocessor address traces to demonstrate the reduction in dependences.
- Computer Hardware