Accession Number:

ADA279664

Title:

System Level Fault Tolerance in Parallel and Distributed Computing Systems

Descriptive Note:

Final rept. 1 Jul 1988-31 Dec 1993n

Corporate Author:

TEXAS UNIV AT AUSTIN

Personal Author(s):

Report Date:

1993-12-31

Pagination or Media Count:

15.0

Abstract:

The major thrust of our effort was focused on the theory and practice of responsive fault-tolerant, real-time computing in parallel and distributed processing environments. New efficient methods of system testing have been developed which shorten a multiprocessor testing time by orders of magnitude and, therefore, can be used at system booting previous techniques were prohibitively long. A new design framework for responsive computing was designed and is being implemented for validation. This framework for responsive computing was designed and is being implemented for validation. This framework is based on consensus which can be used to provide synchronization, reliable communication, fault diagnosis, checkpointing and even scheduling in multiprocessor environments. We have formalized and quantified the space-time tradeoff for efficient fault recovery. The system model is a graph, and we were especially successful in analysis of meshes and hypercubes. We developed a new method called naturally redundant algorithms which allows efficient implementation of application-specific techniques

Subject Categories:

  • Computer Programming and Software
  • Computer Hardware
  • Computer Systems

Distribution Statement:

APPROVED FOR PUBLIC RELEASE