Application Level Correctness and its Impact on Fault Tolerance
University of Maryland College Park United States
Pagination or Media Count:
Fundamental to any fault tolerance research is the definition of correct program execution. Traditionally, correct programs execution requires architectural state to be numerically perfect. However, in many cases, even if program execution is not 100 numerically correct, it may be completely acceptable if the answers can satisfy users requirement. Hence, faults which have caused such numerically faulty execution are no longer intolerable.The extent to which programs are more fault resilient at higher levels of abstraction is application dependent. Programs that produce inexact andor approximate outputs can be very resilient at the application level. We call such programs soft computations, and we find they are common in multimedia workloads, as well as artificial intelligence AI workloads. Programs that compute exact numerical outputs offer less error resilience at the application level. However, we find all programs studied in this paper exhibit some enhanced fault resilience at the application level, including those that are traditionally considered exact computationse.g., SPECInt CPU2000.This report investigates definitions of program correctness that view correctness from the applications standpoint rather than the architectures standpoint. Under application-level correctness, a programs execution is deemed correct as long as the result it produces is acceptable to the user. To quantify user satisfaction, we rely on application-level fidelity metrics that capture user-perceived program solution quality. We conduct a detailed fault susceptibility study that measures how much more fault resilient programs are when defining correctness at the application level compared to the architecture level. Our results show for 6 multimedia and AI benchmarks that 45.8 of architecturally incorrect faults are correct at the application level. For 3 SPECInt CPU2000 benchmarks, 17.6 of architecturally incorrect faults are correct at the application level.
- Computer Programming and Software