Fault Tolerance

Fault Tolerance

  • Definition Fault:

Defect of a system that may cause an error.

  • Definition Error:

Illegal system state.

  • Definition Failure:

Occurs when an error reaches the service interface of a system, resulting in system behavior that is inconsistent with its specification.

Faults #

Internal vs External Fault #

internal fault
fault of a component
external fault
propagated failure of another component or from outiside the system

Fault categories #

Physical
Malfunction of hardware
Design
can be in hardware or software
Interaction
occurrs during operation, caused by the environment (operator error, radiation, …)

Failures #

Classifications of failures #

Domain
value / timing
level of control
signalled / unsignalled
consequences
minor / catastrophic

Tolerance #

Fault tolerant system #

  • Never enters failure state
  • Errors may occurr but never reach service boundary
  • Some kind of redundancy needed
  • Needs 3 steps
    1. Error detection
    2. Error analysis
    3. Recovery

Fail-controlled system #

  • Allow recovery from failureusing special protocols
  • e.g. it goes into a state from which corrective actions can be taken by an external controller

Fault prevention #

  • Doing your best to prevent faults
  • coding standards, firewalls, radiation shielding, …

Fault removal #

  • Eliminate fault possibilities:
    • Validation (check spec)
    • Verification (prove system adheres to spec)
    • Debugging -> find and eliminate faults
Calendar October 22, 2023