Search This Blog

Saturday, 1 November 2014

Fault tolerance

Fault tolerance is a design concept that recognizes that all computer-based systems will fail eventually. The question is whether a system as a whole can be designed to “fail grace-fully.” This means that even if one or more components fail, the system will continue to operate according to its design specifications, even if its speed or throughput must decrease.

Methods and Implementations

There are a number of ways to make a system more fault tolerant. Individual components such as hard drives can be composed of multiple units so that the remaining units can take over if one fails (see also raid). If each key component has at least one backup, then there should be time to replace the primary before the backup also fails.


Another way to achieve fault tolerance is to provide mul-tiple paths to successful completion of the task. In fact, this is how packet-switched networks like the Internet work (see tcp/ip). If one communications link is down or too congested, packets are given an alternative routing.


Fault diagnosis software can also play an important role both in determining how to respond to a problem (beyond any automatic response) and for providing data that will be useful later to system administrators or technicians. Some fault diagnosis systems can use elaborate rules (see expert systems) to pinpoint the cause of a fault and recommend a solution.

The amount of fault tolerance to be provided for a sys-tem depends on a number of factors:

•  How important is it that the system not fail?
•  How critical is a given component to the operation of the system?
•  How likely is it that a given component will fail? (Mean time between failures, or MBTF)
•  How expensive is it to make the component or system fault tolerant?

A related concept is fail-safe. While fault tolerance emphasizes continued operation despite one or more fail-ures, fail-safe emphasizes the ability to shut down safely in case of an unrecoverable failure. With computer-based systems, fail-safe design can use redundant systems (as in avionics) to perform calculations, with a failing system “outvoted” if necessary by the good ones. In most cases there should also be a provision to alert the pilot or opera-tor in time to take over operations from the automatic system. Another common example of fail-safe is modern operat-ing systems that create a “journal” of pending operations to files that can be used to restore the integrity of the system after a power failure or other abrupt shutdown (see file system.)

No comments:

Post a Comment