Single Point of Failure

Contents

The Single Point of Failure: SPOF

This is not something that should be designed into any high-availability distributed system. Sometimes it turns out be in there anyway. For example, someone may rent multiple rendundant data lines out from a data centre, but they end up being shared onto the same piece of fibre under some distant field; when a farmer cuts that cable all the "redundant" links go down.

Every piece of software and hardware in a server farm could be a point of failure, so needs to be examined to see what the consequences of its failure will be. Think also about devices that fail in different ways. An ethernet cable could be bent such that its error rate goes up, even though it is still present. The network would get slower, but the underlying cause of the problem hard to track down, especially remotely. A printer in the same subnet could suddenly flood the network with ARP hostname requests (it has been known to happen!). A hardware RAID controller could start (silently) corrupting data it writes back.

Features

  • A single part of the system, which, if it fails, will result in the unavailability of the entire system.

Advantages

  • Cheaper to roll out (less redundancy)
  • Easier to manage (no need to keep redundant parts synchronised).
  • When a SPOF fails, you can usually track it down.

Disadvantages

  • Reduced system availability
  • Possibly reduced system performance, even while available.
  • Some failures, especially networking and hardware problems, are impossible to track down remotely.

SmartFrog support

We don't like SPOFs in our systems, except low-cost home installations. Even there, we encourage everyone to have spare wifi/firewall/router to hand for emergencies, or know the location of their nearest cafe with free WiFi. We use Anubis for fault tolerance. Even the we have to worry about the networking, NTP clock synchronisation, power, aircon failures...

Get SmartFrog at SourceForge.net. Fast, secure and Free Open Source software downloads