The Single Point of Failure: SPOFThis is not something that should be designed into any high-availability distributed system. Sometimes it turns out be in there anyway. For example, someone may rent multiple rendundant data lines out from a data centre, but they end up being shared onto the same piece of fibre under some distant field; when a farmer cuts that cable all the "redundant" links go down. Every piece of software and hardware in a server farm could be a point of failure, so needs to be examined to see what the consequences of its failure will be. Think also about devices that fail in different ways. An ethernet cable could be bent such that its error rate goes up, even though it is still present. The network would get slower, but the underlying cause of the problem hard to track down, especially remotely. A printer in the same subnet could suddenly flood the network with ARP hostname requests (it has been known to happen!). A hardware RAID controller could start (silently) corrupting data it writes back. Features
Advantages
Disadvantages
SmartFrog supportWe don't like SPOFs in our systems, except low-cost home installations. Even there, we encourage everyone to have spare wifi/firewall/router to hand for emergencies, or know the location of their nearest cafe with free WiFi. We use Anubis for fault tolerance. Even the we have to worry about the networking, NTP clock synchronisation, power, aircon failures... |