Pattern - Bulk purchase of disks

Contents

Pattern - Bulk purchase of disks

When you are building a server farm, you contact your reference HDD supplier and ask them for 500 high capacity SATA drives, usually with identical specifications. This gives predictable behaviour across all nodes in the farm; latency, power budget, heat dissipation, seek time, boot time; all are similar. Unfortunately, so is something else: disk failure modes. If all disks come from the same factory, on the same production line, from the same batches of raw materials and parts, their lifespan is going to be similar. When used in the same server farm, with similar work load, heating and power cycles, they are not independent disks whose failure mode is random.

Features

  • A server farm is built from a batch order of hard disks, all with the same product number

Advantages

  • One single purchase order to fill in/get signed off.
  • You get a good discount for buying in volume.
  • All the disks are ready for farm assembly.
  • No need to wait/manage purchases from different suppliers.
  • A single batch of disks has similar latency patterns, capacity, costs.
  • Disk-intensive Applications will behave identically on all machines (assuming the motherboards, memory and CPU are also all consistent)
  • Filesystem build/check times will be similar, so time to boot relatively consistent.

Disadvantages

  • Disks from the same batch will fail at the same time. A percentage will fail early, in the first three months. After that they should all last for a few years. When one or two start failing, it is a warning that the entire batch is nearing end of life.
  • If you build your RAID arrays from the same batch, there is an increased risk of multiple drives in the array failing, and hence loss of data. Especially in the first few months.

Disks need to be tracked, so that all HDDs from a specific batch run can be located. Plan for their EOL, and view the first of the failures a year or two in as the early warnings, the time to think about replacing all of them.

Amazon S3 apparently treats new disks with caution, using them for non-critical data until they are bedded in. If they are only used for temporary data storage for the first three months of their life, the cost of failure is reduced.

References

  • Pinheiro07 Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz AndrĂ© Barroso, Failure Trends in a Large Disk Drive Population, Google Research, Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST 2007)
  • Schroeder07 Bianca Schroeder and Garth A. Gibson Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?, Computer Science Department, Carnegie Mellon University, Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST 2007)
Get SmartFrog at SourceForge.net. Fast, secure and Free Open Source software downloads