Pattern - Hadoop HDFS Filestore
Features
- A distributed filestore optimised for very large, multi terabyte files.
- A central name node to manage indexing.
- Some support for redundancy in the name node; otherwise log replay can be used to bring a rebooted node back up to speed.
- Some security.
- A shell in Hadoop for access to the filesystem.
- Rack-aware -when configured with a correct model of the datacentre.
- Scatters blocks across multiple disks for redundancy
- Runs on top of existing filesystems, such as Linux ext3
Advantages
- Designed to scale to very large files.
- Eliminates need for RAID file storage (except for the name node and its logs)
- Java-based code can be edited by skilled Java people
- Tested in farms of hundreds of nodes
Disadvantages
- Performance impact/overhead from Java code.
- Optimised for large, bulk data storage, not small files.
- Security still very minimal.
- Still evolving.
- No OS integration/native file APIs
- The NameNode is a SPOF.
SmartFrog support
The [sf-hadoop] components can be used to deploy an HDFS cluster; it also contains the a set of workflow components that can be used to manipulate the filesystem, and to copy files in and out before and after submitting work to a Hadoop Job Tracker.