Pattern - Hadoop running against non-HDFS filesystems
There is no requirement that Hadoop work with HDFS filesystems only. As well as importing/exporting from different filesystems, Hadoop can run with different filestores. As of Hadoop 0.18, that list included kfs and Amazon S3.
Features
- Hadoop is started with the appropriate filesystem driver JARs and any dependencies on the classpath.
- The Hadoop site configuration files are set up with the relevant URI prefix to provider class binding(s).
<property>
<name>fs.example.impl</name>
<value>org.apache.hadoop.fs.example.ExampleFileSystem</value>
<description>Bind the example:</property>
- No HDFS nodes are deployed; no NameNode, DataNode or Secondary NameNode.
- JobTracker and TaskTracker nodes are deployed to use URIs of the relevant prefix as thei
Advantages
- Integrates with existing file systems
- Can access data stored in existing filestores
- Integrating the data with existing applications is easier.
- The file systems may offer different services (such as append, locking, rollback, backups)
- The file systems' security models may use authentication, rather than trust. This will allow Hadoop to be deployed on less secure clusters, or work with more secure data.
- Some file systems offer stronger availability guarantees than HDFS.
- Some file systems are better at storing small files than HDFS, which likes files of 64MB or greater.
Disadvantages
- Requires a new filesystem provide for the specific filestore. Some are bundled with Hadoop, others would have to be written -and tested.
- If the filesystem does not provide locality information, then it is impossible to schedule work close to the data. This will increase network traffic and slow down processing.
- The filesystem may not support new features in HDFS (such as append)
- Hadoop will have been tested less on some of these clusters. You may have to be more self-sufficient.
- HDFS is tested on clusters with many thousands of machines. Other filesystems may top out earlier.