Pattern - Hadoop running against non-HDFS distributed filestores

Contents

Pattern - Hadoop running against non-HDFS filesystems

There is no requirement that Hadoop work with HDFS filesystems only. As well as importing/exporting from different filesystems, Hadoop can run with different filestores. As of Hadoop 0.18, that list included kfs and Amazon S3.

Features

  • Hadoop is started with the appropriate filesystem driver JARs and any dependencies on the classpath.
  • The Hadoop site configuration files are set up with the relevant URI prefix to provider class binding(s).
    <property>
      <name>fs.example.impl</name>
      <value>org.apache.hadoop.fs.example.ExampleFileSystem</value>
      <description>Bind the example:// prefix to the ExampleFileSystem implementation</description>
    </property>
    
  • No HDFS nodes are deployed; no NameNode, DataNode or Secondary NameNode.
  • JobTracker and TaskTracker nodes are deployed to use URIs of the relevant prefix as thei

Advantages

  • Integrates with existing file systems
  • Can access data stored in existing filestores
  • Integrating the data with existing applications is easier.
  • The file systems may offer different services (such as append, locking, rollback, backups)
  • The file systems' security models may use authentication, rather than trust. This will allow Hadoop to be deployed on less secure clusters, or work with more secure data.
  • Some file systems offer stronger availability guarantees than HDFS.
  • Some file systems are better at storing small files than HDFS, which likes files of 64MB or greater.

Disadvantages

  • Requires a new filesystem provide for the specific filestore. Some are bundled with Hadoop, others would have to be written -and tested.
  • If the filesystem does not provide locality information, then it is impossible to schedule work close to the data. This will increase network traffic and slow down processing.
  • The filesystem may not support new features in HDFS (such as append)
  • Hadoop will have been tested less on some of these clusters. You may have to be more self-sufficient.
  • HDFS is tested on clusters with many thousands of machines. Other filesystems may top out earlier.
Get SmartFrog at SourceForge.net. Fast, secure and Free Open Source software downloads