Pattern - Hadoop HDFS Filestore

Contents

Pattern - Hadoop HDFS Filestore

Features

  • A distributed filestore optimised for very large, multi terabyte files.
  • A central name node to manage indexing.
  • Some support for redundancy in the name node; otherwise log replay can be used to bring a rebooted node back up to speed.
  • Some security.
  • A shell in Hadoop for access to the filesystem.
  • Rack-aware -when configured with a correct model of the datacentre.
  • Scatters blocks across multiple disks for redundancy
  • Runs on top of existing filesystems, such as Linux ext3

Advantages

  • Designed to scale to very large files.
  • Eliminates need for RAID file storage (except for the name node and its logs)
  • Java-based code can be edited by skilled Java people
  • Tested in farms of hundreds of nodes

Disadvantages

  • Performance impact/overhead from Java code.
  • Optimised for large, bulk data storage, not small files.
  • Security still very minimal.
  • Still evolving.
  • No OS integration/native file APIs
  • The NameNode is a SPOF.

SmartFrog support

The [sf-hadoop] components can be used to deploy an HDFS cluster; it also contains the a set of workflow components that can be used to manipulate the filesystem, and to copy files in and out before and after submitting work to a Hadoop Job Tracker.

Get SmartFrog at SourceForge.net. Fast, secure and Free Open Source software downloads