Pattern - Hadoop HDFS data stored in a distributed filesystem
This option does get discussed on the hadoop-core-user mailing list intermittently. Can Hadoop be set up so that all HDFS files are stored into an existing distributed filesystem? This is not the same as running the Hadoop MapReduce layer against a different filesystem than HDFS, that is a different pattern . What is being discussed is bringing up the HDFS filesystem, not on a set of local disks, but on an existing clustered filesystem.
Features
- The name node is set up to keep its data in a location in the distributed filestore
- The datanodes are set to keep their data in a location in the local filesystem.
- That location is then symbolically linked to a different location in the distributed filesystem for every datanode
- Similarly, all nodes must be set up (using symbolic links) to log their data to different locations in the same filestore.
The symlink games are to avoid having separate hadoop-site.xml files for every single node. If different parts of the shared filesystem are mounted on every node, the symbolic links are not needed.
Advantages
- Integrates with existing distributed filesystems.
- Some filesystems (e.g. Lustre) have better availability guarantees than HDFS
- All log data is readable from any machine, even if the original machine has now crashed.
- Diskless machines and machines with small SSD filesystems can be brought into the cluster.
Disadvantages
- All data locality is lost. Network traffic may be much higher. This does not scale to large clusters.
- The scripts to estimate available disk space may not work reliably.
- There is still a namenode, which is a point of failure for the cluster.
- Security is downgraded to that of HDFS; the security features of the underlying filesystem are lost.