Distributed File Systems: HDFS, GFS, and Modern Alternatives

Distributed file systems provide the storage substrate for large-scale data infrastructure, enabling thousands of nodes to read and write files as though operating against a single coherent namespace. This page covers the architecture of the two canonical designs — the Google File System (GFS) and the Hadoop Distributed File System (HDFS) — alongside modern alternatives including Ceph, GlusterFS, and cloud-native object storage layers. The scope addresses how these systems are classified, how their internal mechanics differ, the workload scenarios that determine selection, and the decision boundaries that separate one architecture from another.


Definition and scope

A distributed file system abstracts physical storage spread across multiple machines into a unified namespace accessible from any node in a cluster. Unlike local file systems such as ext4 or NTFS, distributed file systems must tolerate node failure, handle concurrent access from hundreds or thousands of clients, and scale capacity and throughput independently. The Apache Software Foundation formally documents HDFS as a fault-tolerant file system designed for commodity hardware, targeting sequential read throughput over low-latency random access.

Four primary categories structure the landscape:

  1. Master-coordinated block file systems — GFS and HDFS use a single master/namenode to manage metadata and direct clients to data nodes holding file blocks.
  2. Distributed object stores — Amazon S3-compatible systems and OpenStack Swift expose a flat object namespace rather than a hierarchical provider network tree.
  3. Peer-to-peer parallel file systems — Ceph and Lustre distribute both metadata and data responsibilities across the cluster, eliminating the single-master bottleneck.
  4. Network-attached distributed file systems — NFS and CIFS derivatives provide POSIX-compliant access but do not inherently provide replication or failure recovery.

The boundary between a distributed file system and distributed data storage more broadly depends on whether the system presents a file-oriented interface (hierarchical namespace, file handles, byte-range reads) versus a key-value or record-oriented interface.


How it works

GFS, described in the 2003 SOSP paper by Ghemawat, Gobioff, and Leung at Google, introduced the architecture that HDFS later adopted and open-sourced. The core mechanics operate as follows:

  1. Namespace management via a single master — the master holds all file-to-chunk mappings in memory. Clients contact the master once to resolve a file location, then communicate directly with chunk servers for data transfer.
  2. Fixed-size block storage — GFS uses 64 MB chunks; HDFS defaults to 128 MB blocks. Large block sizes reduce metadata overhead and favor sequential scan workloads.
  3. Replication for fault tolerance — each block is replicated across 3 nodes by default in HDFS (Apache HDFS Architecture Guide). Rack-aware placement stores 2 replicas on one rack and 1 replica on a separate rack to survive rack-level failures.
  4. Write-once, append-optimized semantics — GFS and HDFS are optimized for write-once, read-many workloads. Concurrent random writes are not a design target, which constrains applicable workloads.
  5. Heartbeat-based failure detection — data nodes send periodic heartbeats to the namenode. Missing 3 consecutive heartbeats triggers block re-replication to restore the target replica count.

Ceph departs from this design through its CRUSH (Controlled Replication Under Scalable Hashing) algorithm, which computes data placement deterministically without consulting a central metadata server. The Ceph documentation published by Red Hat describes how CRUSH eliminates the single-point-of-failure inherent in HDFS namenode architecture, at the cost of increased operational complexity during cluster rebalancing.

The interaction between distributed file systems and replication strategies governs durability guarantees. Systems that use synchronous replication sacrifice write latency; those using asynchronous replication risk data loss during failure windows, a tension formalized in CAP theorem analysis.


Common scenarios

Distributed file systems are deployed across four recurring workload profiles:

Batch analytics pipelines — HDFS remains the dominant substrate for Hadoop MapReduce and Apache Spark jobs processing petabyte-scale datasets. The sequential scan pattern aligns with HDFS's 128 MB block design; a 1 TB dataset requires approximately 8,000 blocks at that size.

Machine learning training data storage — large model training runs read multi-terabyte datasets repeatedly. GCS (Google Cloud Storage) and S3-compatible systems are used here because they decouple compute from storage, allowing GPU clusters to scale independently of data capacity.

Shared storage for containerized workloads — Kubernetes environments described in container orchestration contexts require persistent volumes accessible from any pod. CephFS and GlusterFS provide ReadWriteMany semantics that local-disk solutions cannot.

Archival and compliance storage — distributed object stores with erasure coding (a 10+4 erasure code tolerates 4 simultaneous node failures at lower storage overhead than 3× replication) serve long-retention workloads where durability outweighs access latency.


Decision boundaries

Selecting among HDFS, Ceph, GlusterFS, and object storage requires matching system characteristics to workload requirements across five dimensions, indexed against the distributed system scalability and latency and throughput profiles of the target workload:

Criterion HDFS Ceph/CephFS GlusterFS Object Store (S3-compatible)
Metadata scalability Bounded by namenode RAM Distributed via MDS cluster Distributed hashing Flat namespace, unlimited keys
Random write support Weak Strong (RADOS layer) Moderate Not applicable (PUT/GET only)
POSIX compliance Partial Full (CephFS) Full None
Fault tolerance model Block replication (3×) Replication or erasure coding Replication Erasure coding or replication
Primary workload fit Sequential batch reads Mixed workloads, block/object NAS replacement Object storage, archival

The namenode single-point-of-failure in pre-HA HDFS configurations was a well-documented operational risk. HDFS High Availability, introduced in Hadoop 2.x, uses two namenodes in active-standby configuration coordinated via ZooKeeper and coordination services, reducing unplanned downtime at the cost of additional coordination overhead.

For workloads requiring strong consistency guarantees across concurrent writers, neither HDFS nor GlusterFS is appropriate without additional locking infrastructure. Ceph provides stronger consistency semantics through its RADOS object layer, while cloud-native object stores generally offer eventual consistency for overwrite operations, a distinction engineers must account for in system design as documented across distributed system design patterns.

The broader landscape of distributed storage architecture, including how file systems integrate with compute, coordination, and network layers, is indexed at distributedsystemauthority.com.


References