Types of Failures in Distributed Systems: Byzantine, Crash, and Network

Distributed systems fail in fundamentally different ways depending on whether a node stops responding, responds incorrectly, or the network connecting nodes becomes unreliable. The taxonomy of these failure modes — crash failures, Byzantine failures, and network failures — determines which fault-tolerance strategies are applicable and how much overhead a system must bear to remain correct. Classifying failures precisely is a prerequisite for selecting the right consensus algorithms, replication strategies, and recovery mechanisms. This page defines each failure class, describes its mechanism, maps common scenarios, and establishes the decision boundaries that separate one class from another.


Definition and scope

Failure classification in distributed systems traces its formal structure to research published in the 1980s, most influentially Leslie Lamport, Robert Shostak, and Marshall Pease's 1982 paper "Byzantine Generals Problem" in ACM Transactions on Programming Languages and Systems. The NIST Computer Security Resource Center (NIST CSRC) and the broader systems research community have since built a consistent three-tier taxonomy:

  1. Crash failures (fail-stop failures) — A node halts and ceases all communication. It does not send incorrect data; it simply stops.
  2. Byzantine failures — A node behaves arbitrarily: it may send conflicting, corrupted, or malicious messages to different peers simultaneously.
  3. Network (omission/partition) failures — Messages between nodes are delayed, dropped, reordered, or the network is partitioned such that two subsets of nodes cannot reach each other.

A fourth sub-class, performance failures (also called timing failures), occurs when a node responds correctly but outside its agreed timing bounds. The CAP theorem frames the impossibility result that arises specifically from network partition failures, distinguishing them from node-level failures in terms of system design tradeoffs.

The scope of these definitions extends across cloud-native clusters, peer-to-peer networks, database replication rings, and any architecture where computation spans more than one process boundary. The key dimensions and scopes of distributed systems reference covers how scale, geography, and trust boundaries interact with these failure classes.


How it works

Each failure class propagates through a distributed system by a distinct mechanism, and each demands a different detection and recovery path.

Crash failures are the most tractable. A crashed node sends no further messages. Failure detectors — components described formally in Chandra and Toueg's 1996 work "Unreliable Failure Detectors for Reliable Distributed Systems" (Journal of the ACM) — monitor heartbeat signals. When a node misses a configurable number of consecutive heartbeats, the detector marks it as suspected failed. Raft and Paxos, the two dominant consensus protocols, both assume crash-failure models and require a quorum of (N/2 + 1) nodes to make progress, meaning a 5-node cluster tolerates exactly 2 crash failures before losing availability (quorum-based systems covers quorum math in full).

Byzantine failures are fundamentally harder. A Byzantine node may selectively respond to some peers and not others, send different values to different nodes, or replay stale messages. Detecting Byzantine behavior requires that nodes cross-check responses from peers: the canonical result, from the 1982 Lamport et al. paper, is that tolerating f Byzantine faults requires at least 3f + 1 total nodes. A cluster tolerating 1 Byzantine node therefore requires a minimum of 4 nodes. Byzantine Fault Tolerant (BFT) consensus protocols such as PBFT (Practical Byzantine Fault Tolerance, Castro and Liskov, 1999, USENIX OSDI) carry significantly higher message complexity — O(n²) per consensus round versus O(n) for crash-tolerant protocols.

Network failures include two distinct sub-modes. Omission failures drop individual messages without either endpoint knowing. Partition failures split the network into at least 2 isolated components that cannot exchange any messages. Network partitions and split-brain conditions emerge from partition failures when isolated components each believe themselves to be the authoritative partition and continue accepting writes. The consequence — divergent state — is the central problem that eventual consistency models and CRDT conflict-free replicated data types are designed to resolve.

Clock synchronization and time in distributed systems is directly implicated in performance failures: nodes using unsynchronized clocks may time out prematurely or accept stale lease renewals, triggering false crash suspects or stale-leader scenarios.


Common scenarios

The following failure scenarios are representative of production environments documented in public post-mortems from major cloud providers and academic literature:

  1. Leader crash during log replication — In a Raft cluster, the elected leader crashes mid-replication after sending log entries to 2 of 4 followers. The 2 updated followers hold a quorum and elect a new leader; the 2 non-updated followers accept the new leader's corrective entries. This is a clean crash-failure scenario that Raft handles by design.

  2. Corrupted NIC producing Byzantine messages — A network interface card on a storage node begins emitting packets with silently flipped bits. The node does not detect its own corruption. Peers receive different checksums depending on network path and cache state. Without BFT-capable validation, the system stores inconsistent data across replicas. Replication strategies documentation covers how checksum-based and quorum-read strategies mitigate this.

  3. AWS us-east-1 partition events — Public incident reports from AWS (published on the AWS Service Health Dashboard) have documented partition-style events where internal DNS or routing failures isolated subsets of Availability Zones. Systems relying on strong consistency halted writes; systems designed for eventual consistency continued operating with reconciliation occurring post-partition.

  4. GC pause triggering false Byzantine detection — A JVM node undergoes a 30-second stop-the-world garbage collection pause. Peers time it out and elect a replacement leader. When the paused node resumes, it believes itself still leader — a split-brain scenario arising from a timing failure misclassified as a crash failure. Observability and monitoring practices exist specifically to surface GC pause metrics before they trigger false failure detection.

  5. BGP route leak causing network partition — A misconfigured BGP announcement redirects inter-datacenter traffic, partitioning two data centers. This is a real-world network failure class documented in public BGP incident records maintained by RIPE NCC, the European regional internet registry.


Decision boundaries

Choosing the correct failure model determines the architecture of the entire fault tolerance and resilience stack. The following boundaries are operationally decisive:

Crash vs. Byzantine — If the threat model is hardware failure, software crashes, or operator error, crash-fault-tolerant (CFT) protocols (Raft, Multi-Paxos, Viewstamped Replication) are sufficient and substantially cheaper. If the threat model includes malicious actors, compromised nodes, or untrusted third parties, Byzantine Fault Tolerant (BFT) protocols are required. BFT carries at minimum a 3× node overhead and O(n²) message complexity; deploying it in a fully trusted internal cluster wastes resources. Security in distributed systems defines the threat surface that triggers the BFT decision.

Network partition vs. crash — A partitioned node is not a crashed node. A partitioned node is still running, still accepting local requests, and potentially still writing to its local state. Treating partition as crash causes data loss or split-brain. The correct response to a suspected partition is to fence the node (STONITH — Shoot The Other Node In The Head, a term formalized in Linux-HA documentation) or to stop accepting writes above quorum, not merely to re-elect a leader.

Performance failure threshold calibration — Failure detectors are configured with two parameters: the suspected-failure timeout and the failure-detection accuracy window. Setting timeouts too short converts performance failures (GC pauses, network congestion) into false crash detections. Setting them too long extends the recovery window after true crashes. The distributed systems benchmarks and performance reference covers latency percentile distributions that inform timeout calibration.

The complete reference architecture for how these failure classes interact with consensus, replication, and recovery is mapped across the distributedsystemauthority.com reference network, providing sector professionals with the classification depth needed to make design-time decisions rather than recovering from misclassified failures in production.


References