Skip to main content

Strategies and Algorithms

Load balancing is a foundational traffic distribution mechanism in distributed systems, responsible for spreading incoming requests or computational work across a pool of servers, nodes, or services to prevent resource saturation and maintain system availability. This page covers the definition and classification of load balancing strategies, the algorithmic mechanics that govern request routing, the operational scenarios where specific approaches excel or fail, and the decision criteria practitioners use to select between competing approaches. The treatment applies to architects, engineers, and researchers working with horizontally scaled infrastructure, microservices environments, and cloud-native deployments.

Definition and scope

Load balancing in distributed systems refers to the process of distributing incoming network traffic, computation tasks, or storage operations across multiple backend resources to optimize resource utilization, maximize throughput, minimize response latency, and avoid overloading any single node. The scope encompasses both network-layer balancing (Layer 4, operating on IP and TCP headers) and application-layer balancing (Layer 7, operating on HTTP headers, URLs, cookies, and content type).

The Internet Engineering Task Force (IETF) documents foundational network protocols — including TCP/IP session behavior (RFC 793) — that underpin all Layer 4 load balancing implementations. At the application layer, HTTP/2 and HTTP/3 session multiplexing, documented in RFC 9113 and RFC 9114 respectively, alter how persistent connections interact with load balancer affinity logic.

Load balancing operates at three broad scopes within a distributed architecture:

NIST SP 800-145, the NIST definition of cloud computing, treats elastic resource scaling — which depends on effective load distribution — as a core characteristic of cloud service models. Load balancing is the mechanism that makes elasticity operationally viable at runtime.

How it works

A load balancer receives an incoming request and applies a routing algorithm to select a backend destination. The choice of algorithm determines how fairly work is distributed and how well the balancer adapts to heterogeneous backend capacity. The major algorithm classes are:

Load balancers also perform health checking — sending periodic probes (typically TCP handshakes or HTTP GET requests to a designated health endpoint) to detect node failures. Failed nodes are removed from the routing pool within the health check interval, typically configured between 5 and 30 seconds depending on failure sensitivity requirements. This health-check mechanism directly supports fault tolerance and resilience goals.

The relationship between load balancing and service discovery is structural: before a load balancer can route to a node, it must know the node exists. In dynamic container environments — addressed in detail in the container orchestration reference — service registry updates feed the load balancer's backend pool in near real time.

Common scenarios

Stateless HTTP microservices represent the simplest load balancing scenario. Because no session state is stored on the backend node, any algorithm can be applied without affinity constraints. Least Connections or Weighted Round Robin are standard choices. This is the dominant pattern in microservices architecture deployments.

Stateful session workloads — such as shopping cart services or financial transaction flows — require either IP Hash affinity or externalized session state in a distributed cache. Without affinity or shared state, a request routed to a different node mid-session will fail to find the expected context. Externalizing session state to a shared cache eliminates the affinity dependency and is the preferred engineering approach in high-scale deployments.

Long-lived connections (WebSockets, gRPC streaming) create persistent connection imbalance under Round Robin, because Round Robin distributes connection establishment events evenly but cannot account for connections that remain open for minutes or hours. Least Connections handles this correctly; pure Round Robin creates hotspots. The gRPC and RPC frameworks reference covers the protocol-specific implications in detail.

Database read replicas use load balancing to distribute read queries across replica nodes, reducing primary node load. This intersects directly with replication strategies and consistency models, since load-balanced reads across asynchronous replicas may return stale data depending on replication lag.

Decision boundaries

Selecting a load balancing strategy requires evaluating at least 4 independent dimensions:

Layer 4 vs. Layer 7 is the primary classification boundary in load balancer architecture. Layer 4 balancers operate on TCP/UDP headers only — they are faster and simpler but cannot make routing decisions based on request content. Layer 7 balancers inspect HTTP headers, paths, and cookies, enabling path-based routing (e.g., routing /api/v2/ to one backend pool and /static/ to another), A/B traffic splitting, and protocol-aware health checks. The overhead of Layer 7 inspection adds latency on the order of sub-millisecond to low single-digit milliseconds per request, a cost that is acceptable in most applications but material in ultra-low-latency financial or real-time control systems.

Latency and throughput targets — documented in system SLAs — are the operational anchors for this tradeoff. Systems with 99th-percentile latency targets below 10 milliseconds typically restrict Layer 7 load balancing to edge entry points and use Layer 4 or client-side balancing internally. The broader structural principles governing how distributed systems handle these tradeoffs are mapped in the Distributed Systems Authority index.

References