Distributed Computing Paradigms: MapReduce, Stream Processing, and Batch Processing
Distributed computing paradigms define the structural models through which large-scale data processing workloads are decomposed, scheduled, and executed across multiple nodes. Three paradigms — MapReduce, stream processing, and batch processing — dominate the architectural landscape for data-intensive systems, each with distinct latency profiles, fault models, and throughput characteristics. Selecting the wrong paradigm for a workload imposes real operational costs: unnecessary latency, underutilized infrastructure, or incorrect results due to out-of-order event handling. This page maps the classification boundaries, mechanisms, and decision criteria across all three.
Definition and scope
Batch processing organizes computation as a bounded, time-delineated job: a defined input dataset is loaded, processed in full, and produces output before the next job begins. The model traces formal structure through systems like Hadoop MapReduce, where the Apache Software Foundation (Apache Hadoop Documentation) specifies job execution as a sequence of Map and Reduce phases operating on HDFS-resident data. Batch jobs are inherently high-latency by design — minutes to hours — and prioritize throughput over responsiveness.
MapReduce is a specific batch-processing programming model, not a general synonym for batch computing. Google's original 2004 paper, authored by Jeffrey Dean and Sanjay Ghemawat and published in the proceedings of OSDI 2004 (USENIX OSDI 2004), defined the model as two user-supplied functions: Map, which transforms input key-value pairs into intermediate key-value pairs; and Reduce, which merges all intermediate values associated with the same intermediate key. The framework handles parallelization, fault recovery, and data locality automatically.
Stream processing operates on unbounded, continuously arriving data sequences. Rather than waiting for a complete dataset, a stream processor applies computation to each event or micro-batch as it arrives, producing incremental output with sub-second to low-second latency. The Apache Software Foundation's Apache Flink and Apache Kafka Streams are the most widely deployed open implementations, and the NIST Big Data Interoperability Framework (NIST SP 1500-6) explicitly classifies streaming as a distinct processing category from batch under its taxonomy of big data characteristics.
The broader context for these paradigms — including the consistency and fault-tolerance constraints that govern their underlying architectures — is covered in depth at Distributed System Authority, which structures the full service landscape of distributed infrastructure design.
How it works
MapReduce execution phases
- Input splitting — The input dataset is divided into fixed-size splits, with each split assigned to a Map task. Hadoop defaults to HDFS block size (128 MB in current configurations) as the split boundary.
- Map phase — Each Map task reads its assigned split, applies the user-defined Map function, and writes intermediate key-value pairs to local disk, partitioned by key hash.
- Shuffle and sort — The framework transfers intermediate data across the network to Reduce tasks, sorted by key. This phase is the primary source of network I/O amplification in MapReduce jobs.
- Reduce phase — Each Reduce task receives all values for its assigned key range, applies the user-defined Reduce function, and writes final output to distributed storage.
- Output commit — Output is atomically committed to HDFS, ensuring that partial task output from failed tasks does not corrupt final results.
Fault tolerance in MapReduce relies on task re-execution: if a Map or Reduce task fails, the framework reschedules it on another node using the original input split from durable storage. This model assumes idempotent task functions — a requirement shared with idempotency and exactly-once semantics frameworks in streaming systems.
Stream processing mechanics
Stream processing engines maintain stateful operators — windows, joins, aggregations — in memory or in an embedded state store. Apache Flink uses a distributed snapshot mechanism (derived from the Chandy-Lamport algorithm) to checkpoint operator state to durable storage at configurable intervals, enabling exactly-once processing guarantees. Event-time processing, which reorders events by their embedded timestamp rather than arrival time, addresses out-of-order delivery — a core challenge described in the clock synchronization and time in distributed systems discipline.
Stream architectures are tightly coupled to message passing and event-driven architecture infrastructure, with Apache Kafka functioning as the primary durable event log in the majority of production deployments.
Batch processing mechanics
Batch systems load input from durable storage, process it without maintaining long-lived state across jobs, and write output back to storage. Modern batch frameworks such as Apache Spark improve on Hadoop MapReduce by keeping intermediate results in memory across stages rather than materializing every intermediate dataset to disk, reducing job latency by a factor of 10 to 100x for iterative workloads (per Apache Spark's published performance comparisons at spark.apache.org).
Common scenarios
MapReduce remains appropriate for:
- Large-scale ETL (extract, transform, load) over historical datasets exceeding petabyte scale
- Log aggregation across months of archived data where latency is not a constraint
- Inverted index construction for search engines, where the shuffle-and-sort model aligns naturally with the data transformation
Stream processing is the standard choice for:
- Fraud detection pipelines requiring sub-second event evaluation
- Real-time recommendation engines processing clickstream data
- Telemetry aggregation for observability and monitoring systems, particularly when alert latency is operationally significant
- IoT sensor data integration requiring continuous windowed aggregation
Batch processing (non-MapReduce) is standard for:
- Overnight financial reconciliation jobs
- Machine learning training runs on static datasets
- Periodic report generation over complete historical records
The distributed systems use cases reference covers sector-specific instantiations of these paradigms across industries including financial services, healthcare data pipelines, and logistics networks.
Decision boundaries
Choosing between paradigms requires evaluating three dimensions: latency tolerance, data boundedness, and state complexity.
| Criterion | Batch / MapReduce | Stream Processing |
|---|---|---|
| Acceptable result latency | Minutes to hours | Milliseconds to seconds |
| Input dataset boundary | Bounded (finite) | Unbounded (continuous) |
| State management model | Stateless per job | Stateful operators with checkpointing |
| Fault recovery mechanism | Task re-execution from durable input | Distributed snapshots, offset management |
| Throughput optimization | Maximum (bulk I/O) | Moderate (per-event overhead) |
| Typical infrastructure | HDFS + YARN, S3 + EMR | Kafka + Flink, Kinesis + Lambda |
A hybrid architecture — commonly called the Lambda architecture or the Kappa architecture — combines both paradigms: a streaming layer handles low-latency approximate results while a batch layer periodically recomputes exact results over complete historical data. The Kappa architecture simplifies this by using only a streaming layer with replayable log storage, relying on replication strategies in the log tier to guarantee durability.
Workloads that involve complex multi-partition joins or global sorts are poorly served by stream processors and are better handled in batch. Workloads requiring continuous state updates across an unbounded event stream, particularly those touching backpressure and flow control mechanisms to manage producer-consumer rate mismatches, are poorly served by batch systems.
The distributed systems design patterns reference elaborates architectural templates for combining these paradigms in production systems. Fault tolerance and resilience covers the failure modes specific to each paradigm's checkpoint and recovery model.