Skip to main content

MapReduce, Stream Processing, and Batch Processing

Distributed computing paradigms define the structural models through which large-scale data processing workloads are decomposed, scheduled, and executed across multiple nodes. Three paradigms — MapReduce, stream processing, and batch processing — dominate the architectural landscape for data-intensive systems, each with distinct latency profiles, fault models, and throughput characteristics. Selecting the wrong paradigm for a workload imposes real operational costs: unnecessary latency, underutilized infrastructure, or incorrect results due to out-of-order event handling. This page maps the classification boundaries, mechanisms, and decision criteria across all three.

Definition and scope

Batch processing organizes computation as a bounded, time-delineated job: a defined input dataset is loaded, processed in full, and produces output before the next job begins. The model traces formal structure through systems like Hadoop MapReduce, where the Apache Software Foundation (Apache Hadoop Documentation) specifies job execution as a sequence of Map and Reduce phases operating on HDFS-resident data. Batch jobs are inherently high-latency by design — minutes to hours — and prioritize throughput over responsiveness.

MapReduce is a specific batch-processing programming model, not a general synonym for batch computing. Google's original 2004 paper, authored by Jeffrey Dean and Sanjay Ghemawat and published in the proceedings of OSDI 2004 (USENIX OSDI 2004), defined the model as two user-supplied functions: Map, which transforms input key-value pairs into intermediate key-value pairs; and Reduce, which merges all intermediate values associated with the same intermediate key. The framework handles parallelization, fault recovery, and data locality automatically.

Stream processing operates on unbounded, continuously arriving data sequences. Rather than waiting for a complete dataset, a stream processor applies computation to each event or micro-batch as it arrives, producing incremental output with sub-second to low-second latency. The Apache Software Foundation's Apache Flink and Apache Kafka Streams are the most widely deployed open implementations, and the NIST Big Data Interoperability Framework (NIST SP 1500-6) explicitly classifies streaming as a distinct processing category from batch under its taxonomy of big data characteristics.

The broader context for these paradigms — including the consistency and fault-tolerance constraints that govern their underlying architectures — is covered in depth at Distributed System Authority, which structures the full service landscape of distributed infrastructure design.

How it works

MapReduce execution phases

Fault tolerance in MapReduce relies on task re-execution: if a Map or Reduce task fails, the framework reschedules it on another node using the original input split from durable storage. This model assumes idempotent task functions — a requirement shared with idempotency and exactly-once semantics frameworks in streaming systems.

Stream processing mechanics

Stream processing engines maintain stateful operators — windows, joins, aggregations — in memory or in an embedded state store. Apache Flink uses a distributed snapshot mechanism (derived from the Chandy-Lamport algorithm) to checkpoint operator state to durable storage at configurable intervals, enabling exactly-once processing guarantees. Event-time processing, which reorders events by their embedded timestamp rather than arrival time, addresses out-of-order delivery — a core challenge described in the clock synchronization and time in distributed systems discipline.

Stream architectures are tightly coupled to message passing and event-driven architecture infrastructure, with Apache Kafka functioning as the primary durable event log in the majority of production deployments.

Batch processing mechanics

Batch systems load input from durable storage, process it without maintaining long-lived state across jobs, and write output back to storage. Modern batch frameworks such as Apache Spark improve on Hadoop MapReduce by keeping intermediate results in memory across stages rather than materializing every intermediate dataset to disk, reducing job latency by a factor of 10 to 100x for iterative workloads (per Apache Spark's published performance comparisons at spark.apache.org).

Common scenarios

MapReduce remains appropriate for: - Large-scale ETL (extract, transform, load) over historical datasets exceeding petabyte scale - Log aggregation across months of archived data where latency is not a constraint - Inverted index construction for search engines, where the shuffle-and-sort model aligns naturally with the data transformation

Stream processing is the standard choice for: - Fraud detection pipelines requiring sub-second event evaluation - Real-time recommendation engines processing clickstream data - Telemetry aggregation for observability and monitoring systems, particularly when alert latency is operationally significant - IoT sensor data integration requiring continuous windowed aggregation

Batch processing (non-MapReduce) is standard for: - Overnight financial reconciliation jobs - Machine learning training runs on static datasets - Periodic report generation over complete historical records

The distributed systems use cases reference covers sector-specific instantiations of these paradigms across industries including financial services, healthcare data pipelines, and logistics networks.

Decision boundaries

Choosing between paradigms requires evaluating three dimensions: latency tolerance, data boundedness, and state complexity.

Criterion Batch / MapReduce Stream Processing

Acceptable result latency Minutes to hours Milliseconds to seconds

Input dataset boundary Bounded (finite) Unbounded (continuous)

State management model Stateless per job Stateful operators with checkpointing

Fault recovery mechanism Task re-execution from durable input Distributed snapshots, offset management

Throughput optimization Maximum (bulk I/O) Moderate (per-event overhead)

Typical infrastructure HDFS + YARN, S3 + EMR Kafka + Flink, Kinesis + Lambda

A hybrid architecture — commonly called the Lambda architecture or the Kappa architecture — combines both paradigms: a streaming layer handles low-latency approximate results while a batch layer periodically recomputes exact results over complete historical data. The Kappa architecture simplifies this by using only a streaming layer with replayable log storage, relying on replication strategies in the log tier to guarantee durability.

Workloads that involve complex multi-partition joins or global sorts are poorly served by stream processors and are better handled in batch. Workloads requiring continuous state updates across an unbounded event stream, particularly those touching backpressure and flow control mechanisms to manage producer-consumer rate mismatches, are poorly served by batch systems.

The distributed systems design patterns reference elaborates architectural templates for combining these paradigms in production systems. Fault tolerance and resilience covers the failure modes specific to each paradigm's checkpoint and recovery model.

References