Day 20 Hard Data Processing Batch/Stream Lambda/Kappa Exactly-once

Data Processing Systems — The War to Unify Batch and StreamBatch vs Stream, Lambda/Kappa, Watermark, Exactly-once

Scenario + Requirements

Design an event processing platform for a streaming service: every play, pause, and rebuffer emits an event — 2 trillion events/day (~23M eps average, peaking at 50M eps in prime time). Two radically different consumers:

Real-time: per-second concurrent-viewer counts, rebuffer rate, CTR — fed into recommendations and alerts. SLO: end-to-end p99 < 5s, small errors tolerable.
Offline: exact revenue splits to content partners, BI reports. T+1 settlement, replayable for corrections — not a cent can be wrong.

The core tension: one dataset, one side wants low-latency approximations, the other high-throughput exactness. This is precisely the origin of the "batch vs stream" and "Lambda vs Kappa" debates. Key constraints: events arrive out of order (flaky mobile networks; offline-cached events upload hours later), and on failure the platform must neither lose nor duplicate (billing).

High-Level Architecture (log-centric)

graph LR
    SRC["Client events
50M eps"]
    LOG["Kafka
replayable log · 7-day retention"]
    subgraph Stream layer
      SP["Flink jobs
event-time windows"]
    end
    subgraph Replay/batch
      BP["Replay job
same code"]
    end
    RT[("Real-time store
Druid/Redis")]
    LAKE[("Data lake
S3 + Iceberg")]
    SERVE["Serving
recs/alerts/BI"]

    SRC --> LOG
    LOG -->|live consume| SP --> RT --> SERVE
    LOG -.replay from 0.-> BP --> LAKE --> SERVE

    classDef src fill:#1a2530,stroke:#64c8ff,color:#e8eef5
    classDef log fill:#1a1a30,stroke:#ffb450,color:#e8eef5
    classDef proc fill:#0e2030,stroke:#5eead4,color:#e8eef5
    classDef store fill:#2a1530,stroke:#ff7ab6,color:#e8eef5
    class SRC src
    class LOG log
    class SP,BP proc
    class RT,LAKE,SERVE store

Kafka is the source of truth: the live path consumes online; to correct, replay the same data from offset 0 — no separate batch pipeline needed

Three roles: a replayable log (Kafka/Pulsar) buffers and decouples producers from consumers and enables replay; the stream layer (Flink) does stateful windowed aggregation, emitting real-time metrics; the storage layer splits two ways — real-time metrics into Druid/Redis for low-latency queries, raw detail into a data lake (S3+Iceberg) for offline exact accounting and replay.

Key Technical Points

1. Batch vs Stream: bounded data is a special case of unbounded

Principle: the naive view is "batch = collect a chunk, run once; stream = compute per record." The deeper insight: batch is just a special case of streaming — bounded data = unbounded data where you know the end. Spark's micro-batch chops the stream into tiny batches (e.g. 200ms each); Flink is truly per-record streaming and treats batch as "a stream with a head and a tail." Their latency floors differ accordingly.

	Spark (micro-batch)	Flink (true streaming)	Pure batch (MapReduce/Spark batch)
Latency	100s of ms ~ sec	milliseconds	minutes ~ hours
Throughput	very high	high	very high
State/windows	supported but weaker	first-class	none (full recompute)
Best for	ETL + near-real-time mix	low-latency stateful	large backfills/training sets

Trade-off:

Micro-batch: ✅ reuses the batch engine, simple scheduling, high throughput, fault tolerance falls out of batch boundaries (natural checkpoint); ❌ latency bounded below by the batch interval (scheduling overhead remains), window semantics constrained by batch edges.
True streaming: ✅ millisecond latency, flexible event-time windows, rich state; ❌ fault tolerance is harder (needs ABS checkpoints), higher ops bar, state backend needs tuning.
Pure batch: ✅ lowest unit cost, ideal for TB-scale backfills and offline training; ❌ latency in hours, can't do real-time.

Real-world:

Netflix: the Keystone platform runs on Kafka + Flink, offering managed stream processing as a service (SPaaS), processing trillions of events daily.
Uber: real-time pricing, fraud detection, and ETA run heavily on Flink; their AthenaX lets users write streaming jobs in SQL.
Alibaba: Singles' Day dashboards and real-time risk control run on Flink (built the Blink fork on Flink, later contributed back upstream).
Databricks: champions Spark Structured Streaming (micro-batch + Continuous mode), integrated with Delta Lake for a lakehouse.

2. Lambda vs Kappa: do you maintain two codebases?

Principle: the Lambda architecture (Nathan Marz) runs two paths — a batch layer computes exact results (high latency) and a speed layer computes real-time approximations (low latency); queries merge the two. The problem: the same business logic is written twice (once in Spark, once in Flink), they drift apart, and maintenance doubles. The Kappa architecture (Jay Kreps) insight: since stream engines have matured, drop the batch layer, keep one streaming path, and to correct anything, replay the log from the beginning.

Trade-off:

Lambda: ✅ the batch layer is a safety net, re-running history after a logic change is natural, exactness guaranteed; ❌ two codebases, two engines, consistency maintained by hand, doubled ops and cognitive cost.
Kappa: ✅ single codebase and engine, unified concepts, correction = replay; ❌ full replay is expensive (rebuild large state, re-flush downstream), depends on long log retention, recompute window bounded by retention.
Modern middle ground (lakehouse): Iceberg/Delta + Flink — stream-write, batch-read, one storage two access patterns, blurring the two-layer split.

# Kappa "correction" is not a patch — it's a replay
# 1. Find a bug in the aggregation logic; fix the Flink job code
# 2. Start a new job instance, re-consume from offset 0 (or a checkpoint)
# 3. Write into a new result table metrics_v2
# 4. After validation, atomically switch downstream to v2, drop v1
# Prerequisites: long-enough Kafka retention + result tables can coexist

Real-world:

LinkedIn: birthplace of the Kappa idea — Jay Kreps built Kafka and Samza here, advocating "everything is a log" and replacing the batch layer with replay.
Uber: evolved a "Kappa+" pattern letting stream jobs efficiently replay historical data for backfills, avoiding a separate Spark batch pipeline.
Many teams still use Lambda: when offline-exact and real-time definitions genuinely differ in the business (final reconciliation vs live dashboard), two paths are actually clearer.

3. Event-time and Watermark: making peace with out-of-order and late data

Principle: distinguish two clocks — event-time (when the event actually happened, embedded in the data) and processing-time (when the system handles it). Network jitter and mobile offline caching create skew between them, and events arrive out of order. You must window by event-time to correctly count "plays of this video in 8:00–8:05," but the system can never be sure "have all the 8:05 events arrived yet?" A watermark is the engine's heuristic assertion that "event-time has advanced to T; data earlier than T has mostly arrived" — once it passes a window's right edge, the window fires and closes.

graph LR
    subgraph arrival["events arriving by processing-time (out of order)"]
      direction LR
      A["e@8:01"] --> B["e@8:04"] --> C["e@8:02"] --> WM["💧Watermark=8:05"] --> D["e@8:03
late!"]
    end
    WM -->|fire| W["close 8:00-8:05 window
emit aggregate"]
    D -.allowed lateness.-> W2["update emitted result
or side output"]
    classDef ev fill:#1a2530,stroke:#64c8ff,color:#e8eef5
    classDef wm fill:#1a1a30,stroke:#ffb450,color:#e8eef5
    classDef win fill:#0e2030,stroke:#5eead4,color:#e8eef5
    class A,B,C,D ev
    class WM wm
    class W,W2 win

Trade-off (completeness vs latency):

Aggressive watermark (small skew): ✅ low latency, results emitted early; ❌ more late data, less accurate.
Conservative watermark (large skew): ✅ more complete and accurate; ❌ windows close slowly, higher latency, state held longer.
Remedies: allowed lateness tolerates a window of late data and updates already-emitted results; anything later goes to a side output for separate handling. Akidau's Dataflow model frames this as What / Where / When / How.

# Flink event-time tumbling window + late-data handling
stream
  .assignTimestampsAndWatermarks(           // declare event-time and watermark strategy
     WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(10)))
  .keyBy(e -> e.videoId)
  .window(TumblingEventTimeWindows.of(Time.minutes(5)))
  .allowedLateness(Time.minutes(1))          // tolerate 1 more min of lateness after close
  .sideOutputLateData(lateTag)               // even-later events bypass to offline fix
  .aggregate(new PlayCountAgg());

Real-world:

Google: the watermark + trigger model of Dataflow / Beam was systematized by Tyler Akidau et al. (Streaming 101/102) and became the lingua franca of Apache Beam and the streaming world.
Flink: natively supports event-time, watermarks, and allowed lateness — the most complete open-source implementation of this model.

4. Exactly-once: processed once, or "effect" once?

Principle: in distributed systems messages get redelivered and nodes crash. At-most-once (may drop) and at-least-once (may duplicate) are easy; exactly-once is the hard one. Key clarification: the network can never truly "deliver once" — engineering aims for effectively-once: state and output are equivalent to processing once. Flink's approach is checkpointing: using a variant of Chandy-Lamport called Asynchronous Barrier Snapshotting (ABS), it injects barriers into the stream; an operator snapshots its state on receiving a barrier; after a crash it restores from the last consistent snapshot and rolls back source offsets with it.

Trade-off:

At-least-once + idempotent sink: ✅ simple, low latency; ❌ pushes dedup onto downstream (unique-key upsert), and not all sinks are idempotent.
Exactly-once (transactional / 2PC sink): ✅ end-to-end exactness, mandatory for billing; ❌ barrier alignment adds latency and lowers throughput; transactional sinks (Kafka transactions, two-phase commit) add complexity and commit latency.
Essence: exactly-once is the union of "consistent internal state snapshot" + "atomic output commit" — neither alone suffices. Checkpointing without a transactional sink still double-writes downstream on restart.

# End-to-end exactly-once = state snapshot + transactional output (2PC)
# On checkpoint completion:
#   pre-commit:  pre-write this batch of output into a transaction (Kafka txn / DB staging)
#   notify:      all operators snapshotted -> coordinator sends commit
#   commit:      transaction commits, visible downstream; on crash -> abort, roll back to last checkpoint
# Recovery: source offset and operator state both return to the last consistent point — replay without re-emitting

Real-world:

Flink: fault tolerance based on the ABS variant of Chandy-Lamport consistent snapshots, combined with TwoPhaseCommitSinkFunction for end-to-end exactly-once (detailed in the official docs).
Kafka: achieves read-process-write exactly-once via the idempotent producer + transactions (transactional.id).
Netflix / Uber: billing and reconciliation jobs insist on exactly-once; pure monitoring metrics often settle for at-least-once + idempotency to get lower latency.

Scaling & Optimization

State growth: large windows / long TTLs push operator state to TB scale. Switch to the RocksDB state backend (on-disk + incremental checkpoints), set state TTL on keys, pre-aggregate hot keys.
Backpressure: a slow downstream propagates up the pipeline. Monitor operator busy/backpressure ratios, scale parallelism, or buffer ahead of slow sinks — this is Day 8's message-queue backpressure extended along the operator chain.
Lakehouse convergence: Iceberg/Delta/Hudi let streaming writes and batch reads share one storage, dissolving Lambda's two layers and leaning toward Kappa.
Replay cost: pure-Kappa full replay is slow; keep periodic savepoints/snapshots for "batch bootstrap + stream catch-up" instead of replaying from offset 0 each time.
Multi-region & DR: run mirrored jobs across regions, store checkpoints in object storage; on failover, resume from the remote checkpoint.

Pitfalls + Interview Questions

1. Windowing "per-minute metrics" by processing-time. The moment data is out of order or replayed, results are wrong — on replay all historical data floods in "now," and processing-time windows cram them into the same minute. Windowed metrics must use event-time.

2. Assuming exactly-once solves everything. It only guarantees state and output to the sink are exact; side effects (sending email, calling an external API) aren't in the transaction and can still duplicate. External side effects need their own idempotency.

3. Going Kappa without planning Kafka retention. Wanting to replay 30 days but storing only 3 — the data is gone when you need to correct. Kappa's feasibility is directly tied to log retention and recompute cost.

4. Setting the watermark too loose or too tight. Too tight → lots of late data dropped; too loose → windows won't close, latency and state both high. Tune to the real skew distribution (look at p99 lateness) and use allowed lateness as a backstop.

Common follow-ups: ① Why is "batch a special case of stream"? Where does the latency floor differ between micro-batch and true streaming? ② How exactly do Lambda's two codebases drift, how does Kappa fix it, and what new constraint does it introduce? ③ A slow-advancing watermark keeps windows open and state ballooning — how do you diagnose and mitigate? ④ What must source, operators, and sink each satisfy for end-to-end exactly-once? ⑤ At 50M eps and a 5s SLO, how do you estimate parallelism and state size?

Deep-Dive Resources

Designing Data-Intensive Applications, Ch 10–11 (Kleppmann): batch and stream processing, and unifying the two.
Jay Kreps, "Questioning the Lambda Architecture" (O'Reilly Radar): the original case for the Kappa architecture.
Tyler Akidau, "Streaming 101 / 102: The world beyond batch" (O'Reilly Radar): foundational reading on event-time, watermarks, and triggers; expanded into the book Streaming Systems.
Apache Flink docs · Stateful Stream Processing / Fault Tolerance: the ABS checkpoint and end-to-end exactly-once mechanism.
Netflix TechBlog, "Keystone Real-time Stream Processing Platform": engineering practice for a trillion-scale Kafka+Flink platform.

Going Deeper (click to expand)

1. Why is "batch just a special case of stream"? What concrete engineering wins does this unified view bring?

Bounded data (a file, a day of logs) = unbounded data where you know when it ends. Once you see batch as "a stream with a head and a tail," a single engine with one set of window/state/time semantics can express both: on a live stream the watermark advances with real time; on a historical batch the watermark can "fast-forward," scanning all the data instantly.

Wins: ① business logic written once serves both real-time and backfill, eliminating Lambda's duplicate code and definition drift; ② you can validate a stream job's correctness by replaying historical data; ③ ops maintains one stack. Flink/Beam are designed around this — `Bounded` and `Unbounded` sources flow through the same operators. The cost is that this engine is far more complex than a pure batch one (it must handle event-time, state, fault tolerance).

2. An event-time window won't close because one partition's watermark is stuck. How do you find the root cause? Give at least 3 directions.

Key insight: an operator's watermark = the minimum of all upstream input partitions' watermarks (min). If one partition doesn't advance, the whole thing is held back — the "weakest stave in the barrel."

Idle partition: a Kafka partition gets no new data, so its watermark stays at an old value and drags the global min. Fix: enable `withIdleness` so idle sources don't participate in the min.
Skewed/lagging partition: one partition produces far behind (e.g. a region reports late), so its event-time genuinely lags. Fix: investigate upstream delay, or relax the watermark strategy.
Clock/timestamp errors: a few events carry future or ancient timestamps, polluting watermark generation. Fix: bound-filter timestamps, route anomalies to a side output.
Backpressure: a slow downstream stalls barriers/data, which indirectly looks like a stuck watermark. Distinguish via backpressure metrics.

Diagnose with Flink Web UI per-operator watermark metrics, drilling down to which subtask's watermark is low.

3. What must source, operators, and sink each satisfy for end-to-end exactly-once? Why is checkpointing alone not enough?

Source must be replayable and seekable: able to rewind and re-read by offset (Kafka qualifies; a plain socket does not). On recovery, offsets roll back with the state to the checkpoint point.
Operator state must snapshot consistently: via ABS barrier alignment, guaranteeing the snapshot reflects "exactly the records up to the barrier" — a consistent cut.
Sink must support transactions or idempotency: the most commonly missed piece. Checkpoint only, with a plain-append sink — on replay, data already written after the last checkpoint gets written again, duplicating downstream.

So exactly-once = consistent snapshot (internal) + atomic commit (boundary). Flink ties sink visibility to checkpoints via two-phase commit: commit the transaction only when the checkpoint succeeds, otherwise abort. The cost is "commit latency" — downstream only sees committed checkpoints, so exactly-once inherently sacrifices a bit of latency. That's why monitoring jobs prefer at-least-once + idempotent upsert.

4. At 50M eps and end-to-end p99 < 5s, roughly estimate the job's parallelism and state size. Where's the likely bottleneck?

Order-of-magnitude estimate (what interviews want):

Parallelism: a single Flink slot optimistically handles ~100K–500K eps (depending on logic complexity). 50M / 200K ≈ 250+ parallelism; with headroom, start around ~400 slots. Kafka partitions must be ≥ parallelism or slots sit idle.
State size: 5-minute windows keyed by videoId, say 10M active videos at ~200B each → ~2GB active state; but with a high-cardinality key (user×video) or long windows, it easily hits TB → you must use the RocksDB backend + incremental checkpoints.
Latency budget: of the 5s, you must fit watermark skew (tolerating out-of-order, possibly 2–3s — the big chunk) + window firing + checkpoint interval + sink commit. Watermark tolerance, not computation, often dominates p99 latency.

Most likely bottleneck: not CPU, but ① state IO (RocksDB read/write amplification under high key cardinality); ② data skew (a hot video's key overwhelms one subtask — pre-aggregate or salt to spread); ③ checkpoint duration (large state takes long to snapshot, hurting recovery and exactly-once commit latency).

5. Connecting Days 5/8: how does Kappa's "replay correction" relate to replication lag and delivery semantics?

All three share one root — they rest on the premise that "the log is truth and is replayable."

vs Day 8 (message queues): Kappa works only because of Kafka's durable, replayable log and at-least-once delivery. Replay is just resetting the consumer offset to the past. But at-least-once means replay redelivers duplicates, so downstream processing must be made exactly-once via checkpoint+transactions — otherwise the recomputed result doubles. This is exactly where Point 4 interlocks with Kappa.
vs Day 5 (replication): a stream processor's state backend + checkpoints are themselves a form of "state machine replication" — deterministically replay the input log and any replica rebuilds identical state. This is the same idea as a database leader shipping its WAL to followers to replay: log + deterministic replay = recoverable, replicable state.
Reverse constraint: precisely because replay depends on determinism, a stream job must not contain non-deterministic logic (`now()`, random numbers, dependence on external mutable state), or the replayed result won't match the first run, breaking Kappa's "recompute equals original compute" premise. This is an interviewer's favorite hidden trap.