Day 20 Hard Data Processing Batch/Stream Lambda/Kappa Exactly-once

Data Processing Systems — The War to Unify Batch and StreamBatch vs Stream, Lambda/Kappa, Watermark, Exactly-once

Scenario + Requirements

Design an event processing platform for a streaming service: every play, pause, and rebuffer emits an event — 2 trillion events/day (~23M eps average, peaking at 50M eps in prime time). Two radically different consumers:

The core tension: one dataset, one side wants low-latency approximations, the other high-throughput exactness. This is precisely the origin of the "batch vs stream" and "Lambda vs Kappa" debates. Key constraints: events arrive out of order (flaky mobile networks; offline-cached events upload hours later), and on failure the platform must neither lose nor duplicate (billing).

High-Level Architecture (log-centric)

graph LR
    SRC["Client events
50M eps"] LOG["Kafka
replayable log · 7-day retention"] subgraph Stream layer SP["Flink jobs
event-time windows"] end subgraph Replay/batch BP["Replay job
same code"] end RT[("Real-time store
Druid/Redis")] LAKE[("Data lake
S3 + Iceberg")] SERVE["Serving
recs/alerts/BI"] SRC --> LOG LOG -->|live consume| SP --> RT --> SERVE LOG -.replay from 0.-> BP --> LAKE --> SERVE classDef src fill:#1a2530,stroke:#64c8ff,color:#e8eef5 classDef log fill:#1a1a30,stroke:#ffb450,color:#e8eef5 classDef proc fill:#0e2030,stroke:#5eead4,color:#e8eef5 classDef store fill:#2a1530,stroke:#ff7ab6,color:#e8eef5 class SRC src class LOG log class SP,BP proc class RT,LAKE,SERVE store

Kafka is the source of truth: the live path consumes online; to correct, replay the same data from offset 0 — no separate batch pipeline needed

Three roles: a replayable log (Kafka/Pulsar) buffers and decouples producers from consumers and enables replay; the stream layer (Flink) does stateful windowed aggregation, emitting real-time metrics; the storage layer splits two ways — real-time metrics into Druid/Redis for low-latency queries, raw detail into a data lake (S3+Iceberg) for offline exact accounting and replay.

Key Technical Points

1. Batch vs Stream: bounded data is a special case of unbounded

Principle: the naive view is "batch = collect a chunk, run once; stream = compute per record." The deeper insight: batch is just a special case of streaming — bounded data = unbounded data where you know the end. Spark's micro-batch chops the stream into tiny batches (e.g. 200ms each); Flink is truly per-record streaming and treats batch as "a stream with a head and a tail." Their latency floors differ accordingly.

Spark (micro-batch)Flink (true streaming)Pure batch (MapReduce/Spark batch)
Latency100s of ms ~ secmillisecondsminutes ~ hours
Throughputvery highhighvery high
State/windowssupported but weakerfirst-classnone (full recompute)
Best forETL + near-real-time mixlow-latency statefullarge backfills/training sets
Trade-off:
Real-world:

2. Lambda vs Kappa: do you maintain two codebases?

Principle: the Lambda architecture (Nathan Marz) runs two paths — a batch layer computes exact results (high latency) and a speed layer computes real-time approximations (low latency); queries merge the two. The problem: the same business logic is written twice (once in Spark, once in Flink), they drift apart, and maintenance doubles. The Kappa architecture (Jay Kreps) insight: since stream engines have matured, drop the batch layer, keep one streaming path, and to correct anything, replay the log from the beginning.

Trade-off:
# Kappa "correction" is not a patch — it's a replay
# 1. Find a bug in the aggregation logic; fix the Flink job code
# 2. Start a new job instance, re-consume from offset 0 (or a checkpoint)
# 3. Write into a new result table metrics_v2
# 4. After validation, atomically switch downstream to v2, drop v1
# Prerequisites: long-enough Kafka retention + result tables can coexist
Real-world:

3. Event-time and Watermark: making peace with out-of-order and late data

Principle: distinguish two clocks — event-time (when the event actually happened, embedded in the data) and processing-time (when the system handles it). Network jitter and mobile offline caching create skew between them, and events arrive out of order. You must window by event-time to correctly count "plays of this video in 8:00–8:05," but the system can never be sure "have all the 8:05 events arrived yet?" A watermark is the engine's heuristic assertion that "event-time has advanced to T; data earlier than T has mostly arrived" — once it passes a window's right edge, the window fires and closes.

graph LR
    subgraph arrival["events arriving by processing-time (out of order)"]
      direction LR
      A["e@8:01"] --> B["e@8:04"] --> C["e@8:02"] --> WM["💧Watermark=8:05"] --> D["e@8:03
late!"] end WM -->|fire| W["close 8:00-8:05 window
emit aggregate"] D -.allowed lateness.-> W2["update emitted result
or side output"] classDef ev fill:#1a2530,stroke:#64c8ff,color:#e8eef5 classDef wm fill:#1a1a30,stroke:#ffb450,color:#e8eef5 classDef win fill:#0e2030,stroke:#5eead4,color:#e8eef5 class A,B,C,D ev class WM wm class W,W2 win
Trade-off (completeness vs latency):
# Flink event-time tumbling window + late-data handling
stream
  .assignTimestampsAndWatermarks(           // declare event-time and watermark strategy
     WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(10)))
  .keyBy(e -> e.videoId)
  .window(TumblingEventTimeWindows.of(Time.minutes(5)))
  .allowedLateness(Time.minutes(1))          // tolerate 1 more min of lateness after close
  .sideOutputLateData(lateTag)               // even-later events bypass to offline fix
  .aggregate(new PlayCountAgg());
Real-world:

4. Exactly-once: processed once, or "effect" once?

Principle: in distributed systems messages get redelivered and nodes crash. At-most-once (may drop) and at-least-once (may duplicate) are easy; exactly-once is the hard one. Key clarification: the network can never truly "deliver once" — engineering aims for effectively-once: state and output are equivalent to processing once. Flink's approach is checkpointing: using a variant of Chandy-Lamport called Asynchronous Barrier Snapshotting (ABS), it injects barriers into the stream; an operator snapshots its state on receiving a barrier; after a crash it restores from the last consistent snapshot and rolls back source offsets with it.

Trade-off:
# End-to-end exactly-once = state snapshot + transactional output (2PC)
# On checkpoint completion:
#   pre-commit:  pre-write this batch of output into a transaction (Kafka txn / DB staging)
#   notify:      all operators snapshotted -> coordinator sends commit
#   commit:      transaction commits, visible downstream; on crash -> abort, roll back to last checkpoint
# Recovery: source offset and operator state both return to the last consistent point — replay without re-emitting
Real-world:

Scaling & Optimization

Pitfalls + Interview Questions

1. Windowing "per-minute metrics" by processing-time. The moment data is out of order or replayed, results are wrong — on replay all historical data floods in "now," and processing-time windows cram them into the same minute. Windowed metrics must use event-time.
2. Assuming exactly-once solves everything. It only guarantees state and output to the sink are exact; side effects (sending email, calling an external API) aren't in the transaction and can still duplicate. External side effects need their own idempotency.
3. Going Kappa without planning Kafka retention. Wanting to replay 30 days but storing only 3 — the data is gone when you need to correct. Kappa's feasibility is directly tied to log retention and recompute cost.
4. Setting the watermark too loose or too tight. Too tight → lots of late data dropped; too loose → windows won't close, latency and state both high. Tune to the real skew distribution (look at p99 lateness) and use allowed lateness as a backstop.

Common follow-ups: ① Why is "batch a special case of stream"? Where does the latency floor differ between micro-batch and true streaming? ② How exactly do Lambda's two codebases drift, how does Kappa fix it, and what new constraint does it introduce? ③ A slow-advancing watermark keeps windows open and state ballooning — how do you diagnose and mitigate? ④ What must source, operators, and sink each satisfy for end-to-end exactly-once? ⑤ At 50M eps and a 5s SLO, how do you estimate parallelism and state size?

Deep-Dive Resources

Going Deeper (click to expand)

1. Why is "batch just a special case of stream"? What concrete engineering wins does this unified view bring?

Bounded data (a file, a day of logs) = unbounded data where you know when it ends. Once you see batch as "a stream with a head and a tail," a single engine with one set of window/state/time semantics can express both: on a live stream the watermark advances with real time; on a historical batch the watermark can "fast-forward," scanning all the data instantly.

Wins: ① business logic written once serves both real-time and backfill, eliminating Lambda's duplicate code and definition drift; ② you can validate a stream job's correctness by replaying historical data; ③ ops maintains one stack. Flink/Beam are designed around this — `Bounded` and `Unbounded` sources flow through the same operators. The cost is that this engine is far more complex than a pure batch one (it must handle event-time, state, fault tolerance).

2. An event-time window won't close because one partition's watermark is stuck. How do you find the root cause? Give at least 3 directions.

Key insight: an operator's watermark = the minimum of all upstream input partitions' watermarks (min). If one partition doesn't advance, the whole thing is held back — the "weakest stave in the barrel."

  • Idle partition: a Kafka partition gets no new data, so its watermark stays at an old value and drags the global min. Fix: enable `withIdleness` so idle sources don't participate in the min.
  • Skewed/lagging partition: one partition produces far behind (e.g. a region reports late), so its event-time genuinely lags. Fix: investigate upstream delay, or relax the watermark strategy.
  • Clock/timestamp errors: a few events carry future or ancient timestamps, polluting watermark generation. Fix: bound-filter timestamps, route anomalies to a side output.
  • Backpressure: a slow downstream stalls barriers/data, which indirectly looks like a stuck watermark. Distinguish via backpressure metrics.

Diagnose with Flink Web UI per-operator watermark metrics, drilling down to which subtask's watermark is low.

3. What must source, operators, and sink each satisfy for end-to-end exactly-once? Why is checkpointing alone not enough?
  • Source must be replayable and seekable: able to rewind and re-read by offset (Kafka qualifies; a plain socket does not). On recovery, offsets roll back with the state to the checkpoint point.
  • Operator state must snapshot consistently: via ABS barrier alignment, guaranteeing the snapshot reflects "exactly the records up to the barrier" — a consistent cut.
  • Sink must support transactions or idempotency: the most commonly missed piece. Checkpoint only, with a plain-append sink — on replay, data already written after the last checkpoint gets written again, duplicating downstream.

So exactly-once = consistent snapshot (internal) + atomic commit (boundary). Flink ties sink visibility to checkpoints via two-phase commit: commit the transaction only when the checkpoint succeeds, otherwise abort. The cost is "commit latency" — downstream only sees committed checkpoints, so exactly-once inherently sacrifices a bit of latency. That's why monitoring jobs prefer at-least-once + idempotent upsert.

4. At 50M eps and end-to-end p99 < 5s, roughly estimate the job's parallelism and state size. Where's the likely bottleneck?

Order-of-magnitude estimate (what interviews want):

  • Parallelism: a single Flink slot optimistically handles ~100K–500K eps (depending on logic complexity). 50M / 200K ≈ 250+ parallelism; with headroom, start around ~400 slots. Kafka partitions must be ≥ parallelism or slots sit idle.
  • State size: 5-minute windows keyed by videoId, say 10M active videos at ~200B each → ~2GB active state; but with a high-cardinality key (user×video) or long windows, it easily hits TB → you must use the RocksDB backend + incremental checkpoints.
  • Latency budget: of the 5s, you must fit watermark skew (tolerating out-of-order, possibly 2–3s — the big chunk) + window firing + checkpoint interval + sink commit. Watermark tolerance, not computation, often dominates p99 latency.

Most likely bottleneck: not CPU, but ① state IO (RocksDB read/write amplification under high key cardinality); ② data skew (a hot video's key overwhelms one subtask — pre-aggregate or salt to spread); ③ checkpoint duration (large state takes long to snapshot, hurting recovery and exactly-once commit latency).

5. Connecting Days 5/8: how does Kappa's "replay correction" relate to replication lag and delivery semantics?

All three share one root — they rest on the premise that "the log is truth and is replayable."

  • vs Day 8 (message queues): Kappa works only because of Kafka's durable, replayable log and at-least-once delivery. Replay is just resetting the consumer offset to the past. But at-least-once means replay redelivers duplicates, so downstream processing must be made exactly-once via checkpoint+transactions — otherwise the recomputed result doubles. This is exactly where Point 4 interlocks with Kappa.
  • vs Day 5 (replication): a stream processor's state backend + checkpoints are themselves a form of "state machine replication" — deterministically replay the input log and any replica rebuilds identical state. This is the same idea as a database leader shipping its WAL to followers to replay: log + deterministic replay = recoverable, replicable state.
  • Reverse constraint: precisely because replay depends on determinism, a stream job must not contain non-deterministic logic (`now()`, random numbers, dependence on external mutable state), or the replayed result won't match the first run, breaking Kappa's "recompute equals original compute" premise. This is an interviewer's favorite hidden trap.