Design the telemetry backend for 10 million connected devices (an EV fleet / industrial sensors / smart meters): each reports 20 metrics — battery, temperature, GPS, current — every 10 seconds. This is a different species from a request-response web backend: it is write-dominated (99% writes), connections are long-lived not short requests, the network is unreliable (a car enters a tunnel, a sensor loses power), and data is append-only — written once, almost never updated.
graph LR
D["Devices / Sensors
tens of millions"]
EG["Edge Gateway
filter · aggregate · local alert"]
BR["MQTT Broker Cluster
EMQX · 1M conns/node"]
K["Kafka Buffer
peak shaving · decouple"]
SP["Stream Processing
Flink · dedup/downsample"]
TSDB[("Time-Series DB
hot · compressed")]
OBJ[("Object Storage
cold · Parquet")]
AL["Alerting / Dashboards"]
D -->|MQTT| EG -->|MQTT/QoS| BR
BR --> K --> SP
SP --> TSDB
SP --> OBJ
TSDB --> AL
classDef dev fill:#1a2530,stroke:#64c8ff,color:#e8eef5
classDef edge fill:#0e2030,stroke:#5eead4,color:#e8eef5
classDef pipe fill:#1a1a30,stroke:#ffb450,color:#e8eef5
classDef store fill:#2a1530,stroke:#ff7ab6,color:#e8eef5
class D dev
class EG,BR edge
class K,SP pipe
class TSDB,OBJ,AL store
Four stages, distinct jobs: edge reduces volume, broker absorbs connections, Kafka shaves peaks & decouples, TSDB compresses & stores. Any stage writing directly to the next gets flooded.
The core idea is reduce and decouple at every layer. The edge gateway aggregates raw samples into summaries locally, cutting ~90% of backhaul bandwidth; the MQTT broker is dedicated to maintaining massive long-lived connections and QoS delivery; Kafka is the critical peak-shaving reservoir — broker output is bursty (3× at rush hour) while TSDB write throughput is constant, and Kafka absorbs the difference; stream processing dedups, reorders out-of-order events, and downsamples before persisting.
In one line: MQTT's long connection + pub/sub trades connection reuse for a stateful server — the default for fleets of low-power devices.
Principle: HTTP is request-response with a fresh TCP/TLS handshake each time. For a device sending a 2-byte temperature every 10s, the handshake overhead (several KB) is a thousand times the payload, and the server cannot push (to close a valve remotely, the device must poll). MQTT is an OASIS-standard publish-subscribe protocol: a device holds one long connection to a broker and publishes by topic (e.g. fleet/car42/battery); the header is as little as 2 bytes; it supports server push, Last Will (auto-notify on disconnect), and retained messages. The cost: the broker must statefully maintain millions of connections and per-device sessions.
# MQTT QoS: three delivery guarantees, pick by data importance
# QoS 0 (at-most-once): fire and forget, no retransmit -> high-freq temperature
# QoS 1 (at-least-once): success on PUBACK, may duplicate -> ordinary events
# QoS 2 (exactly-once): 4-way PUBREC/PUBREL/PUBCOMP -> billing/alerts
client.publish("fleet/car42/temp", payload, qos=0) # losing one is fine
client.publish("fleet/car42/alert", payload, qos=1) # at-least-once + downstream idempotent dedup
# WARNING: QoS 2 is the most expensive (double round-trip + broker state); never use it for everything
In one line: the edge trades local compute for backhaul bandwidth + response latency, but introduces ops and consistency pain.
Principle: not all data is worth shipping. A car produces MB/s of sensor streams, but the cloud really needs summaries (per-minute averages) and anomalies (temperature over threshold). The edge gateway (a vehicle compute unit / factory gateway) does three things locally: ① filter & downsample — 20 raw samples into 1 summary; ② local decisions — cut power immediately on overheating, without waiting 200ms for a cloud round-trip; ③ offline buffering — store to disk when the car is in a tunnel, backfill on reconnect. This turns "centralized cloud processing" into "edge-cloud collaboration".
# Edge gateway: local aggregation + anomaly fast-path + offline buffer (pseudo-code)
buf = RingBuffer(maxlen=10000) # local buffer, no loss while offline
def on_sample(s):
if s.value > THRESHOLD: # anomaly: act locally + report at high QoS
actuator.shutdown() # e-stop in <100ms, no cloud round-trip
publish(s, qos=1)
window.add(s)
if window.full(): # normal: one aggregate per minute, -90% bandwidth
agg = window.summary() # min/max/avg/p95
if online: publish(agg, qos=0)
else: buf.append(agg) # offline: store now, backfill on reconnect
In one line: there must be a Kafka buffer between broker and TSDB, or write spikes will backpressure and break the whole chain.
Principle: the core tension in ingestion is bursty input (a fleet booting at rush hour, 3× instantaneous) vs constant downstream throughput (TSDB write capacity is fixed). If the broker writes the TSDB directly, a peak fills the TSDB → backpressure to the broker → connection pile-up → device reconnect storm → avalanche. Kafka as a reservoir: the broker writes Kafka at full speed (sequential write, very fast), and consumers pull at the rate the TSDB can sustain. Kafka also decouples — the same data is consumed independently by alerting, TSDB persistence, and ML training consumer groups. Partition by device_id to keep per-device data ordered.
In one line: the specifics of time-series (write-heavy, time-monotonic, neighboring values similar) let a purpose-built store save ~10× over a general DB.
Principle: telemetry has strong regularity — timestamps are near-arithmetic, adjacent values change little (temperature won't jump from 25 to 5000). Facebook's Gorilla (VLDB 2015) exploits both: timestamps use delta-of-delta (the second difference of evenly-spaced samples is almost always 0), values use XOR compression (XOR of adjacent floats leaves high bits zero, store only significant bits), shrinking each point from 16 bytes to an average of 1.37 bytes (~12×) — enabling all-in-memory storage and queries dozens of times faster. TimescaleDB takes another path: built on Postgres, it uses hypertables that auto-partition into time chunks, converts old chunks to compressed columnar (~10×), and keeps the full SQL ecosystem.
| TimescaleDB | InfluxDB | Prometheus | |
|---|---|---|---|
| Model | Postgres extension (SQL) | Purpose-built TSDB (columnar/Parquet) | Pull model |
| Writes | High, via chunk partitioning | Very high throughput | Medium (monitoring scale) |
| Compression | Columnar ~10x | Columnar + encoding | delta+XOR (Gorilla-like) |
| Best for | Need SQL/JOIN/relational | Pure time-series, huge scale | Infra monitoring |
-- TimescaleDB: create hypertable + auto-compression policy
SELECT create_hypertable('telemetry', 'ts',
chunk_time_interval => INTERVAL '1 day');
-- chunks older than 7 days auto-compress to columnar (~10x)
ALTER TABLE telemetry SET (timescaledb.compress,
timescaledb.compress_segmentby = 'device_id');
SELECT add_compression_policy('telemetry', INTERVAL '7 days');
-- continuous aggregates: pre-downsample so dashboards query minute-level, not raw seconds
SELECT add_retention_policy('telemetry', INTERVAL '90 days'); -- archive cold data
device_id × metric × region) is a separate series, indexed in memory. 10M devices × 20 metrics = 200M series — blowing out the in-memory index. Fix: never put high-cardinality fields (e.g. trip_id, random UUIDs) in tags/labels — put them in fields; aggregate devices into buckets. Early InfluxDB was notorious for cardinality issues.
trip_id/timestamps/random IDs in tags explodes the series count and OOMs the TSDB. A must-ask in any TSDB interview.device_id + ts).The key is tiering + pre-aggregation, designed around access pattern not "keep everything": hot tier (7 days, SSD) keeps raw seconds and serves 95% of queries; warm tier (7-90 days, compressed chunks) keeps only minute-level downsamples, 60× smaller; cold tier (90+ days, S3 Parquet) archives raw seconds.
"Any device sub-second lookup" hinges on partition pruning: the cold tier hash-buckets by device_id and partitions by day, so a single-device single-day query scans one small Parquet file and returns in seconds — no full scan. The cost is slow cross-device aggregation (many files), but that's an analytical query and minute-level latency is acceptable. The essence: archive raw cold + keep aggregates hot, decoupling cost from latency.
A double hit of reconnect storm + backfill flood:
Defenses: ① client random backoff — reconnect delay = base × 2^n + random(0, jitter), spreading reconnects over minutes (must be in firmware); ② broker admission limiting — a token bucket at the connection layer rejects over-rate handshakes to protect established connections; ③ backfill rate-limit + stagger, relying on Kafka to absorb the flood; ④ graceful degradation — shed QoS 0 historical samples during the flood, keep only QoS 1 critical events.
This is Day 23's "retries must carry jitter" amplified for IoT — the device count turns any synchronized behavior into a DDoS.
The root cause is that a broker is stateful: it maintains a TCP long connection, session state, subscriptions, and unacknowledged QoS 1/2 messages per connection. HTTP is stateless — any request routes to any node identically, so adding machines scales. Adding a broker node must solve:
This is exactly why EMQX chose Erlang/OTP — its lightweight processes (one per connection, a few KB each) + built-in distribution (transparent inter-node messaging) are built for "massive stateful long connections". In a thread-model language, the context switching for a million connections would crush you first.
Because it's a staged config rollout across tens of thousands of edge nodes, not a one-line code edit:
The essence: the edge swaps "deploy once" for "deploy N times", degrading ops from centralized easy-updates into distributed staging/consistency/rollback — the biggest hidden cost of the edge.
What writing by arrival order (processing-time) breaks:
The fix (see Day 20 stream processing): ① use event-time (the device's timestamp) not arrival time for windowing; ② watermarks tolerate out-of-order — define "allow N minutes late", close a window only after the watermark passes, and route over-late data to a side output for separate correction; ③ periodic NTP sync to correct drift; ④ idempotent dedup with device_id + event_ts as the key, so backfilled duplicates overwrite rather than append.
Core insight: IoT "time" is two things — when it happened and when it arrived, possibly hours apart. Aggregation/alerting/billing must all go by event time.