Day 35 Hard IoT MQTT Edge Time-Series DB

IoT & Edge — Throughput Engineering for 10M Devices Pinging Every 10sIngestion at Scale, MQTT, Edge Compute, Time-Series Storage

Scenario + Constraints

Design the telemetry backend for 10 million connected devices (an EV fleet / industrial sensors / smart meters): each reports 20 metrics — battery, temperature, GPS, current — every 10 seconds. This is a different species from a request-response web backend: it is write-dominated (99% writes), connections are long-lived not short requests, the network is unreliable (a car enters a tunnel, a sensor loses power), and data is append-only — written once, almost never updated.

Write QPS: 10M ÷ 10s × 20 metrics = 20M points/s. Peak (a fleet booting up at rush hour) is another 2-3×.
Connection scale: tens of millions of concurrent long-lived connections; ~1M per broker node, so a cluster of a dozen-plus brokers.
Storage: ~2 bytes/point compressed (16 bytes raw) → 20M/s × 2B ≈ 3.5 TB/day compressed; hot queries in ms, cold data archived to object storage.
Latency SLO: ingest p99 < 1s (alerting path), dashboard aggregation p95 < 500ms — but device-side alert decisions must be < 100ms, which forces edge compute.
Reliability: dropping a temperature sample is fine (next frame overwrites it); dropping a "battery overheating" alert is an incident — QoS must be tiered.

High-Level Architecture

graph LR
    D["Devices / Sensors
tens of millions"]
    EG["Edge Gateway
filter · aggregate · local alert"]
    BR["MQTT Broker Cluster
EMQX · 1M conns/node"]
    K["Kafka Buffer
peak shaving · decouple"]
    SP["Stream Processing
Flink · dedup/downsample"]
    TSDB[("Time-Series DB
hot · compressed")]
    OBJ[("Object Storage
cold · Parquet")]
    AL["Alerting / Dashboards"]

    D -->|MQTT| EG -->|MQTT/QoS| BR
    BR --> K --> SP
    SP --> TSDB
    SP --> OBJ
    TSDB --> AL

    classDef dev fill:#1a2530,stroke:#64c8ff,color:#e8eef5
    classDef edge fill:#0e2030,stroke:#5eead4,color:#e8eef5
    classDef pipe fill:#1a1a30,stroke:#ffb450,color:#e8eef5
    classDef store fill:#2a1530,stroke:#ff7ab6,color:#e8eef5
    class D dev
    class EG,BR edge
    class K,SP pipe
    class TSDB,OBJ,AL store

Four stages, distinct jobs: edge reduces volume, broker absorbs connections, Kafka shaves peaks & decouples, TSDB compresses & stores. Any stage writing directly to the next gets flooded.

The core idea is reduce and decouple at every layer. The edge gateway aggregates raw samples into summaries locally, cutting ~90% of backhaul bandwidth; the MQTT broker is dedicated to maintaining massive long-lived connections and QoS delivery; Kafka is the critical peak-shaving reservoir — broker output is bursty (3× at rush hour) while TSDB write throughput is constant, and Kafka absorbs the difference; stream processing dedups, reorders out-of-order events, and downsamples before persisting.

Key Technical Points

1. Device Protocol: MQTT vs HTTP vs CoAP

In one line: MQTT's long connection + pub/sub trades connection reuse for a stateful server — the default for fleets of low-power devices.

Principle: HTTP is request-response with a fresh TCP/TLS handshake each time. For a device sending a 2-byte temperature every 10s, the handshake overhead (several KB) is a thousand times the payload, and the server cannot push (to close a valve remotely, the device must poll). MQTT is an OASIS-standard publish-subscribe protocol: a device holds one long connection to a broker and publishes by topic (e.g. fleet/car42/battery); the header is as little as 2 bytes; it supports server push, Last Will (auto-notify on disconnect), and retained messages. The cost: the broker must statefully maintain millions of connections and per-device sessions.

Trade-offs:

MQTT: ✅ long connection saves handshakes, bidirectional push, tiered QoS, low power; ❌ stateful broker is hard to scale, connection migration is complex, not for large files (OTA firmware uses HTTP).
HTTP/REST: ✅ stateless and easy to scale, mature ecosystem, firewall-friendly; ❌ massively wasteful for frequent small packets, no server push, battery killer. Good for low-frequency batch reporting (hourly).
CoAP: ✅ UDP-based, lighter than MQTT, fits extremely constrained devices (a few KB RAM); ❌ UDP unreliability must be handled yourself, small ecosystem, hard NAT traversal.

# MQTT QoS: three delivery guarantees, pick by data importance
# QoS 0 (at-most-once): fire and forget, no retransmit  -> high-freq temperature
# QoS 1 (at-least-once): success on PUBACK, may duplicate -> ordinary events
# QoS 2 (exactly-once): 4-way PUBREC/PUBREL/PUBCOMP        -> billing/alerts
client.publish("fleet/car42/temp",  payload, qos=0)   # losing one is fine
client.publish("fleet/car42/alert", payload, qos=1)   # at-least-once + downstream idempotent dedup
# WARNING: QoS 2 is the most expensive (double round-trip + broker state); never use it for everything

Real-world:

EMQX: a distributed MQTT broker written in Erlang/OTP, reaching tens of millions of connections per cluster and ~1M per node — Erlang's lightweight-process model is a natural fit for massive long connections (see EMQ's MQTT vs HTTP comparison).
HiveMQ / AWS IoT Core: both center on MQTT; HiveMQ's comparison shows MQTT throughput far exceeding HTTP in multi-message scenarios.
Facebook Messenger: early mobile push used MQTT, precisely for the power savings and low-latency delivery of a long connection.

2. Edge Compute: Process Where Data Is Born

In one line: the edge trades local compute for backhaul bandwidth + response latency, but introduces ops and consistency pain.

Principle: not all data is worth shipping. A car produces MB/s of sensor streams, but the cloud really needs summaries (per-minute averages) and anomalies (temperature over threshold). The edge gateway (a vehicle compute unit / factory gateway) does three things locally: ① filter & downsample — 20 raw samples into 1 summary; ② local decisions — cut power immediately on overheating, without waiting 200ms for a cloud round-trip; ③ offline buffering — store to disk when the car is in a tunnel, backfill on reconnect. This turns "centralized cloud processing" into "edge-cloud collaboration".

Trade-offs:

All-cloud: ✅ centralized logic, easy to update, simple devices; ❌ bandwidth cost explodes, paralyzed when offline, control-loop latency too high (no safety e-stop).
All-edge: ✅ low latency, offline-resilient, bandwidth-thrifty; ❌ deploying/upgrading thousands of edge nodes is a nightmare, model updates lag, limited edge compute.
Edge-cloud collaboration (mainstream): edge handles real-time reduction + safety decisions, cloud handles training + global optimization + long-term storage. ✅ best of both; ❌ you must manage "which logic runs where", and cross-layer debugging is hard.

# Edge gateway: local aggregation + anomaly fast-path + offline buffer (pseudo-code)
buf = RingBuffer(maxlen=10000)          # local buffer, no loss while offline
def on_sample(s):
    if s.value > THRESHOLD:             # anomaly: act locally + report at high QoS
        actuator.shutdown()             # e-stop in <100ms, no cloud round-trip
        publish(s, qos=1)
    window.add(s)
    if window.full():                   # normal: one aggregate per minute, -90% bandwidth
        agg = window.summary()          # min/max/avg/p95
        if online: publish(agg, qos=0)
        else:      buf.append(agg)       # offline: store now, backfill on reconnect

Real-world:

Tesla Fleet Telemetry: the open-source teslamotors/fleet-telemetry uses vehicle-side config to control which fields and frequency are reported (reduction at the edge); the cloud ingests via a WebSocket/MQTT frontend into Kafka.
Industrial IoT (PLC + gateway): factory gateways run a rules engine locally for safety interlocks with millisecond shutdowns; the cloud only receives aggregated KPIs.
CDN edge (same idea): Cloudflare Workers / Lambda@Edge push compute closest to the user — edge compute is the same philosophy in Web and IoT.

3. Ingestion Pipeline: Peak Shaving, Decoupling, Backpressure

In one line: there must be a Kafka buffer between broker and TSDB, or write spikes will backpressure and break the whole chain.

Principle: the core tension in ingestion is bursty input (a fleet booting at rush hour, 3× instantaneous) vs constant downstream throughput (TSDB write capacity is fixed). If the broker writes the TSDB directly, a peak fills the TSDB → backpressure to the broker → connection pile-up → device reconnect storm → avalanche. Kafka as a reservoir: the broker writes Kafka at full speed (sequential write, very fast), and consumers pull at the rate the TSDB can sustain. Kafka also decouples — the same data is consumed independently by alerting, TSDB persistence, and ML training consumer groups. Partition by device_id to keep per-device data ordered.

Trade-offs:

Broker writes TSDB directly: ✅ simple, low latency; ❌ no peak shaving, no decoupling — a peak backpressures into an avalanche. Only for small scale.
Kafka buffer (mainstream): ✅ peak shaving, decoupling, replayable (replay to fix a buggy consumer); ❌ one extra hop (tens of ms), Kafka itself needs ops and scaling.
Backpressure vs shedding: when downstream can't keep up, either backpressure (slow the upstream — but IoT devices can't be slowed, they just drop connections) or actively shed low-priority data (QoS 0 temperature samples droppable, QoS 1 alerts must stay).

Real-world:

Tesla: fleet telemetry flows through a WebSocket frontend → Kafka cluster for peak shaving → downstream Cassandra/analytics pipelines; public talks cite trillions of IoT messages per day through Kafka.
Uber: driver/rider app location reports likewise go through a Kafka buffer before fanning out to dispatch, ETA, and risk consumer groups.
LinkedIn: Kafka's birthplace — built precisely to solve "how to decouple high-throughput multi-source data and distribute it to many consumers".

4. Time-Series DB: Compression, Downsampling, Cardinality

In one line: the specifics of time-series (write-heavy, time-monotonic, neighboring values similar) let a purpose-built store save ~10× over a general DB.

Principle: telemetry has strong regularity — timestamps are near-arithmetic, adjacent values change little (temperature won't jump from 25 to 5000). Facebook's Gorilla (VLDB 2015) exploits both: timestamps use delta-of-delta (the second difference of evenly-spaced samples is almost always 0), values use XOR compression (XOR of adjacent floats leaves high bits zero, store only significant bits), shrinking each point from 16 bytes to an average of 1.37 bytes (~12×) — enabling all-in-memory storage and queries dozens of times faster. TimescaleDB takes another path: built on Postgres, it uses hypertables that auto-partition into time chunks, converts old chunks to compressed columnar (~10×), and keeps the full SQL ecosystem.

	TimescaleDB	InfluxDB	Prometheus
Model	Postgres extension (SQL)	Purpose-built TSDB (columnar/Parquet)	Pull model
Writes	High, via chunk partitioning	Very high throughput	Medium (monitoring scale)
Compression	Columnar ~10x	Columnar + encoding	delta+XOR (Gorilla-like)
Best for	Need SQL/JOIN/relational	Pure time-series, huge scale	Infra monitoring

-- TimescaleDB: create hypertable + auto-compression policy
SELECT create_hypertable('telemetry', 'ts',
       chunk_time_interval => INTERVAL '1 day');
-- chunks older than 7 days auto-compress to columnar (~10x)
ALTER TABLE telemetry SET (timescaledb.compress,
       timescaledb.compress_segmentby = 'device_id');
SELECT add_compression_policy('telemetry', INTERVAL '7 days');
-- continuous aggregates: pre-downsample so dashboards query minute-level, not raw seconds
SELECT add_retention_policy('telemetry', INTERVAL '90 days'); -- archive cold data

Cardinality Explosion: the number-one killer of time-series DBs. Each unique tag combination (device_id × metric × region) is a separate series, indexed in memory. 10M devices × 20 metrics = 200M series — blowing out the in-memory index. Fix: never put high-cardinality fields (e.g. trip_id, random UUIDs) in tags/labels — put them in fields; aggregate devices into buckets. Early InfluxDB was notorious for cardinality issues.

Real-world:

Facebook Gorilla: the VLDB 2015 paper, delta-of-delta + XOR shrinking 16B to 1.37B, open-sourced as Beringei; its compression is borrowed by nearly every TSDB since.
TimescaleDB: Timescale's official comparison details the hypertable/chunk/columnar-compression design, pitching "time-series + full SQL".
Prometheus: the de facto monitoring standard; its local TSDB uses Gorilla-like delta+XOR encoding with 2-hour blocks.

Scaling & Optimization

Tiered storage (hot/warm/cold): last 7 days hot in the TSDB (SSD/memory), 7-90 days downsampled and warm, 90+ days to Parquet in object storage (S3) queried via Trino/Athena. An order-of-magnitude cost drop.
Downsampling pyramid: raw seconds → minutes → hours via continuous aggregates; dashboards auto-pick granularity by time window, never scanning huge raw sets.
Horizontal connection scaling: cluster the MQTT brokers, sharding devices across nodes with consistent hashing; an L4 load balancer + connection migration so devices reconnect to a healthy node on failure.
OTA & Device Shadow: devices are often offline, so the cloud keeps a "shadow" (desired vs reported) that the device syncs on reconnect — AWS IoT Device Shadow is exactly this pattern.
Multi-region: devices attach to the nearest regional broker; telemetry lands locally and asynchronously aggregates to a global analytics layer, cutting cross-ocean latency.

Common Pitfalls + Interview Follow-ups

1. QoS 2 for everything: slapping exactly-once on every message forces the broker to keep four-way-handshake state per message — millions of devices crush it. Tiering by data value is table stakes.

2. High-cardinality tags: putting trip_id/timestamps/random IDs in tags explodes the series count and OOMs the TSDB. A must-ask in any TSDB interview.

3. Broker writes the DB with no buffer: forgetting Kafka peak shaving leads to a rush-hour backpressure avalanche. Interviewers love "what if traffic triples?"

4. Ignoring clocks and out-of-order data: bad device clocks and offline backfill cause out-of-order arrival. Use event-time + watermarks (see Day 20), never write by arrival order.

5. Reconnect storm (thundering herd): a broker restart or network blip makes millions of devices reconnect at once and kill the cluster. Reconnects must use random backoff + jitter.

6. Not separating "droppable" from "must-keep": temperature samples are droppable, alerts/billing are not. Downstream consumers must be idempotent (dedup by device_id + ts).

Deeper Resources

"Gorilla: A Fast, Scalable, In-Memory Time Series Database" (VLDB 2015): the foundational work on time-series compression — delta-of-delta + XOR is a must-read.
HiveMQ MQTT Essentials / MQTT vs HTTP: the most systematic tutorial on the MQTT protocol, QoS, and session mechanics.
teslamotors/fleet-telemetry: a production-grade open-source fleet telemetry ingestion layer — see real edge config + Kafka landing.
TimescaleDB vs InfluxDB: hypertable/chunk/compression and cardinality trade-offs.
"Designing Data-Intensive Applications" (Kleppmann, stream-processing & storage chapters): the underlying principles of ingestion pipelines and backpressure.

Food for Thought (click to expand)

1. 20M points/s, 3.5TB/day compressed, but the client wants "any device at any moment, sub-second lookup". All-hot won't fit, all-archived is too slow. How?

The key is tiering + pre-aggregation, designed around access pattern not "keep everything": hot tier (7 days, SSD) keeps raw seconds and serves 95% of queries; warm tier (7-90 days, compressed chunks) keeps only minute-level downsamples, 60× smaller; cold tier (90+ days, S3 Parquet) archives raw seconds.

"Any device sub-second lookup" hinges on partition pruning: the cold tier hash-buckets by device_id and partitions by day, so a single-device single-day query scans one small Parquet file and returns in seconds — no full scan. The cost is slow cross-device aggregation (many files), but that's an analytical query and minute-level latency is acceptable. The essence: archive raw cold + keep aggregates hot, decoupling cost from latency.

2. A city blackout: hundreds of thousands of smart meters disconnect at once, then reconnect together when power returns. What happens? How do you prevent it?

A double hit of reconnect storm + backfill flood:

Reconnect storm: hundreds of thousands of devices start TCP+TLS handshakes in the same second; broker CPU (TLS handshakes are CPU-heavy) maxes out instantly, handshakes queue and time out → devices retry → worse.
Backfill flood: each device buffered N records while offline and dumps them all on reconnect; instantaneous writes are dozens of times normal, breaking the downstream of Kafka.

Defenses: ① client random backoff — reconnect delay = base × 2^n + random(0, jitter), spreading reconnects over minutes (must be in firmware); ② broker admission limiting — a token bucket at the connection layer rejects over-rate handshakes to protect established connections; ③ backfill rate-limit + stagger, relying on Kafka to absorb the flood; ④ graceful degradation — shed QoS 0 historical samples during the flood, keep only QoS 1 critical events.

This is Day 23's "retries must carry jitter" amplified for IoT — the device count turns any synchronized behavior into a DDoS.

3. Why is an MQTT broker hard to scale horizontally while a stateless HTTP service just adds machines? What must adding a broker node solve?

The root cause is that a broker is stateful: it maintains a TCP long connection, session state, subscriptions, and unacknowledged QoS 1/2 messages per connection. HTTP is stateless — any request routes to any node identically, so adding machines scales. Adding a broker node must solve:

Connection ownership: device A connects to node 1 and wants to message device B (subscribed to the same topic, connected to node 3); nodes must route/forward between each other — requiring a shared cluster-wide subscription table (EMQX uses a distributed routing table).
Session migration: node 1 dies, A reconnects to node 2, and its persistent session (offline messages, subscriptions) must recover — requiring externalized or cluster-replicated session state.
Sticky load balancing: not simple round-robin; the same device should land on the same node stably (connections are long-lived), commonly via consistent hashing.

This is exactly why EMQX chose Erlang/OTP — its lightweight processes (one per connection, a few KB each) + built-in distribution (transparent inter-node messaging) are built for "massive stateful long connections". In a thread-model language, the context switching for a million connections would crush you first.

4. The edge gateway runs a rule "temp > 80 → cut power". Now change the threshold to 75. Why is this seemingly one-line change a hard problem at 10M-device scale?

Because it's a staged config rollout across tens of thousands of edge nodes, not a one-line code edit:

Devices are often offline: you can't assume all are online; use a device shadow — store "desired threshold = 75" in the cloud shadow, and devices pull and reconcile on connect.
No all-at-once push: millions pulling new config simultaneously self-inflicts a request storm; you must stage (1% → observe → ramp) and be able to roll back.
Safety logic must fail-safe: if a push fails or config is corrupt, the device falls back to a safe default (rather over-stop than under-stop) — it must not stop protecting just because it can't fetch config.

The essence: the edge swaps "deploy once" for "deploy N times", degrading ops from centralized easy-updates into distributed staging/consistency/rollback — the biggest hidden cost of the edge.

5. Device clocks are routinely wrong (cheap RTC drift, offline backfill), so "event time" arrives out of order. What breaks if you write the TSDB by arrival order? What to do?

What writing by arrival order (processing-time) breaks:

Wrong aggregation windows: a car backfills data from "an hour ago" after reconnecting; counting it into the current minute window pollutes the average and creates fake spikes in trend charts.
Misaligned downsampling: continuous aggregates bucket by arrival time, scattering data from the same real moment across buckets, scrambling the timeline on lookup.
False alerts: a late old "overheating" event is treated as real-time, triggering an already-meaningless action.

The fix (see Day 20 stream processing): ① use event-time (the device's timestamp) not arrival time for windowing; ② watermarks tolerate out-of-order — define "allow N minutes late", close a window only after the watermark passes, and route over-late data to a side output for separate correction; ③ periodic NTP sync to correct drift; ④ idempotent dedup with device_id + event_ts as the key, so backfilled duplicates overwrite rather than append.

Core insight: IoT "time" is two things — when it happened and when it arrived, possibly hours apart. Aggregation/alerting/billing must all go by event time.

← Back to index