Day 35 Hard IoT MQTT Edge Time-Series DB

IoT & Edge — Throughput Engineering for 10M Devices Pinging Every 10sIngestion at Scale, MQTT, Edge Compute, Time-Series Storage

Scenario + Constraints

Design the telemetry backend for 10 million connected devices (an EV fleet / industrial sensors / smart meters): each reports 20 metrics — battery, temperature, GPS, current — every 10 seconds. This is a different species from a request-response web backend: it is write-dominated (99% writes), connections are long-lived not short requests, the network is unreliable (a car enters a tunnel, a sensor loses power), and data is append-only — written once, almost never updated.

High-Level Architecture

graph LR
    D["Devices / Sensors
tens of millions"] EG["Edge Gateway
filter · aggregate · local alert"] BR["MQTT Broker Cluster
EMQX · 1M conns/node"] K["Kafka Buffer
peak shaving · decouple"] SP["Stream Processing
Flink · dedup/downsample"] TSDB[("Time-Series DB
hot · compressed")] OBJ[("Object Storage
cold · Parquet")] AL["Alerting / Dashboards"] D -->|MQTT| EG -->|MQTT/QoS| BR BR --> K --> SP SP --> TSDB SP --> OBJ TSDB --> AL classDef dev fill:#1a2530,stroke:#64c8ff,color:#e8eef5 classDef edge fill:#0e2030,stroke:#5eead4,color:#e8eef5 classDef pipe fill:#1a1a30,stroke:#ffb450,color:#e8eef5 classDef store fill:#2a1530,stroke:#ff7ab6,color:#e8eef5 class D dev class EG,BR edge class K,SP pipe class TSDB,OBJ,AL store

Four stages, distinct jobs: edge reduces volume, broker absorbs connections, Kafka shaves peaks & decouples, TSDB compresses & stores. Any stage writing directly to the next gets flooded.

The core idea is reduce and decouple at every layer. The edge gateway aggregates raw samples into summaries locally, cutting ~90% of backhaul bandwidth; the MQTT broker is dedicated to maintaining massive long-lived connections and QoS delivery; Kafka is the critical peak-shaving reservoir — broker output is bursty (3× at rush hour) while TSDB write throughput is constant, and Kafka absorbs the difference; stream processing dedups, reorders out-of-order events, and downsamples before persisting.

Key Technical Points

1. Device Protocol: MQTT vs HTTP vs CoAP

In one line: MQTT's long connection + pub/sub trades connection reuse for a stateful server — the default for fleets of low-power devices.

Principle: HTTP is request-response with a fresh TCP/TLS handshake each time. For a device sending a 2-byte temperature every 10s, the handshake overhead (several KB) is a thousand times the payload, and the server cannot push (to close a valve remotely, the device must poll). MQTT is an OASIS-standard publish-subscribe protocol: a device holds one long connection to a broker and publishes by topic (e.g. fleet/car42/battery); the header is as little as 2 bytes; it supports server push, Last Will (auto-notify on disconnect), and retained messages. The cost: the broker must statefully maintain millions of connections and per-device sessions.

Trade-offs:
# MQTT QoS: three delivery guarantees, pick by data importance
# QoS 0 (at-most-once): fire and forget, no retransmit  -> high-freq temperature
# QoS 1 (at-least-once): success on PUBACK, may duplicate -> ordinary events
# QoS 2 (exactly-once): 4-way PUBREC/PUBREL/PUBCOMP        -> billing/alerts
client.publish("fleet/car42/temp",  payload, qos=0)   # losing one is fine
client.publish("fleet/car42/alert", payload, qos=1)   # at-least-once + downstream idempotent dedup
# WARNING: QoS 2 is the most expensive (double round-trip + broker state); never use it for everything
Real-world:

2. Edge Compute: Process Where Data Is Born

In one line: the edge trades local compute for backhaul bandwidth + response latency, but introduces ops and consistency pain.

Principle: not all data is worth shipping. A car produces MB/s of sensor streams, but the cloud really needs summaries (per-minute averages) and anomalies (temperature over threshold). The edge gateway (a vehicle compute unit / factory gateway) does three things locally: ① filter & downsample — 20 raw samples into 1 summary; ② local decisions — cut power immediately on overheating, without waiting 200ms for a cloud round-trip; ③ offline buffering — store to disk when the car is in a tunnel, backfill on reconnect. This turns "centralized cloud processing" into "edge-cloud collaboration".

Trade-offs:
# Edge gateway: local aggregation + anomaly fast-path + offline buffer (pseudo-code)
buf = RingBuffer(maxlen=10000)          # local buffer, no loss while offline
def on_sample(s):
    if s.value > THRESHOLD:             # anomaly: act locally + report at high QoS
        actuator.shutdown()             # e-stop in <100ms, no cloud round-trip
        publish(s, qos=1)
    window.add(s)
    if window.full():                   # normal: one aggregate per minute, -90% bandwidth
        agg = window.summary()          # min/max/avg/p95
        if online: publish(agg, qos=0)
        else:      buf.append(agg)       # offline: store now, backfill on reconnect
Real-world:

3. Ingestion Pipeline: Peak Shaving, Decoupling, Backpressure

In one line: there must be a Kafka buffer between broker and TSDB, or write spikes will backpressure and break the whole chain.

Principle: the core tension in ingestion is bursty input (a fleet booting at rush hour, 3× instantaneous) vs constant downstream throughput (TSDB write capacity is fixed). If the broker writes the TSDB directly, a peak fills the TSDB → backpressure to the broker → connection pile-up → device reconnect storm → avalanche. Kafka as a reservoir: the broker writes Kafka at full speed (sequential write, very fast), and consumers pull at the rate the TSDB can sustain. Kafka also decouples — the same data is consumed independently by alerting, TSDB persistence, and ML training consumer groups. Partition by device_id to keep per-device data ordered.

Trade-offs:
Real-world:

4. Time-Series DB: Compression, Downsampling, Cardinality

In one line: the specifics of time-series (write-heavy, time-monotonic, neighboring values similar) let a purpose-built store save ~10× over a general DB.

Principle: telemetry has strong regularity — timestamps are near-arithmetic, adjacent values change little (temperature won't jump from 25 to 5000). Facebook's Gorilla (VLDB 2015) exploits both: timestamps use delta-of-delta (the second difference of evenly-spaced samples is almost always 0), values use XOR compression (XOR of adjacent floats leaves high bits zero, store only significant bits), shrinking each point from 16 bytes to an average of 1.37 bytes (~12×) — enabling all-in-memory storage and queries dozens of times faster. TimescaleDB takes another path: built on Postgres, it uses hypertables that auto-partition into time chunks, converts old chunks to compressed columnar (~10×), and keeps the full SQL ecosystem.

TimescaleDBInfluxDBPrometheus
ModelPostgres extension (SQL)Purpose-built TSDB (columnar/Parquet)Pull model
WritesHigh, via chunk partitioningVery high throughputMedium (monitoring scale)
CompressionColumnar ~10xColumnar + encodingdelta+XOR (Gorilla-like)
Best forNeed SQL/JOIN/relationalPure time-series, huge scaleInfra monitoring
-- TimescaleDB: create hypertable + auto-compression policy
SELECT create_hypertable('telemetry', 'ts',
       chunk_time_interval => INTERVAL '1 day');
-- chunks older than 7 days auto-compress to columnar (~10x)
ALTER TABLE telemetry SET (timescaledb.compress,
       timescaledb.compress_segmentby = 'device_id');
SELECT add_compression_policy('telemetry', INTERVAL '7 days');
-- continuous aggregates: pre-downsample so dashboards query minute-level, not raw seconds
SELECT add_retention_policy('telemetry', INTERVAL '90 days'); -- archive cold data
Cardinality Explosion: the number-one killer of time-series DBs. Each unique tag combination (device_id × metric × region) is a separate series, indexed in memory. 10M devices × 20 metrics = 200M series — blowing out the in-memory index. Fix: never put high-cardinality fields (e.g. trip_id, random UUIDs) in tags/labels — put them in fields; aggregate devices into buckets. Early InfluxDB was notorious for cardinality issues.
Real-world:

Scaling & Optimization

Common Pitfalls + Interview Follow-ups

1. QoS 2 for everything: slapping exactly-once on every message forces the broker to keep four-way-handshake state per message — millions of devices crush it. Tiering by data value is table stakes.
2. High-cardinality tags: putting trip_id/timestamps/random IDs in tags explodes the series count and OOMs the TSDB. A must-ask in any TSDB interview.
3. Broker writes the DB with no buffer: forgetting Kafka peak shaving leads to a rush-hour backpressure avalanche. Interviewers love "what if traffic triples?"
4. Ignoring clocks and out-of-order data: bad device clocks and offline backfill cause out-of-order arrival. Use event-time + watermarks (see Day 20), never write by arrival order.
5. Reconnect storm (thundering herd): a broker restart or network blip makes millions of devices reconnect at once and kill the cluster. Reconnects must use random backoff + jitter.
6. Not separating "droppable" from "must-keep": temperature samples are droppable, alerts/billing are not. Downstream consumers must be idempotent (dedup by device_id + ts).

Deeper Resources

Food for Thought (click to expand)

1. 20M points/s, 3.5TB/day compressed, but the client wants "any device at any moment, sub-second lookup". All-hot won't fit, all-archived is too slow. How?

The key is tiering + pre-aggregation, designed around access pattern not "keep everything": hot tier (7 days, SSD) keeps raw seconds and serves 95% of queries; warm tier (7-90 days, compressed chunks) keeps only minute-level downsamples, 60× smaller; cold tier (90+ days, S3 Parquet) archives raw seconds.

"Any device sub-second lookup" hinges on partition pruning: the cold tier hash-buckets by device_id and partitions by day, so a single-device single-day query scans one small Parquet file and returns in seconds — no full scan. The cost is slow cross-device aggregation (many files), but that's an analytical query and minute-level latency is acceptable. The essence: archive raw cold + keep aggregates hot, decoupling cost from latency.

2. A city blackout: hundreds of thousands of smart meters disconnect at once, then reconnect together when power returns. What happens? How do you prevent it?

A double hit of reconnect storm + backfill flood:

  • Reconnect storm: hundreds of thousands of devices start TCP+TLS handshakes in the same second; broker CPU (TLS handshakes are CPU-heavy) maxes out instantly, handshakes queue and time out → devices retry → worse.
  • Backfill flood: each device buffered N records while offline and dumps them all on reconnect; instantaneous writes are dozens of times normal, breaking the downstream of Kafka.

Defenses:client random backoff — reconnect delay = base × 2^n + random(0, jitter), spreading reconnects over minutes (must be in firmware); ② broker admission limiting — a token bucket at the connection layer rejects over-rate handshakes to protect established connections; ③ backfill rate-limit + stagger, relying on Kafka to absorb the flood; ④ graceful degradation — shed QoS 0 historical samples during the flood, keep only QoS 1 critical events.

This is Day 23's "retries must carry jitter" amplified for IoT — the device count turns any synchronized behavior into a DDoS.

3. Why is an MQTT broker hard to scale horizontally while a stateless HTTP service just adds machines? What must adding a broker node solve?

The root cause is that a broker is stateful: it maintains a TCP long connection, session state, subscriptions, and unacknowledged QoS 1/2 messages per connection. HTTP is stateless — any request routes to any node identically, so adding machines scales. Adding a broker node must solve:

  • Connection ownership: device A connects to node 1 and wants to message device B (subscribed to the same topic, connected to node 3); nodes must route/forward between each other — requiring a shared cluster-wide subscription table (EMQX uses a distributed routing table).
  • Session migration: node 1 dies, A reconnects to node 2, and its persistent session (offline messages, subscriptions) must recover — requiring externalized or cluster-replicated session state.
  • Sticky load balancing: not simple round-robin; the same device should land on the same node stably (connections are long-lived), commonly via consistent hashing.

This is exactly why EMQX chose Erlang/OTP — its lightweight processes (one per connection, a few KB each) + built-in distribution (transparent inter-node messaging) are built for "massive stateful long connections". In a thread-model language, the context switching for a million connections would crush you first.

4. The edge gateway runs a rule "temp > 80 → cut power". Now change the threshold to 75. Why is this seemingly one-line change a hard problem at 10M-device scale?

Because it's a staged config rollout across tens of thousands of edge nodes, not a one-line code edit:

  • Devices are often offline: you can't assume all are online; use a device shadow — store "desired threshold = 75" in the cloud shadow, and devices pull and reconcile on connect.
  • No all-at-once push: millions pulling new config simultaneously self-inflicts a request storm; you must stage (1% → observe → ramp) and be able to roll back.
  • Safety logic must fail-safe: if a push fails or config is corrupt, the device falls back to a safe default (rather over-stop than under-stop) — it must not stop protecting just because it can't fetch config.

The essence: the edge swaps "deploy once" for "deploy N times", degrading ops from centralized easy-updates into distributed staging/consistency/rollback — the biggest hidden cost of the edge.

5. Device clocks are routinely wrong (cheap RTC drift, offline backfill), so "event time" arrives out of order. What breaks if you write the TSDB by arrival order? What to do?

What writing by arrival order (processing-time) breaks:

  • Wrong aggregation windows: a car backfills data from "an hour ago" after reconnecting; counting it into the current minute window pollutes the average and creates fake spikes in trend charts.
  • Misaligned downsampling: continuous aggregates bucket by arrival time, scattering data from the same real moment across buckets, scrambling the timeline on lookup.
  • False alerts: a late old "overheating" event is treated as real-time, triggering an already-meaningless action.

The fix (see Day 20 stream processing): ① use event-time (the device's timestamp) not arrival time for windowing; ② watermarks tolerate out-of-order — define "allow N minutes late", close a window only after the watermark passes, and route over-late data to a side output for separate correction; ③ periodic NTP sync to correct drift; ④ idempotent dedup with device_id + event_ts as the key, so backfilled duplicates overwrite rather than append.

Core insight: IoT "time" is two things — when it happened and when it arrived, possibly hours apart. Aggregation/alerting/billing must all go by event time.