Day 18 Hard Billing Subscriptions Proration Metering

Subscription & Billing — Turning "Charge by Time and Usage" into an Auditable EngineLifecycle State Machine, Proration, Usage Metering, Multi-currency & Tax

Problem & Constraints

Design a SaaS billing platform (think Stripe Billing / Chargebee): 100K business subscriptions, hybrid pricing = monthly/annual fixed fee (seat-based) + usage-based charges (API calls, storage GB). Customers upgrade/downgrade and add/remove seats at any time, across 30+ countries, in multiple currencies, with automatic tax. This is the upstream of Day 17 payments: payments solve "how to safely collect a charge," billing solves "how much to collect this month, from whom, and why."

Key constraints to clarify in an interview:

High-level Architecture

graph TD CAT[Pricing Catalog
products/prices/plans] --> SUB[Subscription Service
lifecycle state machine] EV[usage events ~1B/day] --> ING[Metering Ingest
dedup + idempotent] ING --> AGG[Aggregation Rollup
Redis→Postgres] SUB --> BILL[Billing Engine
cycle trigger / proration] AGG --> BILL TAX[Tax Engine
rates + rules] --> BILL BILL --> INV[(Invoice Store
immutable invoices)] INV --> PAY[Payment / collection
Day 17 idempotent charge] PAY -->|failure| DUN[Dunning
retry + recovery] INV --> LED[(Ledger
double-entry / rev rec)]

Component roles: ① Pricing Catalog defines products, prices, plans (versioned, so price changes don't touch existing subscriptions); ② Subscription Service holds each subscription's lifecycle state machine; ③ Metering dedupes and aggregates massive usage events into billable quantities; ④ Billing Engine is the core — at cycle end it rolls "fixed fee + usage + proration adjustments + tax" into one invoice; ⑤ the invoice goes to Day 17's idempotent payment for collection, failures enter Dunning; ⑥ all amounts land in the Ledger for revenue recognition and reconciliation.

Key Technical Points

① Subscription Lifecycle State Machine + Cycle Trigger — bill on time, never miss, never double

Core trade-off: an explicit persisted state machine replaces if-else scattered everywhere and the risk of double-billing.

[Principle] A subscription isn't a boolean "active/inactive" but a state machine: trialing → active → past_due → canceled / unpaid. Transitions are driven by two event types: time (cycle end = time to invoice) and actions (upgrade, cancel, payment success/failure). On-time billing has two trigger styles: a scan cron (every minute query next_billing_at <= now) or a scheduled job registered per subscription. The key is idempotency: use (subscription_id, period_start) as a unique key for "generate invoice for period P," so a re-run cron only hits the existing invoice and never double-bills.

Options compared:
# pseudo-code: idempotent cycle billing (cron every minute)
def bill_due_subscriptions(now):
    for sub in db.query("status IN ('active','past_due') AND next_billing_at <= %s", now):
        period = (sub.current_period_start, sub.current_period_end)
        # unique-key dedup: one invoice per period
        inv = db.get_or_create_invoice(sub.id, period[0])
        if inv.status != 'draft':          # already issued -> skip
            continue
        add_recurring_lines(inv, sub)      # fixed fee / seats
        add_metered_lines(inv, sub, period)# usage (see point ③)
        finalize_and_charge(inv)           # finalize -> hand to payment
        sub.advance_period()               # advance to next cycle
Real-world cases:

② Proration — how to "settle the difference" on mid-cycle changes

Core trade-off: proration makes pricing fair and immediate, but every change injects a positive and negative adjustment line, raising complexity and explainability cost.

[Principle] A customer upgrading mid-cycle from $10/mo to $20/mo can't be charged a full month at the new price. Proration splits at the change point: refund the unused portion of the old plan's remaining time (negative credit), charge the new plan's remaining time (positive). Upgrade halfway through → -$5 (old price unused) + $10 (new price remaining) = net $5. These accumulate as invoice line items, settled next cycle or immediately. The proration factor = remaining seconds / total period seconds — computed by the second, not the day, or boundaries get unfair.

Options compared:
# pseudo-code: upgrade proration (by-second factor)
def prorate_change(sub, new_price, now):
    total = sub.period_end - sub.period_start      # total period seconds
    remain = sub.period_end - now                  # remaining seconds
    factor = remain / total
    credit = -round(sub.current_price * factor)    # refund old-price unused (minor units)
    charge =  round(new_price        * factor)     # charge new-price remaining
    add_line(sub.invoice, "unused time credit", credit)
    add_line(sub.invoice, "remaining time @ new", charge)
    sub.current_price = new_price                  # balance credit applied, no cash refund
Real-world cases:

③ Usage Metering — aggregating 1B events/day into a few numbers

Core trade-off: balance accuracy (count every event) against throughput/cost (aggregate then record), plus the risk of late events corrupting a finalized invoice.

[Principle] The metering pipeline has four stages: Emit (app sends usage events) → Ingest (validate + idempotent dedup) → Meter (aggregate into billable quantity) → Invoice (roll into the bill). Writing 1B/day one row at a time blows up the DB; the industrial approach is two-level aggregation: first accumulate in app memory/Redis keyed by (customer, meter, hour), flush to a Postgres rollup table every minute, and at billing time just SUM a few thousand hourly buckets. Aggregation functions are limited: typically only sum / count / last — for "peak concurrency" or "average daily storage" you compute it yourself and report the final number as a last value. Events must carry an idempotency key (event_id) for dedup, or client retries = over-billing.

Options compared:
# pseudo-code: two-level aggregation + idempotent dedup
def ingest(event):
    if not redis.set(f"seen:{event.id}", 1, nx=True, ex=DAY):  # idempotent
        return                                  # drop duplicate event
    key = (event.customer, event.meter, hour_bucket(event.ts))
    redis.hincrby("usage", key, event.qty)      # in-memory accumulate

def flush_every_minute():                        # level 2: to Postgres
    for key, qty in redis.hgetall("usage"):
        db.upsert_rollup(key, qty)               # hourly bucket
# billing: SELECT SUM(qty) ... WHERE customer=? AND ts IN [period]
Real-world cases:

④ Multi-currency & Tax — one engine serving the globe

Core trade-off: local-currency pricing gives the best experience, but introduces FX risk, tax-rule complexity, and a reconciliation-dimension explosion.

[Principle] Two things must be right. Currency: store each price in minor units + ISO currency code (JPY has no decimals, dinar has three — don't assume two); cross-currency do not convert at billing time but preset a price per currency, so FX swings don't make the invoice non-reproducible. Tax: whether tax is added on (exclusive, common in the US) or included (inclusive, EU VAT) is determined by region/currency; products carry a tax code deciding rate and exemption. Tax is frozen into the invoice at the rate in effect at billing time; later rate changes don't rewrite historical invoices.

Options compared:
Real-world cases:

Scaling & Optimization

Common Pitfalls + Interview Follow-ups

1. Storing money as float. 0.1 + 0.2 ≠ 0.3; cumulative rounding error eventually surfaces as a cents-level break in reconciliation. Always use integer minor units, and remember not all currencies have two decimals.
2. Non-idempotent billing. Re-run cron, redelivered messages, a frantically clicked upgrade — without a (sub_id, period) unique key you double-bill and double-charge. Idempotency is the floor for billing.
3. Proration by day, not by second. Boundary-day attribution makes customers over/underpay, breeding disputes; and you must state the policy: charge immediately on upgrade, credit on downgrade.
4. Mutating existing subscriptions on price change. Prices must be versioned, existing subscriptions referencing the old version; otherwise one price change rewrites every legacy customer's bill — make the news.
5. Stuffing late usage events into a finalized invoice. A finalized invoice should be immutable; late events go to the next period or a explicit credit note, not a silent rewrite of history.

Frequent follow-ups: ① Cron vs per-subscription timer for cycle billing — how do you guarantee no miss, no double? ② Walk me through mid-cycle upgrade proration step by step (by second). ③ How do you design metering for 1B usage events/day, and handle late events? ④ How does annual prepay do revenue recognition, and why separate cash from revenue? ⑤ Inclusive vs exclusive tax, and why not real-time FX for multi-currency?

Further Resources

Going Deeper (click to expand)

1. 100K subscriptions all bill on the 1st and the billing run can't cope. Beyond "add machines," give two structural fixes and their costs.
  • Spread anchor dates: distribute the billing date evenly across days 1–28 by creation date or hash, flattening the spike to 1/30. Cost: customers don't share a billing date; enterprise customers doing financial reconciliation may want a unified period, so support a "custom alignment date" exception.
  • Shard parallel + incremental pre-aggregation: shard the billing run by customer_id % N across workers in parallel, and roll up usage incrementally within the cycle so day 1 only does "finishing + finalize" rather than scanning 1B events from scratch. Cost: incremental state must be persisted and consistent, complicating crash recovery.

One level deeper: the billing spike propagates to the payment side — 100K invoices charging at once blows the PSP quota and hits issuer rate limits, so spreading billing dates also smooths the downstream charge load, a cross-topic second-order benefit.

2. A customer upgrades→downgrades→upgrades three times within one cycle. What does the invoice look like, and how do you keep it explainable?

Each change generates a pair of ± proration line items by the second, so three changes = 6 adjustment lines plus their remaining-time charges — the invoice reads like a churn ledger. Explainability rests on a few things:

  • Each line item carries period_start/end + the associated price_id + the proration factor, so you can reconstruct "why this amount."
  • Negative proration goes to the account credit balance rather than an immediate refund, so the final bill = net amount, avoiding repeated refund fees and cash-out fraud.
  • Offer a preview so the customer sees the delta on each change on the spot (matching the <300ms sync-preview need), reducing after-the-fact complaints.

Pitfall: frequent changes can enable "plan-trial arbitrage" — some systems cap multiple downgrades in one cycle or only settle at cycle end, trading cleanliness for fairness.

3. A usage event arrives 2 hours after the cycle ends (client buffering/network delay). Stuff it into the closed invoice, count it next period, or something else? Where does each break?
  • Reopen the closed invoice: correct attribution, but modifying a finalized/collected invoice violates "invoice immutability," and if already reconciled it creates a break, which regulators dislike.
  • Count it next period: simple to implement, invoice stays immutable; but revenue is mis-attributed (this month's usage billed next month), so RevRec and customer reconciliation won't line up.
  • Grace window + credit note: set a billing grace window (e.g. finalize N hours after period end) so late events within it still count this period; events outside it use an explicit credit/debit note as an auditable adjustment to the historical invoice, rather than secretly editing the original.

This is fundamentally the watermark / late-data problem from Day 20 stream processing projected onto the billing domain: you set a watermark between "how long to wait before finalizing" (completeness) and "how fast to bill" (timeliness), and prepare an auditable compensation channel for out-of-window data.

4. You collected $1200 on an annual prepay — why can't you recognize $1200 of revenue this month? What does this require of the data model?

Accounting standards (ASC 606 / IFRS 15) require revenue recognized by service-delivery progress: of the cash collected on annual prepay, the undelivered portion is deferred revenue (a liability), converted to recognized revenue monthly ($100/mo) as service is delivered. The reason: collecting cash isn't earning it — if the customer cancels in month two, over-recognized revenue must be reversed.

What this requires of the system: cash flow and revenue flow must be recorded separately. The Ledger uses double-entry: on collection, debit cash / credit deferred revenue; each month, debit deferred revenue / credit recognized revenue. This is exactly where TigerBeetle-style debit/credit invariants apply — at any moment "cash collected = recognized + to-be-recognized" must hold, with single-sided writes becoming illegal states rather than hidden bugs. Refund/mid-term cancel reverses the remaining deferred revenue + refunds cash, moving both legs.

5. Can the billing system and Day 17's payment system share one idempotency key? Where's the boundary?

Not simply — they're idempotent over different layers of operation:

  • Billing idempotency key = (subscription_id, period_start), guaranteeing "this cycle generates only one invoice." It prevents double-billing.
  • Payment idempotency key = each charge's idempotency key (e.g. invoice_id or a one-off UUID), guaranteeing "this invoice is charged once." It prevents double-charging.

The boundary: one invoice may be charged across multiple attempts (dunning: retry on days 1/3/5). These retries should share the same payment idempotency key (for that invoice) so the PSP recognizes them as one charge; but each retry is a legitimate transition of the billing state machine and must not be dropped by billing-layer idempotency as a "duplicate." So the two layers decouple: billing guarantees "no double invoice," payment guarantees "no double charge," and dunning orchestrates retries between them — the classic layered-idempotency design.