Design a SaaS billing platform (think Stripe Billing / Chargebee): 100K business subscriptions, hybrid pricing = monthly/annual fixed fee (seat-based) + usage-based charges (API calls, storage GB). Customers upgrade/downgrade and add/remove seats at any time, across 30+ countries, in multiple currencies, with automatic tax. This is the upstream of Day 17 payments: payments solve "how to safely collect a charge," billing solves "how much to collect this month, from whom, and why."
Key constraints to clarify in an interview:
Component roles: ① Pricing Catalog defines products, prices, plans (versioned, so price changes don't touch existing subscriptions); ② Subscription Service holds each subscription's lifecycle state machine; ③ Metering dedupes and aggregates massive usage events into billable quantities; ④ Billing Engine is the core — at cycle end it rolls "fixed fee + usage + proration adjustments + tax" into one invoice; ⑤ the invoice goes to Day 17's idempotent payment for collection, failures enter Dunning; ⑥ all amounts land in the Ledger for revenue recognition and reconciliation.
Core trade-off: an explicit persisted state machine replaces if-else scattered everywhere and the risk of double-billing.
[Principle] A subscription isn't a boolean "active/inactive" but a state machine: trialing → active → past_due → canceled / unpaid. Transitions are driven by two event types: time (cycle end = time to invoice) and actions (upgrade, cancel, payment success/failure). On-time billing has two trigger styles: a scan cron (every minute query next_billing_at <= now) or a scheduled job registered per subscription. The key is idempotency: use (subscription_id, period_start) as a unique key for "generate invoice for period P," so a re-run cron only hits the existing invoice and never double-bills.
status column + transition log is auditable and replayable; cobbling state from multiple booleans produces illegal combinations (both trialing and past_due) and eventual bugs.# pseudo-code: idempotent cycle billing (cron every minute)
def bill_due_subscriptions(now):
for sub in db.query("status IN ('active','past_due') AND next_billing_at <= %s", now):
period = (sub.current_period_start, sub.current_period_end)
# unique-key dedup: one invoice per period
inv = db.get_or_create_invoice(sub.id, period[0])
if inv.status != 'draft': # already issued -> skip
continue
add_recurring_lines(inv, sub) # fixed fee / seats
add_metered_lines(inv, sub, period)# usage (see point ③)
finalize_and_charge(inv) # finalize -> hand to payment
sub.advance_period() # advance to next cycle
trialing / active / past_due / canceled / unpaid), one invoice per cycle, billing_cycle_anchor controls the billing date, deliberately staggering customers onto different days.Core trade-off: proration makes pricing fair and immediate, but every change injects a positive and negative adjustment line, raising complexity and explainability cost.
[Principle] A customer upgrading mid-cycle from $10/mo to $20/mo can't be charged a full month at the new price. Proration splits at the change point: refund the unused portion of the old plan's remaining time (negative credit), charge the new plan's remaining time (positive). Upgrade halfway through → -$5 (old price unused) + $10 (new price remaining) = net $5. These accumulate as invoice line items, settled next cycle or immediately. The proration factor = remaining seconds / total period seconds — computed by the second, not the day, or boundaries get unfair.
# pseudo-code: upgrade proration (by-second factor)
def prorate_change(sub, new_price, now):
total = sub.period_end - sub.period_start # total period seconds
remain = sub.period_end - now # remaining seconds
factor = remain / total
credit = -round(sub.current_price * factor) # refund old-price unused (minor units)
charge = round(new_price * factor) # charge new-price remaining
add_line(sub.invoice, "unused time credit", credit)
add_line(sub.invoice, "remaining time @ new", charge)
sub.current_price = new_price # balance credit applied, no cash refund
preview the delta first (matching our <300ms sync-preview need).Core trade-off: balance accuracy (count every event) against throughput/cost (aggregate then record), plus the risk of late events corrupting a finalized invoice.
[Principle] The metering pipeline has four stages: Emit (app sends usage events) → Ingest (validate + idempotent dedup) → Meter (aggregate into billable quantity) → Invoice (roll into the bill). Writing 1B/day one row at a time blows up the DB; the industrial approach is two-level aggregation: first accumulate in app memory/Redis keyed by (customer, meter, hour), flush to a Postgres rollup table every minute, and at billing time just SUM a few thousand hourly buckets. Aggregation functions are limited: typically only sum / count / last — for "peak concurrency" or "average daily storage" you compute it yourself and report the final number as a last value. Events must carry an idempotency key (event_id) for dedup, or client retries = over-billing.
# pseudo-code: two-level aggregation + idempotent dedup
def ingest(event):
if not redis.set(f"seen:{event.id}", 1, nx=True, ex=DAY): # idempotent
return # drop duplicate event
key = (event.customer, event.meter, hour_bucket(event.ts))
redis.hincrby("usage", key, event.qty) # in-memory accumulate
def flush_every_minute(): # level 2: to Postgres
for key, qty in redis.hgetall("usage"):
db.upsert_rollup(key, qty) # hourly bucket
# billing: SELECT SUM(qty) ... WHERE customer=? AND ts IN [period]
sum/count/last into the billing period; the docs recommend accumulating in the app and flushing one aggregated event periodically for high-frequency cases, to avoid hitting API limits and quotas.Core trade-off: local-currency pricing gives the best experience, but introduces FX risk, tax-rule complexity, and a reconciliation-dimension explosion.
[Principle] Two things must be right. Currency: store each price in minor units + ISO currency code (JPY has no decimals, dinar has three — don't assume two); cross-currency do not convert at billing time but preset a price per currency, so FX swings don't make the invoice non-reproducible. Tax: whether tax is added on (exclusive, common in the US) or included (inclusive, EU VAT) is determined by region/currency; products carry a tax code deciding rate and exemption. Tax is frozen into the invoice at the rate in effect at billing time; later rate changes don't rewrite historical invoices.
unpaid. This is the key revenue-recovery lever for SaaS.0.1 + 0.2 ≠ 0.3; cumulative rounding error eventually surfaces as a cents-level break in reconciliation. Always use integer minor units, and remember not all currencies have two decimals.(sub_id, period) unique key you double-bill and double-charge. Idempotency is the floor for billing.Frequent follow-ups: ① Cron vs per-subscription timer for cycle billing — how do you guarantee no miss, no double? ② Walk me through mid-cycle upgrade proration step by step (by second). ③ How do you design metering for 1B usage events/day, and handle late events? ④ How does annual prepay do revenue recognition, and why separate cash from revenue? ⑤ Inclusive vs exclusive tax, and why not real-time FX for multi-currency?
customer_id % N across workers in parallel, and roll up usage incrementally within the cycle so day 1 only does "finishing + finalize" rather than scanning 1B events from scratch. Cost: incremental state must be persisted and consistent, complicating crash recovery.One level deeper: the billing spike propagates to the payment side — 100K invoices charging at once blows the PSP quota and hits issuer rate limits, so spreading billing dates also smooths the downstream charge load, a cross-topic second-order benefit.
Each change generates a pair of ± proration line items by the second, so three changes = 6 adjustment lines plus their remaining-time charges — the invoice reads like a churn ledger. Explainability rests on a few things:
preview so the customer sees the delta on each change on the spot (matching the <300ms sync-preview need), reducing after-the-fact complaints.Pitfall: frequent changes can enable "plan-trial arbitrage" — some systems cap multiple downgrades in one cycle or only settle at cycle end, trading cleanliness for fairness.
This is fundamentally the watermark / late-data problem from Day 20 stream processing projected onto the billing domain: you set a watermark between "how long to wait before finalizing" (completeness) and "how fast to bill" (timeliness), and prepare an auditable compensation channel for out-of-window data.
Accounting standards (ASC 606 / IFRS 15) require revenue recognized by service-delivery progress: of the cash collected on annual prepay, the undelivered portion is deferred revenue (a liability), converted to recognized revenue monthly ($100/mo) as service is delivered. The reason: collecting cash isn't earning it — if the customer cancels in month two, over-recognized revenue must be reversed.
What this requires of the system: cash flow and revenue flow must be recorded separately. The Ledger uses double-entry: on collection, debit cash / credit deferred revenue; each month, debit deferred revenue / credit recognized revenue. This is exactly where TigerBeetle-style debit/credit invariants apply — at any moment "cash collected = recognized + to-be-recognized" must hold, with single-sided writes becoming illegal states rather than hidden bugs. Refund/mid-term cancel reverses the remaining deferred revenue + refunds cash, moving both legs.
Not simply — they're idempotent over different layers of operation:
(subscription_id, period_start), guaranteeing "this cycle generates only one invoice." It prevents double-billing.The boundary: one invoice may be charged across multiple attempts (dunning: retry on days 1/3/5). These retries should share the same payment idempotency key (for that invoice) so the PSP recognizes them as one charge; but each retry is a legitimate transition of the billing state machine and must not be dropped by billing-layer idempotency as a "duplicate." So the two layers decouple: billing guarantees "no double invoice," payment guarantees "no double charge," and dunning orchestrates retries between them — the classic layered-idempotency design.