Principle 01
Design Docs: Written for the You of Six Months From Now
Design Docs & RFCs — Writing as Thinking, Not Archiving
RFC · Design Doc
Principle + Master's Words
The first reader of a design doc isn't the reviewer — it's you in six months. The doc preserves your judgment at the time, the options you rejected, the "why not" — not a performance of how clever you were at submission.
"The single most important point of a design doc is to make the author reason about the design at the right level of abstraction."
— Malte Ubl《Design Docs at Google》(industrialempathy.com, 2020)
The value is not in the artifact; the value is in forcing the author to reason at the right altitude — not at "what code will we write," not at "what shall we name the company strategy," but at "what tradeoff are we making, and why."
Why it works
Engineers often think "doc = record of decisions already made." Wrong. The real value of a design doc is in the writing itself: vague intuitions become sentences, and half the flaws fall out on the way to the keyboard. This is the engineering version of "writing is thinking."
And the future reader (you, or a new hire) won't ask "what is this" — they'll ask "why didn't we pick B?" That's why Alternatives Considered is the most long-tail-valuable section of the whole doc: it preserves the rejected options and the reasoning, so the next engineer doesn't re-walk a pothole you already mapped.
Context & Goals
Where the problem is; what counts as winning
Non-goals
Explicit "not solving" — the shield against scope creep
Design
A paragraph or two + a diagram. Intent, not code
Alternatives Considered
≥ 3 options, each with "why not" — the section with the longest tail of value
Risks & Open Questions
Known risks + what's still unresolved (named honestly)
Rollout & Rollback
How to ship; how to retreat if it goes wrong
Six-section skeleton. Full ≠ good — but every missing section is where future "strange coincidences" originate.
Revision examples
We will use Kafka for the event pipeline because it is scalable and reliable. (Empty — "scalable / reliable" is true of every candidate, so it says nothing.)
Chose Kafka over Pulsar after benchmarking both at our 10K msg/s peak (3 days each). Pulsar's broker-bookie split adds two ops surfaces we don't have headcount for. Tradeoff accepted: weaker geo-replication — fine through Q3 because all consumers are us-east. Revisit when EU goes live.
This design adopts a microservice architecture to improve system scalability. (Same empty pattern — "improves scalability" tells the reader nothing about why this option, this tradeoff.)
Between "monolith + modular packages" and "three microservices," we chose the latter. The deciding factor: paid-tier and free-tier release cadence have diverged 3x over the last 8 weeks (data: 12 vs 4 deploys); shared-repo merge conflicts are the bottleneck. Rejected third option, "event-sourced rewrite": no one on the team has shipped event sourcing in two years; the cost of a new paradigm outweighs the benefit.
When to use · Common errors
- ✓ Any decision ≥ 2 weeks / 1 person-month · cross-team interface change · architectural baseline migration
- ✗ Skip for: single-file refactors · product micro-copy tweaks — a PR description suffices
- Error 1: Writing only after the decision is made — the doc should precede the call, not archive it
- Error 2: Listing alternatives without the "why not" — losing all long-tail value
- Error 3: Pseudocode or class diagrams in place of design — the reader cannot read intent or tradeoff
- Error 4: No Non-goals — reviewers keep asking "what about X" that was never in scope
Key references
Malte Ubl《Design Docs at Google》industrialempathy.com 2020 · Will Larson《An Elegant Puzzle》Ch. 4 — the leverage of decisions and memos in engineering orgs · Gergely Orosz"How Big Tech Runs Tech Projects" — comparative RFC / TDD practices
This Week's Exercise + Reflection
Exercise: Pick your next RFC. Cut "Implementation Details" by half; expand "Alternatives Considered" to twice its current size. Have a cross-team peer read it first — every "why not X" they ask is an X that belongs in Alternatives.
Reflect: As AI authorship of RFCs rises, who is doing the "reasoning about the design"? Does AI-drafted prose let the author skip the thinking step? What human–AI split preserves the "writing is thinking" return?
Principle 02
API Docs: Four Functions, Four Pages
API Docs & Diátaxis — Four Modes That Cannot Share a Page
API · Diátaxis
Principle + Master's Words
One page cannot simultaneously teach beginners, serve as reference, walk through tasks, and explain design intent. Separate the four — tutorials, how-to guides, reference, explanation — and let each reader find their lane.
"Documentation needs to include and be structured around its four different functions: tutorials, how-to guides, technical reference and explanation. Each of them requires a distinct mode of writing."
— Daniele Procida《Diátaxis》(diataxis.fr · PyCon AU 2017)
From "What nobody tells you about documentation" — the talk that became the framework most modern doc sites (Django, NumPy, GitLab, Cloudflare) eventually adopted.
Why it works
What happens without the split? Beginners stare at the reference; experts skim the tutorial for the field they need; SREs hunting how-to-guides drown in design philosophy. Everyone is annoyed. Diátaxis maps docs across two axes — vertical "study vs work," horizontal "practical vs cognitive." Four quadrants, four jobs.
Practical / Action
Cognitive / Theory
Study
Tutorials
learning-oriented
"Follow me." One happy path, 10 minutes to a working Hello World. Don't explain why.
Explanation
understanding-oriented
"Why is it this way?" Background, tradeoffs, history. For the reader who's up and running and now curious.
Work
How-to guides
problem-oriented
"How do I handle webhook retries?" Recipes for known problems. Quick, specific.
Reference
information-oriented
Full API tables, fields, types, errors. The kind of doc a machine could generate. Cold, complete, storyless.
Four quadrants. Stripe's docs are gold standard because they enforce the split rigorously — and link generously across.
Revision examples
Single "Getting Started" page — 300 lines mixing install, quickstart, full API table, and design philosophy. New users stall on the API table; experts dig through stories to find a field.
Split into four: ① Tutorial ("10 minutes to your first charge," single happy path) · ② How-to ("Handle webhook retries," "Test refunds in sandbox") · ③ Reference (full API + fields, auto-generated) · ④ Explanation ("Why idempotency keys look like this"). The landing page is now a router, not a textbook.
A single markdown blending function signature, usage demo, caveats, and design philosophy. The reader cannot tell where one ends and the next begins.
Signature & demo go to Reference. Caveats go to How-to ("Handling rate-limits"). Design philosophy goes to Explanation. The Reference page links to both — scanners stay; "why" readers click through.
When to use · Common errors
- ✓ Public APIs · SDKs · internal developer platforms · CLI tools
- ✓ Doc-site refactors — Diátaxis is information architecture, not a styling decision
- ✗ Skip for: single-file libraries (one README is enough) · one-off scripts
- Error 1: Cramming all four into the README — the README is a foyer, not the full collection
- Error 2: Tutorials that read like reference — stacking API tables instead of walking the happy path
- Error 3: Reference that reads like a tutorial — story instead of field definitions
- Error 4: Putting Explanation on the front page — most arriving readers want how, not why; Explanation is for those who are up and running
Key references
Daniele Procida《Diátaxis: A systematic framework for technical documentation》diataxis.fr · Stripe API Docs docs.stripe.com — textbook Diátaxis in production · Google《Technical Writing Courses》developers.google.com/tech-writing — free two-stage course
This Week's Exercise + Reflection
Exercise: Pick a doc you've written and tag each section against the four Diátaxis modes. If any mode is at zero, ask: is it truly unnecessary, or did you skip it by default? Missing tutorials or explanation is often why a doc "looks complete but feels stuck" in use.
Reflect: When LLMs become the "intermediate reader" for most APIs (developers querying your API through ChatGPT), should docs be written for humans or for LLMs? Which of the four quadrants do LLMs handle worst — tutorials? explanation?
Principle 03
ADRs: Time-Stamp the Decision
Architecture Decision Records — Immutable Memos for the Future
ADR · Decision Record
Principle + Master's Words
A decision = context + choice + consequences. Drop any one and six months later it reads as a "strange coincidence." An ADR time-stamps the decision — once written, never edited; to change your mind, write a new ADR marked "Supersedes ADR-042."
"We will keep a collection of records for 'architecturally significant' decisions: those that affect the structure, non-functional characteristics, dependencies, interfaces, or construction techniques."
— Michael Nygard《Documenting Architecture Decisions》(Cognitect blog, 2011)
Three pages of blog that changed how a generation of engineering organizations track decisions. ThoughtWorks moved "Lightweight ADRs" to Adopt on its Tech Radar in 2018.
Why it works
Design docs are about design; ADRs are about decisions. They differ. A design doc may evaluate three options; an ADR is the gavel: "We picked A."
What matters in an ADR isn't the format (five fields, that's it) — it's the immutability stance: once written, never edited. This is git-commit thinking, not wiki thinking. If you change your mind, write a new ADR with "Supersedes ADR-042"; the original stays. History isn't rewritten, and learning compounds.
Title
ADR-042: Migrate to Postgres 15
Status
Accepted · 2026-05-10 · authors: bc, jh · superseded by ADR-067 (2027-03)
Context
PG12 reaches EOL 2024-11. Security patches end. PG16 was released 2023-09 (8 months old, low adoption). PG15 has been GA > 2 years and is mature. Our org is conservative on database majors.
Decision
Migrate production and replicas to PG15 (not 16). Two-week window over a Q3 weekend.
Consequences
+ JSONB performance gain · + EOL exit before EU regulatory review · − must rewrite 12 JSONpath queries to PG15-compatible form · − lose access to PG16's improved logical replication; revisit in 12 months.
Revision examples
PR description: "Switch to Postgres 15." (Eighteen months later a new engineer reads the repo: "Why PG15 and not PG16?" — no one can answer.)
ADR-042 with the five fields above. The doc lives at docs/adr/0042-postgres-15.md. Never edited. Eighteen months later the new engineer reads the ADR: decision and context obvious — "they weren't dumb; they were risk-averse."
Wiki page "Database choice" — edited 12 times by 5 people over six months. No one can tell why PG was chosen. Every new hire reopens the same argument.
Same decision migrated to ADR-042 plus immutable git history. The "why PG" question closes permanently — link to the ADR, 5 minutes to read. New hires who disagree write ADR-067 proposing an alternative with counter-evidence. Progress moves along the ADR chain; history isn't lost.
When to use · Common errors
- ✓ Framework · database · protocol · regional-infra choices · company-wide build vs buy · critical-dependency up/down-grades
- ✓ Any decision where the future will ask "why this?"
- ✗ Skip for: day-to-day implementation calls (third-party package, variable name) — too frequent, ADRs become noise
- Error 1: Treating ADRs as wiki — they get edited, no one remembers the original. ADRs are immutable.
- Error 2: Decision but no Context — the reader cannot judge "does this still hold?"
- Error 3: No Consequences — the next engineer doesn't see the tradeoff the author already understood, re-discovers it the hard way
- Error 4: Written too late — backfilling an ADR after execution is "writing the exam after taking it"; the reasoning trace is gone
Key references
Michael Nygard《Documenting Architecture Decisions》Cognitect blog 2011 — the five-field source · ThoughtWorks《Technology Radar》— "Lightweight Architecture Decision Records" (Adopt, 2018) · adr.github.io — templates and community resources
This Week's Exercise + Reflection
Exercise: Take the most important technical decision you've made in the last six months (with or without an ADR) and backfill the five fields. Spend the most time on Context — back at that moment, what didn't you know that you know now? Write that delta.
Reflect: ADR immutability seems to fight organizational learning — people get smarter, knowledge updates. How do you design the relationship between ADRs and their successors so history isn't rewritten but learning still propagates?
Principle 04
Code Comments: WHY in Comments, WHAT in Names
Code Comments — Why & Why-not, Not What
Comments · Philosophy
Principle + Master's Words
Comments explain WHY and WHY-NOT; names explain WHAT. Overlap is noise. When you find yourself adding a line to explain what the code does, first ask: can I rename a variable or extract a function instead?
"The most important reason to write comments is abstraction: comments can describe things that can't be inferred from the code. Without comments, you can't hide complexity."
— John Ousterhout《A Philosophy of Software Design》Ch. 12 (Stanford, 2018)
Pair with Robert C. Martin《Clean Code》Ch. 4: "Don't comment bad code — rewrite it." If you need a paragraph of prose to defend the function, the function is the problem.
Why it works
A three-layer division of labor:
| Dimension | Carrier | Example |
| WHAT it does | Names | retryBudget · archiveUsersBefore() |
| HOW it does it | Code itself | Function body |
| WHY this way | Comments | # 3 because upstream SLA P95 is 800ms |
| WHY NOT another way | Comments | # not async: workers share a db conn pool |
| Invariants | Comments | # list must stay sorted — bisect downstream |
| Contracts | Docstrings | "Returns timestamp in UTC ms; raises TimeoutError after 5s" |
One under-used role of comments: mark future invariants — e.g. # this list must stay sorted — binary search downstream. A documented constraint that warns the next engineer: "check before you change this."
Revision examples
Comment restates code — zero information:
# increment i
i += 1
Comment explains where "3" comes from:
# retry budget is 3 — beyond this upstream
# (payments-svc) gives up, so we should too
i += 1
Comment is a vague fragment; magic numbers hardcoded:
# fix for legacy users created before 2018
if user.created_at < datetime(2018, 1, 1):
user.role = "legacy"
Extract constants to carry the WHAT; the comment points to an ADR:
# Pre-cutoff users predate the role-schema migration
# in ADR-019; default to LEGACY so they don't get
# write-locked by new RBAC checks.
if user.created_at < LEGACY_USER_CUTOFF:
user.role = ROLE_LEGACY
Comment doing the work of a name:
// loop through users
for (const u of users) { ... }
Delete the comment — for (const u of users) is self-evident. A short form of Martin's rule: "Don't comment obvious code — delete the comment."
When to use · Common errors
- ✓ Any non-obvious "why this number" / "why not that approach" / "future readers must be careful here"
- ✓ Public APIs / SDKs / cross-team interfaces — contracts must live in docstrings (pre/post-conditions, exceptions, units)
- ✗ Skip for: restating code · decorative banners · version numbers (git already tracks) · authorship (lives in LICENSE)
- Error 1: Comments doing the WHAT job — rename or extract a function instead
- Error 2: Lying comments — code changed, comment didn't. A stale comment is worse than no comment because it actively misleads
- Error 3: JSDoc/docstrings as decoration — type-only, no semantics; the reader still doesn't know units or failure modes
- Error 4: A paragraph of prose excusing a tangled function — if the function needs prose, the function is what to fix
Key references
John Ousterhout《A Philosophy of Software Design》(2018, 2nd ed. 2021) Ch. 12–15 — the core text on comments and abstraction · Robert C. Martin《Clean Code》Ch. 4 — the 17 rules of comments, controversial but useful · Steve McConnell《Code Complete》Ch. 32 — self-documenting code
This Week's Exercise + Reflection
Exercise: In your most recent PR, apply a three-way test to every comment: (a) restates code → delete; (b) explains WHY → keep; (c) names could replace it → rename and delete. Bonus: find a comment from three months ago and check whether it still matches the code — update or delete. This is the most under-rated piece of comment hygiene.
Reflect: When AI can instantly generate "what this code does" comments, the marginal cost of WHAT-comments approaches zero. Is that a feature or a bug? Can AI generate WHY-comments — or is WHY forever the human author's territory?