DAY 36 / PHASE 4 · SKILLS GOVERNANCE

Governing an Org's Skills Library

Intake Contract · Dedup · Security Gate · Usage-driven Retirement

2026-06-15 · BigCat

A shared skill library's biggest cost isn't storage — it's that each overlapping skill nudges up the odds the agent picks the wrong one.

// WHY THIS MATTERS

When agents move from "one person playing" to "the whole org uses them," every team starts shoving skills into a shared library (reusable capability modules: a packaged prompt + tools + flow). Six months later you get a classic failure state: 300 skills, nobody knows which work or which are zombies; three overlapping "send email" skills leave the agent unsure which to pick; some unreviewed skill quietly reads the production DB. This isn't "too many docs" — it's a platform-governance problem stacked on an AI-specific one. Day 4 covered "too many tools → agent selection degrades"; an org-scale skill library amplifies that degradation into a systemic cost. A skill library is essentially an internal capability-package registry (like npm + a service catalog), but with two extra headaches over a code repo: its contents are executable and touch data (an attack surface), and the consumer is an agent, not a human (selection is description-driven, not human-reads-docs). Today, four pieces: intake contracts, dedup, quality & security gates, usage-driven retirement. Core mantra: the north star isn't "how many skills are in the library" — it's "how small and precise is the candidate set the agent faces on any given task."

// 01

Intake & Catalog: a skill is a product with a contract, not something anyone dumps in

Claim: with no intake gate, everyone dumps in, and six months on it's a pile of owner-less orphan skills. Each skill needs a contract: an agent-facing description, an owner, a schema, a security tier, a linked eval.

Background & principle

Treat a skill as an internal product with a contract, not a casually-shared script. Entry passes an intake review, centered on a standardized manifest: a clear description — note it's written for the agent to make a choice, not human docs (back to Day 4: description > name; description quality directly decides whether the agent picks it correctly); an explicit owner (owner-less = orphan); input / output schema; a security tier; and a linked eval. The whole library has a single source of truth (SOT) catalog: one record per skill with status (experimental / active / deprecated). This is Day 35's prompt registry and Day 33's provenance, lifted up to the "capability" layer.

In practice

# skill manifest: the intake contract (in git: reviewable / diffable / rollbackable)
id: send-invoice-email
description: "Email a customer with a PDF invoice attached. For AR collections."  # for agent selection
owner: finance-platform
inputs:  { to: email, invoice_id: str }
outputs: { message_id: str }
risk: high                     # external send + customer data
permissions: [email.send, invoice.read]   # least privilege, see Day 31
eval_set: invoice-email-golden-30
status: active                 # experimental / active / deprecated
Failure modes: (1) No intake, everyone dumps into the shared library — six months on nobody dares touch or delete anything. (2) Description written for humans, not for agent selection — the root cause of the agent picking the wrong skill. (3) No owner — when it breaks / needs an upgrade, nobody's there, and it rots into a zombie.
Going deeper · Anthropic Agent Skills, anthropic.com/news/agent-skills · This site, Day 4 Tool Use (description > name) · Day 35 Prompt-as-Code (same registry root)
// 02

Dedup: skill sprawl poisons the agent's tool selection

Claim: bigger isn't better — it's worse. Each overlapping skill lengthens the candidate set the agent faces and raises the odds of a wrong pick. The north star is a small, precise candidate set, not total skill count.

Background & principle

This is a skill library's most counterintuitive and most AI-specific cost. Day 4 covered "too many tools → selection degrades": three skills all named "send email," and the agent doesn't know which, maybe picking the one without attachments and botching the task. So the core governance move isn't "encourage more contributions" — it's actively controlling overlap: when a new skill enters, compute its semantic similarity against existing ones via embeddings; over threshold, warn "likely duplicate — merge or state the difference"; periodically scan the whole library for capability overlap and merge redundancy. More importantly, don't expose the whole library to the agent — expose a small candidate set retrieved by the current task context (skill RAG / tiering / namespaces). The health metric is "average candidate-set size," not "skill count."

┌──────── full library: 300 skills ────────┐ task ─▶│ semantic retrieval / namespace / filter │─▶ candidate set: 5–8 └────────────────────────────────────────────┘ (small & precise → right pick) new skill intake ──▶ embedding similarity vs existing > 0.9 ? ──▶ warn "likely dup" → merge / state difference health metric = avg candidate-set size ↓, not total count ↑
Failure modes: (1) Taking pride in "many skills" — a KPI of contribution count directly breeds sprawl. (2) Dumping the whole library into the agent's context — the candidate set explodes and selection quality collapses (Day 4). (3) Never merging overlapping skills — three email skills coexist forever and the agent keeps gambling.
Going deeper · Anthropic Building Effective Agents (tool-set design), anthropic.com/engineering/building-effective-agents · This site, Day 4 Tool Use (too-many-tools degradation)
// 03

Quality & Security Gates: skills run code & read data — the library is an attack surface

Claim: a skill isn't a doc, it's an executable capability — it calls APIs, runs code, reads data. One bad skill = every agent in the org that uses it is hit. This is a supply-chain problem.

Background & principle

The risk of a shared skill library is badly underrated: it's a capability supply chain; one skill compromised / miswritten / over-privileged hits every agent that references it (echoing Day 24 injection, Day 31 lethal trifecta). Three gates. One, the quality gate: intake / upgrade must pass an eval (proving it actually does the job on golden tasks) + versioning (changes can't silently affect all callers, see Day 35). Two, the security gate: the manifest declares required permissions and enforces least privilege (Day 31); scan which sensitive data it touches, its egress surface, its injection surface; tag high-risk skills (can transfer money / delete data / send customer info) as risk: high. Three, permissions & visibility: high-risk skills aren't visible to everyone by default, and need extra approval before an agent can call them (into Day 34 HITL). Plus provenance: who published, which version — locate & recall on incident (Day 33).

In practice

def promotion_gate(skill):                 # experimental → active promotion gate
    assert run_eval(skill) >= skill.baseline      # quality: no regress on golden tasks
    assert declared(skill.permissions)            # security: permissions explicit
    assert minimal(skill.permissions)             # least privilege, reject "do-everything skill"
    if skill.risk == "high":
        skill.visibility = "restricted"          # high-risk: restricted + call needs approval
    record_provenance(skill.author, skill.version)   # recallable on incident
Failure modes: (1) Skills shipped with no security tier — one over-privileged skill becomes the org's backdoor. (2) No versioning, a change silently affects all callers — one "optimization" drifts the behavior of dozens of agents at once. (3) High-risk skills visible to everyone by default — any agent can trigger a transfer / DB-drop with one sentence.
Going deeper · Anthropic Model Context Protocol (the protocol layer for tools/capabilities), modelcontextprotocol.io · This site, Day 31 Personal AI Safety (least privilege / trifecta) · Day 24 Prompt Injection
// 04

Lifecycle & Discoverability: retire zombies by metrics, let the right skill be found

Claim: a library grows without bound — you need a retirement mechanism. Drive deprecation by usage metrics, by data not human memory — so the library can only get healthier (an extension of Day 33's ratchet).

Background & principle

Without retirement, a library only grows; zombies pile up and sprawl bites selection quality. Usage-driven retirement: each skill records call count, success rate, last-used time; long-term zero calls / persistently low success → auto-mark deprecated → archive. But deprecation must be graceful: a deprecation window + migration guidance, no hard delete (an agent may still use it; hard-delete = production incident). The other side is discoverability — if a good skill isn't found, teams reinvent the wheel, feeding §2's sprawl. So you need naming conventions, tags, semantic search, and "which skill for this task" recommendations. Finally, a health dashboard: active/zombie ratio, duplication rate, average candidate-set size, high-risk count — making library health visible at a glance and governable.

In practice

def lifecycle_sweep(skill):                # run periodically, usage-driven retirement
    if skill.calls_90d == 0 or skill.success_rate < 0.6:
        skill.status = "deprecated"           # mark, don't hard-delete
        notify(skill.owner, "migrate_within: 30d")   # give a migration window

# discoverability: don't let teams rebuild because they can't find it
suggest = search_skills(task_ctx)            # semantic search + "which to use" recommendation
dashboard = { "active/zombie": "180/120", "dup_rate": 0.18, "avg_candidates": 6 }
Failure modes: (1) Only grows, never shrinks — zombies pile up, sprawl bites back, returning to §2's selection degradation. (2) Hard-deleting a skill still in use — breaks live agents, causes incidents. (3) No discovery mechanism — teams that can't find an existing skill build another, a root cause of sprawl.
Going deeper · Spotify Backstage (the software-catalog / ownership-governance paradigm), backstage.io · This site, Day 33 Legacy Governance (ratchet / metric-driven)

// Hands-on · install a governance chassis for your org's skill library

Chain the four into a checklist: even if you're just a small team with a few agents, this prevents "300 skills nobody dares touch" six months from now.

  1. Manifest contract: standardize each skill's manifest — agent-facing description / owner / schema / permissions / risk / eval.
  2. Intake gate: new skills pass intake — complete contract + security scan + dedup check; miss one and it's out.
  3. Dedup sentinel: embeddings scan semantic overlap, warn-and-merge over threshold; watch "average candidate-set size," not total count.
  4. Quality / security gates: promotion passes eval + enforced least privilege + versioning + provenance.
  5. Visibility tiers: high-risk skills restricted, with HITL approval before invocation (Day 34).
  6. Usage retirement: zero calls / low success → deprecated → archive, with a migration window, never hard-delete.
  7. Health dashboard: active/zombie ratio, duplication rate, average candidate set, high-risk count — make health visible.

Once you've built this, evaluating any "we built an agent skill platform" makes you instinctively ask: how do new skills get in, how is overlap prevented, who can call high-risk skills, how are zombies retired — instead of being dazzled by a vanity metric like "we have 300 skills." A skill library's value isn't in being big — it's that the agent can, every time, pick the right one from a small, precise candidate set.

// ENGLISH GLOSSARY

Skill Library / Capability Catalog
An org-shared library of reusable agent capabilities — essentially an internal capability-package registry.
Intake Review
The admission review for a skill, checking contract completeness, security, and non-duplication.
Skill Manifest
A skill's contract: description (agent-facing), owner, schema, permissions, risk, eval.
Skill Sprawl
Unmanaged skill growth — overlap and zombies pile up, poisoning the agent's tool selection.
Semantic Dedup
Detecting capability overlap via embedding similarity, prompting a merge.
Candidate Set
The subset of skills exposed to the agent for a task — smaller & more precise = higher selection quality.
Promotion Gate
The experimental → active gate, with eval + permission + security checks.
Least-privilege Manifest
A skill explicitly declares and is enforced to the minimum permissions.
Usage-driven Deprecation
Auto-marking and archiving zombie skills by call count / success rate.
Discoverability
Letting the right skill be found (naming / tags / semantic search / recommendation), preventing reinvention.

// Deeper Thinking

Dedup merges overlapping skills, but "three email skills" may have subtle differences (one with attachments, one plain-text, one HTML). Blind merging loses capability. How do you tell "redundant" from "legitimate variant"?
The test isn't "do they look alike," it's "can the agent reliably tell from the description which to use." If the three differences can be stated clearly in the description and map to genuinely different task intents (attachments vs plain-text is a real scenario difference) — that's a legitimate variant, but it should merge into one parameterized skill (send_email(format=...)), not three parallel entries, so the agent picks the single "send email" then fills params, and the candidate set doesn't bloat. If the difference is mere implementation detail the agent can't distinguish (two "send email" with near-identical descriptions) — that's redundancy, keep one. Principle: user-perceivable capability difference → a parameter; implementation difference → merge. Never make the agent gamble among "semantically overlapping parallel entries."
Won't an intake gate stifle bottom-up innovation? The classic platform tension: too strict and nobody contributes, too loose and you get sprawl. Where's the sweet spot?
Resolve it with tiered intake, not a blanket strict-or-loose. Have an experimental zone: extremely low contribution bar (a manifest is enough to enter), but not in the agent's default candidate set, not visible to everyone, no maintenance guarantee — let innovation happen freely. Only when an experimental skill accrues enough usage / passes the quality + security gates does it get promoted to active and into the main candidate set. So grassroots innovation isn't blocked, and "being callable by the agent by default" is a privilege that's earned. The sweet spot isn't a strictness number, it's decoupling "contributing" from "being promoted to a default capability": lenient on the former, strict on the latter. Catalogs like Backstage and npm's experimental tag follow this.
Usage metrics retire zombies — but low-frequency ≠ no-value (a once-a-year but critical compliance skill). Pure frequency-based retirement causes false kills. What then?
Pure frequency is a bad metric that'd cut the "low-frequency high-value" tail. The fix adds a value dimension, not just frequency: (1) owners can tag a skill critical: true to exempt it from auto-retirement (an annual compliance skill kept explicitly); (2) retirement is a proposal, not auto-executionlifecycle_sweep only marks + notifies the owner, with a human confirming (same as Day 34: let regression propose, humans decide); (3) distinguish "zero calls" from "low calls" — truly zero in 90 days is likely a zombie, while a once-a-year skill at least has call records and won't trip. The core: metrics drive "draw attention," not "auto-delete." A deletion — an irreversible action — always keeps a human in the loop (Day 31 reversibility).
The candidate set should be small & precise for the agent, but you don't know in advance which skill the next task needs. How do you pick the subset dynamically without dropping a critical one?
This is fundamentally a retrieval problem, isomorphic to RAG (Day 10): semantically retrieve the library against the task context and take top-k into the candidate set. Suppress missed-recall risk with a few moves: (1) tiering — coarse-filter by namespace / category first (a finance task needn't see image skills), then semantic-retrieve within the domain, both shrinking the set and cutting misses. (2) weight / pin high-value skills — core general skills don't gamble on retrieval, they're fixed in the candidate set. (3) keep an escalation path — if the agent can't find a fit in the small set, it can trigger a "broaden retrieval" instead of failing outright. As with RAG, this is a recall vs precision trade-off (Day 44 retrieval quality): too large a candidate set hurts selection, too small hurts coverage — tune k and tier granularity with real task traffic.
How does this skill governance relate to Day 4 tool use, Day 33/35 governance, Day 18 MCP? Isn't it just an "internal package registry"?
It is indeed the "package registry / service catalog" paradigm (npm + Backstage) migrated onto agent capabilities, but with two AI-specific things that can't be copy-pasted. Re Day 4: Day 4 teaches designing a single tool (description, schema); this issue governs the scale problem of a whole library — however good a single tool, pile up 300 and selection degrades. Re Day 18 MCP: MCP is the protocol & runtime carrier for skills / tools; this issue is the governance layer above it (who gets in, how to select, when to retire). Re Day 33/35: all are artifact governance — ratchet / provenance / risk-tiering / eval gate are shared; the only difference is the object (code / prompt / capability). Two things you can't copy from a classic registry: the consumer is an agent (selection driven by description + candidate-set size, a failure mode npm doesn't have), and entries are executable and touch data (a supply-chain attack surface, riskier than ordinary dependencies). Hold these two and you won't manage a skill library like a plain code repo.

// Further Reading