A shared skill library's biggest cost isn't storage — it's that each overlapping skill nudges up the odds the agent picks the wrong one.
When agents move from "one person playing" to "the whole org uses them," every team starts shoving skills into a shared library (reusable capability modules: a packaged prompt + tools + flow). Six months later you get a classic failure state: 300 skills, nobody knows which work or which are zombies; three overlapping "send email" skills leave the agent unsure which to pick; some unreviewed skill quietly reads the production DB. This isn't "too many docs" — it's a platform-governance problem stacked on an AI-specific one. Day 4 covered "too many tools → agent selection degrades"; an org-scale skill library amplifies that degradation into a systemic cost. A skill library is essentially an internal capability-package registry (like npm + a service catalog), but with two extra headaches over a code repo: its contents are executable and touch data (an attack surface), and the consumer is an agent, not a human (selection is description-driven, not human-reads-docs). Today, four pieces: intake contracts, dedup, quality & security gates, usage-driven retirement. Core mantra: the north star isn't "how many skills are in the library" — it's "how small and precise is the candidate set the agent faces on any given task."
Treat a skill as an internal product with a contract, not a casually-shared script. Entry passes an intake review, centered on a standardized manifest: a clear description — note it's written for the agent to make a choice, not human docs (back to Day 4: description > name; description quality directly decides whether the agent picks it correctly); an explicit owner (owner-less = orphan); input / output schema; a security tier; and a linked eval. The whole library has a single source of truth (SOT) catalog: one record per skill with status (experimental / active / deprecated). This is Day 35's prompt registry and Day 33's provenance, lifted up to the "capability" layer.
# skill manifest: the intake contract (in git: reviewable / diffable / rollbackable)
id: send-invoice-email
description: "Email a customer with a PDF invoice attached. For AR collections." # for agent selection
owner: finance-platform
inputs: { to: email, invoice_id: str }
outputs: { message_id: str }
risk: high # external send + customer data
permissions: [email.send, invoice.read] # least privilege, see Day 31
eval_set: invoice-email-golden-30
status: active # experimental / active / deprecated
This is a skill library's most counterintuitive and most AI-specific cost. Day 4 covered "too many tools → selection degrades": three skills all named "send email," and the agent doesn't know which, maybe picking the one without attachments and botching the task. So the core governance move isn't "encourage more contributions" — it's actively controlling overlap: when a new skill enters, compute its semantic similarity against existing ones via embeddings; over threshold, warn "likely duplicate — merge or state the difference"; periodically scan the whole library for capability overlap and merge redundancy. More importantly, don't expose the whole library to the agent — expose a small candidate set retrieved by the current task context (skill RAG / tiering / namespaces). The health metric is "average candidate-set size," not "skill count."
The risk of a shared skill library is badly underrated: it's a capability supply chain; one skill compromised / miswritten / over-privileged hits every agent that references it (echoing Day 24 injection, Day 31 lethal trifecta). Three gates. One, the quality gate: intake / upgrade must pass an eval (proving it actually does the job on golden tasks) + versioning (changes can't silently affect all callers, see Day 35). Two, the security gate: the manifest declares required permissions and enforces least privilege (Day 31); scan which sensitive data it touches, its egress surface, its injection surface; tag high-risk skills (can transfer money / delete data / send customer info) as risk: high. Three, permissions & visibility: high-risk skills aren't visible to everyone by default, and need extra approval before an agent can call them (into Day 34 HITL). Plus provenance: who published, which version — locate & recall on incident (Day 33).
def promotion_gate(skill): # experimental → active promotion gate
assert run_eval(skill) >= skill.baseline # quality: no regress on golden tasks
assert declared(skill.permissions) # security: permissions explicit
assert minimal(skill.permissions) # least privilege, reject "do-everything skill"
if skill.risk == "high":
skill.visibility = "restricted" # high-risk: restricted + call needs approval
record_provenance(skill.author, skill.version) # recallable on incident
Without retirement, a library only grows; zombies pile up and sprawl bites selection quality. Usage-driven retirement: each skill records call count, success rate, last-used time; long-term zero calls / persistently low success → auto-mark deprecated → archive. But deprecation must be graceful: a deprecation window + migration guidance, no hard delete (an agent may still use it; hard-delete = production incident). The other side is discoverability — if a good skill isn't found, teams reinvent the wheel, feeding §2's sprawl. So you need naming conventions, tags, semantic search, and "which skill for this task" recommendations. Finally, a health dashboard: active/zombie ratio, duplication rate, average candidate-set size, high-risk count — making library health visible at a glance and governable.
def lifecycle_sweep(skill): # run periodically, usage-driven retirement
if skill.calls_90d == 0 or skill.success_rate < 0.6:
skill.status = "deprecated" # mark, don't hard-delete
notify(skill.owner, "migrate_within: 30d") # give a migration window
# discoverability: don't let teams rebuild because they can't find it
suggest = search_skills(task_ctx) # semantic search + "which to use" recommendation
dashboard = { "active/zombie": "180/120", "dup_rate": 0.18, "avg_candidates": 6 }
Chain the four into a checklist: even if you're just a small team with a few agents, this prevents "300 skills nobody dares touch" six months from now.
Once you've built this, evaluating any "we built an agent skill platform" makes you instinctively ask: how do new skills get in, how is overlap prevented, who can call high-risk skills, how are zombies retired — instead of being dazzled by a vanity metric like "we have 300 skills." A skill library's value isn't in being big — it's that the agent can, every time, pick the right one from a small, precise candidate set.
send_email(format=...)), not three parallel entries, so the agent picks the single "send email" then fills params, and the candidate set doesn't bloat. If the difference is mere implementation detail the agent can't distinguish (two "send email" with near-identical descriptions) — that's redundancy, keep one. Principle: user-perceivable capability difference → a parameter; implementation difference → merge. Never make the agent gamble among "semantically overlapping parallel entries."experimental zone: extremely low contribution bar (a manifest is enough to enter), but not in the agent's default candidate set, not visible to everyone, no maintenance guarantee — let innovation happen freely. Only when an experimental skill accrues enough usage / passes the quality + security gates does it get promoted to active and into the main candidate set. So grassroots innovation isn't blocked, and "being callable by the agent by default" is a privilege that's earned. The sweet spot isn't a strictness number, it's decoupling "contributing" from "being promoted to a default capability": lenient on the former, strict on the latter. Catalogs like Backstage and npm's experimental tag follow this.critical: true to exempt it from auto-retirement (an annual compliance skill kept explicitly); (2) retirement is a proposal, not auto-execution — lifecycle_sweep only marks + notifies the owner, with a human confirming (same as Day 34: let regression propose, humans decide); (3) distinguish "zero calls" from "low calls" — truly zero in 90 days is likely a zombie, while a once-a-year skill at least has call records and won't trip. The core: metrics drive "draw attention," not "auto-delete." A deletion — an irreversible action — always keeps a human in the loop (Day 31 reversibility).