All notes

Agent Skills are the cheapest ALM your team will ever buy

Agent Skills are the cheapest ALM your team will ever buy

An "agent skill" is a markdown file that tells an AI coding assistant how your team likes to work: when to write a spec, when to break it into tasks, when to run tests, when to stop and ask. The file lives next to your code, the assistant reads it on demand, and the rules survive every model swap. My claim: that folder of markdown is now the cheapest application lifecycle management your team will ever buy, and for a 5-to-50-person engineering org it closes most of the discipline gap that used to require Jira, Confluence, a release manager, and a process consultant.

Key takeaways

  • A 20-person team running seven slash commands ships closer to a 200-engineer org than to its size peer.
  • Skills are the discipline layer above your tools, not a replacement for tool-use or MCP.
  • Anthropic shipped Agent Skills as an open spec; addyosmani/agent-skills is the MIT reference; both are free.
  • Per-seat ALM cost goes from 30 to 80 USD per developer per month to zero, with no governance gap.
  • The tradeoff is supply-chain trust: skills run with model-level privilege, so the PR gate moves to .claude/skills/.

In this article

The discipline gap small teams cannot afford to staff

The 2025 Stack Overflow Developer Survey shows AI tool adoption running far ahead of AI tool trust, and the gap is widening year over year. That is the discipline gap. The model writes the code; nobody has time to write the procedure around it. At Google or Microsoft the procedure lives in a Staff Engineer's head, a release checklist, a TDD culture deck, an SRE runbook, a security review board. At a 35-person company the procedure lives nowhere. The senior engineer becomes the bottleneck, or the AI ships whatever feels right on a Tuesday afternoon.

For most of my Mittelstand clients the answer used to be: buy an ALM platform and hope the workflow rubs off. That bargain never paid out. The platform got configured by the same person who already did not have time for discipline, our teams logged tickets to make the dashboard happy, and the AI tools kept generating code in a parallel universe nobody had ever taught them how the team works.

What changed in the last six months is that the discipline itself became portable. Anthropic published Agent Skills as an open standard in December 2025, with a strict frontmatter contract and progressive disclosure: the agent sees only the name and description until a task matches, then loads the body, then loads bundled files only if it runs them (Equipping agents for the real world with Agent Skills). Addy Osmani's walkthrough on the Microsoft Developer channel at Microsoft Build 2026 (LIVE150) is the cleanest public proof point I have seen: a 48.6k-star MIT repository encoding the entire define-plan-build-verify-review-ship loop as a set of markdown files plus seven slash commands (addyosmani/agent-skills). Both pieces are free.

Monday: clone the repo into a throwaway branch of your main project, point Copilot or Claude Code at it, and ask a junior engineer to ship a single feature using /spec, /plan, /build, /verify, /review in order. Watch where the model pushes back. That pushback is the discipline you have been paying a platform to enforce.

What an agent skill actually is

A skill is a SKILL.md file with two mandatory fields in YAML frontmatter, name (max 64 characters, kebab-case) and description (max 1024 characters), and then a body of instructions. The body can describe a workflow, encode a decision tree, define what "done" looks like, list red flags, or point at runnable scripts. Skills live in .claude/skills/ per project or ~/.claude/skills/ globally and hot-reload mid-session.

In Addy's repo the skills line up with the software development lifecycle. The /spec command triggers a skill that interviews the developer with clarifying questions before writing a line: who is the user, what storage does the browser get, what success looks like for the MVP. The /plan command takes the spec and decomposes it into phases with acceptance criteria and a targeted file list, so the agent knows which files it is allowed to touch. The /build command writes a failing test first, then the minimum implementation, then refactors. The /verify command spins up the dev server and drives a headless browser through the actual UI. The /review skill runs a correctness, security, and simplification pass that catches XSS holes and gratuitous abstraction. The /ship skill handles git hygiene and release notes.

A minimal skill file looks like this:

---
name: test-driven-build
description: Use when implementing a feature in an existing codebase with tests present. Writes a failing test first, then the minimum code to pass, then refactors. Stops to ask if no test framework is detected.
---

# Test-driven build

**When to use**
- The repo has a test runner already configured.
- The change adds a function, endpoint, or component with observable behavior.

**When not to use**
- Throwaway prototypes with a one-hour lifespan.
- Pure config or documentation changes.

**Red flags**
- The agent is about to delete an existing test "to make the build green".
- The agent has not run the test runner at least once before claiming done.

**Verification**
- New test exists and was observed failing before implementation.
- Test runner exits 0 after the change.
- No other tests regressed.

That file is the entire artifact. It is dense, opinionated, and reusable across every project that imports it. The Monday action: pick the one workflow on your team that the senior engineer keeps having to redo by hand (release notes, PR reviews, the way you write tests, the way you scope a ticket) and write that as a single SKILL.md before the end of the day. Commit it to .claude/skills/ in one repo. Use it for a week. Then promote.

Why this is cheaper than an ALM platform

Run the numbers. Azure DevOps Basic is 6 USD per user per month, and the Test Plans add-on, which is the part that buys you the structured test discipline, is 52 USD per user per month (Azure DevOps Services pricing). Atlassian's published Jira and Confluence per-seat tiers stack on top (Standard, Premium, Enterprise rungs broken out in Resources). A 35-person team running Jira Premium plus Confluence plus GitHub Team plus a single Test Plans seat is north of 1,000 USD a month before any consultant touches it. The consultant is the second-order cost, and the configuration drift over the next two years is the third.

addyosmani/agent-skills is MIT-licensed, free, and runs inside the coding assistant the team already pays for. The Anthropic spec is free. The token cost of running a skill is a fraction of the tool-definition tax I will come back to in the next section. The replacement is not perfect: skills do not give you a ticket database, a sprint board, or a compliance report. But they do encode the part of the platform investment that almost always failed to land anyway, which is the discipline that produces the data those dashboards display.

The honest framing is this. An ALM platform is a workflow engine with a database stapled to the side. For a Mittelstand team, the database part is overkill (your GitHub issues are fine, your messaging app already routes work) and the workflow engine never got the cultural buy-in it needed. Skills invert the bargain: free workflow engine that lives where the work happens, no database, no dashboard, no buy-in tax. We keep the lightweight ticketing we already have. Monday: cancel the Test Plans seats and the Confluence renewal that is on the next quarter's renewal cycle, redirect the budget into one engineer-week to write the first dozen skills.

Skills vs MCP: the Ronacher caveat

Now the tradeoff that decides whether this argument survives contact with a real engineering team. Armin Ronacher, Sentry CTO and one of the most credible voices in the Python ecosystem, wrote the line that anchors this whole post in his analysis of skills versus MCP loadouts:

💡 "Skills do not actually load a tool definition into the context. The tools remain the same: bash and the other tools the agent already has." (Skills vs Dynamic MCP Loadouts, Armin Ronacher)

Read that line carefully, because it is the part most launch posts skip. A skill does not give the agent new capability. The agent still acts through bash, file edits, and whatever MCP servers you have wired up. The skill is the layer that decides which of those existing tools to call, in what order, with what gates. Skills are the procedure; MCP is the action surface. The two are complementary, not substitutes.

This matters for the cost story. The often-quoted token comparison cuts in skills' favor only where the team was paying the MCP tool-definition tax in the first place. Anthropic's own analysis observed tool definitions consuming 134K tokens before optimization in some configurations, and Anthropic shipped advanced tool use with deferred loading that reports "an 85% reduction in token usage while maintaining access to full tool libraries, with internal testing showing accuracy improvements from 49% to 74% on Claude Opus 4, and 79.5% to 88.1% on Opus 4.5" (Introducing advanced tool use on the Claude Developer Platform). Those numbers are real, but they describe the gap between a tool-heavy MCP setup and progressive loading. If your team has never wired up a 90-tool MCP server, switching to skills will not cut your token bill, because there was no bill to cut.

What skills DO cut is the discipline gap. The token math is a side-benefit; the procedural encoding is the point. Monday: if you are already running MCP (work IQ, context7, GitHub MCP), keep it. Add skills as the layer above. If you are NOT running MCP yet, do not add it just to put skills on top of it: start with skills directly on the file-edit and bash primitives the assistant already has.

A KMU example: Monday morning at a 35-person Hamburg fintech

Here is the scenario I have actually pitched in the last quarter. The engineering lead at a 35-person Hamburg fintech runs a team of eight engineers shipping a B2B payments product. The stack is GitHub plus GitHub Copilot Business plus Claude Code on the senior engineers' laptops. They have no Jira, no Confluence, no Test Plans. They have a Slack channel, a Notion page for the on-call rotation, and a release-day spreadsheet that the lead maintains by hand. BaFin-regulated, so the audit trail matters.

The lead's Monday morning decision: stop trying to retrofit Jira, write twelve skills instead. Day one: /spec (the clarifying-question skill for new features), /plan (decomposition into reviewable PRs), /audit-log-write (the BaFin-specific skill that flags every change to a money-moving code path and enforces a structured commit-message format the lead can pull from git log at audit time), and /release-notes (auto-drafted from merged PRs with a manual-approval gate). Day two through five: the rest of the SDLC plus a /security-review skill that flags any new dependency, any new endpoint, and any change to the customer-data schema.

Measurable target after one quarter: PR cycle time from 3.2 days to 1.6 days (the lead's existing metric, drawn from GitHub Insights). AI-spend stable, because the Copilot seats were already paid for. Audit-prep time before the next BaFin spot-check: previously two engineer-weeks of trawling commit history, target one engineer-day of running the /audit-log-write skill in reverse against the quarter's commits and exporting the structured log. Avoided cost: roughly 22,000 EUR over twelve months in tooling spend the lead had already committed to in her FY26 budget line, covering Test Plans seats for the eight engineers, Confluence Premium across the company, and the Jira Premium uplift she had been waiting to fund. None of those numbers depend on a model upgrade. They depend on the procedure being encoded as text in the repo, where every engineer and every assistant can see it and follow it. None of our Mittelstand clients will buy a 200-seat ALM platform; all of them can afford a one-week skill-writing sprint led by their existing senior engineer.

Where the thesis loses

I owe the reader the failure modes. The thesis loses in three places.

First, security and supply chain. A skill file is executable instructions running at model-level trust. Anyone who can write to .claude/skills/ can silently change what every assistant in every session does next. For the BaFin-regulated fintech, that is not a footnote. Monday action: treat .claude/skills/ as production code. PR review, two-person sign-off on changes, pin the imported skill repos at a reviewed commit SHA in CI, never git pull skill packs at session start.

Second, governance reporting. Skills give you discipline; they do not give you the dashboard a CTO uses to brief the board. If the business requires "show me velocity, sprint burndown, and cycle time for the quarterly review," a folder of markdown files will not produce that artifact on its own. The honest answer is to keep a thin reporting layer (GitHub Insights, a simple SQL query against the issue tracker, or a single PostHog dashboard) and accept that you are paying for the report, not the workflow. The workflow is the skills; the report is the residual ALM spend.

Third, multi-agent and human heterogeneity. Skills assume the agent reading them is competent enough to follow nuanced markdown. Frontier models are; cheaper models often skip the red-flag section under load. There is also a documented cognitive-debt effect: developers who lean on AI assistance score measurably worse on follow-up comprehension than peers who did the work unaided. The lesson generalizes. Encoding discipline as text does not prevent humans from offloading the decision to the model and skipping the procedure entirely. The skill writes the rule; the team has to read it.

I still think the tradeoff is worth taking for almost every team between 5 and 50 engineers. You give up the dashboard, you accept the supply-chain discipline, you keep the procedural encoding. For most Mittelstand clients I have advised this quarter, that is a strictly cheaper, strictly faster, and strictly more durable answer than the ALM platform they were about to buy.

If you are running this experiment on your own team, or if you tried it and it broke in a way I have not described above, get in touch. I want to compare notes on which skills actually survived contact with a real codebase, and which ones the team quietly stopped using after two weeks.

Resources