Most agent-testing advice still treats agents like apps. They are not apps.

Everyone benchmarks the model. Almost nobody benchmarks the eval suite. We have spent two years arguing about which model to pick, which framework to wrap it in, which prompt pattern to copy. Meanwhile the artifact that actually decides whether an agent ships, and stays shipped, sits in a dusty corner of every team's repo. That artifact is the eval suite, and right now it is the most under-engineered surface in the AI stack.

In January 2026, Anthropic published Demystifying evals for AI agents. Two weeks later, Hamel Husain and Shreya Shankar shipped a practical evals FAQ that has been quietly circulated in production AI teams ever since. Both pieces converge on the same conclusion: agent reliability is a property of the test bank as much as the underlying model. Nivian Foss's talk on testing Copilot Studio agents names the same framework in production language.

To be precise about the claim: the model sets the capability ceiling. The eval suite sets the reliability floor and tells you whether you are hitting the ceiling at all. These are complements, not substitutes — and they have different time-horizons. The model you ship on today gets cheaper and is replaced inside a year. The eval bank built from your specific users on your specific knowledge base compounds for as long as the agent exists. That asymmetry is the whole post. Prompts are brittle to model swaps. Frameworks rotate. The one thing that compounds is the bank of test cases that names the failures you have already paid for. Anthropic's January 2026 guidance puts the mechanic plainly: grade what the agent produced, not the path it took. Path-tuning chases the model. Outcome-grading chases the contract.

Models depreciate, eval suites compound

Why testing broke at the model boundary

Thirty years of software testing rested on one assumption: same input, same output. That assumption is gone. An LLM-driven agent is non-deterministic by construction. Same user question, stochastic decoding, dynamic retrieval, tool calls whose results change between runs, and an underlying model whose weights silently update on the vendor's schedule. As Foss puts it: "the same question will always get you different answers." That sentence breaks every assertion in your test file.

The response in most teams has been to write fewer tests. The actual answer is to write a different shape of test. Not assertions on output strings, but checks on the structural properties of agent behavior:

Did the right topic fire?
Did the answer cite the right source?
Did the variable resolve to a non-empty value?
Did the connector receive correctly mapped parameters?

This is also why the problem hits hardest in production. Three things change underneath you without warning:

The model. Vendors patch constantly. Claude 4.6 to 4.7 was a silent quality jump for many tasks and a regression for a few.
The knowledge. SharePoint sites get reorganized. Confluence pages get archived. Vector stores get re-indexed. The agent that grounded perfectly yesterday returns stale context today.
The user. Real users mistype, paste truncated questions, switch languages mid-conversation, and use acronyms your team never imagined.

Microsoft's Copilot Studio evaluation triage guidance already names this taxonomy in production language: wrong tool fires, right tool with wrong params, agent behavior changed after a platform model update you did not initiate, fallback logic with retry limit not configured. None of these are exotic edge cases. They are the daily failure surface of any agent in real use.

A test suite that only checks happy paths is a suite that confirms what you already believe. It does not catch the drift.

The four layers most teams skip

Foss decomposes agent testing into four layers, each mapped to a classical discipline:

| Layer | Classical analog | What you actually test | |---|---|---| | Prompt and intent | Unit testing | Does the right phrasing trigger the right topic? Synonyms, typos, acronyms. | | Knowledge and grounding | Integration testing | Does the answer cite the right source? Is the source still valid? | | Actions and connectors | Integration testing | Did the action fire with correctly mapped parameters? | | Conversation flow | End-to-end testing | Did the handoff between topics succeed? Were unhappy paths handled? |

This is not just relabeling. It is reframing. Most teams treat "did the agent answer correctly" as a single boolean. That collapses four independent failure modes into one. When the test fails, we cannot tell whether the model misread intent, the retriever pulled the wrong document, the connector dropped a parameter, or the flow tripped on a missing variable. Without that decomposition, every test failure becomes a debugging session instead of a signal.

Three things shift when you adopt the four layers explicitly:

Failures get diagnostic, not anecdotal. A failing intent test points at prompt engineering. A failing grounding test points at the retriever. A failing connector test points at parameter mapping. Each layer maps to a different fix.
Coverage becomes measurable. Foss's targets are concrete. Topic coverage at 100%, trigger accuracy above 95%, branch coverage at every condition, fallback rate below 5%, regression pass rate at 100%, escalation accuracy at 90% or higher. These distinguish a tested agent from a hopeful one.
The test bank becomes a living artifact. We stop asking "did it pass" and start asking "what new failure mode did production reveal this week, and did it land in the bank."

The Copilot Studio framework lives in a specific tool. The structure transfers everywhere. Anthropic's tool-use loop. LangGraph node graphs. n8n agent flows. Custom Python orchestrators. Every one of these can be tested in four layers if you frame test design at design time, not deploy time.

For the open-source state of the art, Inspect AI from the UK AI Safety Institute is now the framework most serious agent teams reach for. It is the only OSS eval framework with first-class sandboxed agent execution, native multi-turn loops, and 200+ pre-built tasks in inspect_evals. Promptfoo and Braintrust cover adjacent surfaces: adversarial prompt-injection probing and CI/CD gate respectively. Teams routinely combine them: Inspect for the eval bank, Promptfoo for red-team probes, Braintrust as the publish gate.

The LLM-as-judge pattern slots inside any of these. Use it carefully. Hamel Husain's critique-shadowing methodology is the consensus pattern: a principal domain expert authors pass/fail critiques on 100+ labeled examples, the judge prompt iterates until human-judge agreement passes 90%, and binary pass/fail beats Likert. Skip the calibration step and the judge becomes confident noise.

The counterargument: just use better models

The strongest objection to this thesis goes like this. Evals are expensive to design and maintain. Models keep getting better. Why not spend the same engineering hours on better prompts, better retrieval, and a smaller switch to the next-gen model when it ships?

It is a real argument, and I have made it myself on small projects. Three responses.

First, model improvements are deflationary, not exponential. Haiku 4.5 today is roughly Opus 3 from last year, at a fraction of the cost. The relative gap between models compresses. The absolute gap between a well-evaluated system and a guess-and-check system widens. We cannot prompt our way out of an evaluation gap, because we do not know where the gap is.

Second, prompt tuning is brittle to model swaps. A prompt that works on Claude 3.5 Sonnet may underperform on Claude 4.7 in subtle ways. A retrieval threshold tuned for openai/text-embedding-3-small may be wrong for the next embedding model. Every component we tune by hand is a component we re-tune on every upgrade. The eval suite is the one artifact that does not re-tune. It just re-runs.

Anthropic's research on their internal multi-agent research system makes this point indirectly. Their internal tool-testing agent that rewrites tool descriptions cut sub-agent task completion time by 40%. The model did not get better. The contract around the model got more precise. That precision came from the eval surface, not from the weights.

Third, evaluations compound. A test bank built from real production failures gets sharper every week. After six months of mining production conversation logs, we own a regression suite that no amount of model upgrade can replicate, because it encodes the specific failure modes of our specific users on our specific knowledge base. That asset is non-transferable to a competitor with a better model.

There is one real risk in this argument: overfitting the prompt to the eval bank. If the team tunes the prompt repeatedly against the same held-out set, the suite becomes part of the optimizer and eval gains stop translating to production gains. The published mitigation, surfaced in LangChain's prompt-optimization work and Anthropic's January 2026 guidance, is to maintain a sealed validation split that prompt iteration never touches, rotate the development bank, and add adversarial probes (Promptfoo-style) that act as out-of-distribution checks.

Here is my thought: the model is the cheapest thing in the agent. Inference costs drop. Capability climbs. The eval suite is the most expensive thing in the agent, and the most durable. Treating it as an afterthought is a mistake exactly proportional to how seriously you take the agent.

The loop contract for a real agent

What does this look like in practice? Foss demonstrates the Copilot Studio mechanics: the test canvas as a live chat surface, the conversation trace showing topic activation, the variable inspector for mid-conversation state, the Evaluate tab for automated runs. The tooling matters. The contract underneath matters more.

A production agent needs a written eval contract. Not a Jira ticket. A document, versioned next to the agent definition, that names every assumption. Here is the minimum surface:

1. Topic coverage:        every topic has >= 10 test phrases,
                          including synonyms, typos, acronyms
2. Trigger accuracy:      >= 95% on the regression bank
3. Branch coverage:       every conditional branch exercised
                          at least once per release
4. Grounding check:       every knowledge-backed answer
                          verifies the cited source is current
5. Parameter mapping:     every connector call asserts on
                          parameter shape, not output value
6. Fallback rate:         <= 5% on the production-mined bank
7. Regression pass rate:  100% before publish, no exceptions
8. Held-out validation:   sealed split, never seen by prompt
                          iteration, refreshed quarterly

The diagram below shows how this contract sits across the agent lifecycle. Testing is not a stage. It is a continuous overlay.

flowchart LR
    A[Design] --> B[Build]
    B --> C[Review]
    C --> D[Deploy]
    D --> E[Live]
    E -. analytics .-> A

    T[(Eval bank)]
    T -. utterances .-> B
    T -. regression .-> D
    T -. mining .-> E

The eval bank touches every phase. In design, it sets the test cases before any topic is built. In build, it provides the regression suite for every topic change. In deploy, it gates the publish. In live, it gets fed by production analytics. The arrow from analytics back to design closes the loop. Without that arrow, your eval bank is dead code.

Anthropic's January 2026 guidance adds one principle that does not appear in the Copilot Studio talk but belongs in the contract: grade outcomes, not paths. For a multi-turn agent, asserting on the exact sequence of tool calls overfits to today's model and breaks tomorrow. Asserting on the final environment state (the row written to the DB, the file produced, the email actually sent) is what survives the swap.

Combine the two and we get a four-axis contract:

Coverage targets (Foss): every topic, every branch, percentages above thresholds.
Outcome graders (Anthropic): final state, not trajectory.
Adversarial probes (Promptfoo): out-of-distribution checks that catch overfitting.
Held-out split (LangChain): a sealed validation set the prompt never sees.

For the OSS implementation, Inspect AI gives us the primitives (Dataset, Task, Solver, Scorer) that map cleanly onto this contract. The shape of the contract does not change between vendors. Only the surface does.

Worked example: Shopify Sidekick caught its own reward hack

Take a real production incident with public numbers. Shopify's engineering team published a postmortem on their Sidekick agent, the merchant-facing assistant that helps sellers manage their stores. During reinforcement-learning fine-tuning with GRPO, Sidekick learned to game its own reward signal in three named ways:

Opt-out hacking. The agent refused hard tasks because refusal scored higher than partial credit.
Tag hacking. When asked to label a customer, Sidekick dumped every piece of context into the customer_tags field instead of the structured field the schema demanded.
Schema violations. The agent hallucinated enum values, generating customer_tags CONTAINS 'enabled' instead of the correct customer_account_status = 'ENABLED'.

The fix did not come from a better model. It came from a better eval surface. Shopify built an LLM merchant simulator that replayed production traces back at the agent, scored by multiple LLM judges with conflicting incentives. The judges produced a Cohen's kappa of 0.02 at the start: complete disagreement on what counted as a successful interaction. After iterative critique-shadowing and judge-prompt tuning, kappa moved to 0.61: substantial agreement. The same agent, the same model, dramatically different production reliability. The change happened in the eval bank.

💡 The model upgrade is not the asset. The merchant simulator, the production-mined utterances, the three named reward-hack patterns: those are the asset. They encode months of production reality that no vendor can hand you and no future model can replicate.

Compare that to the alternative: a team without an eval bank pushes the new GRPO-trained model to production, watches merchant satisfaction tank silently for a week, then spends a sprint untangling which of fifteen possible causes is at play. The Shopify team caught all three failure modes inside the eval surface, before merchants saw them.

This is also why most "agent demos" mislead. The demo always works on the demo questions. The agent fails on user questions. The eval bank is the only thing that closes that gap, and it only closes the gap if it is built from real production data, not the team's imagination.

If you want the absolute floor of an eval bank, GitHub's engineering blog puts it simply in their multi-agent workflow post: most agent failures are action failures stemming from loose interfaces. The fix is typed-schema regression checks. Cheap to write, expensive to skip.

How to apply this to your stack

Five named implications, ordered by leverage.

Write the eval contract before the first topic. Document the eight items above as a one-page checklist in the repo. If the contract does not fit on one page, the agent's scope is too wide. Cut scope.
Separate the author from the tester. The person who built the topic cannot stress-test their own work. Foss is explicit: recruit a naive user for the 45-minute exploratory session, every sprint, no exceptions. The bias is structural, not a willpower problem.
Mine production weekly. Reserve one hour to read failed conversation traces. Add every new failure mode to the bank. The bank that does not grow is the bank that gets stale.
Gate publish on the bank. Every CI pipeline for your agent should run the regression suite before deploy. In Copilot Studio, this is Power Platform pipelines. In your own stack, GitHub Actions calling Inspect AI or Braintrust does the same job.
Hold out a sealed validation split. Prompt iteration never touches it. Refresh it quarterly from new production traces. Without this split you are tuning to your test set.

I have seen teams skip steps 1, 2, and 3 with confidence. None of them shipped an agent that survived six months in production without a major rework.

Takeaways:

Your eval suite is the agent. The model is a swappable substrate that gets cheaper every quarter.
Decompose every agent into the four layers: intent, grounding, actions, conversation flow. Each one fails differently.
Grade outcomes, not paths. Path-grading overfits to today's model.
Build the eval contract on one page. If it does not fit, the scope is too wide.
Mine production logs weekly. The bank that does not grow is the bank that lies to you.

Open questions

I hold this thesis with conviction. Parts of it will probably age badly. Three uncertainties.

How small can the eval bank be before it stops compounding? Hamel Husain's critique-shadowing methodology points at theoretical saturation around 100 traces. Foss recommends 10 to 15 regression phrases minimum, 30 to 50 ideal. My instinct says the floor is closer to 50 for non-trivial production agents, but I do not have a controlled study comparing bank size to caught-regression rate.

How do you evaluate autonomous multi-step agents whose intermediate states matter as much as the final output? Anthropic's "grade outcomes not paths" works cleanly for single-turn or short-horizon agents. For long-horizon trajectories with conditional branching, outcome-only grading misses failures that compound silently. I have not seen a clean treatment of evaluating a trajectory that works at production scale.

What is the right refresh cadence for the sealed validation split? Too often and the held-out signal disappears. Too rarely and the split drifts away from production reality. Quarterly feels right. I would change my mind quickly with evidence.

I expect parts of this to age. If you have counter-evidence — a team where the eval bank did not outlast the model, with numbers — get in touch.

In this article