The same question to the same agent on the same day gives a different answer. Your test framework has not caught up yet.

In 2025 a developer on a Microsoft community forum spent two days debugging a Copilot Studio agent that "randomly" stopped triggering the right topic. The agent had not changed. The prompt had not changed. The connected SharePoint document had been edited by a colleague to "clean up the wording," and the trigger phrase the agent had been matching on quietly disappeared. There was no failing test. There was no alert. There was a user typing the same question they had typed last week and getting a politely unhelpful answer. That is the shape of every agent failure I have seen in production. Silent, sourceless, and immune to any test that asserts on output strings.

An agent test is a hypothesis about a behavior distribution, scored on a statistical pass rate, not an assertion on a string. The same-input-same-output assumption that thirty years of software testing rested on is gone, and the pass condition has to change shape with it. I worked through the strategic case for the eval bank as the surviving artifact in Your eval suite is the agent, not the model. This piece is the technical companion: why even temperature zero will not save you, and what the YAML actually looks like when you stop pretending it can.

Temperature zero will not save you

Why temperature zero is not a fix

The default reflex is to set temperature to zero, write the assertion, and move on. I have written that exact code. It is wrong, and it is wrong in a way that is worth understanding because the misunderstanding is structural, not careless. Temperature is a sampling parameter on the output distribution. Setting it to zero collapses sampling to greedy decoding: at each step the highest-probability token wins. Two identical prompts at temperature zero should, in principle, produce identical token streams.

In practice they do not. Three mechanisms break the guarantee.

The first is that mixed-precision inference (BF16 or FP16 weights and activations) introduces rounding error at every matrix multiply. Two prompts that share a prefix can land on different "highest probability" tokens at a position where the top two candidates are separated by a margin smaller than the precision floor. The paper Understanding and Mitigating Numerical Sources of Nondeterminism in LLM Inference measures the rate at which this happens across Llama, Qwen, and DeepSeek model families and finds it is not rare.

The second is batch non-associativity. When inference servers batch your request with other tenants' requests for throughput, the order of floating-point additions inside each kernel changes with batch size. The Thinking Machines Lab analysis traces this directly: same input, different batch composition at the server, different output. You have no control over the batch composition from inside your test. Non-Determinism of "Deterministic" LLM Settings measures the resulting variance across providers in the "deterministic" setting and finds it is non-zero everywhere.

The third is silent vendor updates. The weights behind gpt-4o or claude-sonnet-4-7 or any commercial endpoint are not the same on Tuesday as they were on Monday in general. Even pinned-version endpoints get fine-tuning patches, safety updates, and kernel-level optimizations that shift outputs at the margin. Your "deterministic" test is reproducing a snapshot of the vendor's infrastructure that no longer exists.

Stop fighting for determinism. Determinism is not the property you actually want. The property you want is a stable, measurable distribution of behavior over a test bank, with a known acceptance threshold and enough samples to make that threshold mean something. That is a different artifact, and it is the one that survives.

💡 An agent test is a hypothesis about a behavior distribution, scored on a statistical pass rate, not an assertion on a string.

The four layers, with a different pass condition

The decomposition I use is Nivian Foss's four layers (prompt and intent, knowledge and grounding, actions and connectors, conversation flow), which the sibling piece lays out in full. The point worth restating here is the one specific to non-determinism: the categories survive the move from software testing, the pass condition does not. You no longer assert that agent.respond("how much does it cost?") equals a fixed string. You assert that the same phrase, across N samples, fires the pricing topic in a high enough fraction of runs that the suite would catch a regression you care about. The unit becomes statistical, and the threshold has to be sized against the variance of the metric, not picked by feel.

Small N is honest, but it is not a gate

Here is the part most eval write-ups skip. If trigger accuracy on a single case can bounce from 80% to 96% across consecutive five-sample runs of the same case on the same day, then a 0.95 threshold on a five-sample run is statistical theater. With N=5, the achievable pass rates are 0, 0.2, 0.4, 0.6, 0.8, 1.0. With N=3, they are 0, 0.33, 0.66, 1.0. A pass_threshold: 0.66 on three samples is "2 of 3 must pass" with extra decimal places. The fraction makes it look principled. It is not.

There are two honest moves. The first is to bump N substantially — into the 20-to-50-samples-per-case range when a case actually gates a release, accepting the cost — and report a confidence interval, not a single fraction. The second is to keep small N for development-time signal and explicitly label those banks "directional, not gate-worthy." Both are fine. Pretending a five-sample pass rate of 0.95 is a release gate is not. The YAML below shows the small-N version because that is what most teams start with; treat the thresholds as a development signal until N grows.

What the eval loop looks like in practice

The smallest credible eval loop for a Copilot Studio agent or any equivalent — the same shape works for LangSmith, Promptfoo, or a hand-rolled Python script against the Anthropic SDK — starts with the bank.

Anthropic's Demystifying evals (January 2026) suggests 20 to 50 cases drawn from real failures as the right floor. Foss accepts 10 to 15 as a practical minimum for a new agent. I use 12 as a starter for a three-topic agent because it covers each topic three ways on the happy path plus a flow case, an adversarial case, and unhappy paths — but the bank should grow to 20-50 the week the agent ships. Build the bank from three sources: known production failures (highest-signal cases you will ever write), synonyms and acronyms of high-traffic phrases, and one adversarial set generated by a user who has never seen the agent.

# evals/prospect-agent.yaml
cases:
  - id: pricing-direct
    layer: prompt-intent
    input: "How much does the enterprise plan cost?"
    expected_topic: pricing-and-plans
    samples: 5
    min_passes: 5            # 5/5 — high-traffic phrase, zero tolerance
  - id: pricing-typo
    layer: prompt-intent
    input: "wht does the enterprize plan cost"
    expected_topic: pricing-and-plans
    samples: 5
    min_passes: 4            # 4/5 — typo path, tolerate one miss
  - id: grounding-fresh
    layer: knowledge-grounding
    input: "What is the current enterprise tier price?"
    expected_citation_url_contains: "/pricing-2026"
    samples: 3
    min_passes: 3            # 3/3 — citation correctness is binary
  - id: connector-demo
    layer: actions-connectors
    input: "I want to book a demo for next Tuesday"
    expected_action: book_demo
    expected_parameters:
      meeting_type: "demo"
      preferred_day_set: true
    samples: 3
    min_passes: 3            # 3/3 — action contract is binary
  - id: flow-unhappy
    layer: conversation-flow
    input_sequence:
      - "how much does it cost?"
      - "I don't like that answer"
      - "actually never mind"
    expected_outcome: graceful_exit
    samples: 3
    min_passes: 2            # 2/3 — directional only; raise N before gating

Each case names its layer, declares the structural property to check, and sets the pass condition as an integer count of successful samples. Integer counts because N is small. The runner samples the agent N times per case, scores each sample with the appropriate grader (exact match for topic name, URL substring for citations, schema check for connector parameters, LLM-as-judge for outcome-level flow checks), and reports per-case pass counts plus an aggregate. At 200+ cases, split the YAML by layer (evals/layer-1-prompt-intent.yaml and three siblings) — single-file readability breaks down around the 50-case mark.

LLM-as-judge deserves a clean caveat. The Zheng et al. NeurIPS paper is the citation that legitimizes the category: in the MT-Bench setup, GPT-4 reached roughly 80% agreement with human raters, comparable to inter-human agreement on the same tasks. That result is a useful proof of concept, not a domain-general guarantee — the agreement rate on your domain will differ, often substantially. Eugene Yan's follow-up is the citation that keeps it honest: a judge cannot compensate for process neglect. Use a judge for outcomes that resist programmatic checks (was the answer relevant? was the tone appropriate?), and pair it with a small calibrated human-review set on every run so the judge itself is being graded against your domain.

Worked example: the prospect-engagement agent

The demo agent Foss walks through in her talk is a Copilot Studio prospect-engagement assistant for a fictional company called Lummetra — a small B2B SaaS with a pricing page, a product page, and a demo-booking flow. The agent does three things in scope: answer product questions, recommend a pricing tier based on team size, and book a demo by collecting required parameters. Three topics: product-info, pricing-and-plans, book-demo.

The starter bank is twelve cases. Three per topic on the happy path (one direct phrasing, one synonym, one typo). Two on the unhappy path across topics (user changes their mind, user asks something out of scope). One conversation-flow case across all three topics (pricing → products → book). One adversarial case (user tries to get the agent to discuss a competitor). Total: twelve cases, each sampled three to five times. Twelve is a floor, not a target — push it past 20 the week the agent ships.

On the first run against the agent, replaying Foss's walkthrough, two cases fail. The typo'd pricing phrase fires the product-info topic 3 times out of 5 instead of pricing. The connector test for book_demo is missing the preferred_day_set parameter on one of three samples because the agent decided to confirm the day in a follow-up question. The first failure is a trigger-phrase coverage gap; the fix is to add the typo'd variant to the trigger phrases for pricing-and-plans. The second is a fixture problem in the test: the expected behavior is conversational, and the test was asserting too early.

That is what twelve cases buys you. Two real findings, one of them an agent bug, one of them a test bug. Both worth fixing. None of which would have surfaced from manually typing into the test canvas, because the test canvas catches the first sample and moves on. Foss's analytics dashboard would have surfaced them eventually, in production, after users hit them.

The same YAML bank, with no changes, runs from a Promptfoo CLI step in a GitHub Actions workflow and from a Power Automate flow inside the Copilot Studio Power Platform pipeline. I have driven both against the same agent; the per-case pass counts agree within the noise floor you would expect from N=5. The bank is the artifact. The runners are interchangeable, which matters because the runner you start with is rarely the one you end with.

How to apply this in your own stack

Five concrete moves, not generalities.

First, name the four layers explicitly in your repo. Create evals/layer-1-prompt-intent.yaml and the three siblings once the bank crosses ~50 cases. The layer of a failing test is the layer where the fix lands; mixing layers in one bank makes the failure-to-fix routing harder.

Second, set per-case sample counts and pass thresholds as integer counts, never as global decimals. A typo case for a low-traffic topic might pass at 3/5; a direct phrasing for your highest-traffic topic should pass at 5/5. A single global threshold collapses the difference.

Third, gate publish on the bank. Foss is right that "tests must pass before publishing" but underspecifies how to enforce it. The mechanism is the REST API or a Power Automate flow; the gate is a Power Platform pipeline step. Same idea as a Braintrust PR-gate or a LangSmith CI step. Salesforce Agentforce, ServiceNow's agent platform, and several LangSmith-wrapped no-code tools have shipped similar primitives; Copilot Studio is not unique in offering this, but it is the one I have driven end-to-end, which is why this piece leans on it. The platform name is a detail; the gate is the substance.

Fourth, store the per-run results. SharePoint works. A SQLite file in the repo works. The point is that a passing run today is only useful if you can compare it to last month's passing run and see whether the trigger-accuracy distribution has drifted. Without the time series, regression testing is wishful.

Fifth, schedule a forty-five-minute exploratory session per sprint with a user who has never seen the agent. Foss makes this a footnote; it should be a calendar event. Every adversarial case that ends up in your bank started as a free-form session that exposed a failure mode you would not have written down.

Takeaways

Treat the eval bank as a versioned artifact next to the agent definition; the runners are swappable.
Score per-case pass conditions as integer counts (4/5, 2/3), not decimal thresholds that mask the underlying N.
Bump N to 20-50 per case before a result is allowed to gate a release; below that, label the bank "directional, not gate-worthy."
Name the four layers (prompt, knowledge, action, flow) in the bank file structure so failures route themselves to a fix.
Pair LLM-as-judge with a small human-calibration set; judges drift and need to be graded too.

Open questions

I am not confident about three things and would rather hear from people who have shipped against them.

First, the right sample size per case. I write three to five for development and that feels low. The honest answer is probably that the sample count is a function of the variance of the metric and we should be choosing it from a confidence-interval calculation, not a habit — but I have not seen a clean published heuristic that maps domain to required N.

Second, the calibration cadence for LLM-as-judge. Once a quarter is what I do; I suspect once a week would catch judge drift faster but the human-review labor is real. I do not know the right tradeoff.

Third, the interaction between layer-one trigger accuracy and the underlying model's instruction-following improvements. When the model upgrades, trigger accuracy often goes up "for free" across the board. That is a regression risk in disguise: the threshold that caught failures last month may pass everything this month, including the failures. The bank needs to evolve with the floor, and I do not have a clean process for that yet.

I expect parts of this to age badly. If you are running an agent in production and your test bank looks nothing like the shape above, the fastest way to change my mind is a concrete counterexample: what does your bank catch that this shape does not, and what does the suite cost to maintain? Get in touch — I read everything.

In this article