All notes

Production agents need runtime scorers, not just pre-ship evals

Production agents need runtime scorers, not just pre-ship evals

You shipped the agent. The CI suite was green. A week later somebody pastes a real user transcript into Slack and it is wrong in a way none of your test cases predicted. That is the prototype-to-production cliff, and it is not solved by adding more rows to the pre-ship suite. The suite that catches it has to live inside the request path, sampling real traffic, writing scores back to traces, and gating rollouts before users feel the regression.

Key takeaways

  • Pre-ship eval suites freeze a distribution; production traffic drifts away from that distribution within days of launch.
  • Runtime evals are a distinct artifact: online scorers, shadow traffic, and regression gates wired into the deployment path.
  • An LLM judge weaker than the model it scores produces confident noise and decorates regressions instead of catching them.
  • Observability answers "did it run"; runtime evals answer "was it right"; conflating them is the most expensive mistake in the stack.
  • A 1 percent Haiku judge sample on a million traces per month costs near $1,000: no economic excuse for shipping blind.

In this article

  • The prototype-to-production cliff in plain language
  • Why the CI suite stops working the moment users arrive
  • What a runtime eval actually is
  • The four-surface model production teams converge on
  • A KMU stack the platform lead can ship Monday
  • Where runtime evals lose: latency, cost, judge calibration
  • Wiring regression gates into the deploy path, not the PR path
  • Replaying the BRK241 walkthrough as one data point
  • What I would build first

The prototype-to-production cliff in plain language

Before deploy, you control the inputs. You wrote the test cases. You picked the prompts. The agent runs against a frozen distribution and a frozen rubric, and a green CI build means the agent passes the test you imagined.

After deploy, you do not control the inputs. Real users ask things your test bank never anticipated. The retrieval corpus drifts as documents change. The vendor patches the model behind the API and your prompts respond differently overnight. The failure distribution shifts faster than a static test suite can encode, and the test suite that shipped is now scoring an agent it no longer matches.

The fix is not "write more tests." The fix is a second artifact that runs in production: scorers that read live traces, score them on the same rubric your CI suite uses, and surface drift before the next release goes out. That is what I mean by runtime evals. My sibling post Models depreciate, eval suites compound covered the pre-ship half of this contract: how to build the bank, calibrate the judge, and keep the CI suite worth running. This post is the post-ship half, the diptych's other panel.

Why the CI suite stops working the moment users arrive

A CI suite is a snapshot. You picked 200 inputs, labelled the right answers, and the suite tells you whether today's prompt beats yesterday's prompt against that snapshot. The snapshot is honest only as long as production traffic looks like it.

Three forces move production away from the snapshot:

  • User drift. Real users do not write the way PMs imagine. The top failure modes after a month live are almost always classes of input you never wrote a test for.
  • Corpus drift. RAG systems retrieve from a corpus that is being updated by other humans. The same prompt against the same model can change answers because the underlying documents changed.
  • Vendor drift. Frontier model snapshots get patched. The same API endpoint behaves differently month over month, sometimes silently, sometimes documented in a release note nobody on your team subscribed to.

💡 The eval suite that catches real drift is built from real production failures, not from prompts a PM imagined last quarter. The CI bank ages out the moment production starts emitting failures the bank never saw.

The Foundry concept page on agent observability names the production half directly: "Continuous evaluation: Quality and safety evaluation of production traffic at a sampled rate" and "Scheduled evaluation: Scheduled quality and safety evaluation using test datasets to detect system drift." Those are two different surfaces from CI. See Microsoft's observability concept page for the verbatim framing.

What a runtime eval actually is

A runtime eval is a scorer that runs against live production traces, on a sampled subset, and writes the score back into the trace store so you can query it the same way you query latency and token counts.

Three concrete shapes show up:

  1. Inline scorer. The judge runs in the request path, blocks on the score, and the agent can branch on it (retry, escalate to human, return a fallback). High signal, high cost, only worth it on small fractions of traffic.
  2. Async scorer. The trace ships to a queue, the judge scores it after the user response is sent, and the score writes back to the trace. The user sees nothing extra; the team sees drift dashboards. Most teams should start here.
  3. Scheduled re-run. A golden set runs against the current production model and prompt on a cron, and any regression beyond a threshold pages the on-call. This is the production extension of your CI bank.

The async pattern is what Langfuse model-based evaluations and Braintrust online evals ship by default. Both let you point a judge prompt at a sampled stream of production traces and write scores back. The CI vendor and the runtime vendor have collapsed into the same product for most teams.

Monday move: pick one trace store (Langfuse or Braintrust if you want managed, Arize Phoenix if you need MIT-licensed self-host), enable trace export on your agent, and write a single judge prompt for the failure mode that bit you most recently in prod.

The four-surface model production teams converge on

I have watched enough teams cross the cliff to see the stack converge on four surfaces. None replaces the others.

  • Offline / CI evals. Inspect AI for the agent-execution harness, OpenAI Evals for the canonical task format, Promptfoo for adversarial probing. The pre-flight checklist.
  • Online evals. Sampled LLM-as-judge on live traces. Langfuse, Braintrust, Arize Phoenix, LangSmith online evaluators, Humanloop. The cockpit instruments.
  • Observability. OpenTelemetry traces, latency, token cost, error rate. Tells you the agent ran. Does not tell you it was right.
  • Drift detection. Scheduled re-runs of the golden set against the current production stack. Catches vendor patches before users do.

Conflating observability with evals is the single most expensive mistake I watch teams make. A clean latency dashboard with a 99 percent uptime number tells you nothing about whether the answers were any good. The trace store is necessary, not sufficient.

A KMU stack the platform lead can ship Monday

Concrete scenario, because abstractions do not survive a Monday morning. You are the platform lead at a 90-person SaaS shop. You shipped a support-triage agent six weeks ago. CI is green. Customer success is forwarding screenshots of wrong answers. You have no observability budget approval.

Here is the stack I would stand up before lunch:

  • Trace store. Self-host Langfuse on the cheapest VPS you already own. Free, OSS, Langfuse self-host docs cover docker-compose in under an hour.
  • Judge model. Claude Haiku 4.5 at roughly $0.001 per judge call at typical eval-prompt sizes. At 1 percent sampling on a million traces per month, you land near $1,000 in judge spend per month, the SMB threshold above which sampling becomes mandatory.
  • Golden set. A 50-row Google Sheet. The platform lead and one customer-success rep label inputs and ideal outputs from last month's transcripts. This is the regression set.
  • CI gate. GitHub Actions running Inspect AI against the 50-row set on every PR that touches the prompt directory. Block merge on regression beyond two rows.
  • Drift cron. A nightly GitHub Actions cron re-runs the same 50-row set against the live production stack. If the score drops by more than 5 percent compared to a 7-day baseline, post to the team Slack.

Monday output: a Slack channel that pings when the agent regresses. Measurable outcome: time-to-detect a regression drops from "a customer-success forward, days later" to "tomorrow morning, before standup." That is the only metric that matters for the first iteration.

# Bootstrap the Monday stack
git clone https://github.com/langfuse/langfuse
cd langfuse && docker compose up -d              # trace store, port 3000

# CI gate, .github/workflows/evals.yml
inspect eval golden_set.py --model anthropic/claude-haiku-4.5 \
  --max-samples 50 --log-dir ./eval-logs

# Drift cron, runs at 02:00 UTC
0 2 * * * inspect eval golden_set.py \
  --model anthropic/claude-haiku-4.5 \
  --tags drift,prod \
  --fail-on-error

Where runtime evals lose: latency, cost, judge calibration

I will not pretend runtime evals are free. Three honest tradeoffs decide whether they pay back on your stack.

Latency. An inline judge adds the judge model's full round-trip to your user-facing path. On a Haiku-class judge that is a few hundred milliseconds, on a frontier judge it can be several seconds. My rule of thumb: at sub-second SLAs or chat streaming, inline judges are a non-starter; async scoring is the only viable shape.

Cost. Sampling fixes the unit cost but not the audit cost. At low QPS the math is trivial. The number I keep in my head: above roughly 10 QPS sustained with a frontier-class judge, the judge bill can exceed the model bill it scores. The lever is sampling rate, but the lower you sample, the longer it takes to detect a drift event with statistical confidence. Below 100 sampled traces per day per failure mode, the noise floor swallows real signal.

Judge calibration. An LLM judge weaker than the model it judges produces confident noise. Hamel Husain's critique-shadowing methodology puts the bar at iterating the judge prompt until human-judge agreement passes roughly 90 percent. The position-bias paper Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge documents that "position bias is not due to random chance and varies significantly across judges and tasks." Below model parity the judge decorates regressions instead of catching them, which is worse than no judge at all because it inspires false confidence.

The honest answer: runtime evals do not pay back at every QPS and every margin. Below maybe 1 QPS and a small user base, a daily scheduled re-run of the golden set is enough; you do not need online sampling. Above that, async sampling is the right shape, inline is reserved for the one critical hop where you can afford to branch on the score.

Wiring regression gates into the deploy path, not the PR path

The CI eval gate blocks pull requests. The runtime regression gate blocks rollouts. These are not the same gate, and most teams only ship the first.

A rollout gate looks like this:

  1. New prompt or model variant deploys to a canary slice. Standard SRE canary ramp: 1 percent, 5 percent, 25 percent, 100 percent.
  2. At each ramp step, async scorers sample the canary's traces and the baseline's traces on the same judge prompt.
  3. If the canary's mean score drops by more than a threshold (mine starts at 3 percent absolute, tuned per surface), the ramp halts and rolls back automatically.
  4. If the canary holds for the ramp dwell time, the promoter moves to the next ramp step.
# rollout-gate.yml
canary:
  ramp: [1, 5, 25, 100]
  dwell_per_step: 30m
gate:
  scorer: haiku-4.5-judge-v3
  sample_rate: 0.05
  metric: mean_score
  threshold_abs_drop: 0.03
  baseline_window: 24h
  on_fail: rollback

This is the pattern Microsoft's Foundry monitor agents how-to gestures at with its alerting and continuous-evaluation surfaces, and the pattern Braintrust documents under online evals. The shape is the same regardless of vendor; what differs is whether the rollout gate is in the same product as the trace store or a glue layer you write.

Monday move: pick the smallest surface in your agent stack (one tool call, one retrieval step) and put a sampled scorer on it before any other instrumentation. You do not need to score the whole agent on day one to learn whether the gate concept fits your team.

Replaying the BRK241 walkthrough as one data point

The Microsoft Build BRK241 session on the Microsoft Developer YouTube channel is one corroborating data point for the thesis, not a recap. Replaying the BRK241 walkthrough, the relevant move is azd ai agent eval init bootstrapping a benchmark dataset from historic production traces when no golden set exists, and azd ai agent optimize running an automated loop that ranks candidate variants across system prompts, tool descriptions, skills, and target models. In the walkthrough the optimizer surfaced an 11 percent evaluator-score gain on a voice-enabled agent variant, which is what motivated the talk's framing.

The number itself is not the point. The pattern is: the same trace store that gives you observability is what bootstraps the eval set, and the eval set is what feeds the optimizer. The runtime trace, the runtime score, and the rollout gate are one closed loop. Microsoft is pitching that loop inside Foundry; the same loop exists inside Langfuse plus Inspect AI plus a custom rollout script if your stack is OSS. The vendor changes; the loop does not.

The sibling Build sessions are worth two skim links: BRK231 on reinforcement learning for production agents and BRK250 on open-source observability across frameworks. Together they triangulate where the vendor consensus is heading: runtime evals plus rollout gates plus closed-loop tuning, not a static CI suite.

What I would build first

If I were dropped into a team that has CI evals and zero runtime evals, here is the order I would build in. Each step has a measurable outcome before the next starts.

  1. Trace store. Enable OpenTelemetry export from the agent to Langfuse or Phoenix. Outcome: every production request is queryable by ID with full inputs, outputs, latency, and cost. Time: half a day.
  2. Manual labelling sprint. Two engineers spend two days labelling 100 production traces for the top failure mode. Outcome: a labelled set sized at Hamel Husain's "100+ labeled examples" calibration floor.
  3. Judge prompt v1. Write a single rubric for that failure mode. Iterate against the labelled set until human-judge agreement clears 90 percent. Outcome: a calibrated judge worth pointing at live traffic.
  4. Async runtime scoring. Run the judge on 5 percent of live traces. Write scores back to the trace store. Outcome: a daily drift dashboard for one failure mode, end-to-end.
  5. Rollout gate. Wire the same judge into a canary rollout script. Outcome: prompts and model swaps cannot ship without holding the score threshold on the canary slice.
  6. Scheduled drift cron. Nightly re-run of the labelled set against the live model. Outcome: vendor patches and corpus drift get caught the morning after they happen.

Steps 1 through 3 are the foundation; the team measurably moves from blind to instrumented. Steps 4 through 6 are the closed loop; the team measurably moves from instrumented to self-correcting. The whole sequence fits inside a two-week sprint for a four-person platform team. I have run it inside three weeks for a team of two; the calibration step is the part that compresses worst.

If you have run a version of this loop and disagree on the ordering, or if the SMB judge math broke for you above a certain QPS, I want to compare notes. The runtime-evals stack is the youngest part of the production-agent toolchain, and every team's first build looks slightly different from every other team's first build. That is where the lessons are.