All notes

Stop routing every agent step through a frontier LLM

Stop routing every agent step through a frontier LLM

Most teams shipping AI agents in 2025 sent every single step, every tool call, every tiny classification, to the biggest model they could afford. That worked when there were three models and the price gap between them looked like rounding error. In 2026 the gap is two orders of magnitude, the small models can actually reason, and the bill at the end of the month is the thing your CFO reads first.

The position I am defending here is unfashionable inside vendor decks and exactly the way our pipelines are being rewritten in private. Orchestrating a planner LLM plus a fleet of 4B to 8B specialists beats one frontier model per step on cost, latency, and steerability, in every workload above a routing-accuracy floor. The pattern is no longer a Microsoft talking point or an NVIDIA talking point. Microsoft Foundry, AWS Bedrock, Anthropic, OpenRouter, and the academic ancestors all converged on it within an 18-month window. The hard part now is not picking the pattern. It is hitting the routing-accuracy floor where the math actually works.

Key takeaways

  • Specialist orchestration is now cross-vendor convergent: Foundry, Bedrock, Anthropic, OpenRouter all ship variations of planner-plus-workers.
  • A planner LLM that delegates tool calls to a 4B specialist costs cents where a single frontier call costs dollars, when routing accuracy holds.
  • Below roughly 0.85 routing accuracy, the savings vanish; you pay the small model and then pay the frontier anyway.
  • Nemotron Nano 4B, Haiku 4.5, Mistral Small, and Phi-4 are now purpose-trained for tool use, not generic chat shrunk down.
  • The bottleneck is no longer model quality; it is eval coverage of the router, and most teams have not budgeted for it.

In this article


The economics that broke the one-frontier-per-step pattern

A frontier model on every step is a habit, not an architecture. It is what we all wired up because Sonnet was cheap, Opus was special, and the latency was tolerable when the agent only made three calls per task. None of those conditions hold in a 2026 production agent. A real agent loop now fires 30 to 80 model calls per task: read email, classify intent, pick a tool, format a JSON arg, parse a response, summarize, decide next step, repeat. Most of those calls are not reasoning. They are routing decisions, format conversions, and small classifications that an 8B model handles at parity.

The current price spread tells you everything. Claude Haiku 4.5 lists at $1 input / $5 output per million tokens; Sonnet 4.6 at $3 / $15; Opus 4.7 at $5 / $25, per BenchLM's April 2026 pricing breakdown. Llama 3.3 70B on Groq lists at $0.59 / $0.79 per the Groq pricing page. DeepSeek V3.2 is another order of magnitude down, per the AI Pricing Guru 2026 comparison. A planner that routes 80% of steps to a $1 model and reserves 20% for the $25 model is not a cost optimization. It is the difference between a feature you can ship and a feature your finance lead kills in the quarterly review.

AWS published numbers on this in production. The Bedrock team reports that in one shipped deployment, "average 63.6% cost savings because of a higher percentage (87%) of prompts being routed to Claude 3.5 Haiku while still maintaining the baseline accuracy with the larger / more expensive model (Sonnet 3.5 v2)", per the AWS ML blog. That is not a vendor brochure number from a hypothetical agent. That is an 87/13 split where the customer paid the small model on almost every step and the big model only when the router refused to commit.

Anthropic's own field testing reports a more conservative but more telling number: "using Opus as an advisor with Sonnet or Haiku as executors achieves an 11% cost reduction and a 2% improvement on benchmark scores", per the advisor-pattern writeup quoting Anthropic. Cost down, quality up. That is the shape of an architecture that is mispriced relative to what it returns, which is the precise condition that produces a market-wide migration.

What a specialist actually is in 2026

Specialist does not mean "small model fine-tuned for your domain" any more. In 2024 that was the only thing it could mean, because the open small models were generic chat shrunk down. In 2026 there is a class of 4B to 8B models that were trained from scratch for tool use, structured output, and reasoning traces, not as a downsized assistant.

NVIDIA's Nemotron Nano 4B is the cleanest example. It is a hybrid Mamba-Transformer trained for "agentic reasoning" with BFCL V4, TerminalBench, and SWE-Bench as the headline evals per the Nemotron 3 Nano paper. The tool-use lineage goes back to a dedicated RL line described in Nemotron-Research-Tool-N1: Exploring Tool-Using Language Models with Reinforced Reasoning. The model card targets Jetson Thor, GeForce RTX, and DGX Spark as edge platforms, which tells you what NVIDIA expects you to do with it. You run it where the data lives, not where the audit log lives.

Anthropic positions Haiku 4.5 as the executor tier in the advisor pattern: Opus does the planning, Haiku does the work. Mistral has done the same with Mistral Small. Microsoft has its own Phi family. The point is not vendor allegiance. The point is that the bottom rung of the model ladder is no longer a generic chat model that pretends to be useful. It is a workhorse with a tool-calling head and a reasoning trace it was actually rewarded for producing.

What you do Monday: open your usage dashboard, sort agent calls by per-call cost, identify the cluster that is "format JSON" or "classify intent" or "extract these three fields", and replace those calls with Haiku 4.5 or a hosted small model behind a thin adapter. Do not architect; just move the cheapest 30% of calls down a tier. The savings show up in the next billing cycle and prove the rest of the migration to whoever signs the cheque.

Replaying the BRKSP94 walkthrough as a pattern, not a product

The Microsoft Build 2026 session BRKSP94, Orchestrate special agents with NVIDIA Nemotron models on Foundry on the MicrosoftDeveloper YouTube channel, is worth replaying because it shows the orchestrator-and-workers pattern in production-shaped infrastructure, not in a research notebook. The setup: a Hermes agent harness on Microsoft Foundry hosted agents, with Nemotron Super doing the heavy reasoning, Nemotron Nano doing the latency-sensitive tool calls, and the Foundry managed toolbox brokering Outlook, GitHub, Teams, MongoDB, and Cosmos DB through a single permission surface.

The demo task was a four-person product launch on a two-week deadline. A senior lead delegated a feature-request email to the agent. The agent read the inbox, opened the right repo, made the code change, and opened a PR. The lead corrected the PR (add reviewer, add docstrings), and the correction was persisted as a reusable skill via an internal skill_manage function, which other team members then inherited. None of that is novel as an idea. The Microsoft Research paper Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks shipped the same orchestrator-WebSurfer-FileSurfer-Coder pattern 18 months earlier. What is new is that the planner-plus-specialist arrangement now lives behind enterprise identity (Entra ID), enterprise audit traces, and a single tool gateway. That is the part that was missing in 2024 and that determined whether the architecture left the demo stage.

Cross-vendor: AWS Bedrock ships Intelligent Prompt Routing where you "configure your own router by selecting any two models from the same model family" with "no additional charge for the routing feature itself". Microsoft Foundry ships model router concepts with model-subset selection so you cap the router's choice set. Anthropic's advisor pattern is the same idea phrased as a system prompt convention. Three vendor accents, one architectural decision.

💡 The talk is one citation. The pattern is the citation graph. If three hyperscalers and the academic paper all picked the same shape inside 18 months, the question is not whether to adopt it. The question is which routing-accuracy floor your workload sits on.

A minimal router config sketch, in pseudo-Foundry yaml so you can read it without context. This is the shape, not a copy-paste; the field names live in the Foundry Agent Service overview docs:

router:
  planner:
    model: gpt-4.1
    role: "Decide which worker handles this step. Output JSON {worker, args}."
  workers:
    - name: classify_intent
      model: claude-haiku-4-5
      ceiling_tokens: 200
    - name: format_tool_call
      model: nemotron-nano-4b
      ceiling_tokens: 400
    - name: hard_reasoning
      model: claude-sonnet-4-6
      ceiling_tokens: 4000
  fallback:
    when: router_confidence < 0.85
    model: claude-sonnet-4-6

What you do Monday: in your existing LangGraph or Bedrock agent, add a single classifier node before the model call. Have it output {worker, confidence}. If confidence drops below 0.85, fall through to the frontier model and log the fallback. After a week of logs you have a real distribution to argue with.

Where this thesis loses

Here is my read. This is the paragraph the vendor decks skip. The whole architecture relies on the router being right. When the router is wrong, you pay the small model, get a bad answer, retry on the frontier, and pay both. The math inverts. The paper Evaluating Small Language Models for Front-Door Routing bounds this directly: "the accuracy prerequisite (>= 0.85) is not yet met for small language models, bounding the gap at 6 to 8 percentage points". Until your router consistently hits that 0.85 floor, specialist orchestration costs more than just calling Sonnet every step, because every routing miss is a double-billed retry plus the latency of two round trips.

Three honest failure modes worth naming. First, long-context tasks: when the prompt is 40k tokens of context and the answer depends on a synthesis the 4B model cannot hold, the planner concedes and you are back on the frontier model anyway. Specialist orchestration silently loses to the monolith on these. Second, eval explosion: the router itself becomes a new component you have to evaluate, version, and regression-test. Most teams do not have a router eval set, they have a model eval set. The two are not the same and the gap is where production accuracy quietly drifts. Third, debugging: when a multi-worker chain produces a bad answer, you have to localize the failure across three or four model boundaries with different temperature settings and different prompts. The observability traces Foundry and Bedrock ship are a real help here, but the cognitive load on the engineer goes up, not down.

If your workload is below the 0.85 floor, my honest read is: stay with the monolith until your router eval set is real. Specialist orchestration on a bad router is worse than no orchestration. This is also why the sibling post models depreciate, eval suites compound argues that the eval set is the asset; the router-eval set is now another row in that ledger. The router eval is the cost most cost-savings posts forget to subtract.

The KMU rollout pattern for a 90-person company

Specialist orchestration is a Mittelstand-friendly architecture, not a hyperscaler-only one. Imagine the data and AI lead at a 90-person logistics company in Hamburg, running Microsoft 365 across the office, a small Azure tenant for analytics, and a single AI-feature project on the roadmap: an inbox triage agent that classifies inbound shipping notifications and either auto-confirms, flags for the dispatcher, or escalates to a human. Today that agent runs entirely on Sonnet 4.6 because the team copied a tutorial. Spend is roughly 800 EUR a month on a workload that processes 4000 emails a week. Fine for a pilot. Bad for a roadmap that wants to ship two more agents this year.

The Monday plan: keep Sonnet as the planner and the escalation handler. Move the classification step (label one of 12 shipment types) and the JSON formatting step (write the dispatcher row) to Haiku 4.5. Add a router node that fires Sonnet on confidence < 0.85. Wire the whole thing into a Logic Apps connector for the existing Outlook flow, because that is the only piece of infrastructure the IT team already owns and accepts. Log every routing decision into a single SQL table for the next four weeks. Compute three numbers at the end of the month: percentage of steps that went to Haiku, percentage of routing fallbacks, and total spend. The expected outcome based on the cross-vendor evidence above is a 40 to 60% spend reduction, no measurable accuracy loss on the classification step, and a routing-fallback rate that tells the lead whether agent #2 and agent #3 can run the same pattern or whether the router needs more work first.

The role I care about here is not the AI lead. It is the dispatcher. If the dispatcher's day gets quieter without obvious mistakes hitting customers, the architecture earned its place. If the dispatcher starts seeing weird routing decisions land in the queue, the routing fallback rate will already show it in the logs, and the planner ceiling is the immediate knob. Sibling post Copilot Studio Workflows is the spine LLM agents needed makes the complementary point: a deterministic workflow holds the control flow, and inside each leaf you choose the model. Specialist orchestration is what happens inside the leaf; the deterministic spine is what catches the failures.

What you do Monday

Five concrete steps, all real for a 20 to 200 person team this week. Each pairs a how-it-works claim with the move that proves or kills it.

  1. Audit your highest-volume agent steps. Pull the last 30 days of model calls, sort by call count, identify the cluster that is "format", "classify", or "extract". That is the 30% you can move down a tier today.
  2. Add a router node, not a router system. A single Haiku call that outputs {worker, confidence} is enough to start. Do not buy a routing platform until you can read your own confidence histogram.
  3. Pin a model-subset. On Foundry, configure the model router to a two-model subset (Haiku + Sonnet, or Nano + Super). On Bedrock, configure intelligent routing for a single model family. Subset selection is the only reliable cost cap.
  4. Build the router eval set first. Carve out 200 labelled routing decisions before you ship. The model eval and the router eval are not the same artifact. Without the router eval, you cannot defend the 0.85 floor and you cannot detect drift.
  5. Set a fallback that escalates, not silences. When router confidence < 0.85, call the frontier model and log the fallback. Silently passing low-confidence calls to the small model is how a cost win becomes a quality regression six weeks later.

A 60-line LangGraph stub that captures the pattern, runnable against the Anthropic SDK with two model identifiers and a confidence parser:

from langgraph.graph import StateGraph, END
import anthropic

client = anthropic.Anthropic()

def route(state):
    msg = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=200,
        system="Pick worker in {classify, format, hard}. Output JSON.",
        messages=[{"role": "user", "content": state["input"]}],
    )
    decision = parse_decision(msg.content[0].text)
    state["worker"] = decision["worker"]
    state["confidence"] = decision["confidence"]
    return state

def worker(state):
    model = "claude-sonnet-4-6" if state["confidence"] < 0.85 else "claude-haiku-4-5"
    if state["worker"] == "hard":
        model = "claude-sonnet-4-6"
    msg = client.messages.create(model=model, max_tokens=2000,
        messages=[{"role": "user", "content": state["input"]}])
    state["output"] = msg.content[0].text
    state["model_used"] = model
    return state

g = StateGraph(dict)
g.add_node("route", route)
g.add_node("work", worker)
g.set_entry_point("route")
g.add_edge("route", "work")
g.add_edge("work", END)
agent = g.compile()

That is not a finished system. It is the smallest thing that emits the data you need to defend or kill the architecture in your context. The hard work is the labelled router-eval set you build alongside it.

Open questions and how to get in touch

I am still watching three things and expect this post to update inside six months. First, whether the 0.85 routing-accuracy floor moves. The front-door routing benchmark is a 2025 number on a specific eval; a 2026 retraining on tool-call traces could push the floor down to 0.75 and change the math for marginal workloads. Second, whether the Nemotron-on-Foundry per-token pricing actually lands cheaper than Haiku on equivalent tasks. I have not seen a clean primary source on Foundry's Nemotron line items yet and refuse to guess the number; if the per-call price comes in above Haiku, the specialist pattern keeps the latency edge but loses the cost edge for non-edge workloads. Third, whether the routing-eval discipline becomes a paid product or stays a build-your-own artifact. If a vendor ships a real router eval harness, the migration accelerates for the Mittelstand teams that do not have ML engineers on staff. If not, the eval cost stays a hidden tax on this architecture and the savings projections in every vendor deck need a footnote.

If you are sitting on a multi-thousand-Euro monthly Sonnet bill and trying to decide whether to migrate to a planner-plus-workers shape, I would happily compare notes. The interesting variable is your router-eval set, not the model lineup. The teams that have one are running specialist orchestration in production and emitting the cost-savings numbers above. The teams that do not are correctly waiting and right to wait. If you want to think through which side of that line your workload sits on, send a note via the contact section with one paragraph on your current agent shape and your current monthly spend. I will reply with the two or three numbers I would measure first.