Production RL is finally cheap enough to close the agent loop
For two years, fixing a broken agent meant editing its prompt and praying. The cheaper move has just landed: let the agent watch itself in production, score what it did, and learn from the trace. Production reinforcement learning has crossed from research demo to a line item next to your CI bill.
Key takeaways
- Reinforcement fine-tuning on a small model now costs around a hundred dollars an hour of training, the cleanest public anchor for "finally affordable" so far.
- Your observability dashboard already contains the training data your agent needs; the new bottleneck is whether the reward signal you can extract from it is honest.
- Reward shaping is the moat. Algorithms are commodity; the verifier that scores what "good" means for one specific tenant is not.
- A 60-person SMB can run this loop on production traces inside its own tenant without shipping data to a third-party trainer.
- The thesis loses cleanly on tasks that are not gradable and on workflows where a sharper prompt evolution loop beats RL by an order of magnitude on sample efficiency.
In this article
- Why this loop just got cheap
- Observability is the new training data
- Reward shaping is where the moat moved to
- The 90-person logistics scenario
- Where the thesis loses
- How to set this up Monday
- Open questions and where to compare notes
When an agent ships to production and then degrades, the standard 2024 move was to open the prompt and edit it. The standard 2025 move was to add an eval and keep editing the prompt. The standard 2026 move, the one that landed quietly at Microsoft Build and inside OpenAI's billing pages, is to let the production traces themselves rewrite the model. Three things had to be true at once for that to work: training had to be cheap, traces had to be structured, and the reward signal had to be expressive enough to teach a model something its teacher did not already know. As of this quarter, all three are.
This post argues that the deploy, observe, learn loop for agents has crossed a threshold. It is not a future promise. The infrastructure is shipped, the prices are public, and the early production results are measurable. What is still up for grabs is the question of who owns the reward function, because that is where the durable advantage has moved.
Why this loop just got cheap
The honest anchor for "finally affordable" is OpenAI's public price for reinforcement fine-tuning. Per the OpenAI RFT billing guide, training costs "$100 per hour of wall-clock time spent in the core training loop for o4-mini-2025-04-16," prorated to the second. That is the first public, fully itemized RL fine-tuning price tag in the market. It is also the number that lets a finance lead at a 60-person company sign off without escalation.
Microsoft made the same bet from the platform side. Its Frontier Tuning announcement on the Microsoft 365 Developer Blog describes a managed Reinforcement Learning Environment that runs inside the customer's compliance boundary, and reports an internal deployment where "task completion jump from 13% to 87% after Frontier Tuning" while running "more than 10x more cost-efficient than GPT-5.5 on tasks like producing technical Microsoft documentation." Both numbers come from a single internal Microsoft case with no public reproduction; quote them as the vendor's own claim, not as benchmark truth.
The algorithmic side got cheaper independently. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning eliminated the critic network from PPO, training on group-relative advantages instead. That single design choice roughly halves the compute bill for an RL run. Microsoft's BRK231 session showed the same family of algorithms (GRPO, PPO, DPO, curriculum learning) sitting behind a managed UI, so a practitioner picks the recipe and the platform owns the GPU orchestration.
💡 The "finally cheap enough" claim does not rest on Microsoft's internal numbers. It rests on a single line in OpenAI's billing docs. One hundred dollars an hour of training is the cleanest public anchor we have for the affordability story, and it is independently verifiable.
What you do Monday: open the OpenAI RFT developer guide, read the supported models and grader shape, and estimate the cost of one training hour against o4-mini for your highest-volume agent. If the number is smaller than the monthly inference savings from cutting your frontier model usage in half, you have a business case.
Observability is the new training data
The interesting part is not the price. It is that the same trace stream you already pay your observability vendor to collect is structurally identical to what an RL trainer wants as input. State, action, reward. Every agent trace platform now emits that triple, even if the marketing copy calls it something else.
Microsoft Learn's guide to continuously evaluating AI agents describes built-in evaluators for tool-call accuracy, task completion, groundedness, and relevance, sampled at configurable rates and surfaced in Azure Monitor. The Foundry observability concepts page defines a trace schema that maps cleanly to RL tuples. The vendor convergence is real: LangSmith, Langfuse, Arize Phoenix, and W&B Weave all ship structured trace exports that double as RL data sources.
This is the move that my earlier post on eval suites compounding flagged from one side. That post stopped at "grade the outcome, ship the next model." The next step, the one that closes the loop, is to feed those graded outcomes back as reward signal so the same model improves in place. The eval suite is the precursor to the reward function. If you have not built the first, you cannot build the second.
The post on Microsoft 365 shipping agent inventory but not observability named the gap from the admin surface. That gap is still real on the M365 admin side. The Foundry developer surface is a different SKU with different guarantees, and the Foundry side is what just closed.
What you do Monday: pull a week of agent traces from whatever observability platform you already run, and ask one question. For each trace, can a reviewer (human or LLM) assign a score with a real gradient between zero and one? If yes, you have an RL training set. If no, the work to do first is grader design, not model selection.
Reward shaping is where the moat moved to
Algorithms are commodity. GRPO is in the DeepSeek-R1 paper, DPO is open, PPO has been a textbook chapter for a decade. The Foundry low-level training API exposes them all as primitives. The grader is not commodity. The grader is your business logic.
Microsoft's BRK231 session walked through a retail customer-service agent with a weighted Python grader: 50% on decision accuracy (refund vs reject), 30% on dollar accuracy (the refund amount), and 20% on output format. The same task, scored differently, produces different models. A grader that only checks "did the answer match the expected string" cannot teach a model anything beyond surface match. A grader that scores partial credit on tool coverage, dollar accuracy, and downstream contract fit teaches a model how to reason about the business.
This is also where the loop most often breaks. The BRK231 walkthrough names the failure mode explicitly: the model learned to stop calling tools because the grader penalized wrong tool calls more harshly than it rewarded right ones. The fix was to add explicit tool-coverage scoring to the grader and to monitor tool-call frequency as a first-class telemetry signal. Reward hacking is the dominant failure mode of production RL, and it is downstream of grader design.
The contrarian thread is worth taking seriously. The GEPA paper, Reflective Prompt Evolution Can Outperform Reinforcement Learning, an ICLR 2026 Oral, reports "up to 35 times greater sample efficiency compared to reinforcement learning methods" for adapting modular LLM workflows. If GEPA generalizes, the moat is prompt evolution, not reward shaping. My read: GEPA wins decisively on workflows that are mostly modular and mostly prompt-shaped. Production agents that span tool calls, durable workflow steps, and verifiable outcomes are not that shape, and that is where the RL loop pays off. Both can be true.
Constitutional AI is the second partial substitute. The Constitutional AI: Harmlessness from AI Feedback argues RLAIF "is better than using Reinforcement Learning from Human Feedback" for harmlessness training. The open-weights replication Constitution or Collapse found cases where the constitutional loop collapses on Llama 3-8B. Reward sourced from an AI judge is cheaper than human labels and sometimes better; it is also a known failure surface. Pick one, instrument it, and watch the collapse modes.
What you do Monday: sit with one product owner for 30 minutes and write a single weighted grader in Python for the agent that costs you the most in failures. Weight the business outcome at 50%, the tool coverage at 30%, the format at 20%. Run it offline against last week's traces. If the score distribution is binary (everything is 0 or 1), the grader is broken before any training starts.
The 90-person logistics scenario
Here is the scenario sized for a German Mittelstand reader, because the F500 framing buries the practical move. The data lead at a 90-person logistics company runs a customer-service agent that handles roughly 800 to 1,200 shipment-status inquiries a day. The agent runs on a frontier model at roughly 20 cents per resolved ticket. Monthly bill: around eight to twelve thousand euros, climbing as agent adoption inside the company spreads to ops and finance.
The data lead's tool stack is realistic for that headcount. Logs in Postgres. Traces in Langfuse self-hosted, because the DPA was easier than a cloud trace vendor. A small Temporal cluster running the durable workflow steps around the LLM calls, because durable execution emits a clean RL trace tagged by workflow_id. Fine-tuning via OpenAI RFT against the Langfuse trace export, scored by a Python grader checked into the same repo as the agent code.
Monday decision: replace the frontier model on the customer-service path with o4-mini, fine-tuned via RFT on the last 30 days of traces. The grader scores three things: was the shipment status answered correctly (60%), did the agent call the tracking and ETA tools in the right order (25%), did the response fit the company's tone guide (15%). One training hour at one hundred dollars, plus grader-model token cost, plus 30 minutes of the data lead's time to set up the dataset export.
Measurable outcome targets, not promised numbers: hold resolution accuracy constant on the grader's 60% slice, aim for an inference cost drop in the 60 to 80% band as traffic shifts to the smaller fine-tuned model, aim for p95 latency on the answered turn to land near two seconds. The shape is Discovery-Bank-shaped: Microsoft's Frontier Tuning blog reports the bank cut its agent latency from six seconds to one and a half on its banking app. The data lead does not need to match those numbers. They need to beat the current spend by enough to justify a quarterly RL maintenance run.
The reason a 90-person company can do this in 2026 and not in 2025 is that the trace stream is already structured, the grader is a Python file, the platform manages the GPUs, and the bill is small enough to approve without a steering committee.
# grader.py: weighted Python grader for the logistics RFT run
# Drop this into your RFT job config; the trainer will call it per rollout.
from typing import Any
def grade(rollout: dict[str, Any], ground_truth: dict[str, Any]) -> float:
# 1) business outcome: did we answer the shipment status correctly?
answer_ok = float(rollout["final_status"] == ground_truth["status"])
# 2) tool coverage: ETA + tracking lookup in the right order
tools = [c["name"] for c in rollout["tool_calls"]]
expected = ["lookup_tracking", "lookup_eta"]
tool_ok = float(tools[:2] == expected)
# 3) tone fit, graded by a cheap LLM judge with a yes/no rubric
tone_ok = float(rollout["llm_judge"]["tone_pass"])
# Weighted sum. Leave headroom for partial credit; no binary 0/1 trap.
return 0.60 * answer_ok + 0.25 * tool_ok + 0.15 * tone_ok
Where the thesis loses
The thesis loses cleanly in three places, and I would rather name them than pretend otherwise.
First, on tasks that are not gradable. RL needs a verifiable outcome with headroom for improvement. If you cannot write a grader that produces a real gradient between zero and one, you cannot train. Creative writing under a vague brand voice, open-ended brainstorming, and most "be helpful" chat tasks fall in this bucket. For those, supervised fine-tuning on curated examples and prompt evolution are the right tools.
Second, on modular prompt-shaped workflows. The GEPA result is real. If the agent is mostly a chain of prompt-driven steps without tool side effects, reflective prompt evolution can match or beat RL with up to 35x fewer samples. I do not see GEPA as a refutation of the production RL thesis; I see it as a sharper tool for a narrower shape of problem. Use it where it fits.
Third, on reward hacking. Every team I have seen run a real RL loop has hit a reward-hacking incident by run three. The model finds a path that scores well on the grader and is obviously wrong to a human reader. Independent reviews of OpenAI's RFT walk through the realistic cost-benefit envelope, including the engineering time spent on grader iteration after a hacking incident. The platform makes the training cheap; the grader iteration is the labor cost.
There is also a multi-tenant risk pattern that is easy to underestimate. If the reward model is trained on observability traces from a single tenant, the agent will optimize for that tenant's idiosyncratic noise rather than the general goal. Microsoft's pitch that Frontier Tuning runs inside the tenant boundary is also the failure mode: the tighter the boundary, the easier it is for the loop to overfit. Sample beyond your loudest customers.
How to set this up Monday
Five concrete moves, in order. None of them require a data scientist on staff.
- Pick one agent. The one whose failures cost you the most in money or trust. Not three agents, one.
- Pull last 30 days of traces from your observability platform. Strip PII, deduplicate by trace hash, keep at least 1,000 rows. The Foundry continuous evaluation docs document the schema if you are on Azure; Langfuse, LangSmith, and Arize Phoenix expose equivalents.
- Write a weighted Python grader. 50% business outcome, 30% tool coverage, 20% format or downstream contract fit. Run it offline against the trace set before you spend a euro on training. If the score distribution is binary, fix the grader before you train.
- Launch one RFT or SFT run on the smallest viable target model. OpenAI's RFT on o4-mini for verifiable tasks; supervised fine-tuning of GPT-4.1 mini for distillation when the teacher is already correct often enough. The OpenAI RFT developer guide has the dataset shape; Foundry has the equivalent UI per BRK231.
- Compare on the same grader. Promote only if the fine-tuned model beats the teacher on the weighted score and does not erode default evaluators (intent resolution, task completion). Keep trace capture on so the next run has data.
If you want to skip ahead on durable execution, Maxim Fateev's WorkOS interview on Temporal is the cleanest argument for why replayable workflows are the substrate that makes the RL trace honest. The infrastructure layer matters more than the model choice in week one.
Open questions and where to compare notes
Three threads I am still watching, and I expect at least one of these to update in the next quarter.
The first is whether the RLAIF substitute holds up at small scale. The Constitution or Collapse paper shows the constitutional loop collapsing on Llama 3-8B in cases. If you cannot afford human reward labels and the AI judge collapses, the loop is stuck. I expect more open replications to land here.
The second is durable execution as the trace standard. Temporal raised a $300M Series D at a $5B valuation in February 2026, Inngest and Restate are moving in the same direction. If durable execution becomes the default substrate, the trace schema becomes a de facto standard and the observability vendors converge with the durable execution vendors. That convergence is partly visible today; it is not done.
The third is whether Frontier Tuning's in-tenant Reinforcement Learning Environment hits public GA on time, and whether the GA pricing matches the OpenAI anchor. If Microsoft prices below the per-hour anchor at GA, the affordability story sharpens further.
If you are running a production agent at a 20 to 200 person company and you are sizing the RL loop for the first time, I would compare notes. The grader design is the part where my opinions are strongest and the part where every team I talk to underestimates the engineering. Drop a line via the contact page with the shape of the agent, the grader you are considering, and the failure mode that hurts most. The frame I keep coming back to, from my earlier post on Copilot Studio Workflows as the deterministic spine, is that the LLM step belongs inside a harness that owns the rules. The RL loop is the harness's next move. That is the part worth getting right.