Pick the durable runtime before your second agent ships
Pick the durable runtime before your second agent ships
Most production agents I see at twenty-to-two-hundred-person companies are a Python script that calls an LLM in a loop, kicked off by cron, with results written to a database row. I have shipped that exact shape myself, more than once, and it works the day you ship it. The next time the server reboots mid-call, the next time two of them run at once, or the next time a tool call times out and silently drops the work, you find out you have been running a distributed system without owning one. The fix is not a smarter prompt. It is the boring runtime layer underneath the agent, the one that journals every step, replays on crash, and tells you which call lost the work.
Key takeaways
- The picking decision for production agents is the durable execution runtime, not the model, not the framework, not the prompt.
- SMB teams hit classic distributed-systems failures the moment a second instance ships: partial failure, lost state, duplicate side effects, no retries.
- Six runtimes now solve the same problem with the same journal-and-replay pattern, priced from free Postgres to ~$100 per month managed.
- Restate or DBOS first for Postgres-native teams; Temporal Essentials at one hundred dollars per month if a managed contract is cheaper than ops.
- The thesis loses on true single-shot stateless calls and at pure POC stage, where a cron loop is honestly enough.
In this article
- What actually breaks in your while-True loop
- The runtime layer Microsoft and Anyscale were pointing at
- Six runtimes, one pattern, very different bills
- A 60-person logistics firm picks on Monday
- Where the thesis loses
- How this fits the framework and observability decisions
What actually breaks in your while-True loop
The first agent in production is almost always a thin script. Open the OpenRouter SDK, call a model, parse the response, hand the result to the next tool, write a row to Postgres, sleep, repeat. There is no journal of what step the agent was on when the EC2 instance got recycled at 03:00. There is no idempotency key on the outbound email the agent sent twice. There is no retry budget on the third-party API that 502'd in the middle of a tool call. The agent is not broken. The runtime around the agent does not exist.
Twenty years ago we already had the vocabulary for this. Partial failure: half a workflow ran, half did not. Lost state: the in-memory dict that held "step three of seven" is gone. Exactly-once side effects: did we charge the customer or not. The trick that durable execution runtimes use is a journal. Every step writes to a durable log before it executes, and on crash the runtime replays the log to bring the workflow back to the exact state it was in. Temporal, Restate, DBOS, Azure Durable Functions, AWS Step Functions, and Lambda Durable Functions all do this. They use different storage, different APIs, different bills, but the pattern is the same.
💡 The model is interchangeable behind an API call. The runtime your agent crashes inside of is not.
Monday move: open the codebase you ship agents from and grep for while True, time.sleep, and cron entries that invoke a Python file. Every match is a candidate workflow that should be a durable step, not a loop.
The runtime layer Microsoft and Anyscale were pointing at
At Microsoft Build 2026, session BRK227 was a fireside between Mark Russinovich (Azure CTO) and Ion Stoica (Anyscale co-founder, Berkeley professor, the person behind Spark, Ray, and vLLM). They spent forty-five minutes talking past most of the audience about KV caches and bulk-synchronous processing. The bit worth pulling out for SMB teams was buried in the middle: classical distributed systems were designed around human speed and human-scale reliability, and agentic workloads break those assumptions because they run at machine speed, hit non-deterministic failure modes, and need state across long-lived steps. Their fix at Azure is the Foundry runtime: hosted agents that survive container crashes and redeploys with state persistence across turns. The session abstract, the Foundry announcement, and the runtime detail post are all linked in Resources below.
Foundry is not the only answer. It is one row in a table of six. Anyscale ships its own runtime on top of Ray. Temporal ships Temporal Cloud. Restate ships a single binary. DBOS ships a Python library that uses the Postgres you already run. The interesting fact is the convergence: five vendors, one pattern. The interesting decision is which one fits a company of your size and stack.
Monday move: if you are already on Azure and your second agent is ten weeks away, put Foundry hosted agents on the evaluation list now, do not wait for GA. If you are not on Azure, scratch it and move to the comparison table below.
Six runtimes, one pattern, very different bills
Strip out the marketing and these runtimes do the same job. The differences that matter to a twenty-to-two-hundred-person team are price floor, storage you already operate, language ergonomics, and how loud the ops bill gets.
| Runtime | What you operate | SMB price floor | Best fit | |---|---|---|---| | Temporal Cloud Essentials | Nothing; managed | ~$100 per month, 1M Actions | Teams that want a managed contract and SLA, willing to learn workflow + activity semantics | | Restate | One self-hosted binary | Free OSS; cloud preview | Python or TypeScript teams who want HTTP-shaped durable RPC, not a workflow DSL | | DBOS | Library on your existing Postgres | Free | Teams already on Postgres who do not want a second datastore | | Azure Durable Functions | Azure Functions consumption plan | First 1M execs free per month | Teams already on Azure with sub-second steps and bursty load | | AWS Lambda Durable Functions | Lambda only | Free tier covers most | Teams who want Step-Functions-shaped flows without Step Functions pricing | | Foundry hosted agents | Nothing; managed | Azure billing; GA early July 2026 | Azure-native teams shipping multi-session agents with state across turns |
The pricing matters. Temporal Cloud Essentials is one hundred dollars per month for one million Actions, one gigabyte active storage, forty gigabytes retained, and a 99.9 percent SLA, per Temporal pricing. Azure Durable Functions on the consumption plan is twenty cents per million executions and includes the first million per month, per the Azure Functions pricing page, with the load-bearing caveat from the Durable Functions billing docs that every orchestrator replay counts as a separate billable invocation. AWS Step Functions Standard charges twenty-five dollars per million transitions; Lambda Durable Functions, when you don't need Step Functions makes the contrarian case that an eight-state approval workflow at ten thousand runs per month costs $2.00 in Step Functions versus $0.00 of waiting-time charges in Lambda Durable Functions.
The Ray side of the conversation pulls a different lever. The canonical scale claim from Ray: A Distributed Framework for Emerging AI Applications is that Ray schedules millions of tasks per second at millisecond-level latency. That number is real and load-bearing for frontier training, but it is not the number an SMB shipping its second agent should be optimizing for. Optimize for "the workflow does not silently lose state when the container reboots." That bar is satisfied by any of the six.
Here is what the same step looks like in three runtimes. Read it and notice that the model call is identical; only the runtime contract around it changes:
# Temporal: a workflow + activity, deterministic replay on crash
from temporalio import workflow, activity
@activity.defn
async def call_llm(prompt: str) -> str:
return await openrouter.complete(prompt)
@workflow.defn
class ResearchAgent:
@workflow.run
async def run(self, topic: str) -> str:
plan = await workflow.execute_activity(
call_llm, f"plan research for {topic}",
start_to_close_timeout=timedelta(minutes=2),
)
return await workflow.execute_activity(
call_llm, f"execute: {plan}",
start_to_close_timeout=timedelta(minutes=10),
)
# DBOS: a Python decorator over the Postgres you already run
from dbos import DBOS
@DBOS.step()
def call_llm(prompt: str) -> str:
return openrouter.complete(prompt)
@DBOS.workflow()
def research_agent(topic: str) -> str:
plan = call_llm(f"plan research for {topic}")
return call_llm(f"execute: {plan}")
# Restate: a service over HTTP, durable RPC, no DSL
from restate import Workflow, WorkflowContext
agent = Workflow("ResearchAgent")
@agent.main()
async def run(ctx: WorkflowContext, topic: str) -> str:
plan = await ctx.run("plan", lambda: openrouter.complete(f"plan: {topic}"))
return await ctx.run("execute", lambda: openrouter.complete(f"execute: {plan}"))
Monday move: open the runtime docs of the one whose storage you already operate. If you run Postgres in production, that is DBOS. If you run on Azure, that is Durable Functions or Foundry. If you run on AWS, that is Lambda Durable Functions. If you run on bare Python and want a managed contract, that is Temporal Cloud Essentials. Read for ninety minutes, then port one existing cron-driven agent over as a spike. Ninety minutes is the entire budget.
A 60-person logistics firm picks on Monday
Concrete example. A sixty-person logistics company in Hamburg ships freight quotes. The ops lead, an engineer-by-background who runs both the data team and the IT contractor, built an agent in April that scrapes carrier rates, queries the internal Postgres for current capacity, asks Claude via OpenRouter for a quote, and emails the customer. It runs every fifteen minutes from a cron entry on a single VPS. It mostly works. Then May: the VPS reboots for a kernel patch at 02:17, three quotes go missing, two get emailed twice because the cron tick that was mid-call retried from scratch when the box came back. The ops lead spends six hours on incident response and writes a Notion doc titled "we need observability."
What the ops lead actually needs is durable execution. The Monday decision is a forced choice between two free options and one paid option. DBOS is free, runs as a Python library on the existing Postgres the carrier-rates agent already writes to, and converts the cron loop into a workflow with one decorator. Restate is free, runs as a single self-hosted binary alongside the Python app, and shapes each step as durable RPC. Temporal Cloud Essentials is one hundred dollars per month and trades the ops cost for an SLA. The ops lead picks DBOS because the Postgres is already there and the team is one developer.
The measurable outcome the ops lead writes into the Notion doc the same day, as a hypothesis to test: percent of agent runs that complete end-to-end versus silently drop. May baseline observed in the incident review: ~94 percent (three lost, ninety-seven shipped of one hundred runs that week). Target two weeks after DBOS lands, written as the ops lead's bet, not an industry benchmark: ~99.9 percent, with the missing 0.1 percent being intentional aborts the runtime logs and surfaces. If the number does not move, the choice was wrong, and the next runtime to try is Restate. I would write the same numbers down in the same Notion doc.
Monday move: write the metric down first. Pick the runtime second. The metric is the only thing that tells you which runtime to keep.
Where the thesis loses
This argument has a soft underbelly and it is worth naming. The "pick a durable runtime before your second agent" rule loses in two places.
First, true single-shot stateless calls. If the agent is a one-pass function that takes input, calls a model, returns output, and never writes anywhere else, a cron loop is honestly enough. The runtime layer adds operational cost (one more thing to upgrade, one more thing to monitor) without removing any failure modes that matter. The honest threshold is "the moment a workflow has two steps and a side effect between them," not "the moment you have an agent."
Second, pure POC stage. If you are still in week one of figuring out whether the agent is worth building, adding a runtime first is premature optimization. The right move is to keep the loop, ship to one user, learn the actual failure modes, then pick the runtime when the second user shows up. The cost of a wasted weekend on Restate is higher than the cost of a wasted weekend on a script that turns out to solve the wrong problem.
The rule reapplies the moment the agent ships to a second concurrent caller, or grows a second step with a side effect between them, or starts running on a schedule longer than the developer's attention span. That is the line. Cross it and the rewrite cost compounds; refuse to cross it and the rewrite is the bill you avoided paying.
Monday move: name the threshold for your own agent in writing today. "We adopt a durable runtime when [condition]." A written threshold turns the runtime decision from a vibe into a trigger.
How this fits the framework and observability decisions
A previous post here, Copilot Studio Workflows is the spine LLM agents needed, argued that five vendors converged on the same deterministic-spine pattern inside their respective products. This post is the layer below that one. The spine post says "the product you pick should have a deterministic spine." This post says "the runtime you pick is the deterministic spine, and you pick it on Monday with a price tag and a storage backend, not at architecture-review time."
The other sibling worth pointing at is Observability outlives your agent framework, which argued that the OpenTelemetry GenAI semconv is the five-year decision underneath the six-month framework choice. The runtime decision is the same shape: it outlives the framework. You will swap LangGraph for Microsoft Agent Framework, then swap that for whatever ships at Build 2027. You will not swap the Postgres your durable workflows journal to, or the Temporal namespace your Actions roll up against. Pick the spine, pick the observability schema, then let the framework above them be a swappable detail.
Monday move: open a one-page doc titled "irreversible layers." Write two rows: the durable runtime, the observability schema. Pick one option for each by Friday. Everything above them is reversible; treat it that way.
Compare notes
If you are at a twenty-to-two-hundred-person company shipping a second agent in the next quarter, I would like to hear which runtime you picked and what the storage backend looks like. Especially interested in DBOS-on-existing-Postgres stories, because that is the path I keep recommending and the one with the least published evidence. Get in touch if you want to compare notes on the picking decision before you sign the contract.
Resources
- BRK227: Distributed systems to AI platforms with Russinovich and Stoica - Microsoft Build 2026 session abstract.
- Foundry hosted agents Build 2026 post - sandbox-per-session, framework-agnostic runtime detail.
- Temporal Cloud pricing docs - canonical Actions billing definition, $0.00005 per Action.
- Restate durable agents docs - journal-and-replay pattern for agents over HTTP.
- DBOS: Why Postgres is enough for durable execution - single-transaction exactly-once argument.
- AWS Step Functions pricing - Standard $0.025 per 1k transitions; Express alternative.
- AWS Step Functions Standard vs Express explainer - 50-100ms transition latency claim.
- AWS Lambda Durable Functions launch (InfoQ) - re:Invent 2025 announcement, pause and resume up to one year.
- Anyscale Runtime announcement - 2x, 6x, 10x perf claims atop Ray.
- LangGraph Platform pricing - $0.001 per node managed; self-host free up to 100k nodes per month.
- Inngest pricing - 50k executions per month free; Pro from $75 per month.
- HN discussion: durable Python workflows - contrarian thread on Temporal self-host complexity.