All notes

Copilot Studio Workflows is the spine LLM agents needed

Most agents shipped in 2025 are an LLM in a costume. A loop, a tool list, a prompt that says "you are helpful," and a prayer that the next token is good. The thing that has been missing, in every serious production deployment I have seen, is the boring layer underneath: a graph that remembers where it was, retries when the LLM blinks, and surfaces every state transition to an auditor. That layer just landed in Power Platform.

Key takeaways

  • Copilot Studio Workflows is not a new automation tool. It is Microsoft conceding that pure conversational agents are too non-deterministic to wire into a business process unchaperoned.
  • Five vendors now ship the same shape: Temporal, LangGraph, AWS Step Functions plus Bedrock AgentCore, Azure Durable Task for AI Agents, and now Copilot Studio Workflows. The deterministic spine around the LLM is the consensus architecture, not a pattern anymore.
  • The spine is also the audit surface. Every node transition is logged; an LLM-only agent that fans out tool calls is not. The governance argument is independent of the reliability one.
  • The thesis loses on emergent reasoning, on low-code version-control hygiene, and at the scale boundary where deterministic graphs start to feel like a coffin.
  • If you are designing agents in 2026 and the design has no named harness, you are building the unshipped 2024 version of someone else's already-retired prototype.

The launch of the redesigned Copilot Studio Workflows canvas is the most interesting governance event of the Power Platform 2026 wave 1, and almost nobody is reading it that way. The coverage I have seen treats it as another low-code automation surface: drag a node, connect a connector, watch the diagram fan out. That misses the load-bearing claim Microsoft is making underneath the pixels. The new canvas treats LLM calls and connector calls as peer primitives, and it places both inside a graph the platform owns, audits, and resumes. That is the entire move. The visual designer is a cover story for an architectural commitment.

This is not just a new editor. This is Microsoft, in writing, agreeing with the people who have spent two years arguing that conversational agents on their own do not survive contact with a real business process. The harness that you built by hand in Temporal or LangGraph because you had to, Microsoft is now shipping in the same surface where your finance team builds expense automation. The same lesson, paid in Copilot credits instead of Python. If the deterministic spine around an LLM was a contested pattern in 2024, the five-vendor convergence in 2026 is its consensus moment.

The pattern Microsoft just shipped

In Microsoft's own framing on the April 2026 Copilot blog, workflows are "step-by-step automation processes that complete actions or tasks in a deterministic, reliable way". Read that sentence the way an engineer reads a postmortem. The word "deterministic" is doing the work. It is the platform admitting, as plainly as Microsoft ever admits anything, that the conversational agent on its own is not. The same blog positions agents-inside-workflows as the answer to autonomous-agent unreliability. The fix is not a better model. The fix is a harness.

The redesigned canvas, now in public preview per Microsoft Learn, exposes a peer set of node types I have not seen Microsoft put on the same surface before: a Prompt node (a single LLM call), a Classify node (LLM-routed branching with few-shot examples), an Agent node (a full Copilot Studio agent as a sub-step), an M365 Copilot node (Graph-grounded generation), a Request for Information node (durable human-in-the-loop), and the traditional connector, loop, variable, and condition blocks. The community tutorial channel covering the public preview (full walkthrough on YouTube) shows the practical consequence: a single graph can route a shared mailbox by sentiment, delegate refunds to a policy-grounded agent, pause for a named approver, and fall through to a connector when the human says yes. None of those steps care what the others are. The graph is the contract.

Two structural details matter. First, the human review primitive is durable: the flow pauses, emails the approver, and resumes on a structured response. That used to be a custom Service Bus message and a queue trigger. Now it is a node. Second, the agent-as-node primitive means an agent embedded inside a workflow is just another step, with inputs, outputs, and a resumable boundary. The graph delegates reasoning to the agent at a prescribed step and reclaims control after. The agent is no longer the orchestrator. It is a tenant.

Why LLM-only agents drift

If you have shipped an LLM-only agent into production and watched it free-fall after week two, you already know the failure mode. The model is stochastic by construction. Temperature zero does not save you (I argued the long version of this in an earlier post): sampling determinism does not give you state determinism, and the second you put a tool loop around the call you have introduced an unbounded state machine with no backing store. The agent forgets where it was. The retry restarts from scratch. The auditor asks what happened and you hand them a chat log.

The deterministic spine pattern solves this with a single move: the orchestrator becomes deterministic code, the LLM becomes an activity, and the platform checkpoints every state transition. Temporal's own AI page argues this is necessary because naive implementations lose all progress and must restart from scratch on transient LLM errors. LangGraph models the same shape as a state graph of nodes and edges with persistent shared state and conditional routing. The graph is durable. The LLM is not. That asymmetry is the whole architecture.

💡 The fix for non-deterministic LLM behavior is not a better model. It is a deterministic graph that treats the LLM as one node in a resumable state machine, with checkpoints on either side and an audit trail through the middle.

The case for the spine gets sharper when you add governance. An LLM-only agent that fans out tool calls is not auditable in any way that an enterprise compliance team will accept; the trace is a token stream, the state is in the prompt, and the next run will not produce the same shape. A workflow with the LLM as one node is auditable because the graph is the artifact: every transition has a timestamp, every input has a parent, every retry has a cause. The third-party recap on Help Net Security calls Copilot Studio "an AI agent control center", and the framing is right for the wrong reason. The spine is the control center. The governance UI is just the read view.

The deterministic spine, by name

Once you see the pattern in one vendor, you cannot unsee it in the others. The fact that this is the same architecture under different SKUs is the load-bearing observation of this entire piece.

# The deterministic-spine pattern, schematically
spine: # owned by the platform, deterministic, checkpointed
  nodes:
    - id: classify_email
      kind: llm_call            # the only non-deterministic step in this slice
      retry: { max: 3, backoff: exponential }
      checkpoint_after: true
    - id: route_on_label
      kind: deterministic_branch
      branches: { billing: bill_path, technical: tech_path, sales: sales_path }
    - id: bill_path.draft_reply
      kind: agent_invocation    # sub-graph, still non-deterministic inside
      returns: { draft: string, requires_approval: bool, refund_amount: number }
    - id: bill_path.gate
      kind: human_review        # durable, may sleep for days
      cond: requires_approval == true
    - id: bill_path.send
      kind: connector_call      # deterministic, idempotent, side-effecting
trace: every_node_transition_to_event_store
resume: from_last_checkpoint_on_worker_crash

The shape is the same in five places. Temporal: workflow code is the spine, activities are the LLM and tool calls, every state transition lives in the event history; Temporal measures roughly 10 to 50 milliseconds of overhead per activity dispatch for that persistence, which is negligible against LLM calls that take 1 to 30 seconds. LangGraph: nodes and edges, persistent shared state, durable execution that resumes on failure; LangChain's own product page frames it as the agent orchestration framework for reliable AI agents. AWS: Step Functions for the rule-based spine, Bedrock AgentCore for the AI-native branch, recommended in combination per the AWS prescriptive guidance. Azure: Durable Task for AI Agents on the Azure Functions runtime, with the same deterministic vs agent-directed taxonomy that Microsoft itself formalized in the agentic application patterns doc.

The Microsoft Tech Community write-up on building durable and deterministic multi-agent orchestrations is, in essence, the same essay as this one, written by Microsoft engineers about Microsoft's code-first stack. The pattern came from the durable-execution community first with Temporal, hit AWS and LangChain second, hit Azure third in code, and is now hitting Power Platform fourth in low-code. That puts five vendors on the same architecture. There is no obvious next move on the horizon. This is the consensus.

| Vendor | Spine | LLM node primitive | Durable resume | Audit surface | |---|---|---|---|---| | Temporal | Workflow code | Activity | Event history | Per activity | | LangGraph | State graph | Node | Checkpointer | Per state transition | | AWS Step Functions plus Bedrock | State machine | Task or agent invocation | Execution history | Per state | | Azure Durable Task | Orchestrator function | Activity or agent | Checkpoint store | Per yield | | Copilot Studio Workflows | Visual graph | Prompt, Classify, Agent, M365 node | Platform-managed | Per node, in MAC |

The visual designer is a hard interface decision and it has real cost (more on that below), but the architecture underneath it is the same architecture I have built in code three times in the last eighteen months. The Copilot Studio version trades flexibility for governance: you get fewer escape hatches, you get a managed event store, you get the Microsoft 365 admin surface for free, and you get the connector ecosystem of Power Platform as a side effect. The community comparison piece pegs agent flows and Power Automate cloud flows at roughly 98 percent overlap in capability, with the divergence concentrated in the AI-native node set and the Copilot-credit licensing. That two percent is the entire point.

There is one consequential migration caveat from the same write-up: cloud flow to agent flow conversion is one-way. If you commit to the spine, you commit to it. The reverse migration is not supported, which I read as Microsoft signalling the durable-task surface is the strategic destination and the connector-only cloud flow is the legacy path. The 2026 wave 1 release plan lists workflows, AI actions, and governance as the headline investments, in that order. The roadmap rhymes with the architecture.

The case study Microsoft is using to sell the move, Unifi (aviation ground handling), reduced contract processing from days to minutes by combining agents with deterministic workflows. I will note that the days-to-minutes number has no published baseline, no error rate, no per-run cost, and no end-to-end latency, which puts it in the same category as every other vendor case study from the last five years. Take it as directional, not quantitative. The numerical claim worth trusting in the primary-source set is the Temporal one: 10 to 50 milliseconds of overhead per activity dispatch, against LLM calls of 1 to 30 seconds. That ratio is the engineering case for the spine, in one number.

Where the thesis loses

I would not publish this argument without naming the places it falls down. The deterministic spine has three real failure modes, and pretending otherwise would be dishonest.

The first is the cap on emergent reasoning. The contrarian read, argued in the deepset blog, is that the value of an agent is its capacity for multi-hop reasoning that no graph author predicted. A pre-drawn spine restricts the decision space to the graph the author imagined. If your problem is the kind that benefits from emergent paths through the tool surface (open-ended research, novel debugging, creative composition), you want fewer rails, not more. The spine is the right answer for a customer-care mailbox. It is the wrong answer for a research agent, and I would not deploy a workflow graph against a problem whose shape I cannot enumerate up front. The deepset framing of "spectrum, not binary" is correct; the spine is one end of it.

The second is the low-code version-control story. A code-defined Temporal workflow lives in git, gets reviewed in a pull request, runs through your CI, deploys via your pipeline, and rolls back with a revert. A Copilot Studio workflow lives in the platform, gets reviewed in a dialog, runs through a publish button, deploys instantly, and rolls back through the version history UI. That last surface is real (Microsoft ships compare, preview, and restore on every workflow), but it is not the same primitive as git revert. Enterprise DevOps commentary has been consistent on this for a decade: low-code platforms sit in an awkward middle, too visual to handle real business logic, too code-heavy for the ops team to maintain, and the gap widens at scale. The audit surface is good. The change-control surface is weaker. Both are true at once.

The third is the scale boundary. The spine pattern is at its best when the graph fits on a screen and the team owns every node. Once a workflow grows past, conservatively, 30 to 40 nodes, the graph starts behaving like a coffin: every change risks an unintended branch, every new node argues for a new sub-graph, and the diagram becomes the documentation. Code does not have this problem because code has functions. Visual graphs solve it with sub-flow composition, but the discipline to compose well in a visual editor is not a discipline most teams have. The blueprint-first arXiv paper (2508.02721) is on the pro-spine side; the deepset critique is on the other. I think both are right within their range.

If you are picking between a code-defined spine (Temporal, LangGraph, Durable Task) and a visual spine (Copilot Studio Workflows), the deciding question is not the architecture. It is who owns the artifact: a platform team that lives in git, or a business team that lives in the Power Platform tenant. The architecture is the same. The owner is not.

What this means if you are building agents in 2026

If you are building agents this year and your design has no named harness, you are building the prototype version of something three vendors have already shipped a production version of. That sentence is sharper than it sounds.

  • Stop pitching pure conversational agents as the answer to multi-step business processes. They are a node in the answer, not the answer.
  • If you are on the Microsoft stack and your problem is enumerable, Copilot Studio Workflows is now the default. Agent-only is the exception you justify, not the assumption you start from.
  • If your problem is not enumerable, the spine is still the right shape, but you want the code-defined version (Temporal, LangGraph, Durable Task) so the graph can change at the speed of git.
  • Design every agent to return explicit boolean flags (requires_approval, needs_followup, is_terminal) so the spine can branch deterministically on agent output. Free-text responses to a control-flow question are a category error.
  • Treat the workflow event history as the system of record for agent behavior. The chat log is not. The trace is. This is the same lesson I wrote about for Copilot Studio integration patterns: the audit story has to live in the platform, not in the prompt.

The deterministic spine is now table stakes. The argument for the next two years will not be whether to wrap the LLM in a graph. It will be which graph, owned by which team, with what escape hatches when the enumeration is wrong.

Open questions

I am still watching three things on this surface, and I expect at least one of them to change my framing.

The first is the boundary between Copilot Studio Workflows and the code-first Microsoft Agent Framework, which ships durable workflows on the same conceptual base. Microsoft is now running both a low-code and a code-first version of the same architecture. The interesting question is whether the low-code surface is a permanent product or a forcing function to get the durable-task pattern into the hands of the Power Platform install base. I expect the answer in the next two release waves.

The second is governance under fan-out. The MAC agent inventory and the workflow audit surface are both good on a per-agent and per-workflow basis. They are less obviously good when one workflow invokes ten agents that invoke ten workflows. The combinatorics matter; the audit story does not yet name them. I am watching the wave 2 governance docs for a hierarchical-trace primitive.

The third is the failure mode of the visual spine at scale. I have not yet seen a Copilot Studio workflow north of 50 nodes in production. The pattern works at the demo size; the question is whether the editor and the version-control story survive the size that real enterprise processes reach. If you are running one of those, I want to hear what you actually do with it.

I expect parts of this to age badly. If you think I am wrong about the boundary between Copilot Studio Workflows and the code-first agent frameworks, or if you are running a workflow graph past the 50-node boundary and the version-control story is holding up, the fastest way to change my mind is a concrete counterexample. Reach me at marcus-duwe.de/contact.