replit agent — what multi-step orchestration looks like when it ships — cases

The thing most agent demos skip

Most public demonstrations of agentic AI systems show a single sweep: a model is given a task, it plans, it executes, the task is done. What the demo does not show is the failure path — the sub-task that returns an error halfway through a sequence, the external service that times out, the file-system state that is different from what the planner assumed. Orchestrating an agent system in production means deciding, in advance, what happens in each of those cases. Replit's Agent feature — shipped across 2024 and described in detail in public changelog posts, engineering blog entries, and founder commentary — is one of the more honest accounts of those decisions, because the product ran in a live coding environment where mistakes had immediate, visible consequences.

This case is not a teardown of Replit's strategy or an assessment of whether the company will win the agentic coding market. It is an account of the specific orchestration problems that Replit's team had to solve when they moved from "AI suggests code" to "AI executes a plan across multiple steps," and what product choices they made at each one.

The setup: from autocomplete to agent

Replit had been in the AI coding-tool business since 2022, starting with Ghostwriter — a Copilot-adjacent autocomplete feature embedded in its online IDE. By early 2024, the competitive pressure from Cursor, GitHub Copilot, and the rapid proliferation of Claude-in-the-IDE integrations had shifted the market question from "does the AI complete lines well" to "does the AI do whole tasks." The unit of comparison was no longer the suggestion; it was the workflow.

Replit's position in that market had a specific shape. Unlike Cursor or Copilot, Replit owned the runtime. Users on Replit were not editing files that they would later run locally — they were building and running code inside Replit's own cloud environment. This meant Replit could give its agent something that Cursor or Copilot extensions running on a local machine could not: a sandboxed execution environment with known boundaries that the agent could actually use. The agent could install packages, run tests, query the filesystem, start a server, and read its output — all within a container Replit controlled.

The architectural choice was consequential. An agent with real execution capability is not the same product as an agent that suggests code. When the suggestion is wrong, the user revises it. When the execution is wrong, the user has to undo it — and depending on what the agent did, undoing it may not be straightforward. The product design question shifted from "how do we make suggestions accurate" to "how do we make execution safe enough to trust."

The planning layer: tasks as a product surface

Replit Agent's first user-visible product decision was to make the plan visible before execution started. Rather than accepting a natural-language prompt and immediately beginning to act, the agent would produce a numbered task list — a breakdown of what it intended to do — and present it to the user for review before taking any steps.

This is a small UX choice with a large functional consequence. A plan displayed to the user before execution serves three purposes simultaneously.

First, it is a trust primitive. A user who can read the plan before the agent acts has a check on the agent's interpretation. If the agent has misunderstood the prompt — if "set up a contact form" was understood as "install a full CRM integration" — the plan surfaces that misreading before any files change. The cost of correcting a misunderstanding at the plan stage is a revised prompt. The cost of correcting it after three files have changed is a debugging session.

Second, it is an accountability surface. When something goes wrong after the agent has started, the plan gives the user a reference point. "The agent was supposed to do step 3 next, but instead it did X" is a report that requires a plan as a precondition. Without a visible plan, failure reports are hard to route: was it the model's reasoning, the execution environment, the task breakdown, or the prompt? The plan creates a chain of evidence.

Third, it is a scope-control mechanism. The plan's task count is a rough proxy for how much the agent is about to do. A five-task plan that runs for ten minutes is calibrated. A five-task plan that expands to twenty tasks mid-execution has gone off-script. Making the initial task count visible gives the user a baseline expectation to compare against.

Sequencing vs. parallelism: the orchestration judgment

Inside the agent's execution of a plan, Replit faced the core multi-agent scheduling question: which steps can run concurrently and which must run in order?

The naive answer is "run everything in order, one step at a time." This is safe but slow. For a web app setup task — install dependencies, scaffold the routing layer, write the database schema, create the test harness — some of these steps are genuinely parallel (schema and test scaffolding can proceed together if they do not share state), and some are strictly sequential (you cannot run the tests until the routing layer exists).

Replit's agent, based on the public changelog and external testing reported by engineering commentators through 2024, uses a dependency-aware execution model: steps that share no output dependencies run concurrently; steps that consume another step's output wait for it. The orchestrator tracks which steps are blocked and which are ready.

This is standard in workflow automation systems, but it is non-obvious in LLM-orchestrated agents, because LLMs are not naturally good at dependency reasoning. The model asked "which of these five tasks can run in parallel" will sometimes get it right and sometimes miss a hidden dependency. Replit's architecture does not rely on the model to make that call at runtime: the task list is parsed into a dependency graph by a deterministic planning layer before execution begins. The LLM provides the task decomposition; the deterministic layer provides the scheduling. These are separate concerns handled by separate components.

The PM lesson here is structural. Multi-agent orchestration that hands scheduling decisions to the LLM at runtime accumulates errors. Scheduling is a problem that has reliable algorithmic solutions; it does not benefit from model creativity. The split — LLM for reasoning, deterministic component for scheduling — is the right architecture because it applies the right tool to each sub-problem.

State passing: what agents need to know about each other

When a sub-task in a multi-step plan completes, the orchestrating agent needs to know two things: what was produced, and what state was changed. If task 2 installs a package and task 4 calls a function from that package, task 4 needs to know that the package is now available. If task 3 modifies a configuration file and task 6 reads that file, task 6 needs to know the modified state, not the original.

This is the state-passing problem. In a local coding session, the developer carries this context in their head. In an orchestrated agent system, it has to be passed explicitly — either through a shared representation of the project's current state, or through task outputs being fed as inputs to downstream tasks, or through the orchestrator maintaining a live model of what has changed.

Replit's approach, described in technical blog posts, uses the file system itself as the primary state-passing medium. When task 2 installs a package and updates package.json, the new package.json is the artifact that downstream tasks read. The orchestrator does not maintain a separate state model — the project directory IS the state. This is architecturally simple and avoids the synchronisation problem that arises when you maintain a separate state representation that can diverge from reality.

The tradeoff is visibility. If the file system is the state, a human inspecting the agent's progress during execution sees the current files — which tells them what has been done, but not necessarily why, or what the downstream tasks are expecting. Replit's agent shows a running log of tool calls and outputs alongside the file changes, so the user can trace the causal chain. The log is the narrative of the execution; the files are its result.

Failure handling: the kill-switch design

The most significant product decision in a live-execution agent system is what to do when something goes wrong in the middle of a sequence.

The naive failure mode in multi-step execution is a cascade: task 3 fails, the orchestrator retries, the retry fails, the orchestrator tries a different approach, the different approach introduces a new error, the new error causes task 4 to fail on a state it did not expect, and by the time the session ends the project is in an inconsistent state that neither the agent nor the user fully understands.

Replit's agent design, as documented and observed in live operation, applies three principles to prevent this cascade.

Bounded retry. Each sub-task has a retry ceiling — typically two to three attempts on the same approach before the orchestrator escalates rather than retries. Escalation here means: pause execution, describe the failure to the user in plain language, and ask for direction. The user can say "try a different approach," "skip this step," or "stop here." This keeps the human in the loop on failures without requiring them to be in the loop on successes.

Reversibility check before irreversible actions. Before any action that cannot be undone — deleting a file, committing to an external service, modifying a configuration that affects other projects — the agent confirms with the user. This is not a general confirmation-on-everything policy (which would make the agent useless) but a targeted one: the agent is capable of distinguishing, at the action level, between writes that can be reverted and writes that cannot.

Hard stop on state inconsistency. If the orchestrator detects that the project's actual state has diverged from its model of the project's state — the file it expected to exist does not exist, the package version it expected to be installed is not installed — it stops execution entirely and reports the inconsistency rather than continuing on wrong assumptions. A partial execution that the user can diagnose is better than a full execution that produced a project that silently does not work.

These three together constitute a production harness for multi-step execution. The goal is not zero failures — that is not achievable. The goal is failures that are detectable, localised, and recoverable.

What the agentic shift cost Replit

Shipping an agentic system inside a consumer-facing coding environment required Replit to make a bet most AI product teams avoid making explicitly: they decided to let users run code the agent wrote, not just read it.

The risk is asymmetric. When a code suggestion is wrong, the cost is the user's time to diagnose and fix it. When a code execution is wrong, the cost can include modified files, failed tests, broken dependencies, or — in the worst case — actions taken against external services that the code called. Replit's container sandbox mitigates most of these, but the sandbox is not hermetic: the agent can and does reach external services when the user's task requires it, and actions taken against external services are not sandboxed.

The mitigation Replit chose was not stronger sandboxing — it was clearer communication. Every tool call the agent makes is logged in real time with the parameters used and the result returned. Every irreversible action is gated on a confirmation step. The agent's plan is visible before execution. The user's ability to stop the agent at any point is a persistent UI element throughout the session, not a buried setting.

This is a trust-design choice more than a security-design choice. The agent is not prevented from doing harmful things — it is required to be transparent about what it is about to do, and the user is given the controls to stop it. The bet is that users who understand what the agent is doing will make better decisions than users who are shielded from that information but occasionally surprised by its consequences.

What the case teaches

The Replit Agent case is most useful for two things.

First, it is a worked example of the multi-agent architecture decisions that most textbook discussions of orchestration treat as pure engineering questions. The visibility of the plan, the dependency-based scheduling, the file-system-as-state approach, the bounded retry, the reversibility check — each of these is an architecture decision that a product team makes, and each has consequences for the user experience that the product team is responsible for. The engineering implementation follows from the product choice; the product choice should not be driven by what the engineering team finds easier to build.

Second, it is an honest account of what a production harness for multi-step agentic execution looks like before the agent has been deployed to millions of users. The Air Canada case shows what happens when a production harness is missing. The Replit case shows what a minimal production harness looks like when the team has thought carefully about what can go wrong — and has built the tools for the user to catch it before it does.

The lesson is not "Replit got it right." The lesson is: these decisions have to be made explicitly, they are product decisions, and the teams that make them implicitly — by shipping the execution capability and adding the harness later — are learning the same things that Klarna and Air Canada learned, at a cost proportional to how long it takes the data to surface the gap.