The gap between an impressive agent demo and an agent that holds up in production is wider than most teams realise. The demo runs once on a curated input, the model picks a good path, and the screenshot ships. The production system runs thousands of times a day on inputs nobody anticipated, and the model picks paths the demo team never tested. Reliability is not a tuning problem — it is an architecture problem, and a small set of patterns consistently separate the agents that hold up from the ones that quietly degrade.
1. Plan-Execute-Reflect, Not Plan-Execute
The naive agent loop is plan, then execute. The robust one is plan, execute, reflect — and only commit when the reflection step passes. After the agent generates an action, a separate reflection step (often by a smaller, cheaper model) asks "does this action actually accomplish the stated goal? Are there obvious problems?" The reflection step catches a meaningful percentage of bad actions before they reach production, and it does so at a fraction of the cost of fixing them after.
2. Bounded Tool Surfaces
Give the agent only the tools it actually needs for the current task, not the union of all tools your platform supports. A planning agent does not need write access to the database. A summarisation agent does not need outbound email. Every tool you do not give the agent is a class of failure mode you do not have to defend against. The framework cost of dynamic tool surfaces is small. The reliability gain is large.
3. Bounded Iteration
A surprising number of production agent failures come from infinite or near-infinite loops — the agent gets stuck retrying the same failing tool call, or oscillating between two near-equivalent actions. Set a hard maximum iteration count for any agent loop. Set a maximum total token budget. When either limit is hit, fall through to a defined error state rather than continuing. "The agent is still working" should never be a possible answer at minute 30.
A pattern we see repeatedly: a dev team builds an agent with no iteration cap because "it usually finishes in three steps." Then production traffic includes an edge case that takes 47 steps. The agent burns $80 of token cost per request, the user gets a timeout, and the team finds out about it from the next month's bill.
4. Structured Memory, Not Unstructured Context
Letting the agent dump arbitrary intermediate results into the context window does not scale. The context grows. Cost grows. Quality degrades because the model has to reason over an increasingly cluttered prompt. Structured memory — a defined schema for what gets remembered between turns — is the alternative. Tool call results, decisions, and intermediate artefacts go into typed storage; the prompt sees only what is relevant to the current decision. This is the single biggest determinant of long-running agent quality.
5. Human Checkpoints at Risk-Relevant Boundaries
Not every action needs human review. Most should not — that defeats the point of automation. But irreversible or high-impact actions (sending external email, executing a payment, modifying production systems, deleting data) should pause for explicit confirmation. The pattern is not "human in the loop" everywhere; it is "human in the loop at the moments where being wrong is expensive." Identify those moments per task and design checkpoints into the agent flow.
- Plan-Execute-Reflect — second-pass verification before commit
- Bounded tool surfaces — least privilege per task, not per platform
- Bounded iteration — hard caps on steps and token budget, defined error state
- Structured memory — typed storage for intermediate state, lean prompt
- Risk-aware human checkpoints — explicit confirmation for irreversible actions
The Pattern That Underlies the Patterns
Every reliable production agent we have seen treats the LLM as a fast but unreliable component, not as the whole system. The patterns above are what surrounds the model with the safety, the structure, and the oversight that the model alone does not provide. Teams that try to make the model more reliable spend a lot of effort on prompt tuning that produces incremental improvements. Teams that build reliable systems around the model produce reliable agents.