Forty-five agents in production. Across industries: garment manufacturing, financial services, legal, healthcare administration, e-commerce, internal tooling. Here is what that experience distills into.
What we got right
Human escalation paths from day one. Every agent we have built has a defined escalation path — a situation in which the agent stops and hands control to a human. The agents where we designed this in from the start have dramatically lower incident rates than the ones where we added it later.
Structured outputs over free text. Agents that produce structured JSON responses — rather than free-text answers — are dramatically easier to integrate with existing systems and to test. The overhead of defining the schema pays off within the first week.
Tight tool scopes. Agents with fewer, more specific tools perform better than agents with broad general capabilities. A customer support agent that can query your CRM, your knowledge base, and your ticketing system — and nothing else — is more reliable than one that can also browse the web and write code.
What we would change
Testing earlier and more systematically. We now run structured reliability tests before any agent goes to production. Early agents did not get this — and several of them failed in ways that systematic testing would have caught.
Better observability from the start. Knowing that an agent ran is not the same as knowing what it did and why. We now log every tool call, every decision, and every escalation. Early agents had minimal logging, which made debugging production issues slow and expensive.
Smaller scope launches. Agents that launched handling 100% of a workflow had worse outcomes than agents that launched handling 20% — with humans covering the rest — and then scaled up as confidence grew. The gradual rollout is now standard.
The three decisions that determine most outcomes
First: is the workflow actually suitable for automation? Second: are the escalation paths designed before the agent is built? Third: is there a human who owns this agent's performance and is accountable for its errors? Get all three right and the agent will succeed. Miss any of them and you will be fixing it in production.