Engineering Reflection

What Broke When We Scaled Our Agent Systems to Production

A former backend lead reflects on two years of building autonomous agent infrastructure — the assumptions that failed, the patterns that survived, and the architecture that eventually held.

Anonymous, former backend lead 8 min read
Agent Systems Infrastructure Post-mortem Scaling

We started with a single agent handling document processing. Within eighteen months, we had forty-seven agents across six teams, and the system was failing in ways none of us anticipated. This is what I learned.


Part I — Four Lessons

Things I Wish Someone Had Told Me

Lesson 01
Agents lie gracefully
When an agent fails, it doesn't throw an error — it confabulates. It produces plausible-looking output that passes validation but is semantically wrong. We lost three weeks to a billing agent that was confidently generating invoices with invented line items.
Lesson 02
State is the enemy
Stateful agents are fragile agents. Every piece of conversation history we carried forward was another surface for drift. The agents that survived production were the ones we made deliberately forgetful, with state externalized into structured stores.
Lesson 03
Eval is the real product
We spent months building agent capabilities and weeks on evaluation. The ratio should have been reversed. The teams that invested heavily in eval frameworks shipped fewer features but broke production far less often.
Lesson 04
Human fallbacks aren't optional
Every agent workflow needs a clear escalation path to a human. Not as a nice-to-have, but as a core architectural requirement. The moment we treated human-in-the-loop as a first-class citizen, our error rate dropped by an order of magnitude.

Part II — Four Failure Patterns

Recurring Ways It Went Wrong

Cascade Hallucination
Agent A sends a slightly wrong output to Agent B, which treats it as ground truth and amplifies the error. By the time Agent D acts on it, the data is fiction. Multi-agent pipelines need validation gates between every handoff.
Prompt Drift
System prompts that worked in dev quietly degraded in production as upstream data distributions shifted. We had no monitoring for semantic drift — only latency and error codes. The agent was getting worse for weeks before anyone noticed.
Context Window Bloat
Engineers kept stuffing more examples and instructions into the context window. Performance improved on benchmarks but collapsed under real-world variance. The longest prompts had the worst production reliability.
Retry Storms
When an agent fails, the naive response is to retry. But LLM calls are expensive, rate-limited, and non-deterministic. Our retry logic once generated a $14,000 API bill in six hours because a downstream service was returning malformed JSON.

Part III — The Architecture Response

What We Built After Everything Broke

After the third major incident, we stopped adding features and spent a full quarter rebuilding the foundation. The architecture that emerged wasn't elegant, but it was honest about the failure modes of agent systems.

Core Architectural Principles


Final Thought
"The hard part of agent systems isn't making them smart. It's making them fail in ways you can understand, contain, and recover from."
— Reflections after two years in production