Engineering Reflection

What Broke When We Scaled Our Agent Systems to Production

A former backend lead reflects on two years of building autonomous agent infrastructure — the assumptions that failed, the patterns that survived, and the architecture that eventually held.

Anonymous, former backend lead 8 min read

Agent Systems Infrastructure Post-mortem Scaling

We started with a single agent handling document processing. Within eighteen months, we had forty-seven agents across six teams, and the system was failing in ways none of us anticipated. This is what I learned.

Part I — Four Lessons

Things I Wish Someone Had Told Me

Lesson 01

Agents lie gracefully

When an agent fails, it doesn't throw an error — it confabulates. It produces plausible-looking output that passes validation but is semantically wrong. We lost three weeks to a billing agent that was confidently generating invoices with invented line items.

Lesson 02

State is the enemy

Stateful agents are fragile agents. Every piece of conversation history we carried forward was another surface for drift. The agents that survived production were the ones we made deliberately forgetful, with state externalized into structured stores.

Lesson 03

Eval is the real product

We spent months building agent capabilities and weeks on evaluation. The ratio should have been reversed. The teams that invested heavily in eval frameworks shipped fewer features but broke production far less often.

Lesson 04

Human fallbacks aren't optional

Every agent workflow needs a clear escalation path to a human. Not as a nice-to-have, but as a core architectural requirement. The moment we treated human-in-the-loop as a first-class citizen, our error rate dropped by an order of magnitude.

Part II — Four Failure Patterns

Recurring Ways It Went Wrong

Cascade Hallucination

Agent A sends a slightly wrong output to Agent B, which treats it as ground truth and amplifies the error. By the time Agent D acts on it, the data is fiction. Multi-agent pipelines need validation gates between every handoff.

Prompt Drift

System prompts that worked in dev quietly degraded in production as upstream data distributions shifted. We had no monitoring for semantic drift — only latency and error codes. The agent was getting worse for weeks before anyone noticed.

Context Window Bloat

Engineers kept stuffing more examples and instructions into the context window. Performance improved on benchmarks but collapsed under real-world variance. The longest prompts had the worst production reliability.

Retry Storms

When an agent fails, the naive response is to retry. But LLM calls are expensive, rate-limited, and non-deterministic. Our retry logic once generated a $14,000 API bill in six hours because a downstream service was returning malformed JSON.

Part III — The Architecture Response

What We Built After Everything Broke

After the third major incident, we stopped adding features and spent a full quarter rebuilding the foundation. The architecture that emerged wasn't elegant, but it was honest about the failure modes of agent systems.

Core Architectural Principles

Every agent output passes through a deterministic validator before reaching the next stage
State is stored externally in typed schemas, never in conversation history
Circuit breakers with exponential backoff and cost caps on all LLM calls
Semantic regression tests run on every prompt change, not just functional tests
Shadow mode for all new agents — runs in parallel with human decisions for two weeks minimum
Observability on meaning, not just metrics — we log what the agent decided and why

Final Thought

"The hard part of agent systems isn't making them smart. It's making them fail in ways you can understand, contain, and recover from."

— Reflections after two years in production