Discussion about this post

User's avatar
Scenarica's avatar

The compound cascade section is the part that separates this from every other "agents in production" post. Most teams debug these five failure modes sequentially. Fix context rot, then fix tool reliability, then add evals, then add observability. That sequencing is itself a failure mode because fixing one in isolation changes the behaviour of the others.

Context rot is the clearest example. You add a summarisation step to compress the context window. Quality recovers. But summarisation changes what the tool layer receives as input, which changes tool behaviour, which shifts what your evals are actually measuring, which means your evaluation baseline is now stale. You fixed one problem and silently invalidated your ability to detect three others. Most teams discover this around week four when the metrics look fine and the users are complaining.

The 38% do-nothing agent stat deserves more attention than a footnote. If an agent that takes no action passes 38% of a benchmark, the benchmark is measuring the correlation between inaction and favourable starting conditions, not capability. Every team running evals against public benchmarks should run the null agent test against their own suite before trusting the numbers.

The nondeterminism framing is the line I'd put in front of every product leader reading this. "The model will not get more consistent. The product needs to be designed for the model it already has." That sentence resolves about six months of engineering arguments in most organisations.

No posts

Ready for more?