Discussion about this post

User's avatar
H. Floyd's avatar

Infrastructure metrics looked normal the whole time Kiro was failing. That's the real problem. The evaluation surface didn't match what the agent was actually doing.

Benchmarks score single outputs against known answers, but production agents run decision chains against shifting contexts. Those are different measurement regimes, and instruments built for one miss the other. The discipline forming around this needs evaluation infrastructure that traces decision chains at the process level, not just output scores.

No posts

Ready for more?