I like the harness emphasis. "Self-improving" only means something useful if the loop has a target, a measurement surface and a stop rule. Otherwise the agent is just optimising inside whatever blind spots the harness failed to name.
Big agreement that the harness is the variable and the loop is the agent. The thing I'd press: the whole loop rests on the evaluator, which makes it both the highest-leverage layer and the single point of failure.
A self-improving loop optimizes against the evaluator, so it compounds whatever the evaluator measures, not quality, and those coincide only while the evaluator is a faithful proxy. Once it's a finite scorer or an AI judge inside the loop, the agent's pressure points at the evaluator, not the goal, and they diverge right where you can't see, the same green-dashboard failure you're escaping.
Self-improving and self-deceiving are one architecture with a good evaluator versus a gameable one. Your own line gives the tell: "tested against the trace history it came from" is in-sample validation, so it won't catch a fix that closes the cluster by gaming the evaluator.
The real requirement is a check outside the loop, one the agent isn't optimizing against. What do think is the best way to keep the evaluator honest once the agent is shipping changes scored by it?
The closed-loop pattern is the part that feels portable into clinical workflows. For an OR readiness agent, the question would be less whether the model improves in the abstract and more whether the harness can preserve a trace: what mismatch it saw, what human review point it reached, and whether the next case had fewer unresolved assumptions.
I like the harness emphasis. "Self-improving" only means something useful if the loop has a target, a measurement surface and a stop rule. Otherwise the agent is just optimising inside whatever blind spots the harness failed to name.
What does any of this mean for regular folks?
Big agreement that the harness is the variable and the loop is the agent. The thing I'd press: the whole loop rests on the evaluator, which makes it both the highest-leverage layer and the single point of failure.
A self-improving loop optimizes against the evaluator, so it compounds whatever the evaluator measures, not quality, and those coincide only while the evaluator is a faithful proxy. Once it's a finite scorer or an AI judge inside the loop, the agent's pressure points at the evaluator, not the goal, and they diverge right where you can't see, the same green-dashboard failure you're escaping.
Self-improving and self-deceiving are one architecture with a good evaluator versus a gameable one. Your own line gives the tell: "tested against the trace history it came from" is in-sample validation, so it won't catch a fix that closes the cluster by gaming the evaluator.
The real requirement is a check outside the loop, one the agent isn't optimizing against. What do think is the best way to keep the evaluator honest once the agent is shipping changes scored by it?
The closed-loop pattern is the part that feels portable into clinical workflows. For an OR readiness agent, the question would be less whether the model improves in the abstract and more whether the harness can preserve a trace: what mismatch it saw, what human review point it reached, and whether the next case had fewer unresolved assumptions.