Discussion about this post

User's avatar
Scenarica's avatar

The four-scope breakdown is the most useful framework in here but honestly the part that landed hardest was something almost buried in the middle. Stale memory producing "confidently wrong" outputs because high relevance plus incorrect information doesn't signal uncertainty. That's the failure mode that will cause the first serious production incident in an enterprise agent deployment, and it'll happen precisely because the system looks like it's working perfectly right up until it isn't.

The memory vs context distinction is one that most teams are still getting wrong in practice. They see a million-token context window and assume the memory problem is solved. It isn't. It's masked. The window holds everything but weighs nothing. Memory is supposed to be the system's judgment about what matters, and judgment requires governance that a context window doesn't provide.

The Karpathy wiki framing is the one I keep coming back to. Ingest, query, lint. Three verbs that describe what most teams think they're doing with RAG but actually aren't, because RAG retrieves without evaluating whether what it retrieved is still true. The lint step is where the real work lives and it's the step almost nobody has built yet.

Pawel Jozefiak's avatar

Calling memory a product surface is the framing I wish I had at month one. The four scopes map almost exactly onto what I ended up with running mine for half a year https://thoughts.jock.pl/p/i-built-a-self-improving-ai-agent, the part that bit me hardest was governance on the write path. Without an explicit graduation step from short-term scratchpad into the always-loaded rules, the agent learns lessons from one bad run and applies them forever. I track corrections per week as a health metric now. 22 in 30 days, 18 in the last 7, that is not the agent getting worse, that is me finally seeing things the bad runs were hiding.

No posts

Ready for more?