In the Age of Agentic Engineering, Context Is Your Real Product
What every product leader needs to understand about shipping AI that actually works
TLDR: AI products break in production not because the model fails, but because the context around it was never designed. This blog is for product leaders and engineers building AI features who keep shipping demos that fall apart under real users. What you’ll take away is practical: a shared vocabulary for context failures, three mental models for designing around them, and pre-launch stress test advice. The model is not your product. The context you give it is.
The Demo Always Works
A product team spends three weeks building an AI customer support agent. Internal testing goes well. The model handles edge cases, stays on topic, and generates responses that feel genuinely helpful.
Lastly, the team ships it.
Two weeks later, the support queue fills with complaints. The agent is confidently answering questions users never fully asked. It assigns ownership to problems nobody claimed. Users stop trusting the product entirely.
What happened?
Nobody changed the model. But what broke was never examined in the first place.
Marily Nika, a former AI Product Lead at Google and Meta, watched the same sequence repeat across teams: an AI feature that worked beautifully in controlled conditions broke in production.
Why?
Because no one could find the failure modes that were visible before launch, if anyone had known where to look.
Simon Willison describes the same gap from the engineering side: the bottleneck in AI development is no longer writing code. It is giving the agent the right environment to produce output that actually works.
That environment is called context. Everything that follows explains why it is your real product.
What Agentic Engineering Actually Is
Agentic engineering is the practice of building software using coding agents — tools like Claude Code, Cursor, and OpenAI Codex — where the agent generates code, executes it, runs tests, and iterates independently between turns. The human sets objectives and maintains oversight. The agent acts.
Simon Willison distinguishes this sharply from vibe coding, where you prompt, accept, and hope.
Addy Osmani puts the operational difference plainly: the single biggest differentiator is testing. A solid test suite lets an agent iterate until it passes. Without one, it declares broken code done.
That distinction reveals something structural.
The test is not just a quality check. It is a context mechanism — a precise description of what success looks like before the agent begins. Willison’s Red/Green TDD pattern makes this explicit:
Write the test first and confirm it fails.
Let the agent implement until the test passes.
The test defines the context. The agent operates within it.
Practitioners who work this way consistently arrive at the same conclusion: the model is rarely the bottleneck. What the model is given to work with is, i.e., the context.
The Context Problem: What Breaks AI Products
A model does not experience ambiguity the way a human does.
For instance, a human encountering a half-formed request pauses or asks for clarification.
An LLM, on the other hand, fills the gap.
It takes whatever is in its context window, finds the most plausible completion, and returns output that looks finished. The problem is not that the model is wrong. The problem is that it does not know it is wrong.
Marily Nika calls this the failure signature. Essentially, it is the pattern of breakdowns a feature reliably falls into when real users arrive.
Every AI feature has one. The teams that find it before launch deliberately push the model into its failure modes during development. The teams that do not find it discover it through support tickets.
Either way, the failure signature takes three distinct shapes:
Context overload occurs when the model receives more information than it can usefully process. Noise crowds out the signal, and the model treats everything with equal weight. A meeting notes tool fed an entire unstructured transcript will summarize the loudest voices, not the most important decisions.
Context gaps occur when the model lacks the information it needs and fills the absence with inference. Mostly probability distribution. The customer support agent who confidently answers “Is this good?” without asking what “this” refers to is not malfunctioning. It is doing exactly what a model does when the context does not tell it what it does not know.
Context misalignment occurs when the model has information, but the wrong framing for the task. Marily’s Slack thread demonstration is precise here. Essentially, the model was not missing content; it was missing the framing that distinguished decisions from noise. It imposed its own structure and returned a fabricated roadmap that looked authoritative.
These are not model failures. They are design failures. Tal Raviv and Aman Khan say support tickets show a pattern of AI “forgetting” facts during sessions. This issue is called context rot.
It refers to the steady loss of reliable behavior as the context window fills up. As this happens, the model struggles to remember earlier instructions. That is not a bug to file. It is a product experience to design around.
Context Engineering Is Product Design
Context engineering is about carefully shaping what an agent observes at every step. Essentially, it shapes its information environment. This way, it gets what it needs to think, act, and recover. It avoids creating confident nonsense when things get tough. It is not prompt writing. Prompt writing is a sentence. Context engineering is an architecture.
That architecture works in three layers. Product leaders are making choices about these layers, even if they don’t view them as context decisions.
System instructions are the rules, constraints, and behavioral boundaries. These tell the model how to operate before any user input arrives. Marily Nika describes adding a single instruction to a Slack summarization tool. Meaning, only assign an owner if someone explicitly volunteers. This immediately eliminated the product’s biggest trust issue. The fix was not a different model. It was a missing context decision.
Retrieved knowledge covers what relevant information is pulled into the model’s context at query time, how much, and how it is structured before the model sees it. Tal Raviv and Aman Khan observe that output quality improves not because the model improves but because the context improves. The model is constant. What changes is what it sees.
Memory and history determine what the agent retains across turns and between sessions. When an agent loses track of an earlier instruction mid-session, the user experiences it as the product breaking. It is a context design failure, not a model limitation.
These three layers map directly onto decisions made during every AI feature build —gro data access scope, system prompt structure, and when to ask a clarifying question rather than let the model infer.
Addy Osmani captures the underlying principle: agentic engineering rewards people who know what good output looks like. Because they can design the environment that produces it.
Agentic engineers call this context engineering. Product leaders have always called pieces of it feature scoping, guardrail definition, and UX constraints. The vocabulary has been different. The decisions have been the same.
Three Mental Models for Product Leaders
Understanding context as the primary determinant of AI product quality changes the questions you ask at every stage of development. These three mental models make that change practical.
Ask what the model sees before asking what it can do.
The right first question is not which model handles this task best. It is what the model will actually see when a real user triggers this feature in production. These are:
A real query.
Arriving with partial context.
Unstated assumptions.
The intent the model will have to infer.
Tal Raviv and Aman Khan describe this as the core of AI product sense: anticipating what will be impactful and feasible requires understanding what the model sees at the moment it acts, not what it can do in a controlled demo.
Define Minimum Viable Quality before you define your feature.
Marily Nika establishes three thresholds every product leader should set before development begins:
Acceptable bar: The acceptable bar is where the feature performs well enough for real users under typical conditions.
Delight bar: The delight bar is where correction rates drop and the feature earns trust through consistency.
Do-not-ship bar: It is the failure rate at which the feature actively damages user trust.
MVQ also requires an honest cost envelope. For instance, a feature at $0.30 per user per month that drives retention is a straightforward decision. The same feature at $5 per user per month with unclear impact is a business problem that no engineering will solve.
Build the adversarial ritual into your launch process.
Before any AI feature ships, push it into the conditions that will break it. Marily runs three stress tests in under fifteen minutes:
Feed it chaotic input.
Give it an ambiguous request.
Assign it something deceptively hard.
What comes back is not a pass or fail. It is a product requirement — a missing constraint, an underspecified instruction, a clarifying question the UX should ask instead of letting the model infer.
Closing
Return to the team whose AI broke in production. They were not asking the wrong questions about their model. They were asking the wrong question entirely.
The question was never “what can our model do?” It was always “what does our model see?”
That change, from capability to context, is what agentic engineering worked out through practice rather than theory. Practitioners hit the walls, inspected the tool calls, watched the context window fill, and arrived at the same conclusion repeatedly: the model was not the problem.
The environment the model was operating in was.
Simon Willison, Marily Nika, Tal Raviv and Aman Khan each arrived here from different directions. The conclusion is the same.
The model is not your product. The context you give it is.




