Adaline Labs

Reliable Tool-Using AI Agents In Production: MCP, State, Retries, Timeouts, and Recovery

Nilesh Barla — Sat, 25 Apr 2026 00:01:16 GMT

TLDR: Getting an agent to call a tool is the easy part. The hard part is what happens when that tool hangs, partially succeeds, or mutates external state in a way the model cannot recover from on its own. This article covers five runtime mechanisms that determine whether a tool-using agent survives production. You will learn how to classify tool risk by state type, how to retry safely using idempotency keys, how to set timeouts per tool rather than per system, and where to place approval gates before irreversible writes. Also, how to design recovery into the workflow before the first failure occurs. If you are building or evaluating an agentic system, the reliability gap is not in the model. It is in the runtime layer around it.

Tool Calling Is Not the Hard Part

The hard part is not getting an agent to call a tool. Every agent that reaches a demo can do that. The hard part is what happens next, i.e., when a tool hangs, returns partial results, mutates state, or leaves the workflow in a condition the model cannot resolve on its own.

Tool calling is what moves agents from answering questions to taking actions. MCP sets the standard for how those tools are exposed and invoked. But neither addresses what production demands: a runtime that survives tools that fail partway, time out, or create side effects that a retry makes worse.

OpenAI’s sandbox documentation separates orchestration from execution because the two layers have different problems. Anthropic’s managed-agents essay frames the same split between the “brain” and the “hands.” Both point at the same fact: the model gets you to the first successful tool call; the runtime decides whether the workflow survives everything after it.

Anthropic's Managed Agents architecture: the Harness (Claude) is decoupled from the Session, Sandbox, and Tools. Each component can fail or be replaced independently. | Source: Anthropic Engineering

This article covers five things that determine reliability for agentic LLMs in production: state type, retries, timeouts, approvals, and recovery. None are model problems. All are runtime problems.

Thanks for reading Adaline Labs! This post is public so feel free to share it.

What Changes When an Agent Uses Tools in Production

A one-shot tool call is simple by design. The agent queries an API, gets a result, and generates a response. Failure resets to zero without damage.

Production workflows are built differently. Once an agent calls tools across a multi-step sequence, it touches mutable systems. For instance,

A call at step three changes the state that step four reads.
A timeout at step five leaves the system in a condition that the model cannot sort out on its own.
A partial failure at step seven may have already sent the email, updated the record, or triggered an external job that cannot be canceled.

OpenAI’s sandbox guide treats execution as a stateful workspace with persistence and tool artifacts.
Anthropic’s managed-agents writeup makes the same point: longer-lived work needs structured execution surfaces, not raw chat continuity.

What breaks in production-ready agentic systems are the boundaries around the tools, like:

What happens when a write fails halfway,
When context breaks in production corrupts a later step,
When nondeterministic failures pile up across a workflow built only for the happy path.

Runtime design handles all of these. Model fluency does not.

MCP Sets the Interface; the Runtime Owns the Rest

MCP standardizes how AI applications connect to tools and data sources. It governs the interface — not what happens inside the execution once a tool is called. | Source: modelcontextprotocol.io

The MCP Product Playbook describes MCP as a standard interface between models and tool providers. That is exactly what the MCP specification does:

It defines how tools are exposed, described, and invoked.
It handles discovery, schema, and transport.
It does not handle what happens when a tool times out, when a write is retried in an unsafe way, or when the model must decide if a failed call means the action ran.

Standard access is the first step and not a guarantee of safe execution. The runtime still owns permissions, retry logic, timeout rules, approval gates, artifact storage, and recovery paths.

The tool-calling functions layer defines how tools are described to the model. The product control plane governs how they run and how state is tracked across steps. Prompt management controls what the model sees; the runtime controls what it does.

Both OpenAI and Anthropic treat standard access and safe execution as separate layers. Conflating them is how production reliability becomes an afterthought.

Stateful vs. Stateless Tools

Not every tool carries the same risk. The line that matters most in production is not what a tool can do — it is what a tool changes.

Stateless tools read or compute without touching anything outside the agent’s context. A web search, a CRM record lookup, a file read, or a database query all fit here. If they fail, retry them freely. The cost is latency, nothing more.

Stateful tools write to the world outside the agent. Sending an email, updating a CRM record, merging a pull request, creating an invoice, publishing content, etc. These all change the external state in a way that reads never do. Once execution begins, a failure does not undo what has already run. The email may already be sent. The invoice may already exist.

This is the line the tool orchestration layer must hold. Different tools require different handling, such as retry rules, idempotency requirements, and fallback paths. Sub-agents that each own a distinct tool set make this boundary clear, rather than running all actions through one loop with no risk distinction.

The problem is the gap between tools you can retry freely and tools you cannot.

Retries and Timeouts Are Workflow Decisions, Not Infra Defaults

Retries look like infrastructure. In practice, they are workflow decisions with consequences that users see.

For stateless tools, retry logic is simple: if the call fails, try again with backoff and jitter. AWS’s Builders’ Library guidance on timeouts and retries applies directly. For stateful tools, the question is harder.

Was the action done before the failure, or not?

A network timeout after a write does not tell you whether the write went through. Retrying without a guard could run the same action twice.

Stripe’s idempotency model handles this with idempotency keys with a unique ID on each request, so that retrying returns the same result instead of creating a duplicate.

AWS’s guidance on making retries safe applies the same idea to distributed APIs. The pattern transfers directly: attach a unique operation ID to each stateful call, and let the downstream system deduplicate on that key.

Idempotency handles the retry problem. But retries only trigger when the system knows a call failed. Timeouts introduce a harder case: the call ended, but you do not know whether it succeeded. One timeout setting across all tools is not a policy; it is a default that creates failure modes the agent was not built to handle. The right cutoff depends entirely on what normal looks like for that tool:

A fast-read API should cut off after 2 seconds.
A code sandbox may need twenty.
A document pipeline may need two minutes.

Each tool needs its own timeout, matched to its own normal runtime.

Four rules apply across both:

Retry reads freely; use idempotency keys for all stateful writes. Meaning: attach a unique operation ID so the downstream system can deduplicate rather than run it twice.
Track four outcomes: success, explicit failure, timeout, and unknown. Treat unknown as requiring review, not the same as failure.
Decide before launch which failures auto-retry, which escalate, and which stop the run.
Surface retry counts in your traces, because a tool that always works on the third attempt is a sign that AI products are breaking in production before users notice.

Adaline’s Deploy overview and CI/CD integration connect here: pipelines that test agent behavior across environments need to know which tools are retry-prone before those patterns hit real traffic.

Recovery Requires Checkpoints, Artifacts, and a Clear Next Step

Retry logic prevents some failures from worsening. It does not cover the case where the workflow must stop, save its state, and either resume or hand off.

OpenAI’s sandbox model treats stateful workspaces as a core design element: the runtime holds files, outputs, and mid-step results so a failed run does not restart from scratch. Anthropic’s managed-agents essay makes the same point: execution surfaces must support checkpoint-and-resume rather than using raw chat context to rebuild what happened.

Recovery is not an error handler. It is a design decision made before the first run. The right checkpoint places depend on which steps are costly to re-run and which are hard to undo. Persistent state across steps lets the system pick up at the right point without redoing completed writes.

The choice between re-plan and hand-off matters. Review loops in coding agents show this clearly: some failures mean the plan needs to change; others mean the run should stop and surface its state to a human. Knowing which applies before the run starts is what keeps a failure recoverable. Deploying your prompt ties this to runtime snapshots, diffs, and rollback history.

Approvals Belong at High-Risk State Transitions

Not every tool call needs a human in the loop. But some should never run without one.

Google ADK’s human-input documentation treats human input as a workflow step for decision checks and permissions, not a safety net added after the fact. Approval gates are workflow boundaries, not general AI safety measures.

The tools that need approval share one trait: they create state changes that are hard to undo. Sending a customer email, merging a pull request, publishing content, creating an invoice, or deleting a record all belong here. Permissions and handoffs between agents, or between an agent and a human, are first-class concerns.

Sub-agents that handle delegated tasks need approval rules set before the task starts, not at runtime. Behavioral constraints in AI PRDs make the same point: failure limits and approval rules must be in the spec before a feature ships, not left as undefined behavior.

Observability Makes Reliability Measurable

Adaline's trace view showing a complete agent execution: every span from RAG retrieval to tool calls to final response, with per-step timing and a total cost of $0.0017. This is what runtime visibility looks like in practice.

Retries, timeouts, checkpoints, and approval gates are mechanisms. Without visibility into what actually ran, in what order, with what inputs and outputs, those mechanisms operate on guesswork.

Observability vs monitoring for agentic systems is not the same problem as watching a stateless API. A stateless API either responded or it did not. A tool-using agent has a multi-step trace in which any step can fail, retry, time out, partially succeed, or pause for approval. The final output tells you almost nothing about what happened in the middle.

What needs to be visible are every tool call, its inputs and outputs, retry counts, timeout events, approval triggers, state changes, and the recovery path taken. That trace is not debugging overhead. It is the layer that turns retry rules and timeout settings into something you can measure and improve.

LLM observability at the production level includes distributed tracing, per-request visibility, and anomaly detection. AI agent evaluation connects pre-launch testing to production monitoring. Essentially, behaviors you test before release need to be tracked after it, because real traffic finds edge cases no test suite fully covers.

Reliable Tool-Using Agents Are Built at the Runtime Layer

Every agent that reaches a demo can call the tools. What separates a solid system from a fragile one is what happens after that first call. Can the runtime classify tool risk, retry safely, hold per-tool timeouts, preserve state through failure, gate irreversible writes, and keep the full trace visible?

PromptOps, Iterate, Deploy, and the full Adaline platform connect to exactly this: reliability is not a feature you add once the agent works. It is the layer you design first and build the agent on top of.

How To Evaluate Coding Agents In Production: Metrics, Failure Modes, And Review Loops

Nilesh Barla — Sat, 18 Apr 2026 00:01:42 GMT

TLDR: Benchmark scores don't reflect production reliability. To evaluate coding agents in real engineering environments, teams need four specific metrics: task completion rate, regression introduction rate, review loop count, and blast radius on failure. They also need a failure mode taxonomy to design tests around, a structured three-stage review loop, and a lightweight eval dataset built from real production tasks. The teams that build this early move faster later. They can swap models or change prompts with confidence.

Every coding agent demo looks impressive. The agent takes a feature request, navigates the codebase, writes a working diff, and the tests pass. If you're still choosing between agents, see our Claude Code vs OpenAI Codex comparison before building your eval framework around a specific tool.

What you don’t see is what happens weeks later. The same agent takes a production task and quietly introduces a regression in a module it was never asked to touch.

Teams evaluating coding agents in production are discovering something important. Demo performance and production reliability measure different things entirely.

Benchmark suites capture capability under controlled conditions.
Production work happens in messy, evolving codebases.
Half-documented APIs.
Test suites that don’t cover everything.
A context that no benchmark has ever encountered.

This blog covers the following:

Four metrics that are important.
The five failure modes worth designing tests around.
How to build a review loop that improves over time.
How to construct an eval dataset from real work.

Learn more about LLM and agent evaluation here.

Why Benchmark Scores Don’t Transfer to Production

SWE-bench is the most commonly cited benchmark for coding agents. It measures whether an agent can resolve real GitHub issues on open-source repositories. That’s a genuinely useful signal for comparing models. But it’s not what production looks like.

A March 2026 study by METR found that roughly half of test-passing SWE-bench PRs would not be merged by actual repo maintainers. The automated grader scores are, on average, 24.2 percentage points higher than what maintainers actually accept.

Both automated grader scores (orange) and maintainer merge rates (blue) improve as models improve — but the gap between them stays wide. The average difference across all models is 24.2 percentage points. | Source: METR, March 2026.

That gap is the benchmark-to-production problem made concrete.

Single-turn evals grade a response. Agent evals have to verify an outcome. The grading logic is fundamentally different. | Source: Demystifying evals for AI agents, Anthropic Engineering, January 2026.

SWE-bench tasks come with a complete repository context, a clear problem statement, and a test suite that validates the fix. Production tasks arrive with ambiguous requirements, partially documented dependencies, and internal libraries with no public docs.

Scale AI’s SWE-bench Pro shows how sharp this issue is. Top frontier models that score 80%+ on Verified fall below 25% on Pro tasks. Those tasks require multi-file reasoning across unfamiliar repositories. That’s closer to what production actually demands.

SWE-bench Pro uses contamination-resilient curation from commercial repos. Resolve rates drop significantly on commercial codebases compared to public ones — GPT-5 falls from 23.3% to 14.9%, Opus 4.1 from 22.7% to 17.8%. | Source: Scale AI SWE-bench Pro

There’s a second structural problem. Benchmark evaluators measure outputs, not processes.

A coding agent that reaches the right answer by making up intermediate steps isn’t a reliable tool. It’s a fragile one. The benchmark score doesn’t capture how it got there. It doesn’t capture what it ignored, or whether the same reasoning chain holds on a problem that’s 10% different.

This effect is made worse by test-time scaling in frontier models. Longer reasoning chains improve accuracy on isolated tasks. But they don’t fix what actually matters in production: the agent still has no memory of your codebase, no awareness of your team’s conventions, and no model of which parts of your system are load-bearing.

Benchmarks aren’t useless. They help you eliminate obviously weak models. But once you’ve made an initial selection, the evaluation that actually matters happens in your codebase, on your tasks, with your review process in the loop.

The Four Metrics That Actually Matter

Production eval for coding agents requires tracking four numbers. Two measures output quality. One measures process efficiency, and the other measures downside risk.

Task completion rate is the percentage of tasks the agent completes correctly. The definition matters: a completion means a diff that passes your test suite, builds cleanly, and requires no correction before merge. An agent that produces a partially working diff that a human has to edit is not a completion. Teams that use a loose definition tend to overestimate their agent’s reliability by 20–30 percentage points.
Regression introduction rate is the percentage of completed tasks where the agent modifies code outside the specified scope and introduces a bug. This is the number most teams miss in their initial evals. An agent that completes 80% of tasks but introduces regressions in 15% of those completions is a net negative. The debugging time erases the output gain.
Review loop count is the average number of human correction cycles before a task output is merge-ready. A healthy baseline for a well-scoped task is one cycle. If your agent requires two or more, the issue is almost always prompt quality or context framing. That number tells you exactly where to iterate.

Faros AI’s analysis of 10,000 developers found that high AI adoption teams merged 98% more PRs but saw review time increase by 91%. There was no measurable gain in organizational delivery. The output gain was absorbed entirely by review overhead.
Collecting this metric requires agent observability tooling. Log each review cycle as a discrete event, not just the final accepted output.
Blast radius on failure measures how much of the codebase is touched when an agent task goes wrong. For instance, a contained failure modifies two files. But a poorly scoped task can cascade across eight modules. That happens when the agent infers imports instead of confirming them. Tracking blast radius gives you data to design better scoping policies before you scale, not after the first multi-module incident.

Collecting these metrics requires logging from day one. Every agent task should generate a structured log: task description, files touched, test results before and after, review cycle count, and final merge decision.

The early data sets your baseline. Don’t wait until you’re scaling to add it.

The Five Failure Modes to Design Tests Around

Building an eval dataset without a failure taxonomy is like writing tests without knowing what could break. These five failure modes cover most of what goes wrong with coding agents in real engineering environments.

Context blindness occurs when the agent operates on a wrong or incomplete model of the codebase. It writes code referencing APIs or variable names that don’t exist in the current project version. This happens because the context window holds only the files you provided. The dependency it needs is two or three levels away.
Context rot makes this significantly worse. As context grows, instruction quality degrades. Multi-step tasks are especially vulnerable.
Instruction drift is the multi-step version of context blindness. The agent begins executing a clear task but gradually shifts its reading of the goal. By step seven of a twelve-step refactor, it’s optimizing for a slightly different target than the one stated at step one.
A January 2026 paper formalizes this as “semantic drift.” The paper documents that unchecked drift reduces task completion accuracy and increases human intervention rates in production systems.
Semantic drift reaches nearly 50% incidence at 600 tokens of context — far earlier than most teams expect. Coordination and behavioral drift follow the same curve. | Source: arXiv:2601.04170
Silent regression is the costliest failure mode. It doesn’t surface at review time. The agent completes the requested task correctly but makes an incidental change to a shared utility or config file. That change introduces a bug. The bug won’t appear until another part of the system is affected in production.
Columbia’s DAPLab studied five coding agents across 15+ applications and found a consistent pattern. Agents “prioritize runnable code over correctness,” suppressing errors to make output appear functional rather than flagging the failure.
Scope creep occurs when the agent infers that the task requires more changes than were requested. It makes those changes without flagging them. Unlike silent regression, these extra changes are deliberate. The agent decided they were needed. The inference is often wrong. The review process focuses on the requested change but misses the additions that weren’t requested.
The hallucinated API surface is the easiest failure mode to detect. The agent calls methods, imports packages, or references config keys that don’t exist. This usually surfaces in CI right away. But it generates an outsized debugging cost. That cost grows when the hallucination is a near-miss: a method name off by one character from a real one.

Designing tests around these failure modes means constructing tasks that stress each one specifically.

Test context blindness with tasks that require files not in the default context. Test instruction drift with multi-step refactors. Test silent regression by running your full test suite after every agent task, not just the tests adjacent to the change.

Thanks for reading Adaline Labs! This post is public so feel free to share it.

How to Design Your Review Loop

The review loop is where evaluation becomes operational. Every coding agent deployment needs a structured process with explicit stages and decision criteria. “Someone should look at this” is not a process.

A three-stage loop works for most engineering teams.

Stage one is automated.
CI runs immediately on every agent-produced diff. It covers the build, unit tests, and integration tests. No human reviews a diff that fails CI.

This isn’t novel. Google’s engineering practices documentation has established automated gates as a baseline for any serious code review process. But teams skip this stage when moving fast. Faros AI’s 2026 data across 22,000 developers found that 31% of PRs are already merging with no review at all. That’s where silent regressions accumulate at scale.

Stage two is scoped human review.
A reviewer checks three things.

First: whether the agent’s changes are contained to the intended scope. Second: whether any out-of-scope files were changed correctly. Third: whether the approach the agent took is the one the team would have taken.

The third question is the one most reviewers skip. They check for correctness rather than coherence. But approach divergence is how teams build up technical debt. Agent-generated code that works today creates refactoring work six months from now.

Stage three is feedback capture. Every correction should be logged and tagged by failure mode. That means reverts, edits, and notes added to the task description.

This turns the review loop into a compounding asset. The corrections become the signal for prompt improvement, context window design, and task scoping. Teams that do this find their review loop count drops within four to eight weeks.

For teams where production reliability is a first-class concern, this loop plugs into your existing code review setup. You’re not building a parallel process. You’re adding structure to one that already exists.

How to Build a Lightweight Eval Dataset from Production

An eval dataset built from synthetic tasks measures what you designed it to measure. That’s often not what actually fails in your codebase. The more reliable path is to mine your real task history.

Collect the last 30–50 coding agent tasks your team has run. Include the final accepted diff and every correction made during review. Include any CI failures that occurred before acceptance. If you don’t have this logged yet, start logging now and run this exercise in four weeks. Don’t wait for synthetic examples. Start with whatever real tasks you have, even if it’s only ten.
Tag each task by the failure mode it encountered. Some tasks will be clean completions. Many will have at least one failure. Tasks that hit multiple failure modes in a single run are your most valuable eval cases. They show how failure modes compound in ways that isolated testing won’t surface.
Split the tagged dataset into two sets. The first is a dev set for iterating on prompts and context design. The second is a held-out set you run only when making a significant change: a new model, a new system prompt, or a major context window restructure. Running your full eval on every small change produces overfitting. Your prompts start passing tests without improving on genuinely new tasks.

This is the foundation of evaluating AI agents in a way that transfers to production. A dataset built from real failures, tagged by failure mode, and split correctly gives you the signal to improve with real confidence.

Final Thoughts

Evaluation is often treated as a one-time setup. Something you do before you deploy and revisit only when something breaks. That framing is exactly backward.

The eval dataset you build from your first thirty tasks becomes more valuable over time. The fiftieth and hundredth tasks reveal patterns that the early data didn’t surface. The review loop generates feedback that compounds into better prompt design. The failure mode taxonomy sharpens as your team develops intuition about which failure modes your codebase makes most likely.

The teams that build this early don’t just run their current model better. They can swap models, change prompts, and scale with genuine confidence. They have the logging to know, with evidence, whether things got better or worse.

That confidence is the actual product of evaluation. The metrics and the tests are how you earn it.

The Missing Product Layer for Multi-Agent Systems

Nilesh Barla — Sat, 11 Apr 2026 00:01:16 GMT

TLDR: Only 1 in 10 agentic AI use cases reached production last year, and the issue is not a model-capability problem. Nor a better model. It is the governance layer above the models: who can do what, when to delegate, what humans can see, and how to recover. This article introduces the Four Control-Plane Primitives (permissions, handoffs, visibility, and recovery) and walks through what each one means for AI PMs and engineers before a multi-agent workflow ships. If your PRD does not define delegation boundaries and escalation conditions, it is not ready for a multi-agent workflow.

When one agent becomes five, the problem changes. You are no longer just designing outputs. You are designing permissions, handoffs, visibility, and trust. And most teams discover this only after they've shipped.

Multi-agent systems are AI architectures in which multiple specialized agents collaborate toward a shared goal. Each agent handles a distinct subtask, calls its own tools, and operates within its own context window, while a coordinating layer routes work between them.

Gartner named multi-agent systems a top 10 strategic technology trend for 2026. They predicted that 40% of enterprise applications will include task-specific agents by year’s end, up from less than 5% in 2025. Yet only one in ten agentic AI use cases reached production in the past year. The problem between prototype and production is not a model-capability issue, but a governability issue.

The models are not the hard part. The hard part is building what sits above them:

The layer that governs who can do what, when an agent can delegate.
How work transfers between agents, what humans can see
How the system recovers when something goes wrong.

This article calls that layer the product control plane. It proposes a practical framework built around four primitives every multi-agent product must get right, and walks through what that means for AI PMs writing requirements and engineers deciding what to instrument.

Why Single-Agent Product Thinking Breaks In Multi-Agent Systems

A single AI agent operates with a knowable mental model. It has one context window, one permission surface, one responsibility boundary, and one output for the user to evaluate.

When that agent behaves unexpectedly, the failure is usually traceable:

You can examine the prompt,
Inspect the tool calls, and
Identify where the reasoning went wrong.

The product surface area is bounded.

Multi-agent systems architecture is categorically different.

A January 2026 survey on orchestration and enterprise adoption described the orchestration layer as “the control plane of a multi-agent system, transforming autonomous components into a coherent, goal-directed collective.”

It warned that without it, “even highly capable agents risk duplication of effort, logical inconsistency, or unbounded autonomy that diverges from the system’s objectives”.

The unbounded autonomy problem is not theoretical. Anthropic’s analysis of agent behavior on their public API, published in early 2026, found that the 99.9th percentile session length grew from 10 to 40 minutes between October 2025 and January 2026. In the same period, the average number of human interventions per session dropped from 5.4 to 3.3. Both trends point in the same direction: agents are operating more autonomously for longer periods with less human contact. That is valuable. It is also the precise condition under which single-agent mental models break down entirely.

Agents are running significantly longer sessions with each model generation — a sign of growing autonomy, and a direct argument for stronger governance design. Source: Anthropic.

When a product team thinks of their system as “an assistant that uses tools,” they are designing for a world where one entity has full context and one person is watching. When that same system starts delegating to subagents, the complexity multiplies.

Think this: each subagent has partial context, different tool access, and its own failure modes.

Every assumption embedded in the original design becomes a liability. Users cannot see the delegation chain. The PMs have no requirement for what happens when a subagent fails. The engineers have no instrumentation for handoff-level errors.

The product seems to work until it stops working for no apparent reason.

Thanks for reading Adaline Labs! This post is public so feel free to share it.

Delegation Changes The Product Surface Area More Than Most Teams Expect

Delegation sounds like a routing problem.

It is not.

Delegation is a transfer of authority, context, and responsibility across a trust boundary. And every one of those transfers expands the product surface area in ways that have to be explicitly designed for.

A February 2026 research paper on AI delegation mechanics put this clearly: once a multi-agent AI system delegates work to a subagent, the system must account for “the delegator’s degree of belief in the delegatee’s” reliability. That trust cannot simply be assumed. In practice, it has to be constructed through three decisions that teams routinely skip:

Task packaging: When a lead agent hands work to a subagent, it must decide what context to transfer. A subagent that receives too little context will misinterpret its scope. One that receives the wrong context will act on incorrect assumptions. Neither failure surfaces as an obvious error; both surface as outputs that are subtly but consequentially wrong.
Authority boundaries. The subagent needs to know what it is allowed to do independently and when it must escalate. Without explicit boundaries, subagents either become overly cautious, interrupting frequently and defeating the purpose of delegation, or overreach, taking actions the user never authorized.
Coordination overhead. Anthropic’s engineering team, in describing their multi-agent research system, noted that early versions made errors like “spawning 50 subagents for simple queries” and “scouring the web endlessly”. The orchestrator had no clear rules about when delegation was appropriate and when it was wasteful. The system behaved rationally within its local context and irrationally at the product level.

These three problems are not solvable with better prompts. They are solvable with better product design. That means specifying them before the first subagent is built.

The Four Control-Plane Primitives: Permissions, Handoffs, Visibility, Recovery

A production-ready multi-agent product needs four things to work together. Each is both a product decision and an engineering problem.

Permissions

Permissions define what each agent is allowed to do:

Which tools can it call?
Which data can it read or write?
Which actions can it initiate without asking for approval?

The failure mode when permissions are weak is not dramatic. It is quiet. An agent with excessive permissions takes actions that fall within its technical authority but outside the user’s intent.

An agent with insufficient permissions interrupts constantly and erodes the value of autonomy. And when permissions are not designed per-agent, the risk compounds.

When all agents in a chain inherit the same flat permission set, a single compromised or misconfigured subagent can propagate unauthorized actions through the entire chain.

The research on this is direct. A February 2026 paper on delegation mechanics argued that permission design must extend beyond binary access to semantic constraints. Meaning, “access defined not just by the tool or dataset, but by the specific allowable operations. For example, read-only access to specific rows, or execute-only access to a specific function”.

The same paper noted that permissions must be dynamic rather than static: “access rights are not static endowments but dynamic states that persist only as long as the agent maintains the requisite trust metrics.”

For PMs: permissions are a product and compliance decision, not a backend default. The permission surface of a multi-agent system determines what the product can do to a user’s data, systems, and environment without the user's consent. That is a business risk decision.

For engineers: implement least-privilege defaults at the subagent level. Each agent should receive only the tools and data access it needs for its specific task, not the full tool set of its orchestrator.

Handoffs

A handoff is the transfer of execution from one agent to another: from the orchestrator to a subagent, from one specialist to another, or from an agent back to a human.

Handoffs are the highest-risk moments in any multi-agent workflow because they combine three failure conditions at once:

Context may be incomplete,
Authority may be ambiguous, and
Neither agent may recognize that the transfer has gone wrong.

A March 2026 trace-based assurance framework for agentic AI orchestration identified five failure classes in multi-agent systems. Three of them manifest specifically at handoff boundaries: coordination failures such as loops and deadlocks, role drift in long-horizon workflows, and error propagation across agents.

The paper described handoffs as moments where “planner, verifier, and action roles may drift, loop, or deadlock across turn boundaries.”

The quality of context transferred at a handoff is ultimately a context engineering problem: what information the receiving agent needs, in what format, and at what level of compression. Get it wrong, and the subagent acts on incorrect premises with full confidence.

Anthropic’s auto mode for Claude Code addresses handoff risk directly, running safety classifiers at both ends of every subagent handoff: when work is delegated out and when results come back. The outbound check catches compromised or unauthorized delegation. The return check catches subagents that were benign at delegation but compromised mid-run by the content they retrieved. When the classifier flags repeatedly, the system escalates to human review.

Higher task autonomy demands higher security investment. Auto mode achieves strong autonomy with low ongoing maintenance friction, but sandboxing remains the highest-safety option for sensitive environments. Source: Anthropic.

For PMs: handoffs are product moments, not just engineering events. They involve responsibility transfer, potential user confusion, and invisible decisions. Specify what the system must communicate to the user when a handoff occurs, and under what conditions a handoff should require explicit approval.

For engineers: log every handoff with source agent, destination agent, task specification passed, and context transferred. Treat a handoff with incomplete context transfer as a failure event, not a warning.

Share Adaline Labs

Visibility

Visibility is the ability for users, PMs, engineers, and operators to understand what the system is doing and why. In a single-agent product, visibility is a nice-to-have. In a multi-agent system, it is the mechanism by which humans maintain meaningful oversight.

Anthropic’s framework for trustworthy agents identifies transparency as a structural requirement: “Humans need visibility into agents’ problem-solving processes. Without transparency, a human asking an agent to ‘reduce customer churn’ might be baffled when the agent starts contacting the facilities team”. That example is not abstract. Without step-level visibility, users cannot assess whether the agent is pursuing the right strategy, and they cannot intervene before an undesirable action completes.

AWS describes the production consequence in their analysis of agent evaluation at Amazon: “Quality issues in production often surface in ways that traditional monitoring misses”. Status codes, response times, and token counts can all show green while the product fails at the reasoning and coordination level.

Visibility requires traces that capture individual agent steps, tool calls, and handoff events, not just the final output. It also requires activity summaries that translate those traces into language that users can understand. State awareness tells users where they are in a multi-step workflow.

For PMs: define what the user sees at each stage of a multi-agent task. A task that runs for ten minutes across four subagents with no user-facing updates is not invisible infrastructure. It is a broken product experience.

For engineers: instrument at the agent step level, not just the request level. Agent observability should capture what each agent received, what it called, and what it returned, with enough granularity to reconstruct the full execution trace after the fact.

Recovery

Recovery is what the system does when something goes wrong:

When a subagent fails, when a handoff delivers bad context,
When an action hits a permission boundary, or
When the workflow reaches a state it was not designed to handle.

Most teams design recovery as a single fallback: “show an error message.” That is not recovery. It is abandonment.

A production-grade multi-agent system needs at least three explicit recovery paths: retry with modified parameters, fallback to a simpler workflow, and escalation to human review.

The escalation condition matters as much as the escalation mechanism. Anthropic’s data on agent autonomy found that experienced users shift over time “from approving individual actions to monitoring what the agent does and intervening when needed”. That is a healthy trust pattern. But it only works if the system surfaces enough signal for humans to know when intervention is warranted.

For PMs: define the escalation trigger conditions before launch. What agent state, output score, or action type should route to human review? What does the product communicate to the user when escalation happens?

For engineers: implement circuit breakers for runaway delegation chains. Log every permission denial and fallback logic event as first-class telemetry, not as debug noise. Recovery paths that are not monitored cannot be improved.

What AI PMs Should Put In The PRD For A Multi-Agent Workflow

Most PRD templates were built for single-feature, single-agent products. They do not account for the coordination, authority, and visibility questions that multi-agent systems introduce. Before a multi-agent workflow goes to engineering, the PRD should answer each of the following:

Agent role definitions: What is each agent responsible for, what tools does it have access to, and what is it explicitly prohibited from doing?
Permission boundaries: Which actions require implicit approval, which require explicit user confirmation, and which are always blocked regardless of context?
Delegation conditions: Under what circumstances does the orchestrator delegate to a subagent versus handling the task directly, and what criteria govern that decision?
Handoff specifications: What context must be packaged when work transfers between agents, what does the receiving agent need to know to act correctly, and who is responsible for the outcome once a handoff occurs?
User-visible states: What does the user see at each stage of the workflow, which intermediate states are communicated, and what happens to the UI during a multi-minute agent run?
Fallback and escalation flows: At what point does the system route to human review, who owns the escalation, and what does the product communicate when a fallback triggers?
Success definition: What does “done” mean in a multi-step, multi-agent task? What is the acceptance criterion, and at what point is the task complete enough to return control to the user?

That is the product specification layer. The engineering layer that makes it observable and recoverable before launch is equally specific, and equally often skipped.

What AI Engineers Should Instrument, Evaluate, And Audit Before Launch

Instrumentation decisions for multi-agent systems differ from single-agent products in scope and consequence. Before a multi-agent workflow goes to production, the following should be in place:

Agent-step tracing: Capture every subagent action as a trace event with parent agent ID, timestamp, and input/output payloads. Traces should reconstruct into a full execution graph.
Handoff logging: Log every handoff with source agent, destination agent, task specification, and context payload. Flag incomplete context transfers as failure events, not warnings.
Permission denial telemetry: Capture every blocked action with agent identity, attempted action, and the policy rule that blocked it. Permission denials are diagnostic signals about where the system design is breaking down, not noise.
Trajectory-level evaluation: Output scoring at the final response level misses failures that happen inside the workflow. Evaluation of AI agents should run across the full sequence of agent decisions, not just at the endpoint. Amazon’s agent evaluation framework covers both individual agent performance and collective system dynamics.
Fallback event monitoring: Log and trend every retry, workflow fallback, and escalation. A spike in fallback events is often the first signal of a model update, a prompt regression, or a new user behavior pattern that the system was not designed for.
Auditability before GA: Any engineer should be able to reconstruct what happened in any session from traces alone, without asking the user. If that reconstruction is not possible, the instrumentation is not sufficient for production.
Launch gate: Define minimum passing thresholds on trajectory evaluation scores, fallback rate, and permission denial rate. Treat them as a hard gate. A multi-agent system that passes output-level quality checks but fails at the trajectory or handoff level is not production-ready.

Final Thought

The industry has spent the past two years optimizing models. The next constraint is not model capability.

Research from Amazon’s internal deployments shows that organizations that invest in governance and evaluation are an order of magnitude more successful in reaching production than those that do not. The Linux Foundation’s Agent-to-Agent Protocol has already crossed 150 supporting organizations in its first year, a signal that the industry has recognized coordination governance as an infrastructure problem, not a product differentiator.

The teams that ship reliable multi-agent products will not be the ones with the most capable agents. They will be the ones who designed for governable autonomy:

Specifying permissions before deploying,
Instrumenting handoffs before trusting them,
Defining recovery before needing it, and
Giving users enough visibility to trust what the system was doing on their behalf.

That is the product layer most teams skip. It is also the one that determines whether a multi-agent system becomes a product or remains a prototype.

Why AI Took Coding Before Everything Else

Nilesh Barla — Sat, 04 Apr 2026 00:01:10 GMT

TLDR: AI automated coding before law, design, or strategy because code has a built-in feedback loop. Meaning, you can run tests and know immediately whether it worked. That property, which barely exists anywhere else in knowledge work, is why autonomous AI iteration was possible in software first. Understanding that logic tells you what to automate next and which parts of the PM role hold out longest. What has changed is already reshaping how engineers work, what cognitive debt accumulates inside fast-moving teams, and what product leadership actually means when execution is no longer the constraint.

The most useful way to think about a large language model is this. It has read every textbook ever published. It executes tasks instantly. And it forgets everything that happened before the current conversation. It gives confident answers to questions it genuinely cannot answer. The confidence is the problem.

Product leaders have spent careers managing exactly this kind of person. In this case, it is the junior hire who executes fast but needs context, direction, and verification. The thing that just changed is that this person now writes all the code.

This article explains why that happened — why coding automated first, before law, before strategy, before many other domains. It traces what that sequence reveals about where product leaders’ attention needs to go next.

Why AI Came for Coders First

The explanation is not that code is simpler than other knowledge work. The explanation is that code has a built-in verification loop that almost no other professional domain has. That loop made AI possible in software before anywhere else.

When a model generates code, a test suite runs. The code either works or it doesn’t. That binary result tells the model exactly where it stands, without a human in the loop. The model generates, encounters a failure, reads the error message, revises, and runs again. This inner cycle closes on its own.

The same property does not exist in law.

As Simon Willison put it: “If you’re a lawyer, you’re screwed, right?”

A brief written by a model may be fluent, well-structured, and completely wrong about precedent, and no automated test can catch it. There is no failing test suite for a hallucinated citation. The error surfaces in court, months later, where the damage is real.

The same applies to medical reasoning, strategic advice, and most of what knowledge workers produce. Whether the output is correct requires a human who already understands the domain.

This distinction -- verifiable output versus output that needs expert judgment to check -- is the most important frame for thinking about the automation timeline:

The fastest-automated domains are those where correctness can be tested automatically.
Domains that hold out longest are those where correctness is ambiguous or can only be judged by someone who already knows the problem deeply.

For product leaders, this maps directly onto your own work. Features with measurable success signals will automate faster:

Conversion rates, error rates, and latency -- trackable, testable, automatable.

Work requiring judgment about ambiguous value holds out longest:

Deciding which roadmap item matters.
Aligning stakeholders around competing priorities.
Judging which user signal is real versus noise.

Verifiability is a strategic concept, and knowing which of your responsibilities falls into which bucket is now a planning skill.

Thanks for reading Adaline Labs! This post is public so feel free to share it.

The November 2025 Inflection

What changed and why that inflection matters to us?

November 2025 was not a moment of gradual improvement. It was a threshold crossing.

Models that had only handled simple, contained tasks suddenly became capable of working through complex, multi-file, deeply connected problems. Single files and narrow scope were no longer the ceiling. The models had crossed an invisible capability line where a whole new class of problems became solvable.

The clearest evidence came from inside the team’s building, these tools.

Boris Cherny, who created Claude Code at Anthropic, has not written a line of code by hand since November 2025. Every line in every pull request is written by the model. He ships ten to thirty pull requests a day. His contribution is not producing code; it is directing the agent and verifying its output.

For product leaders, the significance is not the output volume; it is what that volume implies about how engineers now experience their own job.

The mental model changed from “I write code, the model helps” to “I direct the agent, I verify the output.”

Engineers now spend most of their time on:

Reviewing model output for correctness and coherence.
Writing specifications precise enough for agents to act on.
Catching failures before they reach production.

They need more from product leadership as a result. This includes more precise direction, faster feedback cycles, and clearer success criteria. That need arrived ahead of most product roadmaps.

Most organizations are still structured for a world where the bottleneck was how fast engineers could write code. That bottleneck no longer exists. The constraint that replaced it is less visible, and it is already accumulating inside the teams that have moved fastest.

Cognitive Debt: The Hidden Cost Nobody’s Managing

There is a cost accumulating in engineering organizations right now that is not showing up on any dashboard: cognitive debt.

It is distinct from technical debt, and the distinction matters specifically for product leaders.

Technical debt is a code quality problem — poor architecture, shortcuts taken under pressure, messy implementations that need cleaning up later. Teams have managed this for decades.

Cognitive debt is different. Cognitive debt is a comprehension problem. It means the team has shipped something they cannot reason about.

For instance, a developer vibes-codes a feature in an afternoon. The feature works, passes tests, and ships on schedule. By every visible metric, the sprint was successful. But nobody on the team can predict what breaks when the next feature touches the same codebase.

Nobody can explain why the implementation made the choices it made. The shared mental model of the system — how it works and why — has degraded faster than the code itself.

Research into AI-assisted development teams documented exactly this pattern: teams hit a wall mid-project, unable to make simple changes without breaking something unexpected. The real problem was not code quality; it was that no one could explain why key design decisions had been made. They had accumulated cognitive debt faster than technical debt, and it paralyzed them.

Product managers feel cognitive debt first. It shows up as:

Estimates that consistently miss.
Regressions with no clear cause.
Features that cannot be extended without a full rebuild.

This is why observability stops being an engineering cost and becomes a product input. Trace data, eval systems, and production logs are how a product leader keeps enough understanding of a fast-moving, AI-written system to make planning honest.

Screenshot of casual chain analysis in the Adaline dashboard.

The PM who reads what the product is actually doing in production is managing cognitive debt. The PM who only reviews finished features is not.

Share Adaline Labs

What Design’s Collapse Reveals About the Whole Stack

The compression happening in engineering is not isolated. It is happening across every function simultaneously, and design is the clearest case study.

Jenny Wen, who leads design for Claude at Anthropic and was previously Director of Design at Figma, documented this compression directly.

A few years ago, 60-70 percent of her team’s time went into mocking and prototyping. That number is now 30-40 percent. That recovered time went into working directly alongside engineers, i.e., polishing implementations as they were built, doing the last-mile work the old handoff model assumed someone else would handle.

In other words, execution compressed, and the role compressed with it.

Her Hatch Conference keynote conveys a deeper point: in a world where anyone can build anything quickly, the scarce skill is no longer execution — it is curation.

And it is turning out to be true.

Choosing what to build matters more than being able to build it. And because building in the wrong direction now costs days instead of months, the PM’s old job of gating engineering with a complete spec matters less. The scarce judgment is upstream: which directions are worth exploring at all.

Two insights from this shift reach beyond design.

First, non-deterministic products break the specification model.

You cannot write a complete spec for an AI feature because the product’s behavior is not fixed; it is a range. What users experience depends on the model, the prompt, and the context, which you could not have anticipated in advance.

A PM writes acceptance criteria for a summarization feature: three sentences, neutral tone, key date included.

The model produces a four-sentence summary in active voice that users find more useful than the spec required. The PRD was right about the goal and wrong about every constraint.

That is what structural mismatch looks like in practice.

Specification used to come before execution. Now they run in parallel, and the PM’s job is direction, not permission.

Second, the vision horizon has collapsed.

The two-to-five-year product roadmap is obsolete for teams running at AI execution speed. What replaces it is a three- to six-month directional prototype. It has to be concrete enough to keep teams pointed at the same thing and short-term enough to be revised when model capabilities shift.

Product planning built on annual cycles is misaligned with teams that ship daily. The planning unit needs to compress to match the execution unit, or the roadmap becomes fiction nobody trusts. That directional prototype is now the PM’s primary planning artifact. It is not a detailed spec and not an annual roadmap. But it is a direction concrete enough to keep fast-moving teams aligned and short enough to stay honest.

Where the PM’s Job Shifts First

These are behavioral changes, grounded in what the evidence above actually shows.

Build for the model’s timeline, not yours.

The principle is simple: design for where the model will be in six months, not where it is today. The capability ceiling rises every quarter. Features that feel out of reach for AI execution right now will be routine within two planning cycles. Roadmaps that treat current AI capabilities as fixed points will be wrong by the time they ship.

Shift your verification energy up the stack.

Engineers now spend more time reviewing model output than writing code. Your attention should move too — from reviewing shipped features to understanding what your team actually comprehends about what was built. The cognitive debt frame makes this concrete.

Your job is not just to catch bad output; it is to maintain enough shared understanding of the system so that planning stays honest. The PM who can explain how the system works, not just what it does, is the PM whose estimates hold up.

Treat latent demand as a real-time signal.

With AI products, the signal of what users actually want appears in production before it appears in research. Users encounter non-deterministic behavior and improvise workarounds in real time, and those workarounds are data.

With language model products, you discover use cases by watching people use them, not by specifying them in advance. The PM who builds this habit — reading trace data, support patterns, and user workarounds regularly — will identify the next right feature before a formal research cycle has time to name it.

Closing

The weird, overconfident intern who has read every textbook can now write all the code. That changes execution permanently.

But what does not change is the judgment layer. That layer is now visible in a way it has never been before, precisely because execution has automated around it.

The intern cannot:

Decide what is worth building.
Know when a system that has no memory of understanding is about to fail in production.
Read the signal in a user’s workaround that the product should have been built differently.
Hold a vision long enough to keep a fast-moving team pointed at the same thing across a quarter.

Those are product skills. The execution layer has been automated. Judgment is the job.

How To Design AI Features For Nondeterminism

Nilesh Barla — Sat, 28 Mar 2026 00:01:41 GMT

TLDR: Nondeterminism is not an edge case in LLM-powered products: it is the default. This blog defines the three types of production failures: output variance, behavioral drift, and reasoning-level failure. The blog also diagnoses the three design failures that cause damage and walks through how to write a spec for a probabilistic feature. Essentially, shifting from expected output to acceptance criteria, from test cases to test distributions, and from “works” to "fails by design." If your AI PRD lacks an acceptance threshold section, it is not yet an AI PRD. Reliable AI features in 2026 are not built by teams with the best models. They are built by teams who designed for the day the model behaved unexpectedly.

The feature shipped cleanly. It passed QA, cleared stakeholder review, and ran without incident in staging. But three days after launch, a user forwarded a screenshot with a support ticket.

The AI had returned something the team could not explain. The logs showed nothing wrong. It was just different from anything it had produced before. When the engineer pulled the logs, everything was proper: status 200, latency normal, token count within range, no exception anywhere in the stack.

The model had simply behaved differently. That is not a bug. It is a design problem or a consequence of the probabilistic nature of AI. And until you or the team accepts that framing, every audit will lead to the wrong conclusion.

What Nondeterminism Actually Means for Product Teams

Here are three things that you, as a product leader, should be familiar with.

Output Variance: It is the most familiar. The same input, run twice against the same model, produces two different outputs. In summarisation tasks, copy generation, and classification, this is not an edge case. It is the default behavior of every probabilistic system. Many of us know it exists, but almost none of us design for it deliberately.
Behavioral Drift: It is the one that blindsides teams after launch. A feature works correctly at release, and a few weeks later, something is off with no code changes anywhere. These can be due to a model update, a shift in user input patterns, or a prompt encountering inputs it was never tested against, which can all trigger it. The team learns from user complaints, not from its own monitoring.
Reasoning-Level Failure is the hardest to catch because it produces no visible error. Our blog on Observability vs. Monitoring for Agentic AI describes this precisely: “retrieval works, tool calls complete, the model responds, but the combination of those steps produces a result that is wrong for the actual task. Monitoring shows all green. [But] the product fails.”

Nondeterminism is not a bug to fix. It is a constraint to design around, just as great product teams design around latency, mobile screen size, or network reliability.

Subscribe now

Why Agents and Modern Models Make This Harder

A single nondeterministic call is manageable. An agent making sequential tool calls compounds the problem at every step. One failed retrieval can cascade into four downstream failures. From wrong tool selection to incomplete data to confabulated gap-filling to a correction loop.

You cannot write alerts for failure states you have never seen before. The blast radius of nondeterminism is proportional to agent autonomy.

Architecture comparison of open source LLMs. | Source: The Big LLM Architecture Comparison

Modern model architecture adds a layer that most product leaders do not account for. Mixture-of-Experts models like Qwen3, GLM-4.5, and DeepSeek V3 do not activate all of their parameters for every inference step. A routing mechanism selects a small subset of active experts per token. Sebastian Raschka’s Big LLM Architecture Comparison shows that DeepSeek V3 activates roughly 37 billion of its 671 billion parameters per step, because just 9 of its 256 experts activate at a time.

That means, two nearly identical prompts can route to different expert combinations and produce meaningfully different outputs. This is architecture-level variance. It is not configurable.

Reasoning models add a third dimension.

These models generate an internal chain-of-thought before responding, and that chain is itself variable. The GLM-5 technical report makes this explicit. The model shipped a Preserved Thinking mode specifically to retain reasoning context across conversation turns and prevent cross-turn drift.

How Preserved Thinking works in GLM-5: without it (center), the model drops all reasoning context between turns and must start from scratch. With it (right), reasoning chains persist across turns, which is what makes consistent multi-turn agent behavior achievable. | Source: GLM-5 Technical Report, arXiv 2602.15763

When model builders start engineering against a failure mode at the architecture level, that failure mode is real.

The question is not whether your AI feature will behave differently over time. The question is whether you designed for it.

The Three Design Failures Teams Make

Failure 1: Hiding Variance Instead of Surfacing It

Teams build UX that treats the AI as deterministic: no regenerate button, no confidence framing, no acknowledgment that the same question might produce a different answer tomorrow.

When variance surfaces, users experience it as a bug and report it as one. Support tickets pile up for behavior that is technically correct. Here, we explained why the same input does not guarantee the same output, and temperature introduces randomness by design.

The product response is not to hide this. It is to design around it. “Here is one way to think about this” frames output differently than “Here is your answer.” A regenerate button signals that trying again is normal, not a sign that something broke. The goal is calibrated trust: not blind trust, not distrust, but calibrated.

Failure 2: Writing Binary Acceptance Criteria

Here is how it usually goes. The PRD says "the AI returns a correct answer." QA runs three test cases, marks them green, and the feature ships. Nobody questions what "correct" actually means, because it felt obvious in the room.

Three weeks later, production surfaces a failure pattern nobody can reproduce, because the test cases were not a “distribution.” They were essentially a demo.

A demo compresses all the variability of production into a single scenario, hiding messy inputs and long-tail formats, and it hides drift, too. Meaning a prompt can look stable on five hand-picked examples, then break on some random day when a new user arrives with a different intent.

The fix is defining success as a rate, not a binary. Instead of “the AI returns a correct answer,” write: “the AI passes this rubric on at least 90 percent of real production inputs.”
Nine out of ten is a target you can measure. It is also a target that can degrade over time, which means you will know when it does.

LLM-as-a-judge, where a model scores outputs against defined criteria for accuracy, relevance, and instruction adherence, is the only evaluation mechanism that scales when there is no single correct output.

Failure 3: Treating Fallback as an Afterthought

The spec says, “display error message if the AI fails,” on a single line, and then moves on.

But failure in a nondeterministic system is rarely binary.

The AI responds. But sometimes it just responds badly. Hidden or silent failures do not crash anything, but they essentially make you lose trust, safety, and budget a little at a time, until users stop believing the feature works at all.

The fix is designing three explicit fallback tiers before the first sprint begins.

Soft fallback delivers a simpler and narrower output at low confidence.
Human handoff routes high-stakes or ambiguous cases to a person. Essentially, think of it as human-in-the-loop.
Silent skip does nothing but do wrong.

The choice between these three is a product decision. It belongs in the PRD.

Thanks for reading Adaline Labs! This post is public so feel free to share it.

How to Write a Spec for a Probabilistic Feature

There are three concrete shifts that separate a spec for a deterministic feature from a spec for a probabilistic one. Each shift changes what you ship.

From expected output to acceptance criteria.
The wrong spec line reads: “The AI returns a correct summary.“ The right version reads: “The AI produces a summary that passes the following rubric on 90 percent of a representative input set.“

The difference forces the team to agree on what “good” means before building, not after shipping. Our blog on Prompt Management for Product Leaders makes the point directly: evaluation is the key to iteration, and you cannot iterate toward a target you have not defined.

I would recommend another work of ours, “AI Observability and Evaluations, “which covers how to build a system that makes those improvements trackable.

From test cases to test distributions.
A single test case is a demo.

A distribution is a product.

Effective evaluation starts with roughly 20 representative cases that reflect actual production input. These are not the clean happy path, but messy inputs, edge formats, and ambiguous queries that real users send.

This starting set expands over time using production traces, not gut instinct. The spec should state where the initial eval set comes from before development begins.

From “works” to “fails by design.”
Every AI feature spec should include a Failure Modes section that answers three questions:

What does the feature do when the output confidence is low?
What happens when a tool times out?
What does the user see when the AI produces output outside the acceptable range?

These are product decisions. They belong in the spec, not in a Slack thread three weeks after launch.

If your AI PRD does not have an acceptance threshold section, it is not yet an AI PRD. For a complete structural template, AI PRD guide walks through exactly what that section should contain.

Observability Is the Runtime Layer

Good threshold design requires knowing what the production distribution actually looks like. Traditional monitoring cannot tell you.

Observability vs. Monitoring for Agentic AI documents the issue precisely: status codes, response times, and token counts can all show green while the product is failing. The agent may be retrieving irrelevant content, calling the wrong tool seventeen times, or filling its context window with garbage. None of that surfaces in an infrastructure dashboard.

The design decisions from the previous sections only hold up if the team can see what is happening at the level of reasoning.

Screenshot of casual chain analysis in the Adaline dashboard.

Fallback triggers cannot be calibrated without traces that show where and why failures happen. The real value of a proper observability layer is the ability to ask new questions about old data, tracing a bad decision back through every tool call, every retrieval step, and every token that shaped the final output.

The three fallback tiers described above need threshold data to stay correctly calibrated as the feature evolves in production.

That data comes from traces, not from the test suite.

The spec defines what acceptable behavior looks like. Observability tells you whether you are getting it. For the full operational picture on how to instrument this at the agent level, the Observability vs. Monitoring for Agentic AI post is the companion operational read for everything covered in this blog.

A Checklist for Product Leaders

Before you spec:

Have you defined what “acceptable output” looks like as measurable criteria, not as a description?
Have you named the three failure types for this specific feature: output variance, behavioral drift, and reasoning-level failure?
Have you designed all three fallback states: soft fallback, human handoff, and silent skip?
Have you decided which failure modes are acceptable and which are not before the first sprint begins?

Before you ship:

Does your eval set reflect real production inputs, not just the clean demo cases?
Have you run evaluations at the failure boundary, testing what happens when confidence drops or a tool times out?
Is observability instrumented to capture why a decision happened, not just that it happened?
Does QA know that “cannot reproduce” is not a reason to close an AI ticket?

After you ship:

Are behavioral threshold alerts set, not just infrastructure metric alerts?
Is there a post-incident process for AI failures that traces back to the original spec?
Is the eval set growing from production evidence on a defined cadence?

Closing

The teams shipping reliable AI features in 2026 are not the ones with access to better models. Open-source models like Qwen3, GLM-4.5, DeepSeek V3, and Kimi K2.5 have made agents faster, more capable, and so do closed-source models like GPT 5.4, Claude 4.5, Gemini 3, etc.

All of them are suited to longer-horizon tasks than anything available a year ago. Sebastian Raschka’s Big LLM Architecture Comparison documents labs claiming reasoning systems that can sustain autonomous task execution for thirty hours straight.

That is a genuine capability expansion. It does not solve the product design problem. Capability and reliability are different problems, and the industry conflates them constantly. What separates good AI product teams from great ones is not the model they chose. It is whether they wrote a spec for the day the model behaved unexpectedly.

Your AI PRD Is Missing Its Hardest Sections

Nilesh Barla — Sat, 21 Mar 2026 00:01:10 GMT

TLDR: This post is for product managers, builders, and teams shipping AI features. The central argument is that a PRD for an AI feature is not a specification of behavior; it is a behavioral contract. It is what defines success thresholds, failure modes, fallback logic, and what the system is never allowed to do. This blog breaks down five classic PRD sections that need to be rewritten for AI. It introduces a sixth section that no standard template includes, and walks through a concrete before-and-after example using a meeting summary feature. By the end, you will have a framework you can apply to the next AI feature PRD you write.

Consider a PM hands an engineer a PRD for an AI writing assistant. The acceptance criteria read: the summary should be accurate and concise. Three weeks later, the feature ships. Upon reviewing, the PM says it is broken. But the engineer says it passes the spec.

Here is the problem: they are both right.

Let me explain.

Product circles have been debating whether the PRD is dead, and the AI PRD in particular has become a flashpoint. Aakash Gupta put it clearly.

The spec did not die; it moved. The old flow was a permission document written before anyone had seen the system behave. And it took eight to twelve weeks. The new flow is a decision record written after the prototype has shown you what you are working with, which now takes one to two weeks.

At Anthropic, Boris Cherny’s team does not write specs at all; they run prototypes in parallel and ship dozens of pull requests every day.

OpenAI takes the opposite position. With 800 million monthly active users, a feature without a written behavior contract creates alignment problems that no amount of working code can solve.

Sean Grove made this point in his “The New Code” talk: when hundreds of engineers are building on the same system, a written spec does something working software cannot. It keeps shared intent visible and consistent across the entire team.

That framing is correct. But it sidesteps the harder question. Once the spec moves to step six, what does a PRD for an AI feature actually contain? Especially when behavior is probabilistic, failure modes are invisible, and "accurate" is not a success criterion but an aspiration. Here is what most teams are still missing.

Subscribe now

What Can a Prototype Not Tell You?

The prototype-first movement is correct about sequencing. You discover things by building that no planning document would find. But a working prototype answers the wrong questions for a PRD. It essentially shows you what the system does. It cannot tell you:

Why is the change worth making?
How does the feature connect to the broader product strategy?
Who sees it first and under what release conditions?
What does “good enough to graduate” mean as an actual number?
Which tradeoffs and side effects have you decided to consciously accept?

Aakash Gupta identified those five gaps as the core value of a well-written spec in his August 2025 deep-dive on AI PRDs in Product Growth.

The prototype is a discovery tool. The PRD is an alignment artifact.

And PRD becomes richer and more honest once you have seen how the system behaves.

For AI features specifically, there are three additional gaps that standard PRD thinking has not yet addressed.

Eval thresholds: You need a specific, numeric definition of what good looks like before you ship, not a general sense that the outputs “seem okay.”
Fallback behavior: When the model gets it wrong, and it will, what does the system do? Does it fail or provide a failure response, surface uncertainty to the user, or escalate to a human? This is product logic, and it belongs in the spec.
Behavioral constraints: A definition of what the system must never do, regardless of what the user asks. This is the boundary layer that protects users when the model is technically responsive but wrong in ways that cause harm or erode users’ trust.

The prototype shows you the feature. The PRD defines the contract.

The Sections You Need to Rewrite for a PRD for an AI Feature

The classic PRD format has four sections that appear in almost every template: problem statement, acceptance criteria, success metrics, and definition of done. For an AI feature, each requires a different kind of thinking than most teams currently apply.

Problem statement: Largely unchanged, with one addition: state the cost of a wrong answer explicitly. A standard problem statement frames the user’s need. An AI problem statement also frames the consequences of failure.

For a customer service bot, a hallucinated policy destroys trust in a way that a slow page load never does. In a clinical setting, a triage tool's wrong answer could cause direct harm. Naming that cost upfront shapes every decision that follows, from how strict the quality bar needs to be to whether the feature should exist at all.

Acceptance criteria: This is where most AI PRDs collapse. Hamel Husain and Shreya Shankar have trained over 2,000 engineers and PMs on evaluation systems at companies including OpenAI and Anthropic. Their September 2025 guide on Lenny's Newsletter makes a point I keep coming back to: the first instinct is to reach for off-the-shelf metrics, hallucination rate, toxicity scores, numbers that look rigorous before you understand how your specific feature actually fails.

Those numbers are not wrong. They are meaningless until you have grounded them in your product’s real failure patterns. What matters is how your feature fails, not how AI systems fail in general.

Lenny's Newsletter

Building eval systems that improve your AI product

👋 Each week, I tackle reader questions about building product, driving growth, and accelerating your career. Annual subscribers get a free year of 15+ premium products: Lovable, Replit, Bolt, n8n, Wispr Flow, Descript, Linear, Gamma, Superhuman, Granola, Warp, Perplexity, Raycast, Magic Patterns, Mobbin, and ChatPRD…

8 months ago · 354 likes · 10 comments · Hamel Husain and Shreya Shankar

Writing “should not hallucinate” in an AI feature acceptance criteria section is the same mistake as writing “the app should be fast.” It sounds right, but it measures nothing actionable.

This is the problem that eval-driven development is designed to solve: you build the measurement system alongside the feature, not after it ships broken.

The fix is binary pass/fail criteria tied to specific failure modes. Hamel and Shreya are direct on the scoring format in their September 2025 guide: Likert scales are a trap. The distinction between a 3 and a 4 is subjective and inconsistent.

Binary pass/fail forces clarity.

Adaline’s eval interface in practice: every output gets a clear pass/fail verdict, plus a written reason. The reviewer never has to decide whether an output is a 3 or a 4.

The nuance belongs in a written critique explaining why the judgment was made, detailed enough for a brand-new employee to understand it. An LLM-as-judge can automate this scoring at scale, but the human benchmark must come first.

The criteria also need to specify what percentage of cases must pass and who holds the final judgment. A concrete version: a senior PM reviews 20 random outputs per sprint, and if more than two fail the quality bar, the feature goes back to prompt iteration. That sentence is a testable contract. “Should be accurate and concise” is not.

The AI development lifecycle is a continuous cycle: iterate, evaluate, deploy, monitor, and back again. The behavioral contract you write in the PRD is what makes each stage accountable to the last.

Success metrics: You need two explicit layers, not one.

The first layer covers model quality metrics: output correctness, hallucination rate, LLM-as-judge pass rate, and completeness. These live upstream of the user experience and reveal whether the foundation is sound.

The second layer covers product metrics: task completion rate, session depth, and user override rate, which is the percentage of AI outputs the user manually edits or ignores. User override rate is one of the most honest signals in an AI product. When it climbs, users have stopped trusting the feature, even if they are not explicitly saying so.

Almost every PRD I have seen contains only the second layer. Both are required.

Failure modes: The best failure modes do not come from imagination. They come from reviewing real outputs. Hamel and Shreya recommend starting with a single human expert, often the PM, who sits with roughly 100 real prototype interactions and writes open notes on anything that looks or feels off.

The reason this works is captured by research on criteria drift cited in their guide. People are poor at articulating their full quality requirements in the abstract. Seeing the output is what surfaces the requirement.

Essentially, the act of reviewing and annotating is how real criteria emerge. And not imagining edge cases before anything has shipped. This is a wrong practice.

Consider an AI that summarizes incoming support tickets for customer success agents. In early prototype runs, it marked several tickets as resolved when the customer had simply stopped responding, not because the issue was actually closed. That specific constraint, “must not infer resolution from user silence,” would never have appeared in a PRD written before the prototype ran.

The failure makes the rule visible.

Write your failure modes after reviewing 20 to 50 real prototype outputs and grouping what you observed into concrete categories. That is the section that earns its place in the document.

Definition of done: In a standard PRD, done means QA sign-off. For an AI feature, done requires two additional conditions:

The specified eval suite must pass at the defined threshold.
The quality arbiter, in most cases the PM, must have reviewed a representative batch of outputs and signed off explicitly.

Engineering done and product done are not the same for a probabilistic system. And treating them as equivalent is how low-quality AI features get shipped without anyone being clearly responsible.

When a team ships an AI feature that only QA signed off on, and outputs start degrading in production two weeks later, the definition of done determines who owns the decision to pull it.

If that question is unanswered in the PRD, it will be unanswered at the worst possible moment.

Thanks for reading Adaline Labs! This post is public so feel free to share it.

The Section That Does Not Exist in Standard PRDs

There is one section that no PRD template includes and that every AI PRD requires: behavioral constraints.

Behavioral constraints define what the system must never do, independent of what the user asks. They are not failure modes; failure modes describe things that go wrong unintentionally.

Behavioral constraints describe boundaries that the system must hold, even when the model is technically capable of crossing them. They are the equivalent of the system prompt in implementation: the boundary layer that the PM defines, and the engineer enforces.

Examples:

Must not fabricate citations or statistics.
Must not provide specific legal or medical advice.
Must not imply that a feature exists that is not currently offered.
Must decline politely with a specific message when the input is out of scope.

Vague behavioral constraints are functionally useless. Colin Matthews, writing about AI prototyping for Lenny’s Newsletter in January 2025, observed that the same discipline that makes AI coding tools reliable, being hyperspecific about what should change, is what makes behavioral constraints work. A vague instruction to an engineer produces the same result as a vague prompt to a model: confident-sounding noise.

Lenny's Newsletter

A guide to AI prototyping for product managers

👋 Welcome to a 🔒 subscriber-only edition 🔒 of my weekly newsletter. Each week I tackle reader questions about building product, driving growth, and accelerating your career. For more: Lennybot | Podcast | Hire your next product leader | My favorite Maven courses…

a year ago · 712 likes · 13 comments · Colin Matthews

Here is what the difference looks like in practice. “Should not hallucinate” is not a constraint; the useful version is: must not cite a source that was not present in the retrieved context. “Should be helpful” measures nothing; the useful version is: must attempt a response for any in-scope query, and must decline with a specific message for any out-of-scope query. “Should be concise” has no edge; the useful version is: summary output must be under 150 words unless the input exceeds 2,000 words.

Each of those rewrites does the same thing: it gives an engineer, an automated judge, or a new hire enough precision to make a consistent call on whether the output passes or fails.

The PM owns this section. Engineers should not be inventing behavioral boundaries while writing code. By the time the code is being written, the constraints should already be settled.

A Worked Example: Meeting Summary for B2B SaaS

Take a concrete feature: an AI-powered meeting summary for a B2B SaaS product. Users paste in a transcript, and the feature returns a structured summary with action items. Here are two versions of the PRD for this feature, shown sequentially.

Version A: What most teams write.

The PRD describes a feature that reads transcripts and generates concise summaries with action items. The acceptance criteria read: the summary should be accurate and capture key points. The success metric is a user's thumbs-up or thumbs-down. Failure modes are not listed. The definition of done is a QA sign-off. It sounds reasonable. It produces a broken feature with no clear owner and no shared definition of good.

Version B: The behavioral contract.

This version was written after the PM reviewed 30 prototype outputs before writing a single criterion. That is the sequence: see the system fail, then write the contract.

Acceptance criteria: An LLM-as-judge scores outputs at 4 out of 5 or higher on coherence and completeness for 90 percent of test cases. The PM reviews 15 random outputs per sprint, with fewer than 2 failures per cycle. Pass or fail is defined as: Does the summary correctly capture every action item assigned to a named person? That threshold came directly from watching prototype outputs miss action items. The PM saw the failure before writing the criterion.
Success metrics, model layer: Hallucination rate, defined as any claim not supported by the transcript, must remain under 3 percent. Completeness score from LLM-as-judge must be above 85 percent. For a deeper breakdown of what to measure at this layer, the PM guide to evaluating LLM outputs covers the methodology in full.
Success metrics, product layer: Feature activation rate and user override rate, which is the percentage of summaries the user manually edits heavily, with a target of under 20 percent.
Failure modes, drawn from reviewing 30 prototype outputs: The model fabricated deadlines not stated in the transcript. It dropped action items from speakers whose accents the transcription engine handled poorly. It occasionally produced summaries longer than the original transcript. None of these were written from imagination. They were found.
Behavioral constraints: Must not infer deadlines that were not explicitly stated. Must label uncertainty when speaker intent is ambiguous. Must decline if the transcript is under 100 words.
Definition of done: The eval suite passes at the specified thresholds. The PM has reviewed one full sprint’s worth of outputs and signed off.

The difference between the two versions is not formatting. It is the work that happened before writing. The PM reviewed real outputs, found real failures, and turned those observations into a testable behavioral contract. That is what a PRD for an AI feature is supposed to do.

Conclusion

Pull out the last AI feature PRD your team wrote. Find the acceptance criteria section. Ask one question: could a new hire with no context on this feature use these criteria to decide whether a given output passes or fails?

If the answer is no, you do not yet have acceptance criteria. You have aspirations.

The PRD is not dead. It is harder. Writing a behavioral contract for an AI feature requires you to have seen the system fail, name the failure modes, make a judgment call about what good means, and document that judgment in a form that survives a sprint review.

That work is harder than writing a feature description. It is also the work that separates a PM from a vibe coder.

There is a secondary thesis running through this post worth stating plainly: the PM owns the quality bar for an AI feature, not the engineer. Not because engineers cannot reason about quality, but because what “good looks” like is a product decision, not engineering.

Product decision depends on the cost of a wrong answer, the user’s tolerance for failure, and the competitive stakes of the feature. Those judgments belong in the PRD, where the PM makes them visible and accountable.

The PM’s job in AI products is to make good legible, to the team, to the evaluators who will test it, and to yourself. That work starts in the PRD, long before anything ships.

Embeddings for AI Agents: What Product Leaders Must Know

Adaline — Sat, 14 Mar 2026 00:01:23 GMT

TLDR: This blog makes one argument: embeddings are not just a retrieval mechanism, they are the full context system of every agentic product. You will learn the four jobs that embeddings do in every agent and why each one is a product decision, not an engineering detail. You will also see how multi-agent systems use shared embeddings for sub-agent coordination. This blog is written for product managers, engineers, and builders who are actively building agentic products. If embedding quality is something you have fully delegated to engineers, this blog is where to start.

Philipp Schmid of Google DeepMind put it directly in his June 2025 piece. In “The New Skill in AI is Not Prompting, It’s Context Engineering”, he wrote: “Most agent failures are not model failures anymore, they are context failures.”

The model is capable, but what it receives is where production systems break down. Embeddings for AI agents are the mechanism that determines what an agent receives at every step. They control what gets retrieved, what gets remembered, and what gets passed forward.

For product leaders, embeddings are not an infrastructure decision to delegate. They are product decisions that shape quality and user experience at every layer. This blog is not a vector math tutorial. It is a product strategy argument — why the embedding layer matters, and why getting it wrong explains more failures than a weak model ever could.

What Are Embeddings for AI Agents?

When a language model processes text, it works with numbers, not words. Embeddings are the translation layer that enables this. An embedding model converts text, images, or code into a vector of numbers. Those numbers capture meaning — the relationships between concepts and the intent behind a phrase.

An animated workflow of how the Gemini-2 embedding model works by Google DeepMind.

Tomas Mikolov and colleagues at Google formalized this in their 2013 Word2Vec paper. The paper showed that vectors encode semantic relationships with surprising precision. The most-cited example is the vector for “king” minus “man” plus “woman” yields a vector close to “queen.”

Two sentences that mean the same thing land close together in vector space:

“Cancel my subscription.”
“I want to stop paying for this.”

Two sentences that share a word but mean different things land far apart:

“Bank account.”
“River bank.”

Embeddings encode meaning, not form. That is what makes them the right foundation for any system that needs to understand intent.

The vector produced lives in a vector database alongside millions of others. When the system needs relevant information, it converts the query into a vector and searches for the closest matches. This is called semantic search or vector similarity search.

What product teams build on top of that foundation determines whether agents hold up in production or quietly erode user trust.

How AI Agents Use Embeddings: Retrieval, Memory, Routing, and Personalization

A chat interface processes a message and returns a response.

An agent does much more. It decides what to do, executes steps, uses tools, and builds toward a goal across multiple turns. The difference is not just architectural. It is temporal. That temporal dimension is exactly why agents depend on embeddings in ways a chat interface never needed to.

Retrieval and grounding.

When an agent needs to complete a task, it needs relevant context. The agent converts the current query into a vector and searches the database for the closest chunks. It then pulls those chunks into its context window.

Research at ICLR 2025 found that irrelevant retrieved passages, i.e., “hard negatives,” degrade output quality even when recall is high.

A 2025 paper classifying errors across RAG systems confirmed the same: retrieval failures and generation failures compound each other. When the context layer fails, the model cannot compensate.

More retrieved passages do not mean better context. RAG accuracy peaks at ~10 passages and declines as precision drops and misleading passages enter the context window. | Source: Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG

Memory.

Agents need to remember things across sessions, not just within one. Consider these examples:

A support agent should remember that a user prefers email over phone calls.
A research agent should remember open questions from the last session.
A sales agent should remember the deal context from six weeks ago.

Embeddings make this possible by encoding past interactions as vectors. The system retrieves them semantically when they are needed. Google’s Agent Development Kit (ADK), released in 2025, treats this as a first-class architectural requirement. It separates short-term session memory from long-term persistent memory. It then uses vector similarity search to retrieve only what is relevant, not inject an entire history into the context window.

Routing.

In multi-step workflows, agents decide what happens next. The choice might be:

Which tool to call?
Which knowledge base to query?
Which sub-agent to hand the task off to?

Semantic routing uses embeddings to match an intent to the right next step. Instead of brittle “if X then Y” rules, the routing layer uses embedding similarity to match queries to capabilities. This makes the system far more flexible as user language varies across thousands of real interactions.

Personalization.

Embeddings encode user behavior, preferences, and history in a form that is queryable. A recommendation agent that understands a user’s history as a vector finds semantically similar content without an explicit search term. The personalization is grounded in the meaning of past behavior, not keywords. That is what makes it feel relevant rather than mechanical.

Thanks for reading Adaline Labs! This post is public so feel free to share it.

How Multi-Agent Systems Use Shared Embeddings for Coordination

Multi-agent architectures are becoming the standard production pattern for complex agentic products. A customer success platform might coordinate across:

A billing agent.
A technical support agent.
A knowledge retrieval agent.
An escalation agent.

Each sub-agent is specialized. The coordination challenge sits between them. When the coordinator passes context to a sub-agent, it needs to be semantically accurate. The sub-agent needs the relevant pieces of conversation history, user state, and task context to do its job. A raw transcript dump does not cut it.

Research on the DyTopo routing system (February 2026) found a clear result. Reconstructing agent communication paths using embedding-based semantic matching at each reasoning step produced a 6.2% average improvement over fixed routing rules. That is a meaningful margin in workflows where failures accumulate across steps.

(A) Single-agent. (B) Fixed topology: same agent graph every round. (C) DyTopo: embeddings rebuild the graph each round based on task goal — the architecture behind the 6.2% improvement. | Source: DyTopo,

A shared-memory architecture relies on all agents accessing the same vector database. When one agent learns something important, like a user preference, a resolved constraint, or a task dependency, it writes that to shared memory as an embedding. When another agent needs it later, it retrieves it semantically.

The Federation of Agents framework demonstrated this at scale. Using Versioned Capability Vectors — agent profiles indexed and retrieved through semantic search — it achieved a 13× improvement over single-model baselines on complex multi-step reasoning tasks.

The orchestrator embeds each sub-task and scores it against agent capability profiles using cosine similarity. The highest score determines routing — Sub-task 3 routes to Agent A (0.70), Sub-task 1 to Agent D (0.73). | Source: Federation of Agents

The pattern is consistent: sub-agent systems with a well-maintained shared vector store outperform systems built on static context injection or keyword routing — not because the models are stronger, but because the context system is better designed.

Why Embedding Quality Is a Product Decision, Not an Engineering One

Embedding quality is a product decision. The choices involved directly determine user experience:

Which embedding model do you use?
How do you chunk documents before embedding them?
How often do you refresh the vector store?
Which retrieval strategy do you apply?

A support agent who retrieves stale documentation frustrates users.

A research agent that misses the most relevant source because it was chunked poorly loses user trust.

A sales agent who forgets a deal detail because it was never stored loses the deal.

Product leaders who understand embeddings make better calls here.

They push for retrieval quality metrics to be tracked in production, not just during demos.
They ask whether the embedding model was fine-tuned on domain-specific content.
They question whether the chunking strategy preserves meaning at document boundaries.
They insist that memory architecture is designed before launch, not patched after users complain.

The most common mistake is treating embeddings as only “the RAG layer.” Retrieval-augmented generation is one use case. Embeddings also power:

Memory across sessions.
Semantic routing between agents.
Personalization based on behavioral history.
Anomaly detection when the agent outputs diverge from expected patterns.

A team that scopes embeddings as only a retrieval pipeline leaves memory, routing, and personalization undesigned. Teams that treat embeddings as the full memory and coordination layer build systems that scale with workflow complexity. The others spend months patching failures that could have been designed away from the start.

The Strategic Edge in the Agentic Era

Model quality is converging faster than most teams expected. As of early 2026, LMSYS Chatbot Arena — which aggregates nearly five million human preference votes across 296 models — shows frontier models clustered within a few Elo points of each other.

Zylos Research’s January 2026 benchmark analysis found leading models scoring above 88% on MMLU. A threshold that would have been a meaningful performance gap just twelve months earlier.

The differentiation will not come from which foundation model you pick. It will come from how well your system retrieves, remembers, and routes across the full lifecycle of a user interaction.

Embeddings are what make that possible. They connect memory to retrieval, retrieval to routing, routing to coordination, and coordination to user experience. They are not a backend detail. They are a design decision that compounds across every feature you ship.

Product leaders who understand this layer will catch failures before users do. The ones who delegate it entirely will keep shipping agents that perform in demos and fall apart in production. The model is not the bottleneck. The context system is. Build accordingly.

Frequently Asked Questions

What are embeddings in AI agents?
Embeddings are numerical vector representations of text, code, or data that encode semantic meaning. In AI agents, they power four core functions: retrieval from knowledge bases, memory across sessions, semantic routing between tools and sub-agents, and personalization from user history. Every time an agent finds relevant context or remembers past information, it relies on embeddings.

Are embeddings only used for RAG in AI agents?
No. Retrieval-augmented generation is one use case among many. Embeddings also power memory across sessions, semantic routing between agents and tools, personalization based on user behavioral history, and anomaly detection. Every time an agentic system finds something relevant, recognizes a similar pattern, or organizes data by meaning, it is using the same embedding infrastructure.

How do embeddings improve AI agent memory?
Embeddings encode past interactions as vectors stored in a vector database. When the agent needs relevant context from a prior session, it converts the current query into a vector and retrieves the closest semantic matches. Google’s Agent Development Kit (ADK) treats this as a first-class architectural requirement, separating short-term session memory from long-term persistent memory retrieved via vector similarity search.

What is semantic routing in multi-agent systems?
Semantic routing uses embedding similarity to match an incoming query or task to the most appropriate agent, tool, or knowledge base. Unlike rule-based routing, it generalizes across varied user language. Research on the DyTopo system found embedding-based semantic routing produced a 6.2% improvement over fixed routing rules across code generation and reasoning tasks.

Why should product leaders care about embeddings for AI agents?
Embedding quality is a product decision, not just an engineering one. The choice of embedding model, chunking strategy, vector store refresh schedule, and retrieval approach all directly determine user experience. Product leaders who understand these choices identify context failures before users encounter them — and ship agents that hold up beyond the demo.

From Zero To 100,000: The Questions We Set Out To Answer

Arsh Shah Dilbagi — Wed, 11 Mar 2026 12:00:48 GMT

TLDR: How do LLMs actually work? How do you build reliably with them? How do you know if they’re working in production? These were the questions nobody was answering clearly in 2025. So we built Adaline Labs for the people, asking them. Some of these were the AI PM, the early-stage founder, and the engineer who became their team’s de facto AI lead. One year. 100,000 readers. Here’s the story.

When we published the first post on Adaline Labs, we had a simple and maybe naive belief that the people building AI products were being underserved by the content around them.

There was plenty of research. Plenty of hype. Plenty of “AI will change everything” takes. What was harder to find was something practical, honest, and aimed at the person actually responsible for shipping an AI feature. Or building AI products. This included the product manager and leaders, the early-stage founder, and the engineer who just became their team’s de facto AI lead.

That was the gap we wanted to close. And one year later, with over 100,000 of you reading, we think we were onto something.

Here is what we set out to answer and what we learned along the way.

The First Question: “What Even Is This Thing?”

In early 2025, most product leaders we spoke to were in a strange position. They were being asked to build with LLMs without really understanding how they worked. Not at a research level, that was never the point, but at a product level. Enough to make good decisions.

So we started from the ground up.

What are embeddings, and why do they matter for search? How does attention work, and what does that mean for context limits? What is test-time scaling, and why is reasoning so expensive? What even is an agentic LLM?

These were not academic questions. They were the questions a PM would ask before a planning meeting, and couldn’t find a clean answer to. We wrote them for that person.

The audience was not looking for a shortcut. They wanted to actually understand; they just needed someone to explain it without the jargon.

Posts like "What PMs Need to Know About Transformers" and "Understanding Attention Mechanisms in LLMs" became some of our most widely shared pieces. What surprised us was the enormous appetite for this content.

Thanks for reading Adaline Labs! This post is public so feel free to share it.

The Second Question: “Okay, But How Do I Build With It?”

Once we established the fundamentals, the natural next question arrived: how do you actually go from model to product?

This is where things got interesting and where the content got more opinionated.

We wrote extensively:

About prompt engineering, not as a parlour trick, but as a genuine product discipline.
About tool calling, and how to write effective functions that your LLM can actually use.
About RAG systems, agentic workflows, and the moment when your product stops being “an app with AI” and starts being “an AI-native product.”

We also started writing about the mistakes, such as context rot, token burnout, and how an LLM product can quietly degrade in production without anyone noticing until users start churning.

Product leaders were not intimidated by the technical depth. They were hungry for it. The more specific and precise we got, including actual code, actual prompt structures, and actual failure modes, the more the audience grew.

The Third Question: “How Do I Know If It's Working?”

This one took us longer to articulate, but it became the thread that tied everything together.

You can build a beautiful agentic product. You can have great prompts, well-designed tool calls, and a thoughtful RAG setup. And then it goes to production, and you have no idea what’s actually happening.

Is the LLM hallucinating?
Is a tool call failing silently?
Is your prompt behaving differently at 10 pm than it does at 10 am?
Is latency spiking for a specific type of user query?

This is the evaluation and observability problem. And it turns out it’s the most important problem in AI product development that needs attention right now.

A complete observability trace in Adaline.

We published pieces on LLM observability, eval frameworks, LLM-as-a-judge, and continuous evaluation in production.

And then, in 2026, it became the central thesis: observability is the operating system for reliable LLMs.

Interestingly, this resonated not just with engineers, but with product leaders who finally had a language for why their AI products felt unpredictable. They were not imagining things. The systems were genuinely hard to see inside, and that was fixable.

Our Readers Shaped This Newsletter

Everything we know about our audience comes from listening closely and constantly. These were the consistent signals our readers kept sending us:

How do LLMs actually work?
How do I build reliably with them?
With new models dropping every month, how do I integrate them into existing workflows?
Which model suits which part of the workflow?
Which tool (Cursor, Claude Code, Codex, etc.) can product leaders and builders use to enhance their productivity?
How do I know if it is working in production?

We did not pick our topics. Our readers did. We researched, studied, executed, and wrote about them. Over time, those signals pointed to a clear set of content pillars and a clear center.

The five content pillars of Adaline Labs and where they intersect.

The diagram above captures something we did not plan but discovered over the year. Evals and Observability are not standalone topics. They live at the intersections. They are the connective tissue between understanding AI, building with it, and shipping it with confidence.

What We Believe Now That We Didn’t When We Started

A year in, here are the things we believe more firmly than when we started:

The PM is the most important person in an AI product team. Not because they write code, but because:

They hold the product vision.
They understand the user and serve as the connective tissue between what the model can do and what they should do.

Equipping that person matters more than we initially realized.

Fundamentals compound. The readers who understood embeddings and attention early are now the ones thinking clearly about context engineering and agentic architecture. There are no shortcuts in this field. But there are faster paths, and that’s what we tried to build.

The hardest problems are not technical. They are judgment problems. For instance:

When do you use a smaller, faster model vs. a frontier one?
When is a RAG system the right call vs. fine-tuning?
When do you add an eval layer vs. ship-and-learn?

These are the decisions our readers face every week, and they need frameworks, not just tutorials.

100,000+ people are both humbling and clarifying. Humbling because this community chose to spend its attention here, every week, amid everything competing for it. Clarifying because the scale of the response tells us something: there is a massive, underserved audience of people building at the frontier of AI who want to think rigorously, not just move fast.

What Comes Next

The questions are getting harder. And we believe this is what unfolds in 2026:

AI agents become real production infrastructure.
Evals and observability move from nice-to-have to non-negotiable.
AI coding agents change how teams ship.
Product work gets redefined when everyone can build.

We are going to keep following the questions. The ones our readers are wrestling with. The ones who do not yet have clean answers but deserve clear thinking.

Thank you for being here for year one.

The questions get harder. Our answers get clearer.

Sub-Agents For Product Managers: Stop Directing A Tool. Start Running A Team.

Nilesh Barla — Sat, 07 Mar 2026 01:00:49 GMT

TLDR: PMs are running workflow through chat windows. That’s the wrong architecture. This blog breaks down why the chatbot model has a structural ceiling, not a prompting problem. And what actually changes when you replace it with orchestrated, parallel, and workspace-native agents. It covers the three constraints killing your current setup, how sub-agents actually work, when to use them and when not to, and what the PM role becomes once the architecture shifts.

We are 2026, and I still find many product managers using AI the same way they use Google: type a question, get a response, and act on it.

The interface is a text box.
The output is text that you copy elsewhere.
The workflow is: prompt, read, paste, and repeat.

This works. But it has a ceiling. But not a ceiling of model intelligence.

Claude 4.6, GPT-5.3, and Gemini 3.1 are all capable of more than what a single chat thread lets you access. The ceiling isn’t the model. It’s the architecture you’re running it through. A chatbot is one assistant, one context window, one sequential thread. Every interaction starts with what you type. Every output ends up in your clipboard.

Sub-agents for product managers aren’t a new feature inside that model. They’re a replacement for the model itself.

The change is from directing a single assistant to orchestrating a team.
And the product teams that have made this shift aren’t just working faster, they’re also working differently.

Research, spec drafting, and backlog triage used to happen one at a time. Now they happen in parallel, each handled by a specialized agent, each returning a structured result to an orchestrator, the PM, who synthesizes and decides.

This article is about the mental model behind that shift.

Not a tutorial.

Not a setup guide.

It is a framework for understanding what sub-agents are, why the interface you run them from matters, and what the PM role actually looks like once the architecture changes.

The Single-Assistant Ceiling

The chatbot model has three structural constraints that no amount of improved prompting can solve.

The first is statelessness.

Every session starts from zero. The model has no memory of your product, your codebase, or what you decided last Tuesday unless you paste it back in.

Now, although ChatGPT and Claude (Web) have memory functionality. But the issue is that they have a common memory space and all the chats access the same memory. So the problem with this setup or workflow is that information will be shared in projects that don’t require it. To put it another way, personal, private, and professional life will be mixed up.

In this case, PMs become context managers. They have to:

Maintain long system prompts.
Copy documentation into chat windows.
Manually filter content and information and bridge the gap into what the AI needs to know and what it actually knows.

The intelligence is there, but the continuity isn’t.

The second constraint is single-threading.

Meaning one thing or task happens at a time. If you’re using an agentic AI product manager setup, you’ve probably felt this. You ask the model to research a competitive feature, then draft a spec, then break it into tickets. Each task waits for the previous.

The model is capable of doing all three — just not at once, not in separate contexts, not in parallel.

Complex PM work rarely has that kind of serial structure. Real product work leverages parallelization. Because it saves time, it's fast and efficient.

The third constraint is isolation from the environment.

A chatbot suggestion lives in a chat window. The action it recommends lives elsewhere — in Jira, in Notion, in a Figma file, or in a codebase. It takes manual effort to bring together “AI output” and “real artifact.”

As a PM, you are the integration layer. You copy the draft. You paste the ticket description. You take the suggestion and do something with it. The AI never touches the actual environment where work happens.

These aren’t complaints about specific products. They are structural properties of the chatbot interface. And together, they explain why product teams save roughly two hours a day through AI automation but watch those gains concentrate in routine, documentation-heavy tasks. Not the complex, interconnected work that makes the biggest difference. The interface caps the upside.

The question isn’t how to prompt better inside the single-assistant model. It’s what happens when you replace the model altogether.

What Sub-Agents Actually Are

Sub-agents are not “more prompts.” They are a different architectural pattern. And understanding the pattern is the prerequisite to using it well.

In a sub-agent system, a parent agent — the orchestrator — decomposes a complex task and delegates pieces of it to specialized child agents. Each child agent, the sub-agent, operates in its own isolated context window.

It receives a prompt with exactly the context it needs.
Works autonomously using its assigned tools.
Returns a structured result to the parent.

The parent synthesizes those results and decides what happens next.

Three things make this fundamentally different from a single-assistant setup.

Context isolation.
Each sub-agent starts with a clean context. A research sub-agent exploring competitive positioning doesn’t share its context window with a spec-drafting sub-agent working on a feature brief. Neither pollutes the other’s focus.

And the orchestrator never sees the intermediate noise. It sees final results. This is how Anthropic’s multi-agent research system works:

A lead agent spawns sub-agents to explore different aspects of a question simultaneously, each returning condensed findings rather than raw search logs.

Anthropic’s engineering team — Jeremy Hadfield, Barry Zhang, and colleagues — documented a 90.2% improvement over single-agent performance on complex research tasks. Not because the model got smarter, but because the architecture distributes the cognitive load.

The orchestrator-worker pattern in practice. | Source: Anthropic Engineering, June 2025

Parallel execution.
Multiple sub-agents run simultaneously. This is what the Cursor community noticed when sub-agents shipped — that single-threaded prompting suddenly felt archaic.

Agents with real roles, customized skill sets, clean handoffs, deliberate execution.

That was the reaction, because that’s what becomes visible when you move from sequential to parallel.

From a PM standpoint, a research agent, a spec agent, and a triage agent can all be working simultaneously. Each in its own context and each returning to a shared orchestration layer when complete.

Specialization and model selection.
Each sub-agent is configured for its role. That includes its instructions, its tool access, and most importantly, its model.

A sub-agent doing deep reasoning on a product brief might run on Claude Opus.
A sub-agent performing rapid parallel searches might run on Claude Sonnet 4.6, GPT-5.3 Instant, or even Gemini 3.1 Flash. Where speed matters more than depth.
A sub-agent working with long documents such as research papers, transcript archives, and support logs, might run on Gemini, which is optimized for long-context retrieval.

The model choice stops being a single global setting and becomes a deliberate configuration decision per task type.

This is what multi-agent product management actually means in practice: the PM defines the goal and the team's shape. The team executes in parallel. The results come back structured.

The community reaction to seeing this run — “makes single-threaded prompting feel archaic” — is the right reaction.

It’s not hyperbole.

It’s a recognition that the previous model had a ceiling you didn’t know you were hitting until you saw above it.

Thanks for reading Adaline Labs! This post is public so feel free to share it.

Why the Interface Matters: Chatbot vs Workspace-Native

Knowing what sub-agents are is half the model. The other half is understanding where they can run. Because the interface is not neutral. It shapes what’s possible.

A chatbot interface is isolated by design.

It processes text and returns text. It has no access to your files unless you paste or attach them. It has no connection to your tools unless you’ve explicitly described them in the prompt. It has no memory of your product unless you rebuild that context every session.

This is fine for answering questions. It is a structural constraint for orchestrating a team of agents that need to read your codebase, push to Jira, pull from Notion, and execute changes in real files.

Workspace-native tools solve this at the architecture level.

The clearest articulation of the distinction is this: ChatGPT works from pasted context. Cursor works from your actual project. That difference sounds obvious. Its implications run deep.

Dennis Yang, a PM at Chime, put it plainly after switching: “Cursor is a much better product manager than I ever was.”

He’s not talking about the model. He’s talking about the environment.

When a PRD is drafted inside the workspace, it references real API endpoints. The spec reflects what the team has actually built. Tickets are grounded in the codebase, not a description of it. The artifacts are real because the tool is connected to the environment where real work happens.

This matters specifically for sub-agents because sub-agents need plumbing.

A research sub-agent needs web search and internal documentation.
A spec-drafting sub-agent needs the product’s existing architecture.
A triage sub-agent needs to read from Jira or Linear and write back to it. None of this is possible inside a stateless chat window.

The Model Context Protocol (MCP) is what makes it possible in workspace-native tools: a standardized layer that connects agents to external tools and files as first-class capabilities, not workarounds.

YC’s Spring 2026 Request for Startups named “Cursor for Product Management” as an official startup category.

Naval Ravikant told his 3M+ followers that vibe coding is the new product management. Both point to the same underlying shift: the environment where PMs work is moving from specification documents to executable workspaces.

Source: Naval on X.

The AI agent workflow that matters isn’t the one in the chat window. It’s the one running inside the environment where decisions become artifacts.

The PM as Orchestrator: What the Role Actually Becomes

When the interface changes, the role changes. Not in the direction most PMs expect.

The shift from chatbot to sub-agent orchestration is not primarily a technical shift. PMs who make this transition don’t need to become engineers.

What they need to become is more precise about goals, constraints, and boundaries. Because in an orchestrated system, the PM is not directing each step. The PM is defining the brief. The agents figure out the steps.

This is actually a familiar mental model.

A PM working with a research team, a designer, an engineer, and a data analyst doesn’t tell each person exactly what to type. They define the objective, constraints, output format, and handoff structure.

The team figures out the execution.

Sub-agent orchestration is the same mental model applied to AI agents. The PM provides the brief, not the method.

What changes is the cost of imprecision. A vague goal given to a human engineer prompts a conversation, a clarifying question, and a back-and-forth. A vague goal given to a sub-agent produces an output — confident, well-formatted, and possibly wrong in ways that are hard to catch.

The orchestrator’s core competency becomes writing goals precise enough that agents don’t hallucinate arbitrary decisions to fill in the gaps. This is what product teams are starting to call “executable specs.” Essentially, they are requirements so specific that they function almost as instructions. It is the PM skill that matters most in a sub-agent world.

What the PM stops doing is acting as the integration layer.

In the chatbot model, the PM is the one who carries information between tools — from AI to Jira, from research to spec, from spec to engineer. In a well-designed orchestration system, agents handle those handoffs. The PM’s time shifts toward judgment calls: which goals to prioritize, which agent outputs to synthesize, which results to challenge.

Jim Allen Wallace of Redis documented a 40% agentic project cancellation rate by end of 2027. And it isn’t primarily an engineering failure. It’s a coordination failure. Teams underestimate the design work required to define:

Clean handoffs between agents.
Precise enough goals to prevent hallucination drift.
Clear enough scope boundaries to keep agents from doing work that conflicts.

Getting orchestration right is a product design problem. Which means it’s a PM problem.

When Sub-Agents Are the Right Call

Sub-agents are not the answer to every PM problem. The overhead is real and should be taken seriously.

Each sub-agent runs in its own context window, which means each one consumes tokens independently. Anthropic’s engineering team found that multi-agent architectures use roughly fifteen times more tokens than standard chat interactions. That’s an economic reality, not a footnote.

Sub-agents are worth it when the task’s value justifies the cost and when the task’s structure actually suits parallel execution.

Use sub-agents when:

The task is genuinely too large for a single context window.
Distinct parallel workstreams exist that don’t depend on each other’s output.
different parts of the task benefit from different model strengths — deep reasoning, fast retrieval, and long-context analysis.

Don’t use sub-agents when:

The task is simple, sequential, and fits comfortably in a single context.
When all agents need to share the same context to make decisions (this breaks context isolation, eliminating the primary benefit).
When the coordination overhead — designing handoffs, synthesizing outputs — exceeds the time the parallelism saves.

Single-agent approaches often outperform multi-agent in production for tightly sequential tasks.

Complexity is not a virtue.

The orchestrator’s job is to match the architecture to the task. And sometimes the right call is one agent, one context, one clean result.

Conclusion

The chatbot is not going away. But it’s already not the ceiling; it’s the floor.

The PMs who are pulling ahead aren’t using better prompts inside the single-assistant model. They’re designing systems: specialized agents with defined roles, parallel execution, clean handoffs, and workspace-native environments. Where AI output lands as real artifacts, not clipboard text.

The mental model shift is from user to orchestrator. From “how do I ask this better?” to “how do I design a team that handles this without me acting as the integration layer?”

That transformation requires precision, in goal-setting, in constraint definition, in understanding which tasks justify the architecture and which don’t.

It requires tools that are connected to the actual environment where work happens, not isolated chat windows. And it requires a different relationship to AI: not a tool you direct, but a team you run.

The question to sit with: what is the most complex workflow you currently manage by copying responses from a chatbot into five other tools?

That’s the first candidate.

Not because sub-agents make it trivially easy; they actually don’t. But because that workflow has already exposed the ceiling of the model you’re in.

The architecture exists to go above it.

AI Observability And Evaluations: The Operating System For Reliable LLM Products

Arsh Shah Dilbagi — Wed, 04 Mar 2026 13:02:50 GMT

TLDR: Most LLM products don’t crash. They quietly leak trust, safety, and budget. Silent failure is the default failure mode, and most teams never see it coming. This is a practical guide for engineers and PMs shipping LLM features in production. You will leave with a concrete framework for instrumenting runs, version prompts, design rubrics, catching silent failures, and switching models without fear. The moat is measured improvement, not prompt cleverness.

Introduction

Why LLM Products Break Quietly Without Observability

When I build LLM features, I do not worry about clever prompts first. What I worry about is that the team can’t see what the system is doing when it fails.

In this blog, I am making the case that reliability starts with visibility, not vibes.

The motivating question is simple. What is the equivalent of GitHub plus unit tests for an LLM application where the behavior is shaped by prompts and shifting context? Without that substrate, teams ship changes they cannot review, cannot regress, and cannot explain.

Silent failure becomes the default failure mode. The output looks coherent, the user seems satisfied, and the product metrics stay flat.

Underneath, the system may be wrong, unsafe, or quietly violating policy. That is why I treat observability and evaluations as the reliability layer. They turn unknown behavior into inspectable behavior, then measurable behavior.

Tool use raises the stakes. Once a model can act, a conversation becomes an execution surface. For instance, if the app can issue refunds, the “executable code” can be embedded in the chat thread itself.

The incident pattern is quite familiar.

A support bot approves a refund it should not, the customer is happy, and the mistake only shows up later as leaked margin and policy debt.

Key points I’m making:

LLM apps need a review and regression discipline comparable to code.
Silent failure is more common than loud failure.
Tool calls convert text into real operational risk.
Observability plus evals create accountability for behavior.

How I’d implement this:

Instrument every run with prompt version, context, tool calls, cost, and latency.
Sample real cases and curate a small starting dataset.
Run a small eval set on every change.
Monitor for drift and escalate failures into the dataset.

Next, I will reframe prompts as business logic you have to govern.

Prompts Are Executable Business Logic In Production

When I say prompts matter, I do not mean prompt wording as a copywriting exercise.

The evolution of prompts from punch cards in the 1950s. | Source: Stanford CS 224G: AI Observability & Evaluations | Guest Lecture by Arsh Shah Dilbagi

I mean prompts as runtime logic that drives what the system does.

In production, a prompt is not configuration text. It becomes executable business logic as soon as the model is embedded inside a product that can read data and take action.

The program is not a single string. The program is the assembled runtime bundle that the model receives and acts on. If you do not model it as a bundle, you cannot reason about behavior. You end up debugging the wrong layer, then shipping fixes that only work on one happy-path input.

Prompts are more than words; they define your business, product, logic, and much more.

The runtime bundle includes:

System and developer instructions.
Dynamic variables and session state.
Retrieved context.
User input, untrusted.
Tool permissions and safety constraints.
Runtime parameters, model version, and temperature.

I plan for instruction conflicts because they occur in real systems. A user message can contain a directive that tries to override the instruction layer.

A retrieved document can contain hidden instructions that pull the model off task.

The model may still produce fluent output even when following the wrong instruction, which is why this failure is hard to notice without measurement. This maps directly to the prompt-injection risk category in standard LLM threat models.

Key points I’m making:

The prompt bundle is the real program, not the UI chat box.
Untrusted inputs create instruction conflicts by default.
Tool permissions turn text into operational decisions.
Reliability requires governance, not prompt folklore.

How I’d implement this:

Version prompts and treat edits like code changes.
Require diffs for every prompt revision.
Maintain rollback points for prompt and model versions.
Assign ownership per prompt surface area and workflow.

If this is runtime logic, I need runtime traces.

Thanks for reading Adaline Labs! This post is public so feel free to share it.

What Observability Means For LLM Systems

I have a narrow definition of observability for LLM systems. I want to reconstruct a run the same way I would reconstruct a production incident in any other distributed system. If I only log the final output, I am guessing.

In practice, observability means end-to-end traceability across prompt assembly, retrieval, tool calls, and outputs. That too, with enough context to explain why a specific response happened.

A complete observability trace in Adaline.

Readable traces matter because they reduce debugging time, make ownership clear, and let me iterate without shipping blind changes. When the trace is legible, a failure becomes a concrete artifact, not a debate.

Trace checklist:

Prompt template version, which is a static instruction. And assembled prompt which are variables, i.e., dynamic. The idea is to separate static instructions from variables.
User input, to capture the untrusted trigger.
Retrieved context payload plus retrieval metadata, to validate what the model actually saw.
Tool calls, arguments, responses, and side effects to audit real actions.
Model identifier, version, and runtime parameters, to attribute behavior to runtime choices.
Token usage and estimated cost, to catch budget regressions.
Latency breakdown, to localize slow spans, including model server time .
Final output and structured output if present, to verify compliance and formatting.

When I see a bad answer, the trace tells me where to look.

Here, the observability from Adaline’s dashboard data shows me that answer quality is 0.65, which isn’t good. The reason is poor retrieval quality.

If the retrieval returned irrelevant context, I fix the retrieval. If tool calls are wrong, I fix tool selection and constraints. If the same input flips behavior after a prompt edit, I fix the prompt structure, not the dataset.

Key points I’m making:

Observability is traceability across the full run, not output logging.
Accountability and speed up iteration.
Cost and latency are first-class failure signals.
Tool call visibility is non-negotiable once actions are in place.

Prompt versioning and deployment in Adaline.

How I’d implement this:

Standardize a trace schema and enforce it for every run.
Store prompt versions and attach them to every trace.
Log retrieval inputs and outputs with stable identifiers.
Capture tool calls as structured events with side effects.
Add a weekly review of failed traces and recurring patterns.

Once you can see runs, you can classify failures.

The Silent Failure Taxonomy I Built Around

Silent failures do not crash the system. They leak trust, safety, and budget a little at a time. In the lecture, I push on this because you can ship something that looks fine, then wake up to a week of damage that never showed up as an error page.

Generally, to tackle this issue, I built categories around these failures. Because monitoring and evaluation need targets. A taxonomy keeps the team from treating every issue as a prompt problem.

It also keeps alerts honest. I believe you can only alert on what you can name and measure.

Being hyperspecific to details is the key here.

Taxonomy I use in practice:

Policy failures that look like success: The signal to monitor includes tool call policy violations and missing approvals.
Security failures, prompt injection, and instruction conflicts: Signal to monitor includes override patterns and tool intent that contradict constraints.
Cost and latency failures, token blowups, loops, OCR weirdness: Signal to monitor includes token spikes, repetition, and timeouts.
Correctness failures masked by fluency: The signal to monitor includes missing citations, schema drift, and low agreement with the provided sources.

The incident I plan for is boring, which is the point.

We switched to an OCR workflow, everything looked normal, then costs spiked. The model started appending long runs of spaces, producing around 100,000 characters when 5,000 would have been enough.

Now, customers did not notice at first. But the trace made it obvious, so we tightened the prompt and added a cost guardrail.

Key points I’m making:

Failures show up as drift, not downts, and alerts are concrete.
Security and cost issues can hide behind good-looking text.

How I’d implement this:

Map each category to a small set of measurable signals.
Alert on deltas, not absolutes, for cost and latency.
Triage from traces, then promote repeats into eval datasets.
Add a post incident rule that prevents the same class from returning.

To evaluate any of this, I need representative cases.

Evaluations Start With Sampling The Real Distribution

When I watch teams build LLM features, the demo is rarely the hard part. The demo is one clean input, one clean output, one clean conclusion.

Production is a distribution, and the distribution is where behavior fractures.

A demo lies because it compresses variability into a single scenario. It hides messy inputs, conflicting instructions, and long tail formats. It also hides drift.

A prompt can look stable on five hand-picked examples, then break on day three because a new user arrives with a new intent. This is a very common issue.

So, how to tackle it?

I start evaluations by sampling the real distribution.

My baseline is simple. I take about 20 representative cases that look like what I expect to see in production, I run them, and I ship.

Then I expand the set using the evidence provided by production.

Observability supplies the raw material.

Traces become cases, cases become datasets, datasets become evaluations.

OpenAI’s evaluation guidance makes the same point. Mix production data with expert-curated cases, keep adding edge cases, and keep the set growing as you learn.

Key points I’m making:

One clean example hides the distribution.
A small representative set beats intuition.
Traces are the source of evaluation data.
Datasets must evolve with customers and inputs.

How I’d implement this:

Seed the first dataset from traces whenever possible.
Include messy and adversarial inputs in the first 20.
Add failures and near failures every week.
Refresh the dataset when the customer types or document formats change.
Tag cases by intent and input modality for coverage checks.

I have seen a new customer type break assumptions overnight. The trace showed the same prompt behaving differently because the inputs shifted, not because the model changed. The dataset made that visible fast, then the fix became measurable.

Now I can talk about evals as a feedback loop.

Evaluation Is A Feedback Loop, Not A Unit Test Suite

I have a strong view on evals because I have watched good systems fail for boring reasons. A prompt change sounds better to a human. But production makes it worse.

So, I am making the explicit claim that evals are feedback loops, not deterministic unit tests.

Essentially, their job is to keep me shipping while protecting the downside. I run them to catch regressions when I edit prompts, to switch models without fear, and to detect drift once the system is live.

Perfect coverage is impossible because users will always do something you did not anticipate.

That is fine.

The goal is not perfection.

The goal is fast learning with controlled risk.

Starter eval set I begin with:

Schema and format adherence, so outputs stay parseable.
Tool and policy compliance to keep actions permitted.
Citation or reference presence where required, so answers stay auditable.
Refusal correctness for disallowed requests, so boundaries hold.
Groundedness to provide context, so answers do not drift from inputs.
Cost gate or latency gate, so the product stays within constraints.
Retrieval sanity check, so the model is not reasoning on garbage context.

Here is a mini example from real work.

I have seen a small prompt change that helped one slice of cases and failed another, like drug A versus drug B.

The new prompt read cleaner, then broke the distribution. A basic eval suite made the regression visible before it became a support incident. This matches the eval-driven workflow OpenAI recommends, especially the practice of collecting production-like data and evaluating continuously.

Key points I’m making:

Evals exist to learn quickly, not to certify perfection.
They protect model switches, prompt edits, and production drift.
Coverage grows from failures, not imagination.

How I’d implement this:

Run the eval suite on every prompt or model change.
Label failures as prompt regression, retrieval regression, rubric mismatch, or distribution shift.
Fix the correct layer, then add the failing case to the dataset.
Track cost and latency gates as hard constraints, not nice metrics.

Evals only work if I define good as outcomes.

How I Design Rubrics From Product Outcomes

I design rubrics the same way I design product requirements. I start from what the user must be able to do next. If the rubric cannot predict the next action, it is taste, not engineering.

Evaluating prompts using LLM-as-a-judge metric with custom rubrics in Adaline.

Outcome-first grading means I translate the user goal into observable checks. A good rubric is specific about required fields, hard constraints, grounding to provided inputs, and safe tool behavior.

In high-stakes workflows, I do not pretend engineers can invent correctness. In my experience, the people who own prompts and the people who write rubrics are often domain experts. Someone like clinicians and finance specialists, because they know what the output must contain and what it must never do.

Here is what this looks like in practice. Micro rubric for a support response.

It acknowledges the user request in one sentence without adding new claims.
It applies the correct policy constraint for eligibility and required approvals.
It uses the provided account context and does not invent missing details.
It selects the correct tool action only when permitted and necessary.
It ends with the next step the user should take, if any.

Rubrics drift because products drift. You add customers, new input formats arrive, and the distribution changes.

When a system works for months and rubrics suddenly fail, I treat that as a signal that the rubric may need to change, not just the prompt.

Clear, detailed rubrics also make automated grading more reliable. This is why I write them like executable criteria rather than guidelines.

Key points I’m making:

I define good as a usable next step for the user.
Rubrics encode constraints, not stylistic preferences.
Domain experts define correctness in high-stakes domains.
Rubrics evolve with the input distribution.

How I’d Implement This

Assign rubric authorship to the domain owner for the workflow.
Review rubrics weekly using fresh failure cases from traces.
Update the rubric first when the distribution changes, then update the prompt.
Keep a change log so rubric edits are auditable.

Next, I will show how I scale these checks with model-based graders.

Thanks for reading Adaline Labs! This post is public so feel free to share it.

LLM As Judge, But Only Under Constraints

I use model-based judges or LLM-as-a-judge, because some checks do not reduce cleanly to code. Tone, completeness, and policy reasoning often need language understanding. A judge can also scale review across thousands of traces without turning the team into a labeling factory.

A working illustration of LLM-as-a-judge. | Source: A Survey on LLM-as-a-Judge

My rule is strict. I prefer pass/fail or a small set of named categories. I avoid numeric scoring. In the lecture I gave, I called this out as the easiest way to cripple the entire system because confidence intervals and arbitrary scales do not stay consistent across runs .

When I need nuance, I use semantic labels that carry meaning, not numbers that float.

I ask for reasoning when the verdict depends on a rubric with multiple clauses. I want a short justification tied to rubric items, then the verdict. For everything that should be deterministic, I do not use a judge at all.

I validate schemas with code.

I gate tool calls with policy checks.

I block-banned actions and formatting violations before any judge runs.

OpenAI also recommends structuring evaluations around criteria and using pass/fail or comparisons to improve reliability in judge workflows.

Key points I’m making:

Judges help with nuance, not with mechanics.
Binary beats numeric for stability.
Reasoning improves alignment with the rubric.
Deterministic constraints should stay deterministic.

How I’d implement this:

Provide a rubric with clear pass/fail examples.
Provide the full context, including retrieved snippets and the tool plan.
Require a short, grounded reason.
Output a verdict as pass or fail, or a named category.

Once judging is stable, I run it continuously in production.

Continuous Evaluation In Production Is Where Reliability Compounds

Continuous evaluation is where reliability compounds. Monitoring is the keystone because it captures the real distribution, including the unknown unknowns, and turns them into something the team can act on.

I define continuous evaluation as lightweight checks applied to production traces. I do not wait for support tickets to tell me something drifted. I want the system to tell me first. That is the difference between a small regression and a week of silent damage.

Adaline allows you to continuously run evals in production. This acts like a feedback mechanism rather than a static unit test.

I describe running simple checks on every log and getting notified when a silent failure occurs before customers start getting upset. OpenAI makes the same recommendation with continuous evaluation tied to logs and ongoing case collection.

Alerts I treat as first class:

The pass rate dropped on a key rubric.
Token or cost spikes.
Tool call anomalies or policy violations.
Retrieval is empty or of low quality repeatedly.
Latency regressions by model or route.

Key points I’m making:

Monitoring shows the true distribution, not the demo distribution.
Continuous eval catches drift before users notice it.
Reliability improves when failures are made reusable as test cases.
Cost and latency are behavior signals, not only infra metrics.

How I’d implement this:

Monitor traces and sample failures daily.
Convert failures into dataset entries with labels and notes.
Update rubrics when the distribution changes.
Re-run evals on every prompt or model change.

This is what finally makes model switching safe.

The Payoff: Model Switching Confidence And A Minimal System To Start This Week

I keep seeing the same pattern, and it frustrates me. Teams keep paying for better models, but they stay on an old one.

They are not blocked by procurement, but you know, they are blocked by fear.

The fear is rational.

If I change the model, something might break, and I will not know until production tells me.

I call out teams still running older models because they have no way to predict breakage or to validate upgrades with confidence.

That is a reliability problem, not a model selection problem.

The fix is not a perfect test suite.

The fix is a minimal system that combines evaluations and monitoring.

Evaluations give me a regression signal on known cases.

Monitoring captures the true distribution and feeds new cases back into the eval set, so the system gets safer over time.

OpenAI frames the same workflow as eval-driven development with continuous evaluation and logging so you can grow your eval set from real traffic.

Key points I’m making:

Model upgrades feel risky when behavior is not measurable.
Monitoring plus evals turns upgrades into controlled changes.
Silent failures show up as drift in cost, policy, and quality.
A small, disciplined loop beats a large, vague framework.

How I’d implement this:

Fixed regression dataset for the core workflows that must never regress.
Rolling dataset from recent traces that reflects current traffic.
Side-by-side comparisons for model and prompt changes before rollout.
Instrument traces.
Curate 20 cases.
Implement 4 to 7 evals.
Add 2 to 3 alerts.
Weekly review and dataset refresh.

If I had to boil this down: the moat is measured improvement through observability and evaluation, not prompt cleverness.

In the Age of Agentic Engineering, Context Is Your Real Product

Nilesh Barla — Sat, 28 Feb 2026 01:00:53 GMT

TLDR: AI products break in production not because the model fails, but because the context around it was never designed. This blog is for product leaders and engineers building AI features who keep shipping demos that fall apart under real users. What you’ll take away is practical: a shared vocabulary for context failures, three mental models for designing around them, and pre-launch stress test advice. The model is not your product. The context you give it is.

The Demo Always Works

A product team spends three weeks building an AI customer support agent. Internal testing goes well. The model handles edge cases, stays on topic, and generates responses that feel genuinely helpful.

Lastly, the team ships it.

Two weeks later, the support queue fills with complaints. The agent is confidently answering questions users never fully asked. It assigns ownership to problems nobody claimed. Users stop trusting the product entirely.

What happened?
Nobody changed the model. But what broke was never examined in the first place.

Marily Nika, a former AI Product Lead at Google and Meta, watched the same sequence repeat across teams: an AI feature that worked beautifully in controlled conditions broke in production.

Why?
Because no one could find the failure modes that were visible before launch, if anyone had known where to look.

Simon Willison describes the same gap from the engineering side: the bottleneck in AI development is no longer writing code. It is giving the agent the right environment to produce output that actually works.

That environment is called context. Everything that follows explains why it is your real product.

What Agentic Engineering Actually Is

Agentic engineering is the practice of building software using coding agents — tools like Claude Code, Cursor, and OpenAI Codex — where the agent generates code, executes it, runs tests, and iterates independently between turns. The human sets objectives and maintains oversight. The agent acts.

Simon Willison distinguishes this sharply from vibe coding, where you prompt, accept, and hope.

Addy Osmani puts the operational difference plainly: the single biggest differentiator is testing. A solid test suite lets an agent iterate until it passes. Without one, it declares broken code done.

That distinction reveals something structural.

The test is not just a quality check. It is a context mechanism — a precise description of what success looks like before the agent begins. Willison’s Red/Green TDD pattern makes this explicit:

Write the test first and confirm it fails.
Let the agent implement until the test passes.
The test defines the context. The agent operates within it.

Practitioners who work this way consistently arrive at the same conclusion: the model is rarely the bottleneck. What the model is given to work with is, i.e., the context.

Thanks for reading Adaline Labs! This post is public so feel free to share it.

The Context Problem: What Breaks AI Products

A model does not experience ambiguity the way a human does.

For instance, a human encountering a half-formed request pauses or asks for clarification.

An LLM, on the other hand, fills the gap.

It takes whatever is in its context window, finds the most plausible completion, and returns output that looks finished. The problem is not that the model is wrong. The problem is that it does not know it is wrong.

Marily Nika calls this the failure signature. Essentially, it is the pattern of breakdowns a feature reliably falls into when real users arrive.

Every AI feature has one. The teams that find it before launch deliberately push the model into its failure modes during development. The teams that do not find it discover it through support tickets.

Either way, the failure signature takes three distinct shapes:

Context overload occurs when the model receives more information than it can usefully process. Noise crowds out the signal, and the model treats everything with equal weight. A meeting notes tool fed an entire unstructured transcript will summarize the loudest voices, not the most important decisions.
Context gaps occur when the model lacks the information it needs and fills the absence with inference. Mostly probability distribution. The customer support agent who confidently answers “Is this good?” without asking what “this” refers to is not malfunctioning. It is doing exactly what a model does when the context does not tell it what it does not know.
Context misalignment occurs when the model has information, but the wrong framing for the task. Marily’s Slack thread demonstration is precise here. Essentially, the model was not missing content; it was missing the framing that distinguished decisions from noise. It imposed its own structure and returned a fabricated roadmap that looked authoritative.

These are not model failures. They are design failures. Tal Raviv and Aman Khan say support tickets show a pattern of AI “forgetting” facts during sessions. This issue is called context rot.

It refers to the steady loss of reliable behavior as the context window fills up. As this happens, the model struggles to remember earlier instructions. That is not a bug to file. It is a product experience to design around.

Lenny's Newsletter

How to build AI product sense

3 months ago · 506 likes · 37 comments · Tal Raviv and Aman Khan

Context Engineering Is Product Design

Context engineering is about carefully shaping what an agent observes at every step. Essentially, it shapes its information environment. This way, it gets what it needs to think, act, and recover. It avoids creating confident nonsense when things get tough. It is not prompt writing. Prompt writing is a sentence. Context engineering is an architecture.

That architecture works in three layers. Product leaders are making choices about these layers, even if they don’t view them as context decisions.

System instructions are the rules, constraints, and behavioral boundaries. These tell the model how to operate before any user input arrives. Marily Nika describes adding a single instruction to a Slack summarization tool. Meaning, only assign an owner if someone explicitly volunteers. This immediately eliminated the product’s biggest trust issue. The fix was not a different model. It was a missing context decision.
Retrieved knowledge covers what relevant information is pulled into the model’s context at query time, how much, and how it is structured before the model sees it. Tal Raviv and Aman Khan observe that output quality improves not because the model improves but because the context improves. The model is constant. What changes is what it sees.
Memory and history determine what the agent retains across turns and between sessions. When an agent loses track of an earlier instruction mid-session, the user experiences it as the product breaking. It is a context design failure, not a model limitation.

These three layers map directly onto decisions made during every AI feature build —gro data access scope, system prompt structure, and when to ask a clarifying question rather than let the model infer.

Addy Osmani captures the underlying principle: agentic engineering rewards people who know what good output looks like. Because they can design the environment that produces it.

Agentic engineers call this context engineering. Product leaders have always called pieces of it feature scoping, guardrail definition, and UX constraints. The vocabulary has been different. The decisions have been the same.

Three Mental Models for Product Leaders

Understanding context as the primary determinant of AI product quality changes the questions you ask at every stage of development. These three mental models make that change practical.

Ask what the model sees before asking what it can do.

The right first question is not which model handles this task best. It is what the model will actually see when a real user triggers this feature in production. These are:

A real query.
Arriving with partial context.
Unstated assumptions.
The intent the model will have to infer.

Tal Raviv and Aman Khan describe this as the core of AI product sense: anticipating what will be impactful and feasible requires understanding what the model sees at the moment it acts, not what it can do in a controlled demo.

Define Minimum Viable Quality before you define your feature.

Marily Nika establishes three thresholds every product leader should set before development begins:

Acceptable bar: The acceptable bar is where the feature performs well enough for real users under typical conditions.
Delight bar: The delight bar is where correction rates drop and the feature earns trust through consistency.
Do-not-ship bar: It is the failure rate at which the feature actively damages user trust.

MVQ also requires an honest cost envelope. For instance, a feature at $0.30 per user per month that drives retention is a straightforward decision. The same feature at $5 per user per month with unclear impact is a business problem that no engineering will solve.

Build the adversarial ritual into your launch process.

Before any AI feature ships, push it into the conditions that will break it. Marily runs three stress tests in under fifteen minutes:

Feed it chaotic input.
Give it an ambiguous request.
Assign it something deceptively hard.

What comes back is not a pass or fail. It is a product requirement — a missing constraint, an underspecified instruction, a clarifying question the UX should ask instead of letting the model infer.

Closing

Return to the team whose AI broke in production. They were not asking the wrong questions about their model. They were asking the wrong question entirely.

The question was never “what can our model do?” It was always “what does our model see?”

That change, from capability to context, is what agentic engineering worked out through practice rather than theory. Practitioners hit the walls, inspected the tool calls, watched the context window fill, and arrived at the same conclusion repeatedly: the model was not the problem.

The environment the model was operating in was.

Simon Willison, Marily Nika, Tal Raviv and Aman Khan each arrived here from different directions. The conclusion is the same.

The model is not your product. The context you give it is.

The AI Skills No One Is Teaching Product Managers (But Should Be)

Nilesh Barla — Sat, 21 Feb 2026 01:01:07 GMT

TLDR: Most PMs use AI daily but lack the judgment to use it well. This leads to decisions built on fabricated evidence. This article breaks down 8 practical skills (such as context loading, verification, and sycophancy-aware prompting) that distinguish reliable AI analysis from confident-sounding noise. Essential reading for product managers who want their AI-assisted recommendations to actually hold up under scrutiny.

Everyone Has the Tool. Almost Nobody Has the Skill

98% of product managers use AI daily, but only 39% received job-specific training on how to use it well. Or maybe that 39% tried various methods, read papers, and watched podcasts to learn the best practices.

There are many podcasts and resources that can help you hone AI for a specific workflow.

And that gap does not show up in adoption numbers. It shows up three months later, when a decision built on fabricated evidence collapses in a stakeholder review or audit.

Claude, ChatGPT, GPT-5.2, Gemini 3.1, Claude Code. The interfaces are everywhere. Every PM at a mid-size company has at least one open on their machine right now. Access was never the bottleneck, but judgment is.

Caitlin Sullivan ran the same customer transcripts through two models and received two completely different narratives.

Both were confident. Both cited participants. One cherry-picked three quotes and leapt to a recommendation. The other challenged the framing, segmented users by actual need, and flagged pricing risk with verifiable timestamps.

Same data. Same tools. Different operators.

Claude Code can run analytical scripts without manual input. GPT-5 drafts strategy memos faster than most human first drafts. Gemini 3.1 synthesizes research across dozens of sources in under a minute. These are real capabilities.

But the output quality is decided before the model runs. It is decided by how well the PM shaped the input, loaded the context, and built the habit of verifying what came back.

That is the skill layer. And almost no one is teaching it.

Why AI Analysis Fails PMs in Silence

The thing about AI is that it can fail by giving the wrong output.

Meaning to say, AI does not fail loudly.

There is no error message.

No red flag.

The output arrives clean, structured, and confident, which is exactly what makes it dangerous.

Caitlin Sullivan describes it precisely in Lenny’s Newsletter.

“These mistakes are invisible until a stakeholder asks a question you can’t answer, or a decision falls apart three months later, or you realize the ‘customer evidence’ behind a major investment actually had enormous holes.”

That is not a model failure but more of a skill failure.

Lenny's Newsletter

How to do AI analysis you can actually trust

👋 Hey there, I’m Lenny. Each week, I answer reader questions about building product, driving growth, and accelerating your career. For more: Lenny’s Podcast | Lennybot | How I AI | My favorite AI/PM courses, public speaking course, and interview prep copilot…

2 months ago · 231 likes · 3 comments

Three things make AI analysis silently unreliable for product managers [specifically]:

The output always looks finished. Claude Sonnet 4.6, ChatGPT, and Gemini 3.1 do not signal uncertainty the way a junior analyst would. They return polished prose with participant citations, timestamps, and confident recommendations. Regardless of whether the underlying evidence supports any of it. A well-formatted hallucination and a well-grounded insight look identical on the screen.
Pattern-matching gets mistaken for reasoning. Apple’s GSM-Symbolic research found that changing only variable names in a math problem caused LLM performance to drop by up to 10%. The model was not reasoning through the problem. It was recognizing surface patterns from training data.

Now, consider this: when a PM asks Claude to analyze churn themes, the model does not independently weigh the evidence. It finds what looks statistically probable given everything it has seen before.

Source: GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

Sycophancy shapes the output before the PM notices. Nielsen Norman Group found that 58% of all chatbot interactions display sycophantic behavior. If a PM mentions “pricing issues” anywhere in their prompt, the model weights toward pricing. If a PM pushes back on a theme, the model often reverses a previously correct answer. The output is already a reflection of the input’s assumptions, not an independent read of the data.

The result, as Sullivan documents, is a choose-your-own-adventure experience. Two models. Same transcripts. Different narratives. Different evidence. Different product recommendations. Each was delivered with equal confidence.

Most PMs only ever see one output. They never see what the same data looks like through a different lens, with a different prompt, on a different model. That single output becomes the evidence base for the next decision.

That is where the skills in Section 3 begin to matter.

Thanks for reading Adaline Labs! This post is public so feel free to share it.

The 8 Skills That Actually Matter

The difference between the two outputs Sullivan showed side by side was not the model. It was the decisions made before the model ran. Each skill below addresses one of those decisions.

Prompt for Decisions, Not Just Answers

Most PMs ask AI what the data says. The better question is what to do about a specific problem given specific constraints. Product Faculty puts it plainly. “Bad prompts try to produce good answers. Great prompts try to prevent bad reasoning.”

When the prompt changes from “what are the themes?” to “given that we are deciding whether to build this feature for this user segment, what does the evidence support?”, the model has a decision to serve, not just a pattern to find.

Load Context That Actually Changes the Output

Venn diagram explaining context engineering. | Source: The New Skill in AI is Not Prompting, It’s Context Engineering

Dumping background into a prompt is not context loading. Phil Schmid of Google DeepMind documented this precisely.

“Most agent failures are not model failures anymore. They are context failures.”

Effective context has four components.

Project scope.
The specific business goal
Product constraints.
A participant overview.

Without those four, Claude and ChatGPT default to generic analysis. With them, they answer your question instead of a version of it.

Verify Before Anything Leaves the Room

Sullivan ran a verification prompt on a set of ChatGPT quotes and found that the majority were paraphrases, not the customer’s actual words.

They had participant IDs. They had timestamps. They looked authoritative. But they were not real.

The fix is a two-step habit.

Define quote rules before analysis begins.
Then run a verification pass before any output reaches a stakeholder.

This takes five minutes and catches the errors that would otherwise sit inside a strategy deck for months.

Spot Pattern-Matching Before it Becomes a Recommendation

When AI returns a theme like “users want more reliable data,” that is almost certainly pattern-matching, not signal.

It could describe any product in any category.

Teresa Torres tested Claude against 15 interviews she had previously analyzed manually and found that Claude identified eight opportunities she missed, but also missed seven she found.

The skill here is recognizing when AI is surfacing consensus rather than insight. And then pushing past it with a follow-up that asks for what is specific, contradictory, or unexpected in the data.

Use AI Across Multiple Passes, Not One

The teams that get real value from AI treat it as a thinking partner across several iterations, not a machine that produces a final answer on the first try.

LogRocket research across 18 product teams found that the teams producing the most impact were not the ones generating the most output. They were the ones using AI to challenge their own thinking at each step.

Teresa Torres took a single overloaded prompt, split it into four focused passes, and saw quality improve immediately.

That is orchestration, which is a skill, not a setting.

Match the Model to the Task

Claude Sonnet or Opus 4.6, GPT-5.2, and Gemini 3.1 are not interchangeable. Sullivan documented this after running the same analysis across all three more than 100 times.

Claude covers more ground with less pushing and is best suited for deep qualitative analysis.
Gemini delivers fewer themes but grounds them more heavily in evidence, making it reliable for research synthesis.
GPT-5 excels at stakeholder framing and communication, but is the most prone to combining quotes into plausible-sounding fabrications.

Using the wrong model for the task is not a tool problem. It is a judgment problem.

Write Prompts That Do Not Lead the Witness

A 2025 study found that 58% of chatbot interactions display sycophantic behavior, and AI models agree with users 50% more than humans do.

Mentioning “retention problems” in the prompt prompts the model to find them.

The skill is writing neutral, open-ended inputs that let signal emerge rather than confirm what you already believe. Meaning, don’t be biased in your prompting, have curiosity, and a tendency to explore.

One practical rule is to express the business goal without naming the expected answer.

Translate Output into a Recommendation, Not a Report

AI returns analysis. It does not return a decision. Shreyas Doshi’s framing applies directly here.

The PM’s role is editor, not author.

The last mile, from themes and evidence to a crisp recommendation with a clear rationale and the right level of confidence, is entirely human. That translation is where product judgment lives, and no interface automates it.

Where to Start (Without Overwhelm)

Eight skills are a lot to absorb at once. The good news is that they do not all carry equal weight at the beginning.

Start with context loading. It is the skill that immediately improves every other output without changing anything else about the workflow.

Before the next analysis session, define the project scope, the specific decision at stake, the product constraints, and who the participants are. Load those four things before the first prompt. The difference in output quality is immediate and visible. Try it.

Add verification next.

Before any AI output reaches a stakeholder, run a verification pass on the quotes and claims it contains.

This single habit protects credibility and catches the errors that confident formatting makes invisible. Sullivan’s verification prompt takes five minutes. The cost of skipping it can take months to recover from.

Once those two habits are stable, shift the prompting approach toward decisions. Replace “what does this data show?” with the specific choice the team needs to make.

That reframe naturally pulls the remaining six skills into place. Because decision-focused prompts demand better context, reward iterative passes, and make pattern-matching easier to spot.

These three skills compound.

Better context produces fewer fabrications.
Fewer fabrications make verification faster.
Cleaner verified output makes the final recommendation sharper.

The Judgment Layer Is the Job

The PM who produced the trustworthy output in Sullivan’s experiment was not using a better tool. Claude, ChatGPT, and Gemini were available to both. The difference was the layer of judgment applied before, during, and after the model ran.

That layer does not come from the interface. It does not improve automatically as models get more capable. GPT-5.2 and Claude Sonnet 4.6 are more sophisticated than anything available two years ago. And the failure modes Sullivan documented are still happening daily across product teams everywhere.

Lenny Rachitsky framed the direction clearly. “The PM’s role shifts to becoming very good at knowing what data to feed AI and asking the right questions.”

Lenny's Newsletter

How AI will impact product management

👋 Hey, I’m Lenny and welcome to a 🔒 subscriber-only edition 🔒 of my weekly newsletter. Each week I tackle reader questions about building product, driving growth, and accelerating your career…

2 years ago · 237 likes · 22 comments · Lenny Rachitsky

That is not a peripheral skill.

That is the job.

As models get better at producing outputs that look right, the ability to judge whether they are right becomes more valuable, not less.

The eight skills in this article are not a workaround for weak models. They are the foundation for working with strong ones.

Conclusion

98% of PMs have the tool. The 39% who invest in the skill layer are the ones whose recommendations hold up in the room, whose evidence survives scrutiny, and whose decisions age well.

This gap is not closing on its own. Practice, experiment, read, and learn these techniques. Observe the differences. Find what suits your workflow, then iterate and teach others.

Investor And Venture Outlook On AI | Takeaways For Founders And Product Leaders

Arsh Shah Dilbagi — Wed, 18 Feb 2026 13:55:19 GMT

TLDR: This blog shares what investors really think about AI in 2025. The big idea: AI is still in its early days, even if it doesn’t feel that way. Just because everyone in tech is talking about AI doesn’t mean businesses are actually using it yet. Real adoption shows up in budgets, not just experiments. Many industries have barely started. The core message for founders and investors: the AI opportunity is just getting started, not winding down.

Introduction

Founder Intro: Investor and Venture Outlook on AI in 2025

There’s no shortage of opinions about AI’s future. What’s far rarer is clarity about what actually matters right now. For founders, it is about building companies; for investors, about deciding where conviction belongs.

Panel 5 was designed to cut through that noise. Rather than speculate about distant futures or abstract breakthroughs, we wanted to anchor the conversation in the realities shaping AI businesses in 2025: adoption curves, economics, org design, governance, and where durable value is actually accruing.

To do that, we brought together investors who are actively underwriting these questions across different stages, geographies, and market structures:

Lukas Linemayr, Partner at Streamlined Ventures.
Rak Garg, Partner at Bain Capital Ventures.
Tiger Gao, Investor at Apax Digital.
Zao Chen, Investor at Craft Ventures.

What emerged was a surprisingly grounded picture of the AI landscape. Yes, the market is early, but it is not empty. Yes, capital investment is massive, but revenue realization takes time. Yes, platform risk is real, but applications still capture value. And perhaps most importantly: AI has expanded the outcome space for founders rather than narrowing it.

This panel wasn’t about predicting AGI timelines or chasing the next hype cycle. It was about understanding constraints, making realistic bets, and recognizing where opportunity still hides — often in overlooked markets, unglamorous workflows, and human-heavy industries that software never fully reached.

Across the discussion, one theme stood out:

“AI changes what’s possible — not what’s required to build a real business.”

Durable companies are still built on trust, usage, distribution, and judgment. The tools are new. The fundamentals are not. The sections that follow break down how investors are thinking about value capture, revenue quality, founder profiles, governance, and scale — not as theory, but as underwriting criteria today.

If you’re building in AI and trying to decide what kind of company to build, whether venture is the right path, or where the next decade of opportunity actually lies, this panel offers a clear place to start.

1. The Market Is Early — But Not Empty

One of the most consistent refrains across the panel was a corrective to a common misconception:

AI adoption feels saturated inside tech circles — but it isn’t saturated in the real economy.

What looks crowded from within Silicon Valley looks very different when viewed across industries, geographies, and buyer maturity curves.

Inside the Bubble vs Outside the Market

Within technology ecosystems, AI can feel ubiquitous. Models are improving rapidly. New products launch weekly. Capital is flowing aggressively.

But as multiple panelists emphasized, this perspective is deeply skewed.

Outside of tech-forward companies:

Most enterprises are still experimenting.
Deployments are limited to pilots or narrow workflows.
Leadership teams are cautious.
Organizational readiness lags technical capability.

As Lukas Linemayr, Partner at Streamlined Ventures, noted, exposure should not be confused with adoption. Awareness is high. Actual usage at scale is not.

Budgets Tell the Real Story

Several panelists pointed to a simple reality check: budget allocation.

Despite the attention AI receives, AI spend remains a small fraction of overall enterprise budgets. In most organizations, it competes with:

Legacy software commitments.
Infrastructure modernization.
Security and compliance spend.
Headcount and services.

As Rak Gard, Partner at Bain Capital Ventures, emphasized, real adoption shows up in sustained budget line items — not experimentation funds. By that measure, most enterprises are still in early innings.

Consumer Adoption Is Uneven, Not Universal

The panel also pushed back on the idea that consumer AI adoption is “done.”

While some products have achieved massive usage, adoption remains:

Uneven across geographies.
Concentrated among power users.
Fragmented by use case.
Highly sensitive to trust and clarity.

As Tiger Gao, Investor at Apax Digital, pointed out, consumer behavior varies dramatically outside of early-adopter markets. What feels mainstream in one region can be niche in another.

This unevenness suggests opportunity — not saturation.

Entire Industries Are Barely Started

Perhaps the most important insight was how many sectors have barely begun meaningful AI deployment. Industries like healthcare, manufacturing, logistics, financial operations, and regulated services face constraints that slow down,

Adoption.
Compliance requirements.
Legacy systems.
Data fragmentation.
Cultural resistance.

As Zao Chen, Investor at Craft Ventures, noted, these constraints don’t eliminate opportunity; they delay it. And delayed markets often end up being the largest ones.

Capital ≠ Product-Market Fit

A key clarification from the panel was that capital investment should not be mistaken for market maturity.

Yes, enormous amounts of capital have flowed into AI. No, that does not mean product-market fit is solved.

At-scale PMF:

Is still forming.
Looks different by industry.
Requires integration, not just intelligence.
Unfolds over years, not quarters.

Many AI products are still searching for repeatable, durable deployment patterns.

Diffusion Has Just Begun

This led to the panel’s core takeaway:

Today’s traction does not represent peak penetration.
It represents the beginning of diffusion.

We are early in the curve where:

Workflows are being discovered.
Buyers are learning how to buy.
Organizations are learning how to deploy.
Trust is still being earned.

For founders and investors alike, this reframes the opportunity.

The market isn’t empty. But it’s far from full.

The Practical Takeaway

AI may feel late-stage if you only look at demos, headlines, and funding rounds.

But if you look at:

Real usage.
Real budgets.
Real deployment.
Real behavior.

The conclusion is clear: we’re still at the beginning of adoption, not the end.

For companies that can survive the experimentation phase and earn trust at scale, the next wave of growth is still ahead.

2. AGI Debates Matter Less Than Near-Term Constraints

AGI and superintelligence inevitably came up during the panel, but notably, they were treated as context, not catalysts.

The investors were aligned on a simple point:

AGI debates are intellectually interesting. And that near-term constraints determine outcomes.

AGI Is a Moving Target

One of the first issues raised was definitional.

As Lukas Linemayr, Partner at Streamlined Ventures, noted, there is no stable, shared definition of AGI. What qualifies as “general” varies by speaker, by benchmark, and by moment in time.

This makes AGI a poor anchor for:

Investment decisions.
Company strategy.
Product roadmaps.

If the goalposts keep moving, progress becomes impossible to evaluate meaningfully.

Reasoning Exists — But Only Inside Boxes

The panel acknowledged real advances in multi-step reasoning.

Models today can:

Chain logic.
Follow structured plans.
Solve complex problems within constrained domains.

But that constraint is doing the real work.

As Rak Gard, Partner at Bain Capital Ventures, emphasized, reasoning degrades rapidly once systems leave controlled environments. Outside of well-scoped tasks, models struggle with ambiguity, long-horizon execution, and accountability.

This gap matters far more than abstract intelligence scores.

Autonomy Is Bottlenecked by the World, Not Models

Another key insight was that autonomy isn’t limited by model capability alone.

It’s bottlenecked by:

Messy real-world environments,
Poor or fragmented data,
Limited feedback loops,
Immature reinforcement learning systems.

As Tiger Gao, Investor at Apax Digital, pointed out, intelligence without grounding doesn’t scale. The world is not a clean API. Until systems can reliably sense, act, and learn in open environments, autonomy will remain constrained regardless of model improvements.

Timelines Are Longer Than the Discourse Suggests

The panel was notably conservative on timelines.

Not pessimistic, rather realistic.
Breakthroughs will happen.
Capabilities will improve.
New classes of applications will emerge.

But as Zao Chen, Investor at Craft Ventures, noted, the gap between lab demos and reliable deployment is often measured in years, not months. Overestimating timelines is one of the fastest ways to make bad bets.

Investors Underwrite Constraints, Not Possibility

This led to a shared investment posture.

While AGI-level outcomes may shape long-term narratives, investors operating today underwrite constraints:

Where models fail.
Where workflows break.
Where adoption stalls.
Where economics don’t pencil.

Near-term success depends on navigating these limitations and not assuming they’ll disappear.

Founders who build as if constraints are permanent often outperform those betting on imminent breakthroughs.

The Practical Takeaway

AGI debates will continue — and they matter for long-term vision.

But in 2025:

Constraints drive outcomes.
Environments matter more than intelligence.
Deployment beats demos.
Realism beats speculation.

For builders and investors alike, the message was clear:

The next wave of value won’t come from waiting for AGI. It will come from building durable businesses inside today’s limits and also expanding those limits over time.

3. Massive CapEx Does Not Automatically Equal Massive Revenue

One of the most candid discussions on the panel centered around a growing tension in the AI ecosystem:

Infrastructure spending has exploded, but revenue realization is still catching up.

This disconnect is real, and it matters.

Infrastructure Spend Is Front-Loaded by Design

The panel acknowledged the obvious headline: AI has triggered one of the largest infrastructure buildouts in modern tech history.

Compute.
Data centers.
Specialized hardware.
Energy commitments.

As Rak Gard, Partner at Bain Capital Ventures, noted, this level of CapEx is unprecedented outside of telecom or cloud hyperscalers. But unlike traditional software, AI infrastructure must be built ahead of demand.

This makes early financials look distorted — not broken.

Revenue Exists — Just Not in Proportion Yet

A key nuance the panel emphasized was that AI revenue is real and growing quickly.

Some AI applications are:

Growing faster than any prior software category.
Achieving meaningful ARR at early stages.
Demonstrating strong willingness to pay.

As Lukas Linemayr, Partner at Streamlined Ventures, pointed out, aggregate AI ARR across the ecosystem is already substantial.

What it is not yet is proportional to the infrastructure being built to support future demand.

That gap is expected and temporary.

Monetization Lags Capability

Another consistent insight was that monetization always lags technical capability.

Models improve first.
Use cases emerge next.
Business models stabilize last.

As Tiger Gao, Investor at Apax Digital, explained, AI creates value before it captures value. It takes time for:

Buyers need to understand ROI.
Pricing models to normalize.
Procurement processes to adapt.
Budgets to shift meaningfully.

This lag is not unique to AI, but the scale makes it more visible.

CapEx Absorption Takes Time

The panel converged on a clear expectation:

CapEx absorption will take years, not quarters.

Infrastructure will be amortized over long time horizons.

Revenue will arrive unevenly.

Some segments will monetize faster than others.

As Zao Chen, Investor at Craft Ventures, emphasized, this doesn’t imply poor returns — it implies patience. Investors expecting immediate proportionality between spend and revenue are misreading the cycle.

Uneven Returns Are a Feature, Not a Bug

Another important point was that returns will not be distributed evenly.

Some layers will:

Capture outsized value early.
Show strong unit economics.
Justify spending quickly.

Others will:

Struggle to monetize.
Remain infrastructure-heavy.
Consolidate over time.

This unevenness is characteristic of platform shifts, not a sign of failure.

The Practical Takeaway

Massive CapEx is not proof of massive revenue, yet.

But it is a prerequisite for it.

The panel’s consensus was grounded but optimistic:

Revenue is coming.
Monetization is forming.
Timelines are longer than hype suggests.

For investors and founders alike, the message was clear:

Don’t confuse delayed returns with absent returns.
The AI buildout is early — and uneven by design.

4. Value Accrues to Applications, Not Foundations

One of the strongest points of alignment across the panel was a lesson the industry has learned repeatedly:

Platforms enable value.
Applications capture it.

AI does not break that pattern; it reinforces it.

History Rhymes — Even When Technology Changes

The panel situated AI within a familiar historical arc.

In prior platform shifts:

Operating systems enabled software companies.
Cloud infrastructure enabled SaaS.
Mobile platforms enabled app ecosystems.

In each case, the enabling layer was essential — but the enduring value accrued to the application layer.

As Rak Gard, Partner at Bain Capital Ventures, emphasized, AI follows the same economic logic. Infrastructure makes new behavior possible. Applications turn that possibility into revenue.

Foundations Are Necessary — and Brutal

The panel was clear-eyed about the difficulty of foundation-layer businesses.

Chips, models, and infrastructure are:

Capital-intensive.
Technically complex.
Strategically critical.

But they are also:

Highly competitive.
Subject to commoditization.
Constrained by margin pressure.

As Lukas Linemayr, Partner at Streamlined Ventures, noted, the model layer increasingly resembles cloud infrastructure wars — massive scale advantages, few winners, and brutal economics for everyone else.

These businesses matter — but they are structurally hard to own as long-term value capture plays.

Applications Control the Customer

What applications uniquely possess is the user relationship.

Applications own:

Workflow integration.
Daily usage.
Customer trust.
Switching costs.

As Tiger Gao, Investor at Apax Digital, pointed out, this control translates directly into pricing power. Users pay for outcomes, not for abstractions.

When models improve, applications benefit without having to rebuild trust from scratch.

Differentiation Lives Above the Model

Another key point was that models converge faster than experiences.

Model performance gaps compress.
APIs standardize.
Capabilities diffuse.

Applications differentiate by:

Domain expertise.
Workflow design.
Data context.
User experience.
Operational integration.

As Zao Chen, Investor at Craft Ventures, emphasized, durable defensibility emerges from how AI is applied — not from the intelligence itself.

Margins Expand Up the Stack

The panel also highlighted a familiar economic pattern:

Margins expand as you move closer to the user.
Infrastructure margins are constrained by cost curves.
Model margins are pressured by competition.
Application margins grow through differentiation and pricing power.

This doesn’t diminish the importance of foundational layers — but it clarifies where sustained value capture occurs.

The Practical Takeaway

AI infrastructure enables the future.

Applications monetize it.

For founders, this means:

Obsessing over workflows, not models.
Owning user trust and integration.
Building differentiation above the foundation.

For investors, it reinforces a familiar truth:

The largest, most durable outcomes are still built at the application layer, even in an AI-first world.

Thanks for reading Adaline Labs! This post is public so feel free to share it.

5. Platform Risk Is Real — But Not Fatal

The panel didn’t avoid one of the most sensitive topics in AI investing:
platform risk is real.

Model providers are moving downstream.
APIs are evolving.
Feature parity is increasing.

But the consensus view was notably pragmatic — not alarmist.

Tension Is Inevitable in Platform Shifts

As platforms mature, they naturally look for ways to monetize.

That often means:

Expanding feature sets.
Offering more opinionated tools.
Encroaching on application territory.

As Rak Gard, Partner at Bain Capital Ventures, noted, this tension is not unique to AI. It showed up in cloud, mobile, and SaaS before.

Platforms and applications coexist — sometimes uneasily — because they serve different economic roles.

API Risk Is a Known Variable

Several panelists acknowledged legitimate concerns around:

Access changes.
Pricing shifts.
Deprecations.
Policy updates.

As Lukas Linemayr, Partner at Streamlined Ventures, pointed out, APIs are dependencies — not guarantees. Smart teams model this risk explicitly rather than pretending it doesn’t exist.

Platform risk becomes fatal only when it’s ignored.

Differentiation Isn’t in the Model

The panel repeatedly returned to where applications actually win.

Apps differentiate through:

Workflow design.
Domain expertise.
Product taste.
Brand and trust.
Customer relationships.

As Tiger Gao, Investor at Apax Digital, emphasized, platforms optimize for breadth. Applications win through depth.

That depth is hard to replicate — even for the platform itself.

Competition Reshapes Opportunity

One of the more grounded insights was that competition doesn’t eliminate opportunity; it reshapes it.

When platforms move downstream:

They validate demand.
They educate the market.
They raise baseline expectations.

This often creates new whitespace for more specialized, higher-quality applications.

As Zao Chen, Investor at Craft Ventures, noted, many successful SaaS companies were built after platforms entered adjacent spaces — not before.

Risk Is a Pricing Input, Not a Stop Signal

The panel ultimately framed platform risk the same way investors do:
As a factor to price in, not a reason to walk away.

Founders who understand their dependency surface, design for portability, own the customer relationship, and build real differentiation can survive — and even benefit from — platform competition.

The Practical Takeaway

Platform risk in AI is real.
But it’s not new.
It’s not fatal.
And it’s not a reason to avoid building.

The companies that win:

Acknowledge the risk.
Design around it.
Differentiate beyond the platform.
Move faster than incumbents.

In AI, as in every platform shift before it, value accrues to teams that build where platforms can’t — not where they can.

6. “Quality of Revenue” Now Matters at Seed

One of the clearest shifts highlighted by investors was the earlier evaluation of revenue.

In prior cycles, seed revenue was rare and often enough on its own.

In AI, revenue shows up earlier.

That changes the bar.

Revenue Is Easier to Generate — and Easier to Misread

AI has dramatically compressed time-to-revenue.

Teams can:

Ship quickly.
Demo convincingly.
Monetize early interest.
Close initial contracts faster than ever.

But as multiple panelists emphasized, early revenue is no longer synonymous with a real business.

As Lukas Linemayr, Partner at Streamlined Ventures, noted, the question is no longer “Do you have revenue?” — it’s “What kind of revenue is this?”

The New Questions Investors Ask

Across the panel, investors described a sharper line of inquiry at seed and Series A.

They want to understand:

Durability: Does usage persist after novelty fades?
Depth: Are customers relying on the product, or just experimenting?
Repeatability: Does demand recur, or is it opportunistic?
Expansion: Is there a credible path from $10M to $100M to public markets?

As Rak Gard, Partner at Bain Capital Ventures, emphasized, investors are increasingly underwriting trajectory, not just traction.

Novelty Masks Weak Signals

Several panelists warned that AI novelty can distort early metrics.

Short-term spikes may reflect:

Curiosity.
Experimentation budgets.
Executive mandates.
Fear of missing out.

As Tiger Gao, Investor at Apax Digital, pointed out, these signals look strong in dashboards — but decay quickly if the product doesn’t earn its place in a workflow.

Retention, not activation, tells the real story.

Usage Reveals Business Reality

A recurring theme was that usage behavior is more informative than revenue timing.

Investors look closely at:

Frequency of use.
Depth of engagement.
Reliance during critical moments.
Behavior when the product fails.

As Zao Chen, Investor at Craft Ventures, noted, strong businesses show resilience. Customers return even when things break. Weak ones disappear quietly.

Revenue without usage conviction is fragile.

Scale Tests Everything

Another important point was that scaling reveals quality quickly.

Many AI products can reach $1–5M in ARR through:

Founder-led sales.
Bespoke deployments.
Heavy services.
Early adopter enthusiasm.

The real question is whether the business can:

Standardize delivery.
Reduce marginal cost.
Survive broader scrutiny.
Scale distribution without collapsing economics.

As the panel emphasized, the path from $10M to $100M remains the true test—and AI has not shortened it.

Time-to-Business Maturity Hasn’t Changed

This led to one of the panel’s most grounded conclusions:

AI has compressed time-to-revenue.
It has not compressed time-to-business maturity.

Trust still takes time.

Habits still take time.

Markets still take time.

No model shortcut changes that.

The Practical Takeaway

Revenue is necessary — but no longer sufficient.

For founders:

Focus on usage durability, not just monetization.
Optimize for reliance, not novelty.
Build businesses that survive attention decay.

For investors:

Early revenue is a starting point for diligence, not the end.

In an AI-first world, the quality of revenue matters earlier because it’s easier than ever to get the wrong kind.

7. Taste, Brand, and Community Are Emerging Moats

One of the more surprising — and strongly aligned — themes across the panel was how much intangible moats now matter in AI.

In fact, the investors suggested they may matter more than in traditional SaaS.

Feature Parity Is the New Default

As models converge and capabilities diffuse, feature parity arrives faster than teams expect.

What once felt differentiated — reasoning quality, speed, and output polish — now quickly becomes the baseline.

As Lukas Linemayr, Partner at Streamlined Ventures, noted, when technical advantages compress, competition shifts up the stack — toward how products feel, not just what they do.

Taste Creates Coherence

The panel framed taste not as aesthetics, but as coherence.

Taste shows up in:

Which problems are chosen?
Which features are excluded?
How are workflows structured?
How does the product behave under stress?

As Rak Gard, Partner at Bain Capital Ventures, emphasized, taste is what makes a product feel intentional rather than accidental. In AI products, where outputs are probabilistic, that sense of intention is deeply reassuring.

Coherence builds confidence.

Confidence builds habit.

Brand Is a Trust Shortcut

Brand also took on a more functional meaning in the discussion.

In AI, brand is not about awareness — it’s about trust compression.

As Tiger Gao, Investor at Apax Digital, pointed out, when users don’t fully understand how a system works, they rely on signals. Brand becomes a shortcut for:

Reliability.
Alignment.
Safety.
Intent.

In uncertain environments, trusted brands reduce friction in adoption and forgiveness in the face of failure.

Community Multiplies Distribution and Retention

Community was discussed not as engagement, but as leverage.

Strong communities:

Normalize uncertainty.
Spread best practices.
Reinforce identity.
Accelerate onboarding.

As Zao Chen, Investor at Craft Ventures, noted, community transforms products from tools into shared experiences. That shift increases retention and turns users into distributors.

Community doesn’t lock users in technically — it locks them in emotionally.

Switching Costs Are Becoming Emotional

Perhaps the most important reframe was around switching costs.

In AI, switching costs are often low technically:

Data can be exported.
Integrations are portable.
Models are interchangeable.

But switching costs are high emotionally.

People stick with products they:

Trust.
Identify with.
Feel understood by.
Have invested in learning.

As the panel emphasized, these costs aren’t enforced — they’re felt.

Moats You Can’t Diagram

The panel acknowledged that taste, brand, and community are harder to quantify than traditional moats.

But that doesn’t make them weaker.

In fact, they’re often:

Slower to build.
Harder to copy.
More durable over time.

As one investor summarized, competitors can clone features in months. They can’t clone trust, coherence, or belonging on the same timeline.

The Practical Takeaway

In an AI world defined by rapid convergence, the strongest moats are increasingly human.

They live in:

Product judgment.
Emotional resonance.
Shared identity.
Trust is built over time.

For founders, this means:

Investing in coherence early.
Treating brand as infrastructure.
Designing community intentionally.

For investors, it reframes defensibility.

The most durable moats may no longer be enforced by code; they’re earned through experience.

8. Founder Profiles Are Expanding, Not Narrowing

One of the most encouraging conclusions from the panel was the extent to which the founder archetype is expanding in the AI era. Rather than narrowing the set of who can build venture-scale companies, AI is expanding it.

The Old Pattern Is Breaking

Historically, venture-backed success clustered around a familiar profile:

Elite technical pedigree.
Prior big-tech experience.
Access to capital and networks.
Long lead times to build.

The panel agreed that this pattern is weakening.

As Lukas Linemayr, Partner at Streamlined Ventures, noted, AI dramatically lowers the cost of experimentation. Founders no longer need massive teams or years of infrastructure work to reach meaningful traction.

This opens the door to a much broader set of builders.

Younger Founders Are Succeeding Earlier

Several investors pointed out that founders are reaching real scale earlier in their careers.

AI allows:

Faster iteration.
Quicker feedback from the market.
Earlier revenue.
More compressed learning cycles.

As Rak Gard, Partner at Bain Capital Ventures, emphasized, velocity now matters more than a resume. Teams that learn quickly often outperform those with deeper credentials but slower adaptation.

Domain Expertise Is Rising in Importance

Another major shift discussed was the increasing value of deep domain knowledge.

In many AI categories:

The hard part isn’t building intelligence.
It’s understanding the workflow.
Navigating edge cases.
Earning trust in complex environments.

As Tiger Gao, Investor at Apax Digital, pointed out, founders with lived experience in a problem domain often have sharper product intuition than technically elite generalists.

Knowing what shouldn’t be automated is often more valuable than knowing how to automate everything.

Adaptability Is the New Core Skill

The panel was unified on one point: AI rewards founders who adapt continuously.

Successful founders today must:

Navigate constant model changes.
Reassess architectural decisions regularly.
Update mental models frequently.
Make decisions with incomplete information.

As Zao Chen, Investor at Craft Ventures, noted, the ability to revise beliefs quickly has become a defining trait. Rigid thinkers struggle in environments where assumptions expire every quarter.

Opinionated Thinking Matters More Than Credentials

Another subtle but important theme was the value of opinionated judgment.

With so many tools, models, and paths available, founders who have clear points of view, make decisive tradeoffs, resist chasing every trend, and articulate why they believe something tend to move faster and build more coherent companies.

Pedigree may open doors, but judgment keeps companies alive.

The Founder Archetype Is Broadening

Taken together, the panel painted a clear picture:

There is no single “ideal” AI founder.

Instead, the market rewards:

Speed over seniority.
Learning over lineage.
Judgment over credentials.
Adaptability over perfection.

This is a structural shift — not a temporary one.

The Practical Takeaway

AI is not concentrating on opportunity. It’s distributing it.

For founders, this is a call to lean into:

Lived experience.
Clear thinking.
Fast learning.
Strong opinions.

For investors, it means expanding pattern recognition — not narrowing it.

In the AI era, the founders who win won’t all look the same and that’s a feature, not a bug.

9. Venture-Backed Is a Choice — Not a Default

One of the most refreshingly candid moments in the panel came when the conversation turned to founder paths.

The investors were aligned on a point that’s often left unsaid:

Not every great AI business should be venture-backed.

And that’s not a failure — it’s a feature of the moment we’re in.

AI Has Changed the Economics of Building

AI has dramatically lowered the cost of starting companies.

Founders can now:

Build sophisticated products with small teams.
Reach customers directly.
Generate revenue early.
Operate profitably at smaller scales.

As Lukas Linemayr, Partner at Streamlined Ventures, noted, this fundamentally expands the set of viable outcomes. Venture is no longer the only path to building something meaningful — or enduring.

Niche, Profitable Businesses Are More Viable Than Ever

Several panelists highlighted how AI enables high-quality, niche businesses.

These companies:

Serve specific audiences deeply.
Operate with strong margins.
Grow sustainably.
Don’t require hypergrowth.

As Tiger Gao, Investor at Apax Digital, pointed out, many of these businesses would have struggled to exist a decade ago. Today, they can thrive — and founders can own more of the upside.

Scale isn’t the only measure of success.

Community Enables Profitable Distribution

Another enabling factor discussed was the rise of community-driven distribution.

Strong communities allow companies to:

Reach users directly.
Reduce CAC dramatically.
Build trust faster.
Monetize without heavy spend.

As Zao Chen, Investor at Craft Ventures, noted, community doesn’t just support growth — it supports profitability. For many AI products, that changes the calculus entirely.

Venture Comes With Constraints

The panel was also clear about what venture capital demands.

Venture-backed paths require:

Chasing very large markets.
Tolerating higher risk.
Optimizing for scale over stability.
Committing to outcomes that justify dilution.

As Rak Gard, Partner at Bain Capital Ventures, emphasized, venture is best suited for companies willing to pursue problems that are structurally large — often adjacent to, but not dependent on, AGI-level breakthroughs.

It’s a powerful tool — but it narrows the problem space.

Choosing Venture Means Choosing the Problem

One of the most important reframes was that venture is not just a financing choice — it’s a product choice.

It implicitly commits founders to:

A certain growth rate.
A certain market size.
A certain risk profile.

Founders who don’t want those constraints shouldn’t feel compelled to accept them.

As the panel underscored, opting out of venture isn’t opting out of ambition — it’s opting into a different kind of ambition.

AI Expands the Outcome Space

The broader conclusion was optimistic.

AI doesn’t funnel founders into a single path. It multiplies the paths available.

Some companies should:

Raise aggressively.
Chase massive markets.
Take on existential risk.

Others should:

Stay small and profitable.
Serve communities deeply.
Compound quietly over time.

Both are valid. Both can be impactful.

The Practical Takeaway

AI lowers the cost of building — but it doesn’t dictate how you should build.

Venture-backed is no longer the default. It’s a choice.

The best founders don’t ask:

“Can this raise venture?”

They ask:

“What kind of company do I want to build — and what path best supports that?”

In an AI-first world, freedom of choice is one of the most powerful new advantages founders have.

Thanks for reading Adaline Labs! This post is public so feel free to share it.

10. Huge Markets Remain Underserved

Despite how crowded parts of the AI landscape appear, the panel was emphatic on one point: Many of the largest opportunities aren’t crowded at all. They’re simply overlooked.

Silicon Valley Sees a Narrow Slice of the Economy

The panel highlighted a structural blind spot in how markets are perceived.

Inside tech ecosystems, attention clusters around:

Developer tools.
Knowledge work productivity.
Media and content.
Obvious white-collar workflows.

But as Zao Chen, Investor at Craft Ventures, noted, these categories represent a small fraction of global economic activity.

Outside that bubble sit enormous industries that are:

Operationally complex.
Heavily manual.
Under-softwared.
Resistant to prior automation.

These sectors don’t appear on demo days, but they dominate real GDP.

Service Industries Are Still Software-Poor

Several investors emphasized how many service-heavy industries remain untouched by modern software.

Examples discussed included:

Field services.
Logistics coordination.
Healthcare operations.
Compliance-heavy workflows.
Back-office functions in regulated industries.

As Rak Gard, Partner at Bain Capital Ventures, pointed out, many of these markets were poor fits for traditional SaaS. The workflows were too fragmented, too judgment-heavy, or too expensive to automate manually.

AI changes that calculus.

AI Enables Automation Where Software Never Reached

The panel stressed that AI’s most powerful impact may not be where software already exists — but where it never could.

AI can:

Handle ambiguity.
Adapt to messy inputs.
Support human judgment.
Operate across inconsistent processes.

As Tiger Gao, Investor at Apax Digital, explained, this opens entirely new categories. Work that was previously uneconomical to software-enable suddenly becomes tractable.

The opportunity isn’t a marginal improvement. It’s first-time automation.

Visibility, Not Ideation, Is the Bottleneck

Another important reframing was around innovation itself.

The panel rejected the idea that success requires discovering a “new” idea. Instead, it requires:

Seeing existing problems clearly.
Understanding how work actually happens.
Recognizing where human labor is trapped by process.

As Lukas Linemayr, Partner at Streamlined Ventures, noted, many of the biggest AI companies of the next decade won’t feel novel to insiders. They’ll feel obvious — once someone finally builds them.

Underserved Markets Often Look Unattractive Early

One reason these markets remain open is that they rarely look attractive at first glance. They:

Lack clean APIs.
Involve legacy systems.
Require domain expertise.
Don’t fit standard growth narratives.

But as the panel emphasized, these same traits often signal durability. Once solved, these problems create:

High switching costs.
Deep customer reliance.
Long-term contracts.
Real economic impact.

The Practical Takeaway

AI opportunity isn’t concentrated only where attention is loudest. It’s often hiding in:

Invisible workflows.
Neglected industries.
Unglamorous services.
Problems people stopped trying to solve.

The panel’s closing reframe was simple but powerful:

The opportunity is not finding a new idea, it’s seeing an old problem clearly for the first time.

For founders willing to look beyond the obvious, the AI market is still wide open.

11. Hiring and Org Design Are Still Bottlenecks

One of the most pragmatic points the panel made was also one of the least glamorous: AI does not eliminatea eliminate organizational bottlenecks. It often exposes them.

Despite dramatic gains in technical capability, the fundamentals of building and scaling companies remain stubbornly human.

AI Doesn’t Replace Go-To-Market Reality

The panel was explicit that AI does not remove the need for:

Selling.
Onboarding.
Change management.
Domain translation.
Forward-deployed work.

As Rak Gard, Partner at Bain Capital Ventures, noted, many AI companies underestimate how much of the work happens outside the model. Especially in enterprise and regulated markets, trust must still be earned person by person.

Models don’t close deals. People do.

Non-Technical Roles Matter More Than Expected

A recurring surprise for many founders is how critical non-coding roles remain. They become essential when:

Sales cycles are long.
Buyers are non-technical.
Workflows are entrenched.
Adoption requires behavior change.

As Zao Chen, Investor at Craft Ventures, emphasized, AI products often increase the need for translation — not reduce it. Someone still has to explain what the system does, where it works, where it doesn’t, and how to integrate it safely.

That work doesn’t disappear. It shifts.

Forward-Deployed Humans Are Often the Unlock

Several panelists pointed out that forward-deployed teams are not a sign of weakness — they’re often a sign of realism.

In complex environments, humans:

Adapt to messy workflows.
Handle exceptions.
Earn trust in high-stakes settings.
Surface product gaps quickly.

As Lukas Linemayr, Partner at Streamlined Ventures, noted, many successful AI companies scale through forward-deployed work before they scale away from it. The mistake is treating these roles as temporary hacks instead of strategic leverage.

Org Design Determines Where AI Actually Scales

Another key insight was that organizational design determines where AI leverage shows up.

Teams that struggle often:

Over-index on engineers.
Under-invest in GTM and enablement.
Assume automation replaces coordination.
Delay hiring for customer-facing roles.

As Tiger Gao, Investor at Apax Digital, pointed out, this creates a mismatch: powerful technology paired with insufficient human scaffolding. Adoption stalls not because the product is weak — but because the org can’t support it.

Leverage Comes From Deploying Humans Intentionally

The panel emphasized that winning teams don’t eliminate humans; they deploy them strategically. They:

Put humans where judgment matters most.
Automate where repetition dominates.
Keep humans close to customers early.
Pull them back only once patterns stabilize.

This isn’t inefficient. It’s how learning compounds.

The Practical Takeaway

AI changes what humans do not whether they’re needed.

The companies that win:

Design orgs around real-world adoption.
Hire for translation, trust, and judgment.
Accept that some work cannot be automated early.
Deploy humans where leverage is highest.

In an AI-first world, technology scales fastest when organizations are designed to support it.

Ignoring hiring and org design doesn’t make them go away. It just turns them into silent bottlenecks.

12. Governance Will Emerge Bottom-Up, Not Top-Down

When the conversation turned to regulation and governance, the panel aligned around a view that was notably pragmatic:

Governance will not arrive first through policy.
It will emerge through products.

This isn’t ideological — it’s observational.

Regulation Will Always Lag Innovation

The panel was clear that regulation inevitably trails technology.

AI is moving too quickly for:

Comprehensive legislation.
Globally consistent standards.
Real-time regulatory oversight.

As Lukas Linemayr, Partner at Streamlined Ventures, noted, this lag is not a failure of regulators — it’s a structural reality. By the time rules are written, the underlying technology has already shifted.

Waiting for regulation to define governance is therefore unrealistic.

Governance Will Be Built, Not Declared

Instead, governance is emerging bottom-up, through tooling and infrastructure.

The panel emphasized that real governance is operational, not philosophical.

It shows up as:

Auditability.
Observability.
Access controls.
Permissions.
Rollback mechanisms.
Monitoring and logging.

As Rak Gard, Partner at Bain Capital Ventures, explained, these capabilities allow organizations to manage risk before regulation requires it. They become de facto standards because they work — not because they’re mandated.

Trust Is Earned Through Control, Not Promises

Another recurring theme was that trust cannot be asserted.

In AI systems, trust is earned when:

Behavior is observable.
Decisions can be inspected.
Failures are traceable.
Systems can be constrained.

As Tiger Gao, Investor at Apax Digital, pointed out, customers don’t want assurances — they want mechanisms. Products that offer real control are adopted faster than those that simply claim safety.

Compliance Will Be Solved Inside Products

The panel also reframed compliance as a product problem.

Rather than external enforcement, compliance will increasingly be achieved through:

Built-in controls.
Clear boundaries.
Configurable policies.
Embedded audit trails.

As Zao Chen, Investor at Craft Ventures, noted, the most successful AI products treat compliance as an enabling feature — not an afterthought. When compliance is integrated, adoption accelerates instead of slowing.

Tooling Creates De Facto Standards

Over time, the panel expects governance norms to crystallize around what works in practice.

Tools that reduce risk, improve transparency, and support accountability will spread organically across companies, industries, and geographies.

These tools become standards not because they’re required, but because they’re indispensable.

The Final Takeaway

AI governance won’t arrive as a single policy moment.

It will emerge gradually, through:

Observability layers.
Control systems.
Audit tooling.
Product-level constraints.

Trust, safety, and compliance will be built into systems, not bolted on by regulators after the fact.

In the AI era, the companies that define governance will be the ones that operationalize it first — long before anyone tells them they have to.

Claude Opus 4.6 vs GPT-5.3 Codex: Which AI Coding Model Should You Use?

Nilesh Barla — Sat, 14 Feb 2026 01:00:45 GMT

TLDR: This blog compares Claude Opus 4.6 and GPT 5.3 Codex in the only way that holds up in production. It treats them as different roles, not rivals. You will learn when to use Opus for architecture, deep context, and repo-wide refactors, and when to use Codex for terminal-driven iteration, bug fixes, and test writing. It explains the context tradeoff between large prompts and retrieval, the cost reality that changes defaults, and a hybrid workflow that plans with Opus, executes with Codex, then audits with Opus. You will leave with routing rules you can apply immediately.

Watching Peter Steinberger talk through Claude Opus 4.6 and GPT 5.3 Codex clarified why this comparison keeps producing disagreement. He describes Codex as the model that reads more by default and stays reliable even when it feels dry, while Opus can run ahead unless you push it into a planning posture.

He also ties modern coding to the command line and explains why terminal fluency matters once agents start running loops for you. That combination pushed me to research roles, not rankings, and to write a guide that routes work by scope and risk.

Claude Opus 4.6 vs GPT-5.3 Codex: Quick Summary

On February 5, 2026, the AI coding landscape changed in a very specific way. Anthropic shipped Claude Opus 4.6, and OpenAI shipped GPT 5.3 Codex on the same day.

The first reaction was confusion. Benchmarks pointed in one direction. Hands-on testing pointed to another. People were looking at the same two models and drawing different conclusions, which is a signal that the comparison is being framed incorrectly.

This article uses a simple hiring lens so you can pick the right tool without arguing about winners. Claude Opus 4.6 behaves like a senior architect. It slows down, asks for more context, and spends tokens thinking before it commits to a plan. That deliberation often produces cleaner designs and fewer rewrites when the problem is structural.

GPT 5.3 Codex behaves like a hyperproductive intern. It moves quickly, makes changes early, runs loops, and stays close to the terminal and the feedback cycle. It will break things, notice the break, and patch them in the next pass.

For a focused comparison of the coding agents specifically, see Claude Code vs OpenAI Codex.

Greg Isenberg captured this as a split between reasoning and momentum. Once you see it that way, the question becomes which role you are hiring for on this task.

What Claude Opus 4.6 Is Best For: Architecture & Reasoning

Claude Opus 4.6 is strongest when the task begins with uncertainty and ends with a coherent design. You see this when the codebase is large, the constraints are fuzzy, and the right answer depends on keeping many moving parts consistent across files.

Anthropic calls this adaptive thinking, a mode in which the model spends time reasoning before it writes.

That deliberation shows up as fewer wrong turns, fewer patch cycles, and fewer hidden contradictions later in the build.

The long context capability matters for the same reason. A large context window is not only about reading more text. It changes how the model constructs its mental representation of the repository.

Opus 4.6 supports 200K tokens, and a 1M token context window is available in beta on the Claude Developer Platform. With enough context, it can track relationships across modules, data flow assumptions, and naming conventions without constantly re-fetching or re-explaining them.

This is why Opus is a good fit for greenfield work that still has real complexity.

Think of an authentication system with roles, session rotation, and audit logging, or a 3D floor plan generator with a geometry pipeline and export formats. The model has to choose an architecture before it chooses syntax.

Alex Carter’s 48-hour deep dive captured the same pattern in a concrete test. He reports that Opus produced a fully functional Kanban board with working drag-and-drop and clean state management on the first attempt, while Codex failed on authentication logic in the comparable build.

The tradeoff is cost. The deliberation phase consumes tokens, but it often buys you fewer bugs that only appear after you have shipped.

What GPT-5.3 Codex Is Best For?

If I were to answer that question in three words, it would be “The Speed Demon.”

GPT 5.3 Codex is strongest when the work has a tight feedback loop, and you want the loop to run without supervision.

It behaves more like an operator than a planner. You give it a concrete task, it tries something, it runs the command, it reads the error, then it tries again. That rhythm matters because a large share of day-to-day engineering is not design.

It is repeated compilation, failed tests, missing dependencies, and small fixes that only become obvious after you execute the code.

Source: OpenAI

Terminal Bench 2.0 captures this bias toward command line competence. Codex scores 77.3 percent on that evaluation, while Claude Opus 4.6 scores around 65.4 percent in Anthropic’s reported results. Treat that as a sign about where Codex spends its effort. It is built to act inside terminal-shaped work, not only to write a plausible patch.

Source: Anthropic

This creates a distinct momentum mode.

It feels like pair programming with someone who types much faster than you and keeps running the program while you are still reading the diff.

It will sometimes reach for a package or an import that is not in your stack, but the recovery is quick because it immediately hits the build, sees the failure, and corrects the attempt in the next pass.

That makes Codex a strong fit for brownfield work. Bug fixes, unit tests, small feature additions, and cleanup tasks reward speed over elegance. Claire Vo’s experiment is the clearest proof point. She reports shipping 44 pull requests in five days using these models, and her results show Codex behaving like the closer that turns loops into merged code.

Lenny's Newsletter

Claude Opus 4.6 vs. GPT-5.3 Codex: How I shipped 93,000 lines of code in 5 days

Listen now

2 months ago · 8 likes · Claire Vo

The Context Battle: 1M Tokens vs. Repo-RAG

Claude Opus 4.6 and GPT 5.3 Codex can look similar on the surface because both can edit a repository and both can produce working code. The difference is how each model forms knowledge about your codebase.

Opus leans on sheer context capacity.

Opus 4.6 supports very large prompts, with 200K tokens as the standard limit and a 1M token context window available in beta on the Claude Developer Platform.

When you load large slices of the repo, the model can carry a more continuous mental model across modules, conventions, and edge cases. That is valuable during major refactors because the risk is not writing code. The risk is breaking an assumption that lives in a different folder. Migration work like moving an app from React to Svelte is full of those buried assumptions.

Codex often reaches similar outcomes through retrieval.

Instead of holding the whole codebase in the prompt, it pulls the most relevant files and focuses effort there. This is faster and cheaper when the problem is local, but it can miss cross-file invariants because it only sees what it retrieved. The model edits the correct file, yet the change may conflict with a pattern set elsewhere.

Use a simple rule. When a rename or refactor touches dozens of files, use Opus. When a fix lives in a single function within a single file, use Codex.

Thanks for reading Adaline Labs! This post is public so feel free to share it.

Pricing & Economics: The $28 vs $0.12 Reality

Economics changes the decision faster than benchmarks.

You can admire Opus 4.6 for its deliberation and still choose not to run it on every small question. The model price is not a rounding error. Anthropic lists Opus 4.6 at 5 dollars per million input tokens and 25 dollars per million output tokens, so long outputs and multi-pass reasoning can add up quickly.

A recent thread on r/SlashClaudeAI made the gap concrete. A user named DutchesForKaioSama described a complex task that came out to 28.70 dollars on Opus, while a similar outcome cost 0.12 dollars on Codex.

Source: Reddit

Even if you treat those numbers as anecdotal, the ratio is the point. When you pay for deliberation, you pay for tokens and for time spent thinking.

This is why Opus is a poor default for casual chat. Use it like a contractor.

Bring it in when the task has architectural risk, repo-wide consequences, or requirements you cannot afford to get wrong. Keep it out of simple syntax questions, quick formatting, and routine unit test boilerplate.

Codex fits the always-on role because iteration is cheap. Let it run the loops. Save Opus for the moments where a careful plan prevents a week of cleanup.

The "Hybrid" Workflow: Manager & Intern

A clean way to use both models is to treat them as two roles in the same engineering loop.

One role produces a careful plan that reduces architectural risk.
The other role turns that plan into diffs and runs the feedback cycle until the work is shippable.

Start with Opus 4.6 for planning.

Give it the requirements, the constraints, and the acceptance criteria. Ask for a short spec, interface definitions, and an implementation plan that is broken into steps you can execute one at a time.

Opus is good at this because it enters a deliberate reasoning phase and maintains more global constraints throughout the design. You are paying for that deliberation, so use it where it changes the shape of the work.

Move to Codex for execution.

Paste the plan into Codex and constrain it to one step. Tell it to implement step one, run tests, fix failures, then stop and report.

Codex is designed for tool-using loops and fast iteration, so it is a strong fit for writing the code, running commands, and grinding through the errors without constant supervision.

Bring Opus back for review. Paste the final diff and ask for a logic and security audit. Focus it on auth flows, input validation, permission checks, and failure states. This is where a slower model can catch mismatched assumptions and corner cases.

Claire Vo describes using different models at different stages of the pull request lifecycle to maximize return on spend, and this workflow turns that idea into a repeatable routine you can adopt immediately.

Decision Matrix & Conclusion

Use this decision matrix when you want a fast answer without rethinking the tradeoffs.

Complex Logic and New App: Use Opus 4.6
Bug Fixing and Terminal Ops: Use Codex 5.3
Refactoring Legacy Code: Use Opus 4.6
Writing Tests: Use Codex 5.3

Note this: You are not choosing a winner. You are choosing a role.

Opus is the call when the work needs a stable design, and one correct pass matters more than speed.

Codex is the call when the work is a loop and the fastest path is to run commands, fix failures, and repeat until green.

The one model strategy is not how teams will work in 2026. The winning setup is a router that assigns work to the right model based on risk, scope, and iteration cost.

Engineers who ship consistently do not take sides. They pick a roster.

Shipping Fast And Iterating At AI-Speed | Takeaways For Founders And Product Leaders

Arsh Shah Dilbagi — Wed, 11 Feb 2026 13:50:02 GMT

TLDR: Shipping Fast and Iterating at AI Speed explores why traditional startup speed advice fails in AI development. The blog argues that real AI speed isn't about moving faster than competitors, but about learning velocity—understanding what "good" looks like and adapting quickly. It covers how short-term velocity destroys long-term progress through technical debt, why correctness is subjective in AI products, and how sustainable speed requires informed restraint, clear ownership, and reversible decisions. Readers will learn concrete principles from industry leaders on building feedback loops, maintaining team confidence through transparency, and designing systems flexible enough to survive the AI ecosystem's rapid changes. The key insight: the fastest teams avoid premature bets and focus on preserving optionality while maintaining strong signals about what matters.

Introduction

Founder Intro: Shipping Fast and Iterating at AI Speed

“Move fast” has always been a startup mantra. In AI, that advice has become dangerously ambiguous.

Teams ship more often than ever. Demos come together in days. Iteration feels constant. And yet, many companies still feel stuck, slowed not by lack of activity but by a lack of clarity.

Panel 4 was designed to unpack that tension. Rather than asking, “How do we ship faster?”, we wanted to ask a more precise question: What does speed actually mean when you’re building AI products, and what quietly destroys it over time?

To explore that, we brought together operators who are shipping at the edge of what’s possible, across very different contexts:

Daksh Gupta, Co-founder and CEO at Greptile, is building AI systems where correctness and iteration speed must coexist.
Evan Owen, Co-founder and CEO at Glue, is navigating fast-moving AI workflows where trust and learning loops matter more than raw throughput.
Ray Jang, Co-founder and CEO at Atria, operates at the intersection of automation, experimentation, and reliability.
Yen Tan, Product Manager at 15Five, is bringing a product and user-centered lens to shipping in high-trust environments.

What emerged was a clear reframing of AI speed. This panel wasn’t about shipping more features or chasing every new model release. It was about learning velocity: how quickly teams understand what good looks like, detect when something feels off, and correct course without eroding trust.

Across the conversation, a consistent theme surfaced:

“Speed without direction is just noise.”
“Sustainable speed comes from tight feedback loops, informed restraint, and organizations designed to learn.”

The sections that follow break down what that looks like in practice, from why dogfooding beats dashboards early, to how feature flags enable safe aggression, to why trust behaves like a finite resource.

If you’re building with AI and feel like you’re moving fast but not forward, this panel offers a grounded perspective on what real velocity actually requires.

1. AI Speed Is About Learning What “Good” Looks Like

The panel opened by dismantling a common misconception: AI speed does not simply mean shipping faster.

Shipping faster is easy. Learning faster is hard.

What separates teams that actually move quickly from those that just move often is how fast they develop a shared understanding of quality.

Speed Comes From Signal, Not Velocity

Across the discussion, speakers converged on a more precise definition of AI speed.

AI speed is defined by:

How quickly teams learn what “good” outputs look like?
How fast can they tell when something feels off?
How early can they course-correct without breaking trust?

As Daksh Gupta, Co-founder and CEO at Greptile, emphasized, most AI teams don’t slow down because they ship too little. They slow down because they don’t know what to aim for.

Without a clear target, iteration becomes noise.

Correctness Is Often Ambiguous in AI Products

In traditional software, correctness is binary. Something works, or it doesn’t.

In AI products, correctness is often subjective.

As Yen Tan, Product Manager at 15Five, described, this ambiguity shows up most clearly in:

Creative workflows,
Generative systems,
Judgment-based tasks,
Assistive experiences.

Outputs can be plausible without being good. They can be technically correct, but emotionally wrong. They can pass automated checks, and still fail user expectations.

This makes iteration fundamentally harder.

Without Quality Signals, Teams Thrash

Several speakers described a familiar failure mode:

Teams ship quickly,
Outputs look reasonable,
Feedback is vague,
Iteration continues blindly.

As Ray Jang, Co-founder and CEO at Atria, noted, without fast, reliable signals on quality, teams end up oscillating—changing prompts, models, or workflows without knowing whether they’re actually improving anything.

The result is activity without progress.

“Feels Off” Is an Important Signal

One of the more subtle insights from the panel was the importance of intuition early on.

As Evan Owen, Co-founder and CEO at Glue, explained, experienced teams learn to trust early discomfort. When outputs feel off—even if they technically pass—that’s often the first indicator that assumptions are wrong, or constraints are missing.

Teams that move fast don’t ignore that signal. They investigate it immediately.

Speed comes from shortening the gap between:

Noticing something feels wrong,
Understanding why,
Fixing the underlying cause.

Directional Clarity Beats Raw Throughput

The panel repeatedly returned to the idea that speed without direction is wasted motion.

AI makes it easy to:

Generate more outputs,
Try more variations,
Explore more options.

But without a shared definition of “good,” those options don’t converge.

As one speaker summarized:

The fastest teams aren’t the ones shipping the most changes,
They’re the ones learning what to keep.

The Practical Takeaway

AI speed isn’t about how fast you deploy. It’s about how fast you learn.

Teams that truly move quickly:

Define quality early,
Develop strong instincts for “wrong”,
Create tight feedback loops,
Correct course before problems compound.

In AI products, learning velocity beats shipping velocity.
Speed without clarity feels productive—until it isn’t.

2. Short-Term Velocity Can Destroy Long-Term Velocity

One of the most consistent warnings across the panel was a counterintuitive one:

The fastest way to slow down permanently is to optimize too aggressively for short-term speed.

In an ecosystem that rewards quick demos and rapid iteration, this is an easy trap to fall into, and a hard one to escape.

Early Momentum Often Comes From Fragile Choices

Several speakers described how teams often gain early momentum by making expedient decisions:

Choosing frameworks optimized for speed over control.
Hardcoding integrations instead of designing interfaces.
Building around temporary standards.
Overfitting workflows to current model capabilities.

These choices feel rational in the moment. They produce visible progress. They reduce upfront friction.

As Daksh Gupta, Co-founder & CEO of Greptile, explained, many of these decisions aren’t mistakes. They’re unexamined commitments that accumulate quietly.

The Hidden Cost of Expedience

What looks like speed early often shows up later as a constraint.

As products mature, those early shortcuts create:

Architectural lock-in.
Brittle abstractions.
Painful migrations.
Slow, risky changes.
Fear of touching core systems.

As Ray Jang, Co-founder & CEO of Atria, noted, teams often don’t realize they’ve slowed down until they’re already stuck. Every change requires workarounds. Every improvement risks regression. Momentum evaporates.

The system becomes fast to run, but slow to change.

AI Ecosystems Shift Faster Than Architecture

This problem is amplified in AI because the ecosystem itself is moving so quickly:

Models evolve.
APIs change.
Best practices shift.
Capabilities that felt stable six months ago suddenly aren’t.

As Evan Owen, Co-founder & CEO of Glue, pointed out, decisions that assume today’s model behavior will persist are especially dangerous. Overfitting to current capabilities may unlock speed now, but it creates fragility later, precisely when adaptation matters most.

Overfitting Is a Form of Technical Debt

The panel reframed overfitting in a broader sense.

It’s not just about data or prompts. It’s about designing systems that only work under narrow conditions.

Overfit systems:

Assume specific output formats.
Rely on implicit model behavior.
Break when context windows change.
Fail when reasoning patterns shift.

Each assumption tightens the system’s tolerance for change.

True Speed Requires Optionality

The teams that sustained velocity over time shared one trait: informed restraint.

They:

Moved fast where reversibility was high.
Slowed down where decisions were expensive to undo.
Avoided locking in assumptions prematurely.
Designed interfaces, not shortcuts.

As Yen Tan, Product Manager at 15Five, emphasized, speed isn’t just about shipping. It’s about preserving the ability to change direction without breaking everything.

Velocity Is a Function of Confidence

Another subtle insight from the panel was that teams slow down not just because systems are brittle, but because people lose confidence.

When:

Changes feel risky.
Behavior is hard to predict.
Regressions are costly.

Teams hesitate. Reviews drag on. Releases slow. Innovation stalls.

Short-term speed that undermines confidence eventually kills momentum.

The Practical Takeaway

AI speed isn’t about maximizing short-term output.

It’s about:

Making reversible decisions quickly.
Deferring irreversible ones thoughtfully.
Preserving optionality.
Designing for change, not permanence.

True AI speed requires restraint, not caution, but judgment.

Move fast.

Just don’t move fast into a corner.

3. Long-Term Speed Requires Informed Foresight

Several speakers emphasized that sustaining velocity over time requires more than execution discipline.

It requires informed foresight.

Not clairvoyance.

Not a perfect prediction.

But the ability to make educated bets about where the ecosystem is heading, and where it isn’t.

Speed Over Time Is About Betting on What Endures

In fast-moving AI environments, it’s tempting to treat everything as temporary.

Frameworks change.

Models improve.

Tooling evolves monthly.

“Best practices” have short half-lives.

But as the panel made clear, some things do last longer than others, and knowing the difference is what separates teams that compound velocity from those that reset every six months.

As Daksh Gupta, Co-founder & CEO of Greptile, noted, long-term speed comes from investing in abstractions that survive churn, even when the layers above them change.

The Cost of Dead-End Bets

Several speakers shared examples of teams that moved quickly, but in the wrong direction.

These teams:

Adopted tooling that couldn’t evolve.
Built on standards that never stabilized.
Committed deeply to APIs that were clearly transitional.
Optimized for the current model generation.

Each decision felt reasonable at the time.

Together, they created dead ends.

As Ray Jang, Co-founder & CEO of Atria, explained, the problem isn’t making bets, it’s making bets without understanding their reversibility. Dead-end bets don’t just slow teams down. They force rewrites.

AI Makes Reactivity Expensive

Because the AI landscape changes so quickly, many teams default to being reactive.

New model? Switch immediately.

New framework? Rewrite.

New technique? Adopt everywhere.

The panel warned that this behavior creates motion, but not progress.

As Evan Owen, Co-founder & CEO of Glue, put it, reactive teams feel fast until they realize they’re constantly rebuilding the same system with slightly different parts.

Speed becomes cyclical instead of compounding.

Selective Proactivity Is the Real Advantage

The fastest teams described on the panel weren’t chasing every change.

They were selectively proactive.

They:

Tracked where standards were converging.
Waited for signal before committing.
Designed internal interfaces to absorb change.
Insulated core logic from external volatility.

As Yen Tan, Product Manager at 15Five, emphasized, foresight isn’t about predicting the future, it’s about limiting the blast radius when the future arrives.

Understanding Direction Beats Knowing Timing

Another important reframe from the panel was that timing matters less than direction.

You don’t need to know when a standard will win.

You need to know whether it’s likely to matter.

Teams that understood direction:

Avoided one-off integrations.
Favored open interfaces.
Resisted premature optimization.
Chose boring, stable layers where possible.

That restraint allowed them to move faster later, when clarity emerged.

Foresight Is a Team Skill

Importantly, foresight wasn’t described as a founder superpower.

It was treated as an organizational capability.

Teams built foresight by:

Discussing ecosystem trends openly.
Revisiting architectural assumptions regularly.
Questioning “why this now?”.
Rewarding reversibility over cleverness.

Over time, this created shared intuition, and faster decision-making.

The Practical Takeaway

Long-term AI speed isn’t about reacting faster than everyone else.

It’s about:

Understanding where the ecosystem is heading.
Avoiding bets that trap you.
Investing in abstractions that outlast hype.
Moving early only when it matters.

The fastest teams don’t chase change.

They position themselves so change can’t knock them off balance.

4. Dogfooding Is the Highest-Leverage Evaluation Mechanism

One of the most practical insights from the panel was also one of the simplest:

The best early evaluation system is lived experience.

Before metrics.
Before dashboards.
Before formal eval frameworks.

Teams need to feel their product.

Formal Evals Come Too Late for Early Learning

Several speakers cautioned against jumping too quickly into formal evaluation systems.

Evals are powerful, but only once teams already understand:

What does good look like?
Which failures matter?
Where nuance lives?

Before that understanding exists, evals tend to encode guesses rather than truth.

As Daksh Gupta, Co-founder & CEO of Greptile, emphasized, premature evals often give teams false confidence. They pass checks while the product quietly degrades in ways the metrics don’t capture.

“This Feels Wrong” Is a Real Signal

A recurring phrase on the panel was some version of:

“This feels wrong.”

That instinct, especially from domain experts, surfaced again and again as an early warning signal.

As Yen Tan, Product Manager at 15Five, explained, when people who understand the problem deeply start to hesitate, something important is usually off. The issue might not be obvious. It might not be measurable yet. But ignoring that signal almost always leads to larger failures later.

Early intuition isn’t noise.
It’s a compressed experience.

Dogfooding Exposes What Metrics Miss

Dogfooding forces teams to confront the product as it actually behaves, not how they hope it behaves.

When teams use their own product daily:

Subtle regressions surface.
Quality decay becomes obvious.
Friction accumulates visibly.
Edge cases repeat.

As Ray Jang, Co-founder & CEO of Atria, noted, dashboards rarely capture the emotional texture of a product. Dogfooding does.

You notice when:

Outputs start to feel generic.
Responses drift off tone.
Latency becomes irritating.
Trust erodes slightly but consistently.

These are the signals that matter most early.

Shared Intuition Accelerates Teams

Another benefit the panel highlighted was alignment.

Dogfooding builds:

Shared intuition across engineering, product, and GTM.
Common language for quality.
Faster decision-making.

When everyone has felt the pain personally, debates get shorter. Teams don’t argue abstractly about metrics; they argue from experience.

As Evan Owen, Co-founder & CEO of Glue, put it, teams that dogfood aggressively don’t need long spec documents to explain why something needs fixing. Everyone already knows.

When Formal Evals Actually Help

The panel wasn’t dismissive of formal evaluation, just precise about timing.

Formal evals work best when:

Intuition is already strong.
Failure modes are known.
Quality criteria are shared.
The team agrees on tradeoffs.

At that point, evals scale understanding.
Before that point, they obscure it.

The Practical Takeaway

Dogfooding isn’t a culture perk.
It’s an evaluation strategy.

The teams that move fastest:

Live inside their product.
Trust early discomfort.
Use intuition to guide iteration.
Add formal evals once meaning exists.

In AI products, you can’t measure what you don’t yet understand.

Understanding comes first.
Automation follows.

5. Evals Prevent Regression — They Don’t Create Insight

The panel was clear, and notably aligned, on one point:

Evals are often introduced too early.

Not because evals are bad, but because teams frequently expect them to do the wrong job.

What Evals Are Actually Good At

When used correctly, evals are extremely effective.

They:

Prevent systems from getting worse.
Enforce known baselines.
Catch regressions early.
Scale judgment once patterns are understood.

As Ray Jang, Co-founder & CEO of Atria, described, evals are invaluable once a team already knows what quality looks like. At that point, they act as guardrails, ensuring progress doesn’t slip backward as systems evolve.

But guardrails don’t decide where you’re going.

The Risk of Introducing Evals Too Early

Several speakers warned that early-stage AI teams often reach for evals before they’ve earned them.

When evals are introduced prematurely, they tend to:

Cap quality too early.
Freeze incomplete assumptions.
Obscure creative exploration.
Incentivize optimization against the wrong signals.

As Daksh Gupta, Co-founder & CEO of Greptile, noted, early evals often reflect what teams think matters, not what actually does. Once encoded, those assumptions quietly shape every future decision.

What feels like rigor becomes constraint.

Insight Comes From Humans, Not Metrics

A recurring theme across the panel was that insight emerges from exposure, not automation.

Early-stage AI products benefit far more from:

Human review of outputs.
Direct customer conversations.
Qualitative feedback.
Rapid iteration driven by intuition.

As Yen Tan, Product Manager at 15Five, explained, insight requires context. It requires understanding why something feels wrong, not just that it failed a check. That depth simply can’t be automated early on.

Metrics without understanding are misleading.

Evals Encode Assumptions — Whether You Want Them To or Not

One of the most important cautions from the panel was that evals always encode values.

They define:

What does “good” mean?
Which failures matter?
What tradeoffs are acceptable?

When those definitions are immature, evals lock teams into a narrow view of quality.

As Evan Owen, Co-founder & CEO of Glue, put it, once an eval exists, teams naturally optimize for it, even if it no longer reflects reality. Exploration slows. Creativity narrows. Learning stalls.

Guardrails, Not Steering Wheels

This led to one of the clearest metaphors of the panel:

Evals are guardrails, not steering wheels.

They prevent disaster.
They don’t choose a direction.

Teams that try to steer with evals early often end up driving confidently in the wrong direction.

The Practical Takeaway

The fastest AI teams sequence evaluation deliberately.

They:

Learn through humans first.
Build intuition around quality.
Identify stable patterns.
Then encode those patterns into evals.

Used this way, evals accelerate progress without freezing it.

In AI products, understanding precedes automation.

If you automate judgment before you’ve developed it, you don’t move faster, you just lock in ignorance.

6. Teams That Ship Fast Collapse Distance Between Thinking & Doing

A recurring operational insight from the panel was deceptively simple:

Communication is lossy — especially in fast-moving environments.

Every handoff introduces delay. Every translation risks distortion. Every layer adds friction.

The teams that ship fastest aren’t necessarily working harder. They’re working with less distance between thinking and doing.

Speed Comes From Collapsing the Loop

Across examples, the panel highlighted the same pattern:

Teams maximize velocity when:

The same person designs, builds, ships, and iterates.
Ownership spans the full lifecycle of a feature.
Feedback flows directly to the builder.

As Daksh Gupta, Co-founder & CEO of Greptile, emphasized, this collapse of roles doesn’t eliminate rigor — it eliminates delay. Decisions happen where context already lives.

Handoffs Are Hidden Taxes

In theory, specialization increases efficiency. In practice, handoffs impose invisible costs.

Each handoff requires:

Re-explaining intent.
Re-establishing context.
Re-interpreting feedback.

As Ray Jang, Co-founder & CEO of Atria, noted, even perfect documentation can’t fully transmit intuition. What gets lost isn’t just information — it’s judgment.

In AI products, where quality is often subjective and evolving, that loss is especially expensive.

Feedback Is Only Useful If It’s Immediate

Another theme that emerged was the importance of feedback proximity.

When feedback:

Reaches the builder quickly.
Arrives unfiltered.
Includes real user context.

Iteration accelerates.

As Yen Tan, Product Manager at 15Five, explained, teams slow down when feedback is delayed, summarized, or abstracted. By the time it reaches the person who can act on it, urgency — and insight — have faded.

Fast teams shorten that path aggressively.

Ownership Creates Judgment

The panel also emphasized that ownership isn’t just about accountability — it’s about learning.

When the same person:

Makes the decision.
Implements the solution.
Observes the outcome.
Feels the failure.

They develop judgment rapidly.

As Evan Owen, Co-founder & CEO of Glue, shared, teams that fragment ownership fragment understanding. No one fully knows why something works — or why it doesn’t.

Judgment accumulates fastest when responsibility is continuous.

Thinking and Execution Belong Together

One of the most resonant reframes of the section was this:

Speed increases not because people work harder — but because thinking and execution happen in the same head.

When design, implementation, and iteration are separated, speed decays. When they’re unified, momentum compounds.

This doesn’t mean eliminating collaboration. It means eliminating unnecessary translation.

The Practical Takeaway

Teams that move fast don’t optimize for efficiency on paper.

They optimize for:

Tight ownership loops.
Minimal handoffs.
Direct feedback.
Continuous learning.

In AI products, where quality signals are subtle and shifting, distance is the enemy of speed.

Collapse the distance — and speed follows.

7. Customer Obsession Beats Process Optimization

Despite the panel’s technical depth, the conversation kept circling back to a simple truth:

Customers are the fastest feedback system available.

No internal process, tool, or framework can compete with direct exposure to real usage.

Process Doesn’t Create Insight — Exposure Does

Many teams try to move faster by refining internal processes:

Better roadmaps.
Tighter sprint rituals.
More detailed specs.
More sophisticated tooling.

The panel was blunt about the limitations of this approach.

As Daksh Gupta, Co-founder & CEO of Greptile, noted, process can reduce chaos — but it doesn’t create understanding. Teams that rely too heavily on internal abstractions often end up optimizing for the wrong problems.

Speed comes from knowing what to build — not just how to build it efficiently.

High-Velocity Teams Stay Close to Users

The fastest teams described on the panel shared one defining habit: constant customer contact.

They:

Talk to users weekly — sometimes daily.
Onboard customers themselves.
Watch real usage in real contexts.
Feel confusion and delight firsthand.

As Evan Owen, Co-founder & CEO of Glue, explained, nothing accelerates learning like watching someone struggle with your product in real time. Feedback becomes concrete. Priorities become obvious.

Abstract Requests Hide Real Needs

Another recurring insight was that customer requests are often misleading.

Users ask for features. They describe symptoms. They propose solutions.

But as Yen Tan, Product Manager at 15Five, pointed out, the real work is understanding why they’re asking. That understanding rarely comes from tickets or surveys. It comes from observing behavior.

Teams that prioritize based on lived feedback move faster than those reacting to abstract input.

Proximity Collapses Feedback Loops

Customer proximity shortens feedback loops in ways no internal system can replicate.

When teams are close to users:

Misalignment is obvious immediately.
Incorrect assumptions are exposed early.
Course correction happens faster.
Iteration becomes confident.

As Ray Jang, Co-founder & CEO of Atria, noted, teams often underestimate how much time they lose by guessing instead of asking — or by interpreting instead of observing.

Obsession Is a Practical Choice

The panel was careful to separate customer obsession from performative empathy.

This isn’t about:

NPS slogans.
Empathy workshops.
Abstract personas.

It’s about:

Proximity.
Frequency.
Firsthand exposure.

Customer obsession isn’t a cultural value. It’s an operational strategy.

The Practical Takeaway

If speed is the goal, customer proximity is the lever.

The teams that ship fastest:

Stay close to real usage.
Trust lived feedback over speculation.
Let customers shape priorities directly.
Reduce internal debate by increasing external clarity.

In AI products, where quality is contextual and evolving, customers are the fastest way to find the truth.

No process can substitute for that.

8. Feature Flags Enable Safe Aggression

One of the most practical themes to emerge from the panel was that shipping fast does not mean shipping recklessly.

High-velocity teams don’t move carefully — they move contained.

Feature flags surfaced repeatedly as one of the most important tools for making that possible.

Speed Requires the Ability to Contain Risk

AI products introduce uncertainty by default.

Outputs vary.

Behavior shifts.

Edge cases surface unpredictably.

In that environment, shipping changes broadly and permanently is dangerous.

As Ray Jang, Co-founder & CEO of Atria, emphasized, teams that move fast sustainably all share one trait: they can limit blast radius. Feature flags give teams that control.

They allow teams to:

Isolate risk.
Control who sees what.
Roll out changes incrementally.
Pull back instantly if something breaks.

Speed without containment isn’t velocity — it’s gambling.

Flags Turn Experiments Into Reversible Decisions

A recurring insight was that reversibility is the foundation of speed.

Feature flags turn what would otherwise be hard commitments into reversible bets.

As Daksh Gupta, Co-founder & CEO of Greptile, noted, teams are far more willing to experiment aggressively when they know they can turn something off without damage. That psychological safety unlocks real momentum.

Without flags, every experiment feels existential.

With flags, experimentation becomes routine.

Early Adopters Are Not the Same as Everyone Else

Another key point was segmentation.

Not all users want — or tolerate — the same level of experimentation.

Feature flags allow teams to:

Expose new capabilities to power users.
Test with internal teams first.
Learn from early adopters.
Protect broader user trust.

As Yen Tan, Product Manager at 15Five, explained, trust is fragile in AI products. Once users lose confidence, it’s difficult to earn back. Flags allow teams to learn without burning that trust.

Reliability and Experimentation Are Not Opposites

The panel strongly rejected the idea that teams must choose between speed and reliability.

The fastest teams do both — by separating learning from exposure.

Feature flags make that separation explicit.

As Evan Owen, Co-founder & CEO of Glue, shared, flags allow teams to test bold ideas while keeping the core experience stable. Users experience consistency, while teams gain insight.

That balance is what allows iteration at AI speed without chaos.

Safe Aggression Is a Design Principle

What emerged was a broader principle:

Move aggressively — but only where failure is contained.

Feature flags operationalize that principle.

They:

Encourage experimentation.
Reduce fear of shipping.
Protect user trust.
Preserve optionality.

Without them, teams naturally become conservative.

With them, teams can be bold — responsibly.

The Practical Takeaway

Speed in AI products isn’t about recklessness.

It’s about controlled risk.

Teams that ship fast:

Isolate experiments.
Segment exposure.
Learn quickly.
Revert instantly.

Feature flags don’t slow teams down.

They make it safe to move faster.

In an AI-first world, aggression without containment is chaos — but aggression with guardrails is progress.

9. Trust Is a Battery — Spend It Carefully

Across multiple parts of the discussion, trust kept coming up — not as a vague brand concept, but as a finite operational resource.

The panel consistently framed it this way:

Trust behaves like a battery.
It charges slowly.
It drains quickly.

And once it’s depleted, speed collapses.

Early Products Must Earn Trust Before Spending It

The panel was clear that early-stage AI products don’t have the luxury of experimentation at scale.

Before teams can move aggressively, they must:

Nail table-stakes experiences.
Behave predictably.
Avoid surprising failures.
Demonstrate basic reliability.

As Yen Tan, Product Manager at 15Five, noted, users are far more sensitive early on. When trust hasn’t been established yet, even small inconsistencies feel disproportionate.

Early trust isn’t built by novelty.

It’s built on dependability.

Trust Decays Faster Than It Accumulates

Several speakers emphasized how asymmetrical trust really is.

It takes:

Repeated successful interactions.
Consistent behavior.
Clear boundaries.

to build trust.

But it takes:

One confusing output.
One silent failure.
One unexplained change.

to start draining it.

As Daksh Gupta, Co-founder & CEO of Greptile, pointed out, AI systems feel especially brittle because they present confident outputs even when they’re wrong. That makes trust loss sharper — and recovery harder.

Experimentation Is a Privilege, Not a Right

A recurring theme was that experimentation must be earned.

Once trust is established, teams gain:

Room to experiment.
Tolerance for occasional failure.
Forgiveness for iteration.
User patience during change.

As Ray Jang, Co-founder & CEO of Atria, explained, trusted products can ship imperfect updates and recover quickly. Untrusted products can’t survive even minor missteps.

Trust buys optionality.

Small Mistakes Compound When Trust Is Low

Without trust, every issue feels bigger than it is.

Minor bugs turn into reasons to churn.

Ambiguous behavior becomes incompetence.

Iteration feels like instability.

As Evan Owen, Co-founder & CEO of Glue, shared, teams often underestimate how much damage is caused not by catastrophic failures — but by frequent, low-grade disappointment.

Without trust, those moments stack up fast.

Spend Trust Where Learning Is Highest

The panel also emphasized that trust should be spent intentionally.

When teams do experiment, they should:

Do it where learning is maximized.
Isolate exposure carefully.
Communicate changes clearly.
Roll back quickly when needed.

As Daksh Gupta noted earlier, feature flags and segmentation aren’t just technical tools — they’re trust-management tools.

They allow teams to learn without draining the battery.

The Practical Takeaway

Trust isn’t an abstract virtue in AI products.

It’s fuel.

The fastest teams:

Build trust deliberately.
Protect it aggressively.
Spend it where learning is highest.
Replenish it through reliability.

In an AI-first world, trust determines how fast you’re allowed to move.

Spend it recklessly, and speed disappears.

Spend it wisely, and iteration compounds.

Thanks for reading Adaline Labs! This post is public so feel free to share it.

10. Customer Feedback Must Be Filtered, Not Obeyed

One of the final — and most important — clarifications from the panel was this:

Listening to customers is not the same as following them.

High-velocity teams do both — but they do them very differently.

Feedback Is Raw Data, Not Direction

The panel emphasized that customer feedback is inherently noisy.

Users:

Describe symptoms.
Articulate frustrations.
Suggest solutions.
React emotionally to outcomes.

But they rarely diagnose root causes accurately.

As Evan Owen, Co-founder & CEO of Glue, noted, treating every piece of feedback as a directive leads teams to chase surface-level fixes — and lose coherence over time.

Feedback is signal.

Direction requires judgment.

Caring Is Different From Complaining

A key distinction surfaced around how much users actually care.

Many users complain.

Very few are willing to change behavior.

Effective teams learn to distinguish:

Annoyance from urgency.
Requests from necessity.
Opinions from switching behavior.

As Daksh Gupta, Co-founder & CEO of Greptile, explained, the most valuable signals come from moments where users say, “I can’t do my job without this working.” Everything else requires scrutiny.

“Hell Yes” Outcomes Are Rare — and Precious

Several speakers emphasized the importance of identifying “hell yes” moments.

These are moments where:

Users light up.
Value is immediately obvious.
Behavior changes without prompting.
Adoption accelerates naturally.

As Ray Jang, Co-founder & CEO of Atria, shared, teams that optimize for lukewarm satisfaction move slowly. Teams that optimize for undeniable value move decisively.

Mediocre feedback leads to mediocre products.

Surveys Don’t Surface Tradeoffs — Conversations Do

Another clear takeaway was the limitation of surveys.

Surveys:

Flatten nuance.
Encourage safe answers.
Hide tradeoffs.

Tradeoff conversations, by contrast:

Force prioritization.
Surface real constraints.
Reveal what users would give up.

As Yen Tan, Product Manager at 15Five, noted, asking users to choose — not just react — exposes what truly matters.

Speed comes from clarity, not consensus.

Builder’s Own Diagnosis

The panel repeatedly returned to a simple but powerful responsibility:

Customers describe symptoms.
Builders diagnose causes.

When teams outsource diagnosis to users, they lose control of the product’s direction.

The fastest teams:

Absorb feedback deeply.
Triangulate across users.
Test hypotheses quickly.
Make opinionated decisions.

They don’t abdicate judgment — they sharpen it.

The Practical Takeaway

Customer feedback is indispensable — and dangerous.

Used well, it:

Accelerates learning.
Validates direction.
Surfaces blind spots.

Used poorly, it:

Fragments focus.
Slows decision-making.
Erodes product coherence.

In AI products, especially, where complexity is high and quality is subtle, judgment is the bottleneck — not information.

Listen closely.

Filter aggressively.

Decide decisively.

That’s how teams ship fast — without losing their way.

11. “Minimum Lovable” Beats “Minimum Viable”

One of the most subtle — and powerful — reframings from the panel was this:

In AI products, “viable” is not enough.

What passes as acceptable in traditional software often fails immediately in AI.

AI Outputs Feel Personal — Whether You Intend Them To or Not

AI products don’t just execute instructions.

They respond.

They:

Speak in natural language.
Make suggestions.
Infer intent.
Appear confident.

As a result, users interpret outputs as judgment, not just functionality.

When an AI system gets something wrong, it doesn’t feel like a bug.

It feels like a misunderstanding.

As Yen Tan, Product Manager at 15Five, noted, this makes early impressions far more emotionally charged. Mistakes feel intelligent — and therefore scarier.

“Viable” Is a Low Bar for Trust-Heavy Systems

Minimum viable products are designed to answer one question:

Does this work at all?

In AI, that question is insufficient.

Because:

Trust is fragile.
Users don’t know system boundaries.
Failures feel personal.
Confidence amplifies error.

As Daksh Gupta, Co-founder & CEO of Greptile, explained, shipping something that technically works but feels careless or incoherent often does more damage than not shipping at all.

Users don’t wait for it to get better.

They leave.

Lovability Is About Respect, Not Polish

The panel was careful to distinguish lovable from polished.

Lovability doesn’t mean:

Perfect UX.
Flawless outputs.
Exhaustive feature sets.

It means the product feels:

Coherent.
Intentional.
Respectful of user intent.
Reliably useful in its core job.

As Ray Jang, Co-founder & CEO of Atria, shared, users forgive missing features. They don’t forgive feeling misunderstood or dismissed.

Lovability Creates Forgiveness

A recurring insight was that forgiveness is the real early-stage moat.

When a product feels lovable:

Users retry after failure.
They give feedback instead of churning.
They tolerate iteration.
They stay curious.

When a product feels merely viable:

Failures feel unacceptable.
Trust erodes quickly.
Churn accelerates.

As Evan Owen, Co-founder & CEO of Glue, noted, early-stage AI products live or die by whether users believe the team cares.

Lovability communicates care.

Minimum Lovable Sets the Right Floor

The panel ultimately reframed early-stage quality bars.

Instead of asking:

“Is this good enough to ship?”

High-velocity teams ask:

“Is this good enough to earn patience?”

That question leads to different decisions:

Tighter scope.
Clearer boundaries.
Fewer but better use cases.
More intentional defaults.

The Practical Takeaway

AI products don’t get graded like traditional software.

They’re judged as collaborators.

That raises the bar.

Minimum viable gets you tried.

Minimum lovable gets you trusted.

And in an AI-first world, trust is the only thing that lets you move fast without breaking everything that matters.

12. AI Speed Is Organizational, Not Just Technical

As the panel closed, one final theme became unmistakably clear: AI speed is not primarily a tooling problem. It’s an organizational one.

Models matter. Frameworks matter. Infrastructure matters. But none of them determines speed on their own.

Tools Don’t Learn — Teams Do

Throughout the discussion, speakers repeatedly returned to the same observation: Two teams can use the same models, the same frameworks, and the same tools — and move at radically different speeds. The difference isn’t technical sophistication. It’s how the organization learns.

AI speed is driven by:

Team structure.
Ownership models.
Cultural norms.
Decision-making velocity.
How feedback is interpreted and acted on.

As Daksh Gupta, Co-founder & CEO of Greptile, emphasized, teams don’t slow down because prompts are bad — they slow down because decisions get stuck.

Ownership Determines Learning Velocity

One of the strongest predictors of speed discussed on the panel was clear ownership.

Fast teams:

Know who decides.
Know who owns quality.
Know who responds to failure.
Don’t diffuse responsibility.

As Ray Jang, Co-founder & CEO of Atria, noted, ambiguity in ownership creates hesitation. And hesitation compounds quickly in fast-moving AI environments. When no one owns learning, learning slows.

Culture Shapes How Feedback Is Handled

Another recurring insight was that feedback is only as useful as the culture that processes it.

In slower organizations:

Feedback is debated endlessly.
Mistakes trigger defensiveness.
Learning is politicized.
Decisions wait for consensus.

In faster ones:

Feedback is welcome early.
Mistakes are treated as data.
Iteration is normalized.
Decisions move forward with imperfect information.

As Yen Tan, Product Manager at 15Five, explained, psychological safety isn’t just a people concept — it’s a speed multiplier. Teams that feel safe to surface problems do so earlier, when fixes are cheaper.

Decision Velocity Beats Decision Accuracy

The panel also reframed how teams should think about decision quality.

Perfect decisions are rare. Reversible decisions are common.

Fast AI teams:

Make decisions quickly.
Revisit them often.
Correct course early.
Avoid over-indexing on certainty.

As Evan Owen, Co-founder & CEO of Glue, put it, teams that wait for confidence rarely get it. Teams that act and observe learn faster. Speed comes from motion with feedback — not deliberation without data.

Learning Loops Are the Real Differentiator

Across all examples, one pattern dominated:

The fastest AI companies have the tightest learning loops.

They:

Ship small changes.
Observe real behavior.
Absorb feedback directly.
Adjust immediately.

Tooling supports this — but it doesn’t create it.

Learning loops are designed through:

Org structure.
Incentives.
Ownership.
Trust.

The Final Reframe

By the end of the panel, “AI speed” had been fully redefined.

It isn’t about:

Better prompts.
Faster GPUs.
Clever architectures.

It’s about:

Collapsing feedback loops.
Reducing organizational drag.
Empowering decision-makers.
Learning faster than competitors.

The Practical Takeaway

If your AI team feels slow, the bottleneck is rarely technical. It’s usually:

Unclear ownership.
Delayed decisions.
Filtered feedback.
Cultural friction.

The fastest teams don’t just build better systems. They build organizations designed to learn at AI speed. And in an ecosystem where technology converges quickly, learning speed is the only durable advantage left.

OpenClaw Is Not Magic; It's Just Good Architecture

Nilesh Barla — Sat, 07 Feb 2026 00:45:12 GMT

TLDR: OpenClaw feels alive, maybe near AGI, but it's not magic. It's event-driven architecture implemented correctly. This piece explains why triggers, queues, and persistent state create the illusion of intelligence, what makes agent assistants reliable in production, and where they fail. This blog is for engineers and builders who want to understand the machinery behind the hype, not just believe it.

Why The OpenClaw Hype Makes Sense

OpenClaw is easiest to understand as an always-on local assistant that can execute tools. It runs on a machine you control, and it listens for messages. Not only that, it can take actions such as reading files, running commands, or pulling information from services.

For engineers, that description translates cleanly. It is an event-driven runtime with persistent state.

Essentially, it means that events arrive from a messaging surface or a schedule. The runtime turns those events into ordered work, calls models when needed, and persists what happened so the next event has context. That framing explains the excitement better than any claim about model intelligence.

Ben Goertzel’s “hands for a brain” metaphor makes sense because it points to the real differentiator.

Eurykosmotron

OpenClaw – Amazing Hands for a Brain That Doesn’t Yet Exist

A lot of people are excited about OpenClaw just now – and they should be. It’s a genuinely important piece of software -- an open-source, self-hosted agent runtime that lets AI systems reach out and touch the world through your laptop, connecting to file systems, browsers, APIs, shell commands, and a growing ecosystem of integrations. It’s language-mode…

3 months ago · 35 likes · 6 comments · Ben Goertzel

OpenClaw extends the system's capabilities globally. It gives a language model a set of practical hands, essentially, so the output is not only text. It is a changed file, a launched process, a completed check-in, or a scheduled action.

This is also why adoption is massive and still growing.

Many people do not need a system that writes better paragraphs. They need a system that handles life ops with low ceremony. A calendar change should not require three apps and ten taps. A reminder should not require re-explaining the same preferences each time.

Demos of OpenClaw in daily use tend to center on ordinary tasks like managing calendar items, controlling devices, checking in to flights, or handling small admin actions through a chat surface because those are repeatable and measurable.

One useful comparison for orientation is Claude Code.

If Claude Code is a familiar coding agent surface, OpenClaw is a life ops agent surface.

The rest of this article will stay on that system’s lens. Execution, availability, and state are enough to produce the alive feeling, even when the underlying reasoning is ordinary.

Thanks for reading Adaline Labs! This post is public so feel free to share it.

Heartbeats And Triggers Create The Illusion Of Initiative

OpenClaw feels alive because it behaves like a running system rather than a chat window. The right term is reactive compute, which means work happens because events arrive, not because the assistant decides to be proactive.

Claire Vo’s framing is useful here. The system can have a heartbeat without having a brain.

The heartbeat is the machinery that keeps checking, waking up, and responding to new inputs.

Initiative, in this setup, is mostly scheduling and routing. A message comes in. A timer fires. A file changes. Something external updates. The runtime wakes up, runs a short sequence, and leaves behind a state so the next event has context.

You can hold the whole behavior in one pipeline: inputs, then scheduler, then queue, then tools, and then state update

The Gateway is the always-on intake layer that receives events from channels and integrations and routes them into the right session or workflow.

WhatsApp / Telegram / Slack / Discord / Google Chat / Signal / iMessage / BlueBubbles / Microsoft Teams / Matrix / Zalo / Zalo Personal / WebChat
               │
               ▼
┌───────────────────────────────┐
│            Gateway            │
│       (control plane)         │
│     ws://127.0.0.1:18789      │
└──────────────┬────────────────┘
               │
               ├─ Pi agent (RPC)
               ├─ CLI (openclaw …)
               ├─ WebChat UI
               ├─ macOS app
               └─ iOS / Android nodes

That pipeline is enough to explain why it looks like the system is taking initiative. It is not guessing what to do next. It is being triggered.

A few common trigger types cover most of the “alive” feeling:

Heartbeats that run on a timer, like every morning or every hour.
Inbound messages from a channel, like Telegram or Slack.
External events, like a calendar change or a webhook.
Local changes, like a file being updated in a watched folder.

Let’s look at two examples to better understand.

First, a daily briefing. A morning timer is automatically executed at 8am. The runtime pulls calendar and reminders information, formats a brief, and stores that. The next day, it can compare and focus on what changed since yesterday rather than starting from scratch.

Second, a scheduled check. An hourly is automatically executed. The runtime checks one condition, sends an alert only if the condition flips, and records the last known value so it does not spam you. That record is the difference between a noisy bot and a useful assistant.

This is also why “always on” matters. When events can wake the system, the system can appear to have momentum.

Queue-Based Execution Keeps Agent Workflows Reliable

Reliability is the difference between an agent demo and an agent you trust with real work. In a demo, the system runs one clean task in isolation. In real use, tasks overlap.

For instance, let’s assume that messages arrive continuously while a tool is running. A scheduled check fires while you are in the middle of a conversation. The runtime has to decide what runs now, what waits, and what is allowed to overlap.

The common failure mode is parallel tool calls without control. When two tasks run at once, they both touch the same state, and you get three kinds of damage.

Logs interleave, so you cannot tell which action produced which output.
Race conditions appear when two actions read and write the same files or external resources.
State drift creeps in when partial results land out of order, and the next step reads the wrong snapshot.

Queue-based execution is the simplest high-leverage fix.

Treat every requested action as a unit of work that must be scheduled. Give each session a boundary so one thread of work stays coherent. Make serial execution the default so ordering is predictable, then allow parallelism only for tasks you can prove are independent.

The Hesamation teardown describes this approach as lane-based command queues with per-session lanes, a concrete way to make serialization a first-class property rather than an afterthought.

A useful analogy is air traffic control. Planes can share airspace safely because takeoff and landing are sequenced. The system does not ban concurrency; it makes it explicit and governed. A queue does the same thing for tool calls.

A practical example is inbox work. One task is drafting a reply based on the latest thread. Another task is archiving old messages. If they run in parallel, the archiver can move the thread while the drafter is reading, or the drafter can quote content that is no longer in view. With a queue and session boundary, the system completes one coherent step, writes the result, and then moves to the next.

The architecture video frames the illusion of sentience as inputs, queues, and a loop that stays legible under load, which is exactly the reliability point.

Share Adaline Labs

Persistent Memory And Recall Create Continuity

Continuity is mostly a persistent state plus retrieval, not human-like understanding. Personalization is often just statefulness. The system feels consistent because it can carry facts forward, not because it has a stable internal model of you.

Claire Vo’s point about a heartbeat without a brain fits here, too. A running assistant can appear attentive even when it is simply good at storing and reusing state over time.

Operationally, memory is not mystical. It is three boring components that work together.

Durable notes and preferences that outlive a single session.
Session history that records what happened and what was decided.
Recall that pulls the right fragments at the moment they matter.

Engineers can think of this as a read-and-write loop around a store. The write path captures decisions and stable preferences.

The read path retrieves relevant items when a new event arrives.

Summarization and compaction emerge as patterns as history grows large. This is similar to Claude compaction for a long conversation. The system compresses what mattered, so the next retrieval step still has a signal.

Two examples make this concrete.

First, weekly updates. You tell the assistant that your status update should follow a specific format: three bullets for progress, two for blockers, and a short next week plan.

If that preference is stored durably, the assistant stops asking every time. It can draft the update in the same shape each week, and you only adjust the content.

Second, recurring constraints. You set a rule like do not send emails after 8 pm. If that constraint is written to durable storage, it becomes a guardrail that is applied whenever an email-related task is encountered. The assistant can draft at 9 pm, but schedule the send for the next morning and record that it followed the rule.

Goertzel’s “hands for a brain” framing matters here because the hands are only useful when they are guided by stable context and preferences rather than ad hoc guessing.

But there is a tradeoff.

Memory without hygiene can become stale or risky.

Old preferences can outlive their usefulness. Sensitive details can linger longer than intended.

This is why good systems need user control, recency, and a way to inspect and edit what the assistant thinks it knows.

Event-Driven Agent Assistants Win On Clear Tasks And Guardrails

Event-driven agent assistants work best when the job can be specified in a way that a tool can verify. They are less reliable when the job is really a judgment call disguised as a task.

The architecture gives you reach and persistence, but it does not give you governance for free.

A simple rule holds. If you can define the inputs, the action, and the success check, these systems tend to behave either well or poorly.

Good at:

Clear operational tasks with observable outputs, such as producing a daily brief, filing a note, or running a scheduled check.
Multi-step workflows where each step has a tool-backed result, like collecting context, drafting, and then saving to a known place or directory.
Repetitive life ops work where preferences stay stable, which is why creator demos focus on calendar, reminders, and admin tasks that recur daily or weekly.

Behind the Craft

How OpenClaw's Creator Uses AI to Run His Life (Full Demo) | Peter Steinberger

Dear subscribers…

3 months ago · 35 likes · Peter Yang

Bad at:

Ambiguous goals where success is subjective, like deciding what you should prioritize this month.
High-stakes actions without a hard verification step, like sending money, deleting data, or making irreversible changes.
Situations where autonomy grows faster than the operator’s ability to inspect what happened and why.

Goertzel’s “hands for a brain” metaphor is a good mental boundary. Strong hands can still do the wrong thing if the instruction is underspecified or if the system lacks a disciplined way to pause and ask for confirmation.

This is where guardrails matter. Increase autonomy in steps. Start with approvals for any command that changes external state. Keep allowlists for routine safe operations. Treat risky actions as review required until the logs are boring.

Try OpenClaw, but start with low-risk workflows, watch how it behaves, and only then give it more reach.

Growth And Retention In An AI-first World | Takeaways For Founders And Product Leaders

Arsh Shah Dilbagi — Wed, 04 Feb 2026 13:50:24 GMT

TLDR: This blog explains why smart AI features alone don't create lasting user habits or growth. It covers three key insights: excitement doesn't equal habit without intentional design. Retention depends on how often users naturally need your product, not how smart it is. And products grow when they solve shared problems, not just individual ones. Readers will learn why forcing engagement backfires, why aligning with users' natural workflows matters, and how collaboration drives real stickiness. The main takeaway is simple: AI products succeed by becoming useful to groups, not by being brilliant alone.

Introduction

Founder Intro: Growth & Retention in an AI-First World

One of the most persistent misconceptions in AI right now is that intelligence alone drives growth.

Build something impressive enough, the thinking goes, and users will keep coming back. Retention will take care of itself. Distribution will follow naturally.

In practice, the opposite is happening.

AI products are getting better faster than teams are learning how to retain users, build trust, and compound growth over time. Excitement is abundant. Habit is rare.

Panel 3 was designed to confront that gap directly.

Rather than focusing on models or capabilities, we wanted to examine a harder set of questions:

What actually makes AI products stick?

What drives durable growth once the novelty wears off?

And how do retention dynamics change in an AI-first world?

To explore those questions, we brought together operators who have spent years studying — and living inside — growth systems at scale:

Aaron Cort, Growth & Marketing Partner at Craft Ventures, advising and operating across some of the fastest-growing AI and SaaS companies
Brian Balfour, Founder & CEO at Reforge, who has shaped how an entire generation of operators thinks about growth and retention
Bryce Hunt, Founding GTM at Cognition, working at the frontier of agent-native products and new go-to-market motions
Gaurav Vohra, Advisor and Head of Growth at Superhuman, where precision, trust, and habit are non-negotiable

What emerged was a clear reframing of growth in the AI era.

This panel wasn’t about hacks, channels, or short-term tactics. It was about fundamentals — how problem frequency governs retention, why trust is the real retention loop, how onboarding becomes more critical (not less), and why community and personal brand are increasingly powerful growth multipliers.

Perhaps most importantly, it surfaced a shared conviction:

AI doesn’t change the laws of growth.
It exposes when teams ignore them.

The sections that follow break down the core lessons from this conversation — from why hype fades faster than habit, to why motion choice constrains everything, to why the most defensible layer in many AI companies today is human trust.

If you’re building an AI product and wondering why early excitement isn’t translating into durable usage — or how to design growth systems that actually compound — this panel offers a grounded, experience-driven place to start.

1. Hype Is Easy — Habit Is the Hard Part

One of the most consistent themes across this panel was the widening gap between initial excitement and durable usage in AI products.

AI excels at creating “wow” moments.

New users are impressed by:

Instant results.
Intelligent-sounding outputs.
Dramatic productivity claims.
Novelty-driven breakthroughs.

As Brian Balfour, Founder & CEO of Reforge, pointed out, this has created a dangerous illusion in the market: teams mistake interest for retention.

The Illusion of Early Traction

AI products today are exceptionally good at:

Generating excitement.
Creating impressive first impressions.
Driving short-term spikes in usage.

These early signals feel like momentum. Dashboards light up. Activation looks strong. Engagement graphs climb.

But as multiple speakers emphasized, very few AI products convert that excitement into habit.

Usage drops sharply once novelty fades. Sessions become sporadic. Power users emerge — but the majority quietly churn.

As Aaron Cort, Growth & Marketing Partner at Craft Ventures, noted during the panel, this is one of the most common failure patterns he sees across AI companies: strong top-of-funnel interest paired with weak behavioral lock-in.

Habit Is Not a Side Effect of Intelligence

A critical distinction surfaced early in the discussion:

Habit does not emerge automatically from impressive capability.

Habit forms when a product:

Solves a recurring problem.
Delivers consistent value.
Reinforces usage at the right cadence.

AI often accelerates the first interaction — but it does not guarantee the second, third, or tenth.

As Gaurav Vohra, Advisor and Head of Growth at Superhuman, framed it, delight gets users to try. Reliability gets them to stay.

The Missing Link: Problem Frequency

Several speakers emphasized that many AI products fail because they misunderstand how often users naturally experience the problem being solved.

If a product:

Solves a weekly problem,
But is designed for daily engagement,
Or pushes frequent prompts to manufacture usage,

it creates friction, not habit.

This mismatch leads to:

Forced engagement.
Notification fatigue.
User resentment.
Eventual churn.

AI doesn’t change the natural frequency of a problem; it only exposes when teams ignore it.

Shallow Engagement Looks Like Growth (Until It Doesn’t)

One of the more subtle warnings from the panel was about engagement theater.

Short sessions, repeated trials, and sporadic experimentation can look like healthy usage in aggregate. But without a clear, repeatable value loop, that engagement is fragile.

As Bryce Hunt, Founding GTM at Cognition, described from the frontier of agent-native products, users will experiment enthusiastically — right up until they don’t trust the system to deliver reliably when it matters.

At that point, usage collapses.

From Hype to Habit Requires Intentional Design

The panel was clear that the transition from hype to habit is not accidental.

It requires:

A deep understanding of the underlying user problem.
Clarity on when and why users should return.
Consistent value delivery, not sporadic brilliance.
Reinforcement at a cadence that matches real behavior.

Without these elements, AI products experience:

Rapid churn.
Novelty decay.
Shallow engagement disguised as growth.

The Core Insight

AI makes it easier than ever to impress users once.

It does not make it easier to earn a place in their daily — or weekly — routine.

As this panel made clear, habit is not created by intelligence alone.

It’s created by relevance, consistency, and trust — delivered over time.

2. Retention Is Governed by the Natural Frequency of the Problem

Early in the panel, a foundational concept surfaced — and then kept resurfacing in different forms:

Retention is constrained by how often users naturally encounter the problem you solve.

No amount of AI sophistication can override that constraint.

You can improve how a problem is solved.

You can reduce friction.

You can increase quality.

But you cannot change:

How frequently does the user feel the pain?
How urgent is it when it appears?
Whether it belong in their daily, weekly, or occasional workflow?

As Brian Balfour, Founder & CEO of Reforge, emphasized, retention mechanics are downstream of reality — not product ambition.

AI Doesn’t Change Problem Frequency — It Reveals It

One of the traps AI companies fall into is assuming that intelligence increases usage frequency.

It doesn’t.

AI can:

Make a task faster.
Make a task easier.
Make a task more impressive.

But if the task only matters once a week, daily usage is artificial.

As Aaron Cort, Growth & Marketing Partner at Craft Ventures, noted during the discussion, many AI products feel pressure to justify venture-scale expectations by forcing daily engagement — even when the underlying problem doesn’t support it.

That pressure often leads to bad decisions.

Forced Engagement Backfires

When companies try to:

Force daily usage for a weekly problem.
Manufacture engagement through notifications.
Inflate frequency with alerts, nudges, or reminders.

They don’t create a habit.

They create:

Worse products.
User fatigue.
Eroded trust.
Eventual churn.

Users don’t interpret forced engagement as helpful.

They interpret it as noise.

AI amplifies this effect because the expectations are higher. If a system claims intelligence but interrupts users unnecessarily, the disappointment is sharper.

Criticality Matters as Much as Frequency

The panel also highlighted that frequency alone isn’t enough — criticality matters.

Some problems occur infrequently but are extremely important when they do. Others occur often but are low-stakes.

As Gaurav Vohra, Advisor and Head of Growth at Superhuman, explained, retention emerges when a product aligns with moments that matter. If users don’t feel meaningful relief or leverage when the problem appears, they won’t return — no matter how impressive the solution is.

AI products that misunderstand this often chase engagement metrics instead of solving meaningful pain.

Misaligned Cadence Creates Product Friction

A recurring failure pattern described on the panel looked like this:

The product solves a real problem.
The solution works well.
But the cadence of engagement doesn’t match the user’s life.

Daily prompts for a weekly task.

Constant nudges for occasional workflows.

Persistent reminders for low-urgency problems.

The result isn’t retention — it’s resistance.

As Bryce Hunt, Founding GTM at Cognition, pointed out from the edge of agent-driven products, users quickly disengage when a system feels like it’s working for itself, not for them.

AI Makes Violations More Obvious

One of the sharpest insights from the panel was this:

AI does not change the natural frequency law — it only amplifies violations of it.

Because AI systems are more visible, more interactive, and more assertive, misalignment shows up faster.

Users don’t quietly tolerate friction.

They disengage.

What might have taken months to surface in traditional software becomes obvious in weeks — sometimes days — in AI products.

The Practical Takeaway

Retention doesn’t come from intelligence alone.

It comes from alignment.

Teams that succeed:

Identify the natural cadence of the problem.
Design engagement around that cadence.
Resist the urge to force frequency.
Measure success by consistency, not volume.

In an AI-first world, respecting user reality is the fastest path to durable retention.

3. Growth Comes From Solving Shared Problems, Not Isolated Ones

Another strong theme that emerged from the panel was the importance of multi-user relevance.

Many AI products begin by delivering clear value to an individual user. That’s often the right starting point. It simplifies onboarding, shortens time-to-value, and helps teams validate core utility quickly.

But as the panel made clear, durable growth rarely stops at the individual.

Individual Value Is Necessary — But Not Sufficient

AI products are especially good at creating powerful single-player experiences.

They help users:

Think faster.
Produce better outputs.
Automate personal workflows.
Feel individually empowered.

This often leads to strong early adoption.

But as Brian Balfour, Founder & CEO of Reforge, emphasized, products that remain purely individual struggle to compound. They grow linearly, not exponentially. Each new user must be acquired independently, and retention alone has to carry the entire growth story.

That’s a hard ceiling.

Shared Problems Unlock Compounding Growth

The most durable products discussed on the panel followed a different arc.

They:

Start with individual value: Solving a clear, personal pain point.
Expand into shared contexts: Teams, organizations, or communities.
Embed themselves into collaboration: Where work is coordinated, reviewed, or acted upon together.

This transition unlocks:

Natural network effects.
Lock-in through shared workflows.
Organic distribution via collaboration.

As Gaurav Vohra, Advisor and Head of Growth at Superhuman, explained, once a product becomes part of how people work together, switching costs become emotional and operational — not just technical.

AI Amplifies Collaboration — When Designed For It

AI has the potential to accelerate this transition, but only if products are designed intentionally.

When AI outputs are:

Easily shareable.
Reviewable by others.
Editable collaboratively.
Embedded in team workflows.

They create natural reasons for expansion.

As Aaron Cort, Growth & Marketing Partner at Craft Ventures, noted during the panel, many of the strongest AI companies see growth inflection not when the product gets smarter, but when it becomes socially necessary within a team.

Single-Player Products Hit a Wall

The panel was also clear about the risks of staying single-player for too long.

AI products that remain isolated experiences often:

Depend heavily on paid acquisition.
Struggle to create organic loops.
Face high churn when usage is optional.
Fail to embed themselves into daily work.

Even when the individual experience is strong, growth plateaus.

As Bryce Hunt, Founding GTM at Cognition, shared from the perspective of agent-native products, the moment AI systems begin influencing shared outcomes — codebases, decisions, deliverables — adoption dynamics change dramatically. Teams care. Conversations start. Distribution accelerates.

Collaboration Creates Accountability — and Stickiness

Another subtle benefit of shared problems is accountability.

When work is:

Visible to others.
Reviewed collaboratively.
Dependent on multiple stakeholders.

Usage becomes harder to abandon quietly.

Products that live inside shared workflows benefit from:

Social reinforcement.
Collective habit formation.
Stronger norms around usage.

This doesn’t require viral mechanics.

It requires relevance to how people already work together.

The Practical Takeaway

AI products don’t compound by being smarter alone.

They compound by becoming collectively useful.

The most durable growth comes from:

Starting with individual value.
Expanding into shared problems.
Embedding into collaboration.
Letting distribution emerge naturally.

In an AI-first world, growth follows shared utility — not isolated brilliance.

4. AI Raises the Bar for Onboarding — Not Lowers It

One of the more counterintuitive conclusions from the panel was this:

AI does not make products easier to adopt.
It often makes them harder.

Despite early expectations that intelligence would reduce friction, the opposite pattern has emerged in practice — especially once products reach real users.

AI Introduces New Kinds of Friction

Traditional software is predictable.

AI systems are not.

AI products introduced:

Nondeterministic behavior.
Unfamiliar mental models.
Probabilistic outcomes.
Workflows users haven’t seen before.

Even when the product is powerful, users often don’t know:

What to expect.
How to judge success.
When the system is confident.
When they should intervene.

As Brian Balfour, Founder & CEO of Reforge, emphasized, this creates a gap between capability and confidence. And without confidence, users don’t stick.

Self-Serve Onboarding Breaks Earlier Than Teams Expect

A recurring theme across the panel was that self-serve onboarding fails much earlier in AI products than in traditional SaaS.

Many teams assume that:

Users will experiment.
Value will reveal itself.
Intelligence will “sell” the product.

In reality, users often stall immediately.

As Aaron Cort, Growth & Marketing Partner at Craft Ventures, noted, AI products place a higher cognitive burden on users. When people don’t understand how to succeed quickly, they disengage — even if the product is technically impressive.

The failure isn’t loud.

It’s silent.

Early Handholding Accelerates Learning

The fastest-learning companies described on the panel didn’t avoid human involvement — they leaned into it.

They:

Onboarded users personally.
Walked them through first successes.
Observed where confusion emerged.
Adjusted workflows based on real behavior.

As Gaurav Vohra, Advisor and Head of Growth at Superhuman, explained, early handholding isn’t a scaling failure — it’s a learning accelerator. It shortens the feedback loop between what teams think users understand and what users actually experience.

Onboarding Is About Education, Not Explanation

A subtle but important distinction emerged around onboarding intent.

Onboarding isn’t about:

Explaining features.
Listing capabilities.
Documenting everything the system can do.

It’s about teaching users how to think with the product.

That means:

Showing what good usage looks like.
Defining boundaries clearly.
Guiding users through successful outcomes.
Correcting misuse early.

As Bryce Hunt, Founding GTM at Cognition, pointed out from the frontier of agent-based products, onboarding is often the moment where trust is either established or permanently lost.

Learning Speed Beats Go-To-Market Speed

Perhaps the most important reframe of the section was this:

The fastest-growing AI companies prioritize learning speed over go-to-market speed.

They don’t rush to scale acquisition before:

Understanding user confusion.
Clarifying workflows.
Stabilizing outcomes.

They accept slower early growth in exchange for:

Stronger retention.
Clearer value propositions.
More predictable expansion later.

In AI, onboarding is not a cost center.

It’s where product truth is discovered.

The Practical Takeaway

AI raises expectations — and uncertainty — at the same time.

That makes onboarding more important, not less.

Teams that succeed:

Invest heavily in early education.
Embrace guided experiences.
Treat onboarding as a product system.
Learn from confusion instead of ignoring it.

In an AI-first world, great onboarding isn’t about removing friction — it’s about removing uncertainty.

5. Product-Led ≠ Hands-Off

One of the clearest misconceptions surfaced on the panel was around what product-led actually means in an AI-first world.

Too often, product-led growth is interpreted as:

Zero human involvement.
Fully self-serve from day one.
No guidance or intervention.
No opinionated direction.

The panel was unequivocal: this interpretation breaks down quickly in AI products.

Product-Led Is About Where Value Is Created — Not Who’s Involved

At its core, product-led growth means that the product is the primary driver of value realization.

It does not mean:

Users are left alone to figure things out.
Teams remove themselves from the learning loop.
Human touch is a failure mode.

As Brian Balfour, Founder & CEO of Reforge, emphasized, product-led growth (PLG) is about value delivery, not absence of people. Confusing the two leads teams to optimize for scale before they’ve learned what actually works.

AI Products Need Human Scaffolding Early

AI introduces uncertainty in ways traditional software does not.

Users often:

Don’t know what’s possible.
Don’t know how to judge outputs.
Don’t know when they’re using the product “correctly”.

In this context, early human involvement is not optional.

As Aaron Cort, Growth & Marketing Partner at Craft Ventures, explained, the most effective AI companies use concierge onboarding early — not to sell, but to observe. Watching how users struggle, succeed, and misunderstand the product surfaces insights that no dashboard ever will.

Human Feedback Accelerates Product Discovery

Several speakers described how early human touch dramatically shortened product discovery cycles.

By staying close to users, teams were able to:

Identify confusing workflows quickly.
Understand which outputs actually mattered.
Separate novelty from real value.
Refine positioning before scaling acquisition.

As Gaurav Vohra, Advisor and Head of Growth at Superhuman, noted, this human feedback loop is often the difference between a product that feels impressive and one that earns trust.

Learning Cycles Matter More Than Scale at the Start

A recurring warning from the panel was about premature scaling.

AI products that rush to:

Remove human touch,
Automate everything, and
Maximize self-serve acquisition,

often do so before they’ve stabilized value delivery.

As Bryce Hunt, Founding GTM at Cognition, shared from the frontier of agent-based products, early scale amplifies misunderstanding just as fast as it amplifies success. If users are confused at small scale, they’ll be lost at large scale.

The Strategic Use of Human Touch

The panel offered a more nuanced model for PLG in AI:

Use human involvement intentionally early
- To teach.
- To observe.
- To learn.
Identify repeatable patterns of value
- Where users succeed without help.
- Where workflows stabilize.
- Where trust is earned.
Replace human touch deliberately
- With product affordances.
- With opinionated flows.
- With automation that reflects real usage.

The goal is not to avoid human touch — it’s to earn the right to remove it.

The Practical Takeaway

In AI products, product-led does not mean hands-off.

It means:

The product leads value creation.
Humans accelerate learning.
Automation follows understanding.

Teams that treat PLG as an excuse to disengage learn slowly.

Teams that treat PLG as a system — with humans embedded early — learn fast.

In an AI-first world, strategic human involvement is not a growth liability. It’s a competitive advantage.

Thanks for reading Adaline Labs! This post is public so feel free to share it.

6. Motion Choice Is a Strategic Constraint, Not a Tactic

One of the most direct — and least hedged — messages from the panel was about go-to-market motion.

Being “in the middle” is the worst place to be.

This wasn’t framed as a tactical mistake.

It was framed as a structural one.

GTM Motion Shapes Everything That Follows

The panel was clear that GTM motion is not something you “optimize later.”

It determines:

How products are built.
How onboarding works.
How trust is earned.
How quickly deals close.
How economics scale.

As Aaron Cort, Growth & Marketing Partner at Craft Ventures, emphasized, motion choice constrains what’s possible long before it shows up in metrics. Teams that delay this decision often find themselves stuck with a product that doesn’t cleanly support any motion well.

Why the Middle Collapses

Several speakers described the same failure pattern:

The product isn’t self-serve enough to convert quickly.
It isn’t enterprise-ready enough to close confidently.
Sales cycles stretch.
Security reviews stall.
Legal friction increases.
Economics break.

This “hybrid by default” approach sounds flexible. In practice, it creates friction everywhere.

As Brian Balfour, Founder & CEO of Reforge, noted, ambiguity in motion leads to ambiguity in execution. Teams don’t know whether to optimize for conversion speed or deal size — and end up doing neither effectively.

Pure Sales-Led Is Fragile in AI

The panel was equally candid about the limits of traditional sales-led motions in AI.

Pure sales-led AI companies often struggle because:

Products evolve too quickly for long sales cycles.
Value is hard to fully demonstrate upfront.
Buyers want proof through usage, not promises.
Model behavior can’t be perfectly specified in contracts.

This doesn’t make sales irrelevant — but it makes sales-first strategies fragile, especially early.

Hybrid Motions Hit Real-World Friction

Hybrid motions — product-led entry with early sales involvement — sound attractive in theory.

In practice, the panel noted that they often collapse under:

Security reviews.
Legal scrutiny.
IT procurement processes.
Unclear ownership.

Without a clear product-led wedge or a true enterprise motion, teams get stuck negotiating before value is experienced.

The Two Motions That Actually Work

Across the discussion, the panel converged on two viable extremes:

1. Product-Led (Sales Layered Later)

Clear self-serve value.
Fast time-to-first-success.
Minimal friction to try.
Sales introduced after usage and trust are established.

2. Forward-Deployed Engineering

Deep customer involvement.
Hands-on implementation.
High-touch workflows.
Clear value before scale.

As Bryce Hunt, Founding GTM at Cognition, explained from the frontier of agent-native products, forward-deployed work isn’t a fallback — it’s often the fastest way to learn when problems are complex and trust is critical.

Ambiguity Is the Real Enemy

What failed consistently were companies that tried to keep all options open.

Ambiguous motion leads to:

Slow deals.
Broken economics.
Unclear product priorities.
Stalled growth.

Teams hesitate. Buyers hesitate. Momentum dies quietly.

As Gaurav Vohra, Advisor and Head of Growth at Superhuman, put it earlier in the panel, clarity — even when it limits options — is what enables speed.

The Practical Takeaway

GTM motion is not a growth hack. It’s a strategic constraint.

The companies that win:

Choose a clear motion early.
Design the product around it.
Accept the tradeoffs.
Execute decisively.

In an AI-first world, clarity beats flexibility.

Choosing the right motion doesn’t guarantee success — but avoiding the decision almost guarantees failure.

7. Trust Is the New Retention Loop

Across multiple threads of the conversation, one idea kept surfacing in different forms:

Trust is the real retention mechanism in AI products.

Not novelty.

Not intelligence.

Not even habit on its own.

Users return when they trust the system.

Trust Is Built on Predictability, Not Perfection

The panel was clear that users don’t expect AI systems to be perfect.

They expect them to be understandable.

Users return when:

Outputs are predictable.
Behavior is consistent.
Failure modes make sense.
The system feels aligned with their intent.

As Brian Balfour, Founder & CEO of Reforge, emphasized, predictability is what allows users to form mental models. Without a mental model, there is no habit — only hesitation.

Randomness Destroys Confidence Faster Than Errors

Several speakers noted that randomness is more damaging than being wrong.

AI systems lose users when:

Results feel inconsistent.
Success feels accidental.
Similar inputs produce wildly different outcomes.
Behavior changes without explanation.

As Aaron Cort, Growth & Marketing Partner at Craft Ventures, explained, users can forgive known limitations. What they can’t tolerate is uncertainty about whether the product will work this time.

In AI products, confusion doesn’t just slow adoption — it actively repels it.

Opaque Systems Feel Unaligned

A recurring theme was alignment.

Users trust systems that feel like they’re:

Working with them.
Respecting their intent.
Operating within understood boundaries.

When behavior is opaque, users assume misalignment — even if none exists.

As Bryce Hunt, Founding GTM at Cognition, described from the frontier of agent-based systems, trust collapses quickly when users don’t understand why the system acted the way it did. At that point, even good outcomes feel suspect.

Failure Modes Matter More Than Success Cases

One subtle but important insight from the panel was that users judge AI products by how they fail, not how they succeed.

When failure modes are:

Explainable,
Constrained, and
Recoverable,

trust grows.

When failures are:

Surprising,
Silent, and
Inconsistent,

users disengage.

As Gaurav Vohra, Advisor and Head of Growth at Superhuman, pointed out, trust isn’t built by eliminating failure — it’s built by making failure legible.

Trust Compounds Over Time

The panel repeatedly emphasized that trust behaves like a compounding asset.

Each predictable interaction:

Reinforces confidence.
Lowers cognitive load.
Increases willingness to rely on the system.

Over time, trust becomes the reason users return — even when alternatives exist.

Conversely, confusion compounds just as quickly.

Each unclear outcome:

Introduces doubt.
Raises friction.
Shortens patience.

Churn doesn’t usually happen after one bad experience.

It happens after several confusing ones.

The Core Retention Loop in AI

The panel implicitly described a new retention loop for AI products:

Predictability → Trust → Reuse → Deeper Reliance

Break that loop anywhere, and retention collapses.

As one speaker summarized succinctly:

Trust compounds.
Confusion churns.

The Practical Takeaway

In AI products, retention is not driven by how impressive the system is.

It’s driven by how safe it feels to rely on.

Teams that win:

Prioritize predictable behavior.
Surface boundaries clearly.
Design for understandable failure.
Align outputs with user intent.

In an AI-first world, trust isn’t a brand attribute — it’s a product property.

Thanks for reading Adaline Labs! This post is public so feel free to share it.

8. Onboarding Must Be Opinionated, Interruptive, and Interactive

One of the sharpest insights from the panel was that great AI onboarding behaves more like a game than a tutorial.

It doesn’t politely explain everything and hope users figure it out.

It actively guides behavior.

Neutral Onboarding Is a Silent Failure Mode

Many AI products default to neutral onboarding:

Feature tours.
Passive documentation.
Optional walkthroughs.
“Explore on your own” prompts.

The panel was blunt about the outcome: users fail silently.

As Brian Balfour, Founder & CEO of Reforge, noted, neutral onboarding shifts responsibility onto users at the exact moment they are least equipped to succeed. When users don’t know what “good usage” looks like, they hesitate — and hesitation kills momentum.

Opinionation Reduces Anxiety

Effective AI onboarding tells users exactly what to do.

It:

Prescribes the first action.
Narrows choices intentionally.
Removes ambiguity.
Defines success clearly.

As Aaron Cort, Growth & Marketing Partner at Craft Ventures, emphasized, opinionation reduces cognitive load. When users don’t have to decide how to start, they’re more likely to start at all.

In AI products, especially, clarity feels like competence.

Interruption Is a Feature, Not a Bug

The panel also reframed interruption as a positive design choice.

Great onboarding:

Interrupts users at the right moments.
Stops them before misconfiguration.
Corrects behavior early.
Enforces setup steps.

As Gaurav Vohra, Advisor and Head of Growth at Superhuman, explained, early interruption prevents downstream confusion. Fixing misuse later is far more expensive — and often impossible once trust is lost.

Interrupting early is an act of respect.

Interaction Beats Explanation

Another recurring theme was that users don’t learn AI products by reading about them.

They learn by:

Doing.
Seeing outcomes.
Correcting mistakes.
Receiving immediate feedback.

As Bryce Hunt, Founding GTM at Cognition, shared from agent-native systems, onboarding that rewards interaction — not passive consumption — accelerates understanding dramatically.

A successful first experience is worth more than a complete explanation.

Forcing Correct Setup Early Pays Dividends

Several speakers emphasized the importance of enforcing correct setup early — even if it feels restrictive.

Opinionated onboarding:

Blocks users from skipping critical steps.
Validates inputs.
Ensures prerequisites are met.
Prevents false negatives.

This reduces:

Misuse.
Early failure.
Frustration blamed on the product.

As the panel made clear, letting users “explore freely” often leads to bad conclusions about the product’s value.

Why Neutrality Fails in AI

Neutral onboarding assumes users:

Know what they want.
Understand system boundaries.
Can evaluate outputs accurately.

In AI products, these assumptions are almost always wrong.

Neutrality pushes responsibility onto users — and users fail silently.

Opinionation keeps responsibility where it belongs: with the product.

The Practical Takeaway

In AI onboarding:

Politeness is overrated.
Clarity is everything.
Guidance beats freedom early.

The best onboarding:

Tells users what to do.
Interrupts them when necessary.
Rewards correct interaction.
Teaches success through action.

In an AI-first world, great onboarding doesn’t wait for users to understand — it actively teaches them how to win.

9. Moats Are Shifting — Stacking Matters More Than Strength

One of the clearest rejections from the panel was the idea that AI companies can rely on a single, permanent moat.

That framing no longer holds.

Instead, the panel converged on a more nuanced — and more practical — view:

AI moats are time-bound.
They strengthen and weaken at different phases.
Durability comes from stacking and sequencing them.

The Myth of the Singular AI Moat

Early AI discourse often revolves around finding the moat:

Proprietary models.
Unique data.
Technical sophistication.
Speed of execution.

The panel was direct: no single advantage remains dominant for long.

As Brian Balfour, Founder & CEO of Reforge, noted, AI compresses competitive cycles. What feels defensible today often becomes baseline tomorrow. Teams that bet everything on one advantage eventually find themselves exposed.

Different Moats Peak at Different Times

Rather than dismissing moats entirely, the panel reframed them as phase-dependent.

Examples discussed included:

Data moats
- Extremely strong once established.
- Slow to build.
- Often unusable early.
- Most powerful after scale and repetition.
Brand moats
- Can accelerate trust and adoption.
- Fragile if product quality lags.
- Difficult to repair once broken.
Distribution windows
- Temporary but decisive.
- Often tied to timing, channels, or platforms.
- Missed windows rarely reopen.
Speed
- No longer a differentiator.
- Table stakes in AI.
- Necessary but insufficient.

As Aaron Cort, Growth & Marketing Partner at Craft Ventures, emphasized, many AI companies fail not because they lack moats — but because they rely on the wrong one at the wrong time.

Stacking Creates Durability

The companies that endure don’t search for a silver bullet.

They stack advantages:

Speed early.
Distribution when available.
Brand as trust compounds.
Data as usage accumulates.

Each moat reinforces the others.

As Gaurav Vohra, Advisor and Head of Growth at Superhuman, explained, durability comes from overlap. When one advantage weakens, others compensate. This redundancy is what allows companies to survive competitive shocks.

Sequencing Matters as Much as Strength

Another subtle but important insight was that moats must be sequenced intentionally.

Building a data moat before you have distribution is pointless.

Pushing brand before reliability backfires.

Optimizing for speed without retention burns credibility.

As Bryce Hunt, Founding GTM at Cognition, shared, many AI startups mistake early momentum for defensibility — only to realize later that nothing was reinforcing it.

Momentum without structure decays quickly.

The Competitive Reality of AI

AI lowers the cost of imitation.

Features are copied faster.

Capabilities converge.

Execution gaps narrow.

In that environment, durability doesn’t come from being the strongest in one dimension.

It comes from being good enough across many — at the right times.

The Practical Takeaway

There is no permanent AI moat.

There are:

Temporary advantages.
Shifting strengths.
Strategic windows.
Compounding combinations.

The companies that win don’t chase the perfect moat.

They build a system of advantages that evolve as the market evolves.

In an AI-first world, stacking beats strength — and sequencing beats brilliance.

10. Brand Is Becoming Personal Again

One of the most striking themes to emerge near the end of the panel was a shift that’s easy to underestimate:

Brand is becoming personal again.

Not nostalgic.

Not performative.

Personal in a way that materially affects growth, trust, and retention.

Logos Don’t Carry Trust the Way They Used To

The panel noted that the environment around buyers and users has fundamentally changed.

Today:

Search is fragmented.
Feeds are noisy.
Information is overwhelming.
AI-generated content is everywhere.

In that world, traditional brand signals — logos, taglines, even company-level messaging — carry less weight than they used to.

Users don’t trust abstractions.

They trust people.

Trust Attaches to Opinionated Individuals

Across multiple threads, speakers pointed to the same pattern:

Users increasingly trust:

Individuals with clear points of view.
Builders who explain how they think.
Leaders who show up consistently over time.
People willing to be specific, not neutral.

As Brian Balfour, Founder & CEO of Reforge, noted, trust now accrues to those who reduce ambiguity. In a world of infinite answers, conviction becomes a signal.

Founder-Led Brand as a Growth Channel

The panel reframed founder-led (or leader-led) brand not as marketing — but as infrastructure.

When done well, personal brand becomes:

A distribution channel.
A trust shortcut.
A wedge into new audiences.
A retention lever for existing users.

As Aaron Cort, Growth & Marketing Partner at Craft Ventures, explained, many of the strongest AI companies today see disproportionate leverage from founders and leaders who actively articulate the product’s philosophy in public.

People don’t just buy the product — they buy the worldview.

Explanation Is the New Differentiator

AI products often struggle because users don’t understand why they work.

Founder-led brand helps close that gap.

When leaders:

Explain tradeoffs,
Share decisions,
Talk openly about constraints, and
Narrate progress and failure,

they make the product legible.

As Gaurav Vohra, Advisor and Head of Growth at Superhuman, pointed out earlier in the panel, explanation builds trust faster than polish. Users forgive imperfection when they understand intent.

Consistency Beats Virality

The panel was careful to separate personal brand from social performance.

This isn’t about:

Going viral.
Hot takes.
Constant posting.

It’s about:

Consistency.
Clarity.
Coherence over time.

As Bryce Hunt, Founding GTM at Cognition, noted, users don’t need constant visibility — they need repeated signals of alignment. Over time, that consistency compounds into trust.

The Most Defensible Layer Available

Perhaps the most important reframe of the section was this:

In an AI world where:

Features are copied.
Capabilities converge.
Moats shift quickly.

human trust compounds slowly — and decays slowly.

For many AI companies, especially early, a founder- or leader-led brand may be the most defensible layer available.

Not because it’s impossible to copy — but because it’s impossible to fake sustainably.

The Practical Takeaway

Brand is no longer just how a company looks.

It’s:

How clearly its leaders think.
How openly they explain.
How consistently they show up.

In an AI-first world flooded with answers, people follow judgment.

And judgment, increasingly, wears a human face.

11. Community Is a Growth Multiplier, Not a Feature

As the panel wrapped, one final idea brought many of the earlier themes together:

Community is not a feature.

It’s a growth multiplier.

And like any multiplier, it only works when the underlying system is sound.

Community Is Not a Container

The panel was explicit about what community is not.

It is not:

A Slack group.
A Discord server.
A forum.
A channel you “launch”.

Those are containers.

Community is what happens inside them — if anything happens at all.

Too many AI companies mistake presence for participation and confuse access with value.

Real Community Is Shared Learning

What actually worked, according to the panel, was community built around learning.

The strongest communities shared:

How people were using the product.
What worked and what didn’t.
Failure modes and recovery patterns.
Evolving best practices.

As Brian Balfour, Founder & CEO of Reforge, noted, learning compounds when users can see each other thinking. In AI products especially, this shared sensemaking reduces fear and accelerates confidence.

Identity Drives Contribution

Another recurring insight was that community only works when contribution is rewarded.

Healthy communities give members:

Status through insight.
Recognition through contribution.
Identity through participation.

As Aaron Cort, Growth & Marketing Partner at Craft Ventures, emphasized, community isn’t about broadcasting updates — it’s about creating a place where users feel ownership over collective progress.

When contribution is visible, learning accelerates.

The Product Must Reinforce Belonging

The panel also stressed that community cannot live outside the product.

The most effective AI communities were reinforced by:

Product language.
Shared workflows.
Common artifacts.
Visible usage patterns.

As Bryce Hunt, Founding GTM at Cognition, shared, when users see themselves reflected in how a product is built — not just how it’s marketed — community becomes self-sustaining.

Belonging doesn’t come from access.

It comes from relevance.

Why Community Matters More in AI

AI products introduce uncertainty by default.

Users often ask:

“Am I using this correctly?”
“Is this result trustworthy?”
“Is everyone else confused, too?”

Community normalizes that uncertainty.

As Gaurav Vohra, Advisor and Head of Growth at Superhuman, explained earlier, seeing others wrestle with the same questions reduces anxiety and builds confidence faster than documentation ever could.

Community turns uncertainty into momentum.

Organic Growth Emerges From Shared Progress

When done well, community quietly powers growth.

It:

Spreads best practices.
Accelerates onboarding.
Reinforces habit.
Drives organic distribution.

Users don’t just adopt the product — they advocate for it, teach it, and evolve with it.

That’s not a feature. That’s leverage.

The Final Takeaway

Community doesn’t create growth by itself.

But when paired with:

Trust,
Clarity,
Shared learning, and
Visible contribution,

it multiplies everything else.

In AI products, where understanding is as important as capability, community becomes the fastest way to scale trust.

Not by telling users what to do — but by letting them learn together.

When Everyone Can Build: Redesigning Product Work for the AI Era in 2026

Nilesh Barla — Sat, 31 Jan 2026 00:55:13 GMT

TLDR: AI tools are collapsing traditional role boundaries—PMs build dashboards, engineers write copy, designers produce specs. This creates a "Mexican standoff" where old lanes no longer match daily work. The real risk isn't job loss; it's chaos from faster output without coherence. This post shows how to redesign roles around decision rights instead of job titles. You’ll learn a practical framework (Doer/Decider/Reviewer), four collaboration artifacts that prevent drift, and what PMs specifically should become. Read this if your team is shipping faster but feels misaligned, or if you're unsure how AI changes product management fundamentally.

The Collapse Is Real, And It’s Not About Job Loss

Picture a normal week. A PM uses Claude Code to produce a working internal dashboard that had been sitting in an engineering backlog. A designer ships a prototype that already includes the awkward states users hit in production. An engineer writes three copy variants while adjusting the UI component.

Marc Andreessen describes this friction as a “Mexican standoff” among PMs, designers, and engineers—the old lanes no longer align with the actual work.

This is boundary collapse. Execution capability spreads across roles because AI tooling makes credible artifacts cheap to produce. Anyone can now build things that used to require specialized skills. The constraint shifts from who can make something to who can decide what should exist.

Which is why the term “job loss” misses what’s happening week to week. The near-term change is “task reshuffling.” Some tasks disappear, many get faster, and most get rebundled into new workflows and expectations. Work reorganizes around different constraints: decision rights, coherence, and accountability.

Lenny's Newsletter

Marc Andreessen: The real AI boom hasn’t even started yet

Marc Andreessen is a founder, investor, and co-founder of Netscape, as well as co-founder of the venture capital firm Andreessen Horowitz (a16z). In this conversation, we dig into why we’re living through a unique and one of the most incredible times in history, and what comes next…

Listen now

3 months ago · 78 likes · Lenny Rachitsky

More output arrives first. Misalignment follows quietly.

This blog shows how to redesign roles around outcomes and decision rights.

What Exactly Is Collapsing

This section exists to make the blur legible. Without a simple map, teams argue about identity and titles. A clear map keeps the conversation on work, ownership, and outcomes. Here are the three areas where boundaries are collapsing.

PM and Engineer

Drafting specs that include edge cases and acceptance criteria so engineers can execute with less back-and-forth.
Producing a clickable demo or internal proof that narrows the scope before a build starts.
Turning raw customer feedback into a structured backlog that encodes tradeoffs and sequencing.

The PM now produces technical artifacts. The engineer now shapes product scope.

Engineer and Designer

Writing multiple copy directions and microcopy variants while iterating on UI behavior.
Exploring interaction options and edge states fast enough that design intent and feasibility converge earlier.

The engineer now makes content decisions. The designer now navigates technical constraints.

Designer and PM

Synthesizing research notes into themes, risks, and decision-ready narratives.
Writing onboarding language and positioning that stays consistent with the product’s mental model.

The designer now structures strategic inputs. The PM now produces user-facing language.

Thanks for reading Adaline Labs! This post is public so feel free to share it.

The Hidden Failure Mode: More Throughput, Less Coherence

Here is how it breaks. Your onboarding flow promises "get set up in 5 minutes." Your pricing page emphasizes "enterprise-grade control." Your settings screen adopts a new toggle pattern in one section while keeping old dropdowns in another. Each decision was reasonable. But the composite is confusing.

Output rises immediately. Coherence degrades quietly. Coherence is consistency across:

UX patterns and interaction language,
Positioning and copy,
Metrics definitions and measurement,
Decision logic and constraints.

When more people can produce product artifacts quickly, local changes accumulate. You get onboarding that promises one thing while pricing implies another. You get settings that adopt new patterns in one area and old patterns in the rest. None of these are “bugs.” They are coordination debt.

Teresa Torres has been explicit about a related trap: delivery pressure can crowd out discovery discipline, and faster building can pull teams toward what is easy to produce rather than what is valuable to learn.

You can also predict the coordination tax with old software lessons: as contributors increase, communication paths multiply, and misalignment becomes a first-order cost.

So the scarce resource shifts. Execution gets cheaper. Decision clarity and product coherence become the limiting factors.

If coherence is not owned, the product becomes a collage.

Redesign The Contract Outcomes Interfaces Accountability

AI tools let more people create product artifacts—specs, prototypes, demos, copy. This only works if your org has clear rules for outcomes, decision rights, and accountability. Without that clarity, your product becomes whatever the last person to touch it decided it should be.

Let’s make this concrete first.

Say your outcome is activation for new workspace admins, which moves from 32% to 42% in 8 weeks.

Your constraints:

You cannot change core navigation.
You cannot add more than 2 new backend calls.
The brand requires keeping the existing voice.

Decision rights:

PM decides scope cuts if the timeline slips.
Designer decides which of the three onboarding approaches to emphasize.
Engineer decides whether to instrument activation via backend events or client-side tracking.

Artifacts you’ll produce:

Decision log entry explaining why you chose tutorial overlay over empty states.
System map showing the three critical activation moments where users drop off.
Tradeoff table showing you prioritized speed over visual polish.
Launch narrative positioning this as “faster time to value.”

That’s the contract in action. Here’s how to structure it for any outcome.

Start with outcomes. Outcomes are measurable, scoped, and tied to a segment. Keep them dual-sided with a user signal and a business signal.

Outcome examples:

Activation for new workspace admins moves from 32% to 42% in 8 weeks. User: Faster value realization | Business: Higher retention cohort.
Trial to paid conversion for the SMB persona improves by 3 points this quarter. User: Clearer value prop | Business: Revenue growth.
Support tickets tagged “confusing pricing” drop by 25% in 6 weeks. User: Better understanding | Business: Lower support costs.

Define interfaces next. Interfaces describe artifacts, not activities. This prevents “I’m doing discovery work” from becoming a catch-all for undefined contributions.

Interface by function:

PM produces a decision narrative, sequencing, and a tradeoff record.
Design produces an interaction spec, principles for consistency, and edge state intent.
Engineering produces constraints, system invariants, and feasibility boundaries.

These are the artifacts each role is uniquely positioned to create. When contribution happens outside these lanes, decision rights determine who approves the work.

Then set accountability. Each outcome needs one DRI (Directly Responsible Individual). Contributors can be many. Reviewers should be explicit.

The DRI model works because it separates decision authority from contribution. Many people can produce artifacts. Many people can provide input. But only one person decides whether the outcome is achieved and makes the final call when trade-offs conflict.

Use a simple contract template:

Outcome
[Metric, segment, timeframe]

Constraints
[UX invariants, technical invariants, brand and compliance constraints]

Decision rights
[Who decides scope, who decides UX tradeoffs, who decides system tradeoffs]

Artifacts
[Decision log entry, system map update, tradeoff table row, launch narrative]

Everyone can build; not everyone can decide.

Copyable contract statement

We will optimize for [OUTCOME: metric, segment, timeframe], within [CONSTRAINTS: UX/tech/brand limits]. The DRI [NAME] owns the decision. Contributors [NAMES] produce [ARTIFACTS: list]. Reviewers [NAMES] validate coherence and constraints before the product surface moves.

This contract only holds if it is enforced in writing. Verbal agreements decay the moment priorities shift or new people join. The artifacts from the next section make the contract real.

What PMs Should Become In This New World

The boundary collapse forces a career clarification. Execution is no longer the moat. The moat is judgment that other people can see and follow.

The PM job does not disappear. It gets narrower in definition and harder in standards. Here is what that means in practice.

PM as Coherence Architect

You own sequencing, narrative, and tradeoffs. This means you make parallel work feel like one release. When engineering ships three features in parallel, you write the launch narrative that ties them together. When design proposes a new interaction pattern, you check whether it conflicts with existing patterns. When someone updates copy, you verify it matches the product’s voice and mental model. You are the person who says “that does not belong here” and can explain why.

Example: Engineering ships a dashboard feature, a new API endpoint, and a billing update in the same sprint. Separately, they look like infrastructure work. Your job is to write the release narrative that positions them as “Enterprise-ready workspace controls” so customers see one coherent capability, not three unrelated updates.

PM as Systems Thinker

You understand constraints well enough to make realistic decisions. You should be able to read a system map and spot where tight coupling will create drag. You should be able to ask “what breaks if we add this?” and understand the answer. You should know the difference between a database constraint and a business rule well enough to know which one can flex under pressure.

Example: Design proposes a “duplicate workspace” feature. You check the system map and notice workspace creation is tied to billing events, which means duplication would trigger unexpected charges. You spot this before engineering starts building.

PM as Leverage Designer

You design workflows that scale decision-making across teams. You create decision templates that five teams can reuse without asking for clarification. You build artifact formats that capture why you decided X without requiring synchronous meetings. You set up review rituals that catch drift before users experience it.

The goal is not to make every decision yourself. The goal is to make good decisions repeatable and bad decisions impossible.

Example: You create a feature proposal template that includes outcome metric, constraints, three alternatives considered, and decision criteria. Now, when anyone proposes a feature, the conversation starts from tradeoffs rather than lobbying.

Capability Self-Check

Here is how to assess whether you are operating at this level:

I can write a one-paragraph tradeoff explanation that engineering, design, and execs accept as complete.
I can look at a system map and identify where coupling will create future drag.
I can design a decision artifact that five contributors can use without clarification.

The PM job is not disappearing. When execution is cheap, judgment becomes valuable—but only when it is visible, structured, and repeatable.

That is the new PM superpower: making decisions legible, coherence enforceable, and good judgment scalable across teams that move faster than org charts can keep up with.

Building AI Products, Not Prototypes | Takeaways For Founders and Product Leaders

Arsh Shah Dilbagi — Wed, 28 Jan 2026 14:02:44 GMT

TLDR: This blog explains how to turn AI demos into durable products by choosing opinionated workflows, controlling the environment, designing for user understanding, and planning for maintenance. It covers data reality, dual-system architecture, evals, framework tradeoffs, and task decomposition—helping teams ship more reliable, debuggable, scalable AI features.

Introduction

Founder Intro: Building AI Products, Not Prototypes

One of the motivations behind Adaline Applied was simple: there’s a growing gap between what AI can demo and what AI can actually sustain in the real world.

Every week, we see impressive prototypes. Agents that look magical. Systems that feel powerful in isolation. And yet, when you talk to founders and operators trying to ship these systems into production, you hear a very different story — one defined by edge cases, trust issues, brittle workflows, and products that stall after their first moment of excitement.

Panel 2 was designed to sit directly in that tension.

Rather than asking what’s possible, we wanted to ask a harder question: What does it actually take to turn an AI prototype into a real product?

To explore that, we brought together builders operating at very different layers of the stack:

Aiden Bai, Co-founder & CEO at Same, building AI-native products with speed and opinionation from day one
Josh Payne, Founder & CEO at Coframe, translating AI capability into measurable business outcomes
Linus Lee, Engineer, AI at Thrive Capital, thinking deeply about interfaces, cognition, and long-term product truth
Matthew Rastovac, Director of AgentForce at Salesforce, shipping AI systems inside large, high-trust enterprise environments

What emerged was not a checklist or a framework — but a shared set of hard-earned lessons.

Again and again, the conversation returned to the same idea:

Most AI failures aren’t caused by weak models.
They’re caused by weak product decisions.

The sections that follow unpack what that really means in practice — why generality creates fragility, why user understanding matters as much as accuracy, why maintenance dominates prototyping, and why the hardest problems are still hard.

This write-up isn’t meant to prescribe a single way to build AI products. It’s meant to surface the patterns that consistently separate demos from durable systems.

If you’re building with AI today — or planning to — my hope is that this panel helps you make better decisions about what to build, how to build it, and when not to ship yet.

1. Generality Is Expensive — Opinionated Workflows Win

One of the strongest points of alignment across the panel was that generality is not a free abstraction. It has real, compounding cost—and that cost shows up fastest once a product leaves the demo environment.

Early on, many AI teams are drawn to building systems that are:

Highly flexible.
Broadly applicable.
Capable of handling many use cases.
Impressive in demos.

This instinct makes sense. General systems feel powerful. They look future-proof. They suggest unlimited upside.

But as multiple speakers emphasized, that flexibility quickly becomes a liability once real users are involved.

When Systems Are Too General, the Model Becomes the Product Designer

As Aiden Bai pointed out, overly general systems force the model to make decisions the product team hasn’t made. The model must infer:

What does the user actually want?
Which constraints matter?
How to sequence actions?
What does “correct” look like?

At the same time, the user is left guessing how to use the product successfully. The result isn’t intelligence—it’s ambiguity.

General systems push cognitive load onto both sides:

The model gets too many degrees of freedom.
The user gets too little guidance.

Neither wins consistently.

In Production, Flexibility Turns Into Fragility

This tradeoff becomes even more pronounced at scale.

Matthew Rastovac, speaking from the perspective of shipping agent systems inside Salesforce, described how generality breaks down quickly in enterprise environments. The more freedom an agent has, the harder it becomes to guarantee predictable behavior—and predictability is non-negotiable when trust is on the line.

Even when a system is technically capable, inconsistent behavior erodes confidence fast. In enterprise settings, users don’t tolerate surprises—especially from software that claims intelligence.

Generality Also Hurts Monetization

Josh Payne highlighted a parallel failure mode from the commercial side.

At Coframe, systems designed to be flexible across many customer use cases became:

Harder to explain.
Harder to position.
Harder to tie to concrete metrics.

When outputs vary too widely, customers struggle to understand why the product is valuable. And if value can’t be explained, it can’t be measured—which makes it nearly impossible to sell or scale.

Generally, in this sense, doesn’t just hurt reliability. It hurts revenue.

Opinionation Is How Products Take Control Back

Across these anecdotes, a consistent pattern emerged:

The more general the system, the more responsibility is abdicated to the model—and the less control the product team retains.

By contrast, the AI products that successfully crossed from prototype to production looked very different. They were highly opinionated.

Aiden described how real progress came not from adding flexibility, but from removing it. Teams narrowed the scope. They encoded domain assumptions directly into workflows. They removed optionality. They chose depth over breadth.

Instead of asking the model to figure everything out, they asked a different question:

“What decisions should the product make so the model doesn’t have to?”

Opinionated Systems Teach Users How to Succeed

This idea surfaced again when Linus Lee spoke about interfaces and cognition. Every AI product teaches users how to think with it—whether intentionally or not:

General systems teach uncertainty.
Opinionated systems teach clarity.

When workflows are explicit:

Users learn faster.
Trust builds more quickly.
Success becomes repeatable.

The product becomes legible instead of mysterious.

Why Specificity Wins in the Real World

In practice, the difference is stark.

General systems tend to:

Produce unpredictable outputs.
Fail in subtle, hard-to-debug ways.
Create a UX that’s difficult to explain.
Erode trust through inconsistency.

Opinionated systems tend to:

Surface fewer failure modes.
Make success repeatable.
Clarify what “good usage” looks like.
Feel reliable even when the model isn’t perfect.

As one speaker noted during the session:

“Models don’t fail gracefully—products have to make them fail gracefully.”

That only happens when constraints are intentional.

Prototypes Need Breadth. Products Need Structure.

This led to one of the clearest takeaways of the panel:

Generality makes prototypes impressive.
Specificity makes products usable.

Prototypes exist to explore what’s possible. Products exist to work—repeatedly, for real users, in real conditions.

Opinionation isn’t premature optimization. It’s the mechanism by which AI systems become dependable.

Teams that delay opinionation often end up retrofitting guardrails onto systems that were never designed to support them. Teams that embrace it early build foundations that scale.

The Quiet Contrarian Insight

In a landscape obsessed with flexibility and “AI that can do everything,” this panel offered a quieter, more durable insight:

The path from prototype to product isn’t paved with more general intelligence. It’s paved with tighter workflows, clearer assumptions, and intentional constraints.

Thanks for reading Adaline Labs! This post is public so feel free to share it.

2. Control Over the Environment Determines Feasibility

Another sharp dividing line between prototypes and real products emerged quickly in this panel: how much control the system has over its environment.

Across very different domains—product workflows, enterprise systems, and creative tooling—the same pattern repeated:

“AI systems perform dramatically better when the environment is constrained.”

When teams control:

The inputs.
The structure of the task.
The available actions.
The shape of the output.

AI systems feel capable, reliable, and even “smart.” When they don’t, reliability collapses fast.

AI Thrives in Structured Worlds

Several speakers described how early success almost always happened in environments where:

Inputs were owned or normalized.
Patterns are repeated frequently.
Constraints were known ahead of time.
Failure modes were visible and enumerable.

In these settings, models didn’t need to reason from first principles every time. They could operate within guardrails.

As Aiden Bai noted, many early prototypes feel magical precisely because they live in these controlled worlds. The system works on clean data, predictable tasks, and narrow problem definitions. It’s not that the model is unusually capable—it’s that the environment is unusually forgiving.

This creates a dangerous illusion.

Leaving the Sandbox Is Where Things Break

The moment these systems leave controlled environments, cracks appear.

Matthew Rastovac spoke directly to this from an enterprise perspective. Once AI systems interact with real customer data, real workflows, and real organizational complexity, unpredictability spikes. Inputs aren’t clean. Processes aren’t linear. Edge cases aren’t rare—they’re constant.

Enterprise systems introduce:

Inconsistent schemas.
Legacy workflows.
Partial permissions.
Conflicting sources of truth.
Human-driven exceptions.

In those conditions, even strong models struggle—not because they lack intelligence, but because they’re being asked to operate without a stable frame of reference.

Arbitrary Inputs Are the Enemy of Reliability

This challenge becomes even more pronounced in domains like code, content, and knowledge work.

Josh Payne described how systems that worked well on curated examples failed once exposed to the diversity of real customer data. What looked robust in testing collapsed under the weight of:

Messy inputs.
Inconsistent structure.
Unclear user intent.

These failures weren’t dramatic crashes. They were subtle. Outputs were “almost” right—just wrong enough to erode trust.

And because the failures were inconsistent, they were hard to debug and even harder to explain to users.

Toy Examples Hide Real Constraints

One of the most consistent failure modes discussed on the panel was over-reliance on toy examples.

Many impressive prototypes fail because:

They’re built on idealized data.
They assume cooperative users.
They ignore edge cases.
They avoid ambiguous scenarios.

These prototypes answer the question: “Can the model do this?”

Production systems must answer a harder one: “Can the system do this reliably, every day, for imperfect users, under imperfect conditions?”

That gap is where most AI products fail.

Environmental Control Beats Model Power

A key insight from the panel was that model capability is often the wrong lever to pull.

Teams instinctively respond to failures by:

Switching models.
Increasing context windows.
Tuning prompts.
Layering complexity.

But as multiple speakers emphasized, these changes rarely fix the root problem. The issue is not intelligence—it’s exposure.

Systems fail because:

They’re asked to handle arbitrary inputs.
They lack clear task boundaries.
They don’t know which failures matter.
They don’t control how work enters the system.

As Linus Lee framed it, feasibility is not just a modeling question—it’s a product and interface question. The more a system can shape the environment it operates in, the less it has to rely on raw reasoning.

Task Framing Is the Hidden Superpower

The most successful teams on the panel didn’t try to eliminate environmental complexity. They absorbed it into the product design.

They:

Pre-processed inputs.
Guided users into structured flows.
Constrained actions intentionally.
Limited surface area for failure.

By doing so, they reduced the cognitive burden on the model and increased consistency for users.

The takeaway was clear:

“Production success depends less on model power and more on environmental control and task framing.”

When teams own the environment, AI looks capable. When they don’t, even the best models look unreliable.

The Practical Implication

If Section 1 argued that opinionated workflows win, Section 2 explains why.

Opinionation isn’t just about UX. It’s about feasibility.

The fastest path from prototype to product is not giving AI more freedom—it’s deciding, deliberately, where freedom is dangerous and structure is necessary.

3. User Understanding Is as Important as Model Accuracy

Several of the failures described on this panel had nothing to do with model quality.

The systems technically worked. The outputs were often reasonable. The models were capable. And yet—users still failed.

This surfaced a critical distinction the panel kept returning to:

“Many AI product failures are UX failures, not technical ones.”

When Users Don’t Know How to Succeed, Accuracy Doesn’t Matter

Across multiple anecdotes, speakers described situations where:

Users didn’t know what to ask.
Users didn’t know how to phrase inputs.
Users didn’t understand what the system could or couldn’t do.
Users couldn’t tell whether an output was “good.”

Even when the system produced correct or useful responses, users lacked confidence in how to use it.

As Aiden Bai explained, this creates a subtle but fatal problem: users blame themselves. When they don’t know whether they’re using a product correctly, they stop experimenting. They hesitate. Eventually, they churn—not because the system failed, but because success felt accidental.

Hidden Mental Models Kill Adoption

A recurring theme was the danger of implicit mental models.

Many AI products assume users will intuitively understand:

What kind of inputs work best?
How much context to provide?
When is the model confident versus guessing?
Where are the system’s boundaries?

But as Linus Lee emphasized, users don’t arrive with the product team’s mental model. When success depends on unspoken rules, only power users thrive—everyone else quietly fails.

This creates a false signal:

The product “works” for a small group.
Engagement looks healthy at the surface.
But learning doesn’t spread.

Without explicit guidance, the system becomes brittle outside of expert hands.

Affordances Matter More Than Capability

Several panelists stressed that capability is useless if affordances are unclear.

Matthew Rastovac described this tension in enterprise contexts. Even highly capable agent systems struggled when users couldn’t predict behavior or understand why certain actions were taken.

In those environments, confusion is indistinguishable from risk, and risk is unacceptable.

When affordances are unclear:

Users hesitate to rely on outputs.
Teams introduce manual checks.
Automation stalls.
Trust erodes.

The system doesn’t need to be perfect. It needs to be legible.

Trust Depends on Understanding, Not Just Accuracy

Josh Payne framed this from a commercial perspective. Customers don’t just want correct outputs. They want to understand why the product helps them.

If users can’t explain the value of a system to a colleague, adoption doesn’t spread, and renewal becomes fragile.

Trust, in this sense, isn’t about correctness alone. It’s about predictability, explanation, and confidence.

Users trust systems they can reason about—even if those systems are imperfect.

“Good Usage” Must Be Taught, Not Discovered

One of the clearest lessons from the panel was that good usage doesn’t emerge naturally in AI products.

If users must discover:

What to ask?
How to phrase inputs?
How to evaluate outputs?
When to intervene?

Most of them won’t.

Successful teams made good use of explicit. They:

Constrained inputs.
Provided examples.
Guided first actions.
Surfaced boundaries clearly.

They didn’t assume users would figure it out.

If Users Can’t Explain It, the Product Doesn’t Exist

This led to one of the most blunt conclusions of the panel:

“If users can’t explain how your product helps them, the product doesn’t exist.”

Accuracy alone doesn’t create understanding.

Understanding creates confidence. Confidence creates habit.

Without that chain, even technically impressive systems fail to become products.

The Practical Takeaway

Model accuracy matters, but user comprehension determines whether accuracy is ever experienced.

The teams that succeeded didn’t just build smarter systems. They built systems that taught users how to succeed.

In AI products, clarity is not a UX polish. It’s a core capability.

4. Prototypes Are Cheap — Maintenance Is the Real Cost

One of the most sobering insights from the panel was that AI has made prototyping deceptively easy.

With modern models, teams can:

Stand up impressive demos in days.
Chain together workflows quickly.
Simulate “end-state” product behavior early.

This is a genuine gift. It dramatically lowers the barrier to exploration.

But as multiple speakers warned, it’s also a trap.

The Dangerous Pattern AI Enables

The panel described a pattern that has become increasingly common:

Teams prototype quickly.
Early demos look strong.
Features gain internal and external momentum.
The system gets shipped.
Long-term maintenance costs quietly explode.

Because AI prototypes look so close to finished products, teams often skip a crucial step: asking whether the system is worth maintaining.

As Aiden Bai noted, many teams now treat “we can build this” as sufficient justification to ship. But in AI, feasibility and sustainability are very different questions.

Shipping Is a Commitment, Not a Milestone

Once an AI feature ships, it stops being an experiment.

It becomes:

Something users rely on.
Something customers expect to improve.
Something that must remain stable.
Something that must adapt as models change.

Matthew Rastovac emphasized this from an enterprise perspective. In large organizations, every shipped capability creates an implicit contract. Even “experimental” features quickly become assumed infrastructure.

Removing or degrading them later is far harder than never shipping them at all.

The cost of reversal is high—both technically and politically.

AI Features Age Faster Than Traditional Software

Another key distinction surfaced on the panel: AI features don’t stay still.

Unlike traditional software, AI systems must evolve alongside:

Changing model behavior.
Shifting user expectations.
New failure modes.
Emerging best practices.

What worked six months ago may feel broken today—not because the system regressed, but because the surrounding ecosystem moved.

As Josh Payne pointed out, this makes AI features uniquely expensive to maintain. They require continuous reevaluation, not occasional updates.

Without active stewardship, quality decays silently.

Debugging Gets Harder Over Time, Not Easier

Several speakers also highlighted how maintenance cost compounds in non-obvious ways.

Early on:

Failures are obvious.
Edge cases are limited.
The system’s behavior is still well understood.

Over time:

Failures become subtle.
Behavior drifts.
Assumptions break.
No one fully remembers why decisions were made.

Debugging shifts from “what broke?” to “why does this behave like this at all?”

That transition is where many AI products stall.

The Question Teams Rarely Ask

All of this led to one of the most important reframes of the panel.

The real question isn’t:

“Can we build this?”

With modern AI, the answer is almost always yes.

The real question is:

“Are we willing to maintain this for years?”

That means being willing to:

Own its failures.
Evolve it as models change.
Explain it to users repeatedly.
Defend it internally.
Invest in its long-term quality.

If the answer is no, shipping the prototype is often a mistake—no matter how impressive it looks.

A More Disciplined Definition of Speed

This insight ties directly back to the panel’s broader theme: real speed is long-term speed.

Shipping something that creates drag six months later is not velocity. It’s debt.

The teams that succeed don’t ship fewer prototypes. They ship fewer commitments.

They explore aggressively, but commit selectively.

The Practical Takeaway

AI makes it easy to build things. It does not make owning them easy.

Teams that treat every prototype as a potential long-term system make different decisions:

They constrain the scope earlier.
They delay shipping until maintenance is understood.
They design for evolution, not just launch.

In an era where prototypes are cheap, judgment about what to ship becomes the real competitive advantage.

5. Data Reality Beats Synthetic Optimism

Many of the production failures discussed on this panel didn’t stem from model weakness.

They stemmed from a data mismatch.

Again and again, speakers described the same underlying issue: systems that looked impressive in controlled testing environments broke down almost immediately when exposed to real-world data.

The problem wasn’t intelligence. It was optimism.

The Comfort of Clean Data

AI prototypes are often built on data that is:

Clean.
Structured.
Well-labeled.
Internally generated.
Carefully curated.

This makes early progress feel smooth. Outputs look coherent. Failure rates appear low. The system feels “ready.”

But as Josh Payne noted, this creates a false sense of confidence. Clean data hides the very conditions that define production environments: ambiguity, inconsistency, and noise.

Synthetic data, in particular, tends to encode the assumptions of the team that generated it. That makes it useful for testing logic, but dangerous for validating feasibility.

Real Data Is Messy — And Honest

Once systems encounter real user data, the illusion collapses.

Matthew Rastovac described how quickly edge cases surface inside enterprise systems. Inputs arrive partially filled, inconsistently formatted, or shaped by legacy processes no one fully understands.

Data sources conflict. Human behavior introduces exceptions that no synthetic dataset anticipates.

In those conditions:

Models hallucinate more often.
Confidence signals break down.
Workflows fail silently.
Trust erodes.

These failures aren’t rare. They’re immediate.

Generalization Is Not Guaranteed

A critical mistake surfaced repeatedly in the panel: assuming that strong performance on one dataset implies strong performance everywhere.

As Aiden Bai pointed out, model behavior is highly sensitive to distribution shifts. What works well on curated inputs can fail dramatically when:

Vocabulary changes.
Structure degrades.
Context is incomplete.
User intent is unclear.

Generalization is not automatic, and in many cases, it never arrives without deliberate intervention.

Feasibility Must Be Proven Early

One of the strongest recommendations from the panel was simple, but uncomfortable:

“Use real data as early as possible.”

Not after the prototype. Not after the demo. Not after initial traction.

Early feasibility checks save enormous downstream cost. They reveal:

Whether the problem is actually solvable?
Where constraints need to be added?
How much preprocessing is required?
Which failure modes matter most?

Teams that delay real-data testing often spend months optimizing systems that were never viable in the first place.

Edge Cases Aren’t Edge Cases

Another subtle but important point: in production, edge cases stop being edges.

Once a system is deployed:

Rare inputs appear regularly.
Unexpected usage becomes normal.
Misuse becomes common.
Ambiguity becomes the default.

As Linus Lee framed it, production environments don’t just surface edge cases. They invert them.

What seemed unlikely in testing becomes inevitable in the wild.

This is why synthetic optimism fails so reliably. It prepares teams for best-case scenarios in a world dominated by worst-case inputs.

Garbage In Still Applies — Faster Than Ever

The panel returned to an old truth, with a modern twist:

“Garbage in, garbage out still applies. AI just makes the consequences arrive faster.”

Bad data doesn’t just degrade performance. It accelerates failure.

Because AI systems act confidently even when they’re wrong, poor inputs don’t produce obvious crashes. They produce plausible errors—the most dangerous kind.

The Practical Takeaway

Teams that succeed don’t avoid messy data. They confront it immediately.

They:

Test with real inputs early.
Design workflows to absorb noise.
Constrain what data is allowed in.
Surface uncertainty explicitly.
Build around failure, not perfection.

In AI products, optimism is expensive.

Reality is cheaper—if you face it early.

6. Building an AI Product Means Building Two Things

One of the most important conceptual frameworks to emerge from the panel was deceptively simple:

“When you ship an AI product, you are building two systems at once.”

Most teams only focus on the first.

The First System: The Product Users See

The first system is the obvious one:

The interface.
The workflows.
The outputs.
The features customers interact with.

This is the artifact teams demo, launch, and market. It’s where most effort is visibly spent, and where most AI conversations begin.

But as the panel made clear, this system alone is not enough.

The Second System: The One That Keeps the First Alive

The second system is quieter, less visible, and far more decisive.

It’s the organizational system that:

Observes how the product behaves in the wild.
Detects when outputs degrade.
Understands why failures occur.
Enables safe iteration.
Evolves as models and user expectations change.

This system doesn’t ship to customers, but without it, the customer-facing product inevitably decays.

As Matthew Rastovac emphasized from an enterprise standpoint, AI products don’t just require ongoing support. They require continuous interpretation.

Outputs need context. Failures need explanation. And teams need mechanisms to decide when a system is “good enough” versus when it’s quietly drifting.

Why AI Products Are Fundamentally Different

In traditional software, the rules are relatively stable:

Logic is deterministic.
Behavior changes only when engineers change it.
Best practices evolve slowly.

AI breaks all three assumptions.

As several speakers noted:

Model behavior can shift without code changes.
Upgrades introduce new capabilities and new regressions.
User expectations evolve as AI becomes more commonplace.
Yesterday’s “impressive” becomes today’s “table stakes.”

This means AI products don’t just age. They mutate.

Without a strong second system in place, teams lose the ability to reason about what’s happening inside their own product.

The Invisible Work That Actually Determines Success

When the panel discussed teams that successfully shipped AI products at scale, the conversation quickly moved away from prompts and models and toward internal processes.

Long-term success depended far more on:

Observability into real-world usage.
Fast feedback loops.
Clear ownership of failure modes.
Evaluation infrastructure that evolves over time.
Teams that actively learn from mistakes.

As Aiden Bai noted, teams that move quickly without these systems often appear productive, until suddenly they aren’t.

Progress stalls not because the product is bad, but because no one can confidently change it anymore.

Iteration Without Understanding Is Just Thrash

Another key insight was that iteration alone is not a virtue.

Teams can ship frequently and still move backward if they:

Don’t understand why changes help or hurt.
Lack signal on the output quality.
Can’t trace failures to causes.
Don’t know which metrics actually matter.

This is where the second system earns its keep. It transforms iteration from guesswork into learning.

As Linus Lee framed it, the real challenge isn’t building intelligence. It’s building understanding around intelligence.

Without shared understanding inside the team, velocity collapses into churn.

Evaluation Is a Living System, Not a One-Time Setup

Evaluation came up repeatedly as a core part of this second system, but with an important caveat.

Evals are not something you “set and forget.”

They must:

Evolve as the product evolves.
Reflect real user behavior.
Adapt to new use cases.
Change as expectations change.

Static evals freeze assumptions in time. Living evals encode learning.

Teams that treated evaluation as infrastructure, not tooling, were better positioned to move fast without breaking trust.

The Real Competitive Advantage

By the end of the discussion, a clear pattern had emerged.

The most successful AI teams weren’t the ones with:

The biggest models.
The cleverest prompts.
The most impressive demos.

They were the ones with:

Tight learning loops.
Strong internal feedback.
Clear ownership.
The ability to change their product with confidence.

In other words, they built organizations that could evolve as quickly as their technology.

The Practical Takeaway

AI products are not static artifacts. They are living systems.

And living systems require:

Observation.
Care.
Feedback.
Adaptation.

If you only build the product users see, you will eventually lose control of it.

If you build the second system—the one that understands, evaluates, and evolves the first—you earn the right to ship AI into the real world.

7. Evals Are Automation — Not Truth

Evaluations came up repeatedly on the panel, but not in the way many teams expect.

Rather than positioning evals as a silver bullet, the speakers shared a more cautious, experience-earned view:

“Evals scale insight, but they reduce resolution.”

They are powerful tools. They are also blunt instruments.

Understanding that tradeoff is critical to building AI products that improve over time instead of calcifying prematurely.

What Evals Are Actually Good At

At their best, evals do three things extremely well:

They automate human judgment.
They enable iteration at scale.
They prevent regressions.

Several speakers described evals as essential guardrails. They make sure teams don’t move backward as systems evolve.

But guardrails are not maps.

Where Evals Quietly Fail

The panel was equally clear about what evals don’t do well.

Evals:

Rely on proxy signals.
Encode assumptions that may be wrong.
Flatten nuance into binary scores.
Struggle with edge cases.
Fail to capture intent, context, or taste.

As Daksh Gupta noted elsewhere in the event, once an eval exists, teams tend to optimize for it, even when it no longer reflects reality.

What started as a helpful abstraction slowly becomes a constraint on thinking.

The risk isn’t that evals are inaccurate.

The risk is that they are confidently incomplete.

Resolution vs Scale Is a Real Tradeoff

A key mental model that emerged was the idea of resolution.

Human review has:

High resolution.
Strong intuition.
Deep contextual awareness.

But it doesn’t scale.

Evals, by contrast, have:

Massive scale.
Consistency.
Speed.

But low resolution.

As Linus Lee framed it during the discussion, evals compress complex judgment into simplified signals. That compression is useful, but it necessarily discards information.

The mistake teams make is assuming compression is harmless.

Evals Can Freeze Bad Assumptions

Several speakers warned about introducing evals too early.

When evals are created before:

Failure modes are understood.
Good usage is well defined.
The product has stabilized.

They tend to encode guesses, not knowledge.

From that point on:

The system optimizes toward the eval.
Exploration slows.
Unexpected behaviors are suppressed.
Real learning stalls.

What looks like progress is often just alignment with an incomplete metric.

How Great Teams Actually Use Evals

The most effective teams on the panel treated evals very differently.

They used evals as:

Learning accelerators, not arbiters of truth.
Ways to scale known insights, not discover new ones.
Safety nets, not steering mechanisms.

Human judgment remained central.

Teams continued to:

Review real outputs.
Talk to users directly.
Interrogate surprising behavior.
Revisit eval criteria frequently.

Evals didn’t replace judgment. They made judgments faster and more focused.

The Real Goal of Evaluation

This led to one of the cleanest reframes of the panel:

“The goal of evals isn’t perfection. It’s making humans faster at understanding where models fail.”

Perfection is a mirage. Understanding is durable.

When evals are used to surface where to look, not what to believe, they unlock speed without sacrificing insight.

The Practical Takeaway

Evals are infrastructure, not intelligence.

They are most powerful when:

Grounded in deep domain understanding.
Updated as products evolve.
Paired with continuous human review.
Treated as provisional, not absolute.

Teams that mistake evals for truth slow themselves down.

Teams that use evals to amplify learning move faster and with confidence.

Thanks for reading Adaline Labs! This post is public so feel free to share it.

8. Frameworks Encode Values — Choose Carefully

One of the quieter, but most consequential insights from the panel was that framework choice is not a neutral technical decision.

It’s philosophical.

Frameworks don’t just provide abstractions. They encode:

Assumptions about how work should be done.
Values about speed versus safety.
Opinions about who the product is for.
Mental models about how systems should evolve.

When teams adopt a framework, they’re not just choosing tooling. They’re choosing a worldview.

Frameworks Optimize for Something — Always

Several speakers noted that most modern AI frameworks are optimized for a specific phase of development.

Common priorities include:

Speed of iteration.
Ease of onboarding.
Approachability for new users.
Rapid prototyping.

These are not bad goals. In fact, they’re often exactly what teams need early on.

But as the panel emphasized, those same values frequently come into conflict with what production systems require.

What Prototyping Frameworks Often Trade Away

Frameworks designed for speed and flexibility tend to de-emphasize:

Robustness.
Explicit control.
Debuggability.
Long-term evolvability.

Early on, these tradeoffs are invisible. Everything works. Changes are easy. Velocity feels high.

Over time, however, the costs surface.

As Linus Lee pointed out, abstraction layers that hide complexity also hide causality. When something goes wrong, teams struggle to understand why.

Behavior becomes emergent rather than intentional. Debugging shifts from reasoning to guesswork.

The framework didn’t break. It did exactly what it was designed to do.

Tooling Shapes How Teams Think

A subtle, but important point emerged during the discussion: frameworks don’t just shape systems. They shape teams.

They influence:

How are problems framed?
Where do teams look for solutions?
Which tradeoffs feel “normal?”
What kinds of questions get asked?

Frameworks optimized for rapid demos encourage experimentation and breadth. Frameworks optimized for production encourage constraint, observability, and discipline.

Neither is universally correct. But mismatches are costly.

Speed Now vs Speed Later

Several speakers highlighted a recurring mistake: optimizing for early velocity at the expense of future movement.

Frameworks that:

Make it easy to ship quickly.
But hard to change direction.
Or painful to evolve.

Often impose invisible ceilings on long-term speed.

As Aiden Bai noted earlier in the panel, the fastest teams long-term are not the ones that move fastest on day one. They’re the ones that preserve optionality.

Framework choice plays a large role in whether that optionality exists.

Adoption Is a Commitment

Once a framework is deeply embedded:

Workflows form around it.
Team expertise concentrates within it.
Migration costs rise.
Architectural decisions harden.

At that point, changing frameworks is no longer a refactor. It’s a replatforming.

This is why the panel encouraged teams to treat framework adoption with the same seriousness as core architectural decisions.

The Practical Reframe

The panel offered a simple, but powerful way to think about frameworks:

“Ask not just ‘What does this framework make easy?’ Ask ‘What does it make hard?’”

Every framework makes something difficult:

Introspection.
Control.
Customization.
Evolution.

Those tradeoffs only become painful when the product outgrows the framework’s original intent.

The Practical Takeaway

Frameworks are leverage, but leverage cuts both ways.

They can:

Accelerate learning.
Reduce boilerplate.
Enable fast exploration.

They can also:

Constrain thinking.
Obscure failure modes.
Slow adaptation over time.

Choosing a framework means choosing which problems you want to solve, and which problems you’re willing to inherit later.

In AI products, where change is constant and certainty is rare, that choice deserves more intention than it usually gets.

9. Task Decomposition Beats End-to-End Autonomy

One of the most consistent reframes on the panel challenged a question many AI teams instinctively ask:

“Can an agent do this end-to-end?”

The panel suggested a better one:

“How should this task be factored between the human and the model?”

That shift, from autonomy to decomposition, turned out to be decisive.

End-to-End Autonomy Is a Fragile Goal

Several speakers described early attempts to build fully autonomous agents that could:

Take a vague input.
Reason through a complex task.
Execute multiple steps.
Deliver a finished result.

These systems often looked impressive in demos. But they failed in production for predictable reasons.

End-to-end autonomy concentrates too much responsibility in one place:

Intent interpretation.
Decision-making.
Execution.
Error handling.

When something goes wrong, there’s no clear boundary for intervention. Failure becomes opaque. Trust collapses quickly.

As one panelist noted:

“Users don’t mind AI helping. They mind AI disappearing into a black box.”

Decomposition Creates Control

By contrast, the teams that found real success broke complex tasks into smaller, legible pieces.

Instead of a single autonomous flow, they designed systems with:

Intermediate artifacts.
Assistive steps.
Explicit checkpoints.
Explainable outputs.

Each step answered a narrower question. Each output gave the user something concrete to react to.

This approach didn’t just improve reliability. It improved collaboration.

Humans Are Better Judges Than Executors

A recurring insight was that humans and models excel at different parts of the workflow.

Models are strong at:

Pattern recognition.
Synthesis.
Drafting.
Proposing options.

Humans are strong at:

Judgment.
Context.
Prioritization.
Responsibility.

When tasks are decomposed intentionally, each party does what it does best.

As Aiden Bai pointed out earlier in the panel, systems that treat AI as a collaborator, not a replacement, tend to scale better.

Users feel in control. They understand where the system helps and where it defers.

Explainability Drives Trust

Matthew Rastovac emphasized that explainability isn’t a compliance requirement. It’s a usability requirement.

When users can see:

How outputs were generated.
What assumptions were made.
Where uncertainty exists.

They’re far more likely to trust the system, even when it makes mistakes.

Decomposed systems surface reasoning naturally, because each step has a purpose.

End-to-end systems hide reasoning, because there’s nowhere to expose it without breaking the illusion of autonomy.

Adoption Follows Legibility

Another theme that emerged was adoption speed.

Systems that relied on full autonomy:

Required more onboarding.
Triggered more skepticism.
Produced more hesitation.

Systems that offered assistance in steps:

Felt safer.
Were easier to learn.
Integrated more naturally into existing workflows.

As Josh Payne noted earlier in the panel, adoption isn’t about how powerful a system is.

It’s about how easily users can see themselves using it successfully.

Human + AI Beats AI Alone

Across anecdotes and domains, the conclusion was consistent:

“Human + AI systems outperform AI-only systems when boundaries are explicit.”

Explicit boundaries:

Clarify responsibility.
Reduce surprise.
Enable graceful failure.
Preserve user agency.

Autonomy can be added later, once trust, understanding, and structure exist.

The Practical Takeaway

The question isn’t whether AI can do something end-to-end.

It’s whether it should.

Teams that default to decomposition:

Ship faster.
Build trust earlier.
Adapt more easily.
Avoid catastrophic failure.

In AI products, autonomy is not the starting point. It’s the reward for getting everything else right.

10. The Hardest Problems Are Still Hard

The panel closed on a sobering, but ultimately empowering truth: some problems are still genuinely hard.

Not hard because teams lack talent. Not hard because models aren’t improving.

Hard because the problems themselves sit at the edge of what today’s systems can reliably handle.

And acknowledging that reality turned out to be a strength, not a weakness.

Value and Difficulty Are Still Correlated

Several speakers noted a pattern that can be uncomfortable in an era of rapid progress.

The most valuable problems tend to be:

Deeply contextual.
Poorly structured.
Full of ambiguity.
Dependent on human judgment.
Embedded in messy real-world systems.

These are exactly the problems where AI looks promising, and where it most often breaks down in production.

As Linus Lee framed it earlier, intelligence alone doesn’t solve these problems.

They require alignment between models, interfaces, workflows, and human expectations, and that alignment is still hard to achieve.

Model Capability Has Limits — And That’s Okay

The panel was notably clear-eyed about current model limitations.

Despite dramatic improvements:

Reasoning degrades under uncertainty.
Long-horizon tasks remain fragile.
Edge cases dominate real usage.
Confidence often exceeds correctness.

Pretending these limits don’t exist leads teams to ship systems that fail silently, and damage trust in the process.

As Matthew Rastovac noted from an enterprise lens, realism about limitations is often what unlocks adoption.

Users are far more forgiving of systems that clearly communicate what they can’t do than systems that promise everything and fail unpredictably.

Exploration Is Not the Same as Production

A crucial distinction emerged between exploration and production.

Exploration:

Tolerates failure.
Values learning.
Embraces uncertainty.
Rewards ambition.

Production:

Demands reliability.
Requires accountability.
Exposes weaknesses.
Punishes overreach.

The panel emphasized that confusing these two modes is one of the most common causes of AI product failure.

Exploration is valuable. But production requires restraint.

Honest Feasibility Checks Save Time

Several speakers described projects that only succeeded once teams stopped asking “How do we make this work?” and started asking “Should this exist right now?”

That shift unlocked better decisions:

Reframing the problem.
Narrowing scope.
Delaying automation.
Changing the abstraction.
Temporarily walking away.

This wasn’t failure. It was judgment.

As Aiden Bai put it earlier in the session, knowing when not to ship is just as important as knowing how to ship quickly.

Walking Away Is Sometimes the Fastest Path Forward

One of the most counterintuitive takeaways of the panel was that walking away can be a form of progress.

Teams that succeed long-term:

Revisit problems as models evolve.
Reattempt challenges with better tools.
Recognize when timing is wrong.
Preserve optionality instead of forcing solutions.

Walking away doesn’t mean abandoning ambition.

It means sequencing it correctly.

The Mature View of AI Product Development

By the end of the discussion, a clear philosophy had emerged.

The best teams:

Push hard where leverage exists.
Design carefully where risk is high.
Accept limits without resignation.
Combine optimism with discipline.

They don’t mistake possibility for readiness.

The Final Takeaway

AI is expanding what’s possible, rapidly.

But not everything possible today is viable today.

The teams that win are not the ones who chase the hardest problems blindly.

They’re the ones who:

Understand the limits of current models.
Respect domain complexity.
Choose the right problems at the right time.

In an industry driven by acceleration, the panel offered a grounding reminder:

Progress comes not just from pushing forward, but from knowing when to pause, reframe, and return stronger later.

How to Ship Reliably With Claude Code When Your Engineers Are AI Agents

Nilesh Barla — Sat, 24 Jan 2026 01:00:20 GMT

TLDR: PMs don’t need “AI that codes”; they need a delivery protocol. This blog explains how PMs can ship reliably with Claude Code by using plan-first gates, guardrails, Claude Code subagents, and multi-model review to turn messy tickets into clean, reviewable PRs. You’ll learn how to lead the gates and quality system so Claude Code ships safely and consistently.

Why PMs Need a Delivery Protocol for Agentic Engineering

Let me start off with a scenario.

Let’s assume yesterday’s ticket is three lines long and slightly wrong. The AI agent grabs it anyway, starts coding immediately, and opens a PR that “looks” complete. Then you find that the diff is noisy, the intent is unclear, and the tests are either missing or irrelevant. Engineering does what engineering always does: they don’t trust it, they ask for a rewrite, and you spend your afternoon translating ambiguity into something reviewable.

That failure mode is not about capability. It is about leadership. As an AI PM, it is okay not to be an expert at coding, but not being a good leader isn’t. If your team is still deciding which coding agent to standardise on, start with our Claude Code vs OpenAI Codex comparison.

In agentic engineering, PMs are no longer just managing people’s throughput. You are managing a delivery system’s production reliability, i.e., how predictable, governable, and reviewable work is under speed. The fix is not “AI that codes.” It is the PM Build Protocol. A plan-first shipping workflow that turns ambiguous intent into structured execution. Plan Mode [in Claude Code] exists specifically to force safe analysis and requirement clarification before changes begin.

To enable plan use the following command:

claude --permission-mode plan

Plan mode in Claude Code.

If you recognize these symptoms, you need a protocol, not more prompts:

PRs are large, noisy, and hard to review.
Engineers say, “This doesn’t match the ticket,” even when it compiles.
AI code review becomes a vibes debate instead of a checklist.
Reliability issues show up late because verification is not enforced.
PM time shifts from product decisions to cleanup and re-explaining intent.

The workflow is: start from the ticket, pass through a plan gate, apply guardrails, run subagent review, run multi-model review, and then open the PR.

Pull requests must be made only when the AI structures and aligns everything.

In the next sections, we will operationalize each gate—how to run Plan Mode as the approval boundary, how to encode guardrails, how to use Claude Code subagents for structured review, and how to add multi-model review so humans only see clean, trustworthy diffs.

Plan Mode With Claude Code to Turn a Ticket Into an Execution-Ready Plan

Plan Mode is the first place where agentic delivery becomes governable. It is your go/no-go gate: no code changes until the model can produce an execution-ready plan that a human can review and approve. Claude Code is explicitly designed to support plan-first behavior before taking actions.

In plain PM terms, plans are the unit of work. Tickets are intent. Diffs are output.

A plan is the path that makes intent legible and output reviewable. When you treat the plan as the artifact—especially when the input is a Linear issue—you stop “AI thrash” early, and you make engineering trust possible.

Plan Output Contract

Goals and non-goals are stated in one sentence each.
Scope boundaries that define what will not be changed.
Files or components likely to be touched and why.
Assumptions and open questions, labeled as blocking vs non-blocking.
Acceptance criteria rewritten as checkboxes that the PR must satisfy.
Test approach mapping each acceptance criterion to a test or verification step.
Rollout and rollback plan, including flags, monitoring, and safe failure behavior.

Copy/paste prompt box:

You are in Plan Mode. Do not modify code.
Use the Plan Output Contract format exactly (7 bullets).
Input: .
Ask only blocking questions; if none, proceed to the plan.
Name files/components you expect to touch and why.
List tests and rollout/rollback steps tied to acceptance criteria.
For any Claude Code feature/workflow claim, cite an official Anthropic/Claude Code source.

Output: An execution-ready plan that can be approved like a spec, then handed to the agent to implement with guardrails.

Guardrails That Make AI Coding Reliable in Production

Guardrails are how you convert agent autonomy into production reliability. In practice, guardrails are concrete constraints—permissions, scoped access, allowed tools/commands, data-handling boundaries, and mandatory checks that must pass before work is considered done. I like the best practices for agentic coding from Anthropic. It’s worth checking out.

Guardrails Ladder

Tier 1: Read-only and analysis.
Agent can inspect, explain, and plan, but not write files or run risky commands. I saw this issue about CC in GitHub. Essentially, CC just ignored all instructions and “…modify files that should be blocked.”
Tier 2: Controlled changes in scoped directories.
Agent can edit only within approved paths and use a pre-approved tool set, with prompts for anything outside the allowlist.
Tier 3: PR-ready changes with enforced checks.
Agent can produce a PR candidate only after automated checks run via hooks and the workflow produces evidence (tests, lint, and a clear diff narrative).

Non-negotiables:

Secrets are never committed; keys and tokens must be handled via environment variables or a secrets manager, not files in the repo.
Directory boundaries are explicit; sensitive paths are disallowed, and the agent’s working scope is narrowed to the minimum viable surface.
Safe commands are pre-approved through Claude Code’s permissions system and shared via project settings to standardize behavior.
Tests and lint are mandatory; hooks should run checks automatically and fail fast when standards are not met.
Logging discipline is enforced; hooks can record tool activity so reviews have an audit trail of what ran and why.
Rollback is expected; every change carries a safe failure path, whether that is a flag, a revert strategy, or a limited rollout plan.

Engineers trust diffs that are bounded and verifiable. Guardrails make the PR smaller, the intent clearer, and the failure modes testable—so review becomes a checklist, not a debate.

Opt for a reusable Guardrails Ladder that your team can adopt to standardize autonomy without sacrificing compliance or speed.

Thanks for reading Adaline Labs! This post is public so feel free to share it.

Subagents as Your Review Team for Spec Checks, Risk Discovery, and Test Design

A single general agent is good at momentum. It is bad at critique. When one model both proposes the approach and judges it, you get confident blind spots.

Claude Code subagents let you split “doing” and “reviewing” into specialized roles with narrow mandates, so critique becomes structured and repeatable instead of conversational. PMs can treat this like an AI review org chart: small teams, clear responsibilities, crisp outputs.

To create an agent in CC use “/agents”

Subagent Org Chart

Spec-to-AC (Acceptance Criteria) checker.
1. Mission: Verify the plan or diff satisfies every acceptance criterion and does not expand scope.
2. Output: A checklist of ACs marked Pass/Fail with one-line evidence per item.
3. Prompt snippet:
  1. Compare this plan/diff to the ticket acceptance criteria.
  2. Mark each AC Pass/Fail and cite the exact file/line or plan step.
  3. List any scope creep as bullets.

Risk and edge-case hunter.
1. Mission: Surface failure modes, regressions, and operational risks before humans review.
2. Output: Top 5 risks with severity and the test or guardrail that would catch each.
3. Prompt snippet:
  1. Enumerate edge cases and regression risks from this change.
  2. Rank by severity and likelihood.
  3. Propose one test per risk.

Test designer.
1. Mission: Translate acceptance criteria into a minimal test plan that proves behavior.
2. Output: A test matrix mapping AC to the test type and to target location.
3. Prompt snippet:
  1. For each AC, propose the smallest test that would fail before this change.
  2. Name the test type and likely file location.
  3. Flag any gaps where behavior is untestable.

Security and privacy reviewer.
1. Mission: Identify risky data handling, injection surfaces, secrets exposure, and unsafe logging.
2. Output: Findings grouped by category with recommended mitigations.
3. Prompt snippet:
  1. Scan for data ingress/egress, auth, secrets, and logging changes.
  2. List issues by category and severity.
4. Suggest the minimal mitigation per issue.

Sequencing note: run subagents on the plan first (before execution), then rerun on the diff before PR creation to reduce noisy iterations and human review load.

Output: A copyable Subagent Org Chart that turns “AI review” into an internal review pipeline your engineers can trust.

Multi-Model Review to Catch Logic Gaps and Regressions Before Human Review

Multi-model review is a practical QA layer, not a philosophical stance. Different models carry different blind spots, so cross-model critique is a cheap way to catch logic gaps and regressions before a human ever opens the diff.

To make this repeatable, you do not “ask for a review.” You assemble a packet that reviewers can audit quickly, and you keep it consistent across PRs. Meaning, you put together the same set of review details or criteria every time, so reviewers can check it fast and know what to expect.

Claude Code is well-suited to generating this packet because it operates with direct repo context and workflows rather than detached chat snippets.

Check out this podcast from Lenny Rachitsky where he and Zevi Arnovitz talk a great on how to use Claude Code and how he uses it review code.

Below are examples of what to include in a review packet:

Plan summary.
Acceptance criteria.
Diff summary.
Test results.
Edge-case list.
Rollout/rollback.

Here are the examples of reviewer questions:

Does the change align with every acceptance criterion without scope creep?
Is the core logic correct under normal and edge-case paths?
Is error handling explicit, safe, and consistent with existing patterns?
Are there any security or privacy risks in data handling, secrets, or logging?
Are there performance footguns such as N+1 calls, expensive loops, or unbounded retries?
Are tests adequate, minimal, and clearly mapped to acceptance criteria?
Is rollback safe, fast, and realistic under incident pressure?

When reviewers disagree, the rule is simple: tests plus spec win. If the packet shows AC alignment and passing tests, prefer the path that preserves correctness and rollback safety. If risk is high or the change touches sensitive surfaces, escalate to a human reviewer immediately and narrow the scope rather than debating model opinions.

Output: A paste-ready Review Packet checklist you can drop into your PR template to make AI code review faster, safer, and more predictable for production reliability.

Conclusion

Start with the ticket, pass it through a plan gate, apply guardrails, run subagent review, run multi-model review, and then open the PR.

This is what reliable agentic engineering looks like in practice: not more output, but more control. PMs lead the gates and the quality system. You own the plan gate that converts ambiguity into an execution-ready spec. You define guardrails that bound autonomy into safe, verifiable changes. You design the Claude Code subagent reviewers so critique is structured and repeatable. You run a lightweight multi-model audit so humans see clean diffs, not surprises.

Tomorrow,

Pick one small ticket with clear acceptance criteria.
Run plan-first, apply your guardrails, then run subagent review and multi-model review before the PR.
Measure one outcome such as review time, rework cycles, or regression risk.

Protocol is better than heroics.