Building AI Products, Not Prototypes | Takeaways For Founders and Product Leaders
A production-first guide to opinionated workflows, environmental control, and evals that keep AI features reliable.
TLDR: This blog explains how to turn AI demos into durable products by choosing opinionated workflows, controlling the environment, designing for user understanding, and planning for maintenance. It covers data reality, dual-system architecture, evals, framework tradeoffs, and task decomposition—helping teams ship more reliable, debuggable, scalable AI features.
Introduction
Founder Intro: Building AI Products, Not Prototypes
One of the motivations behind Adaline Applied was simple: there’s a growing gap between what AI can demo and what AI can actually sustain in the real world.
Every week, we see impressive prototypes. Agents that look magical. Systems that feel powerful in isolation. And yet, when you talk to founders and operators trying to ship these systems into production, you hear a very different story — one defined by edge cases, trust issues, brittle workflows, and products that stall after their first moment of excitement.
Panel 2 was designed to sit directly in that tension.
Rather than asking what’s possible, we wanted to ask a harder question: What does it actually take to turn an AI prototype into a real product?
To explore that, we brought together builders operating at very different layers of the stack:
Aiden Bai, Co-founder & CEO at Same, building AI-native products with speed and opinionation from day one
Josh Payne, Founder & CEO at Coframe, translating AI capability into measurable business outcomes
Linus Lee, Engineer, AI at Thrive Capital, thinking deeply about interfaces, cognition, and long-term product truth
Matthew Rastovac, Director of AgentForce at Salesforce, shipping AI systems inside large, high-trust enterprise environments
What emerged was not a checklist or a framework — but a shared set of hard-earned lessons.
Again and again, the conversation returned to the same idea:
Most AI failures aren’t caused by weak models.
They’re caused by weak product decisions.
The sections that follow unpack what that really means in practice — why generality creates fragility, why user understanding matters as much as accuracy, why maintenance dominates prototyping, and why the hardest problems are still hard.
This write-up isn’t meant to prescribe a single way to build AI products. It’s meant to surface the patterns that consistently separate demos from durable systems.
If you’re building with AI today — or planning to — my hope is that this panel helps you make better decisions about what to build, how to build it, and when not to ship yet.
1. Generality Is Expensive — Opinionated Workflows Win
One of the strongest points of alignment across the panel was that generality is not a free abstraction. It has real, compounding cost—and that cost shows up fastest once a product leaves the demo environment.
Early on, many AI teams are drawn to building systems that are:
Highly flexible.
Broadly applicable.
Capable of handling many use cases.
Impressive in demos.
This instinct makes sense. General systems feel powerful. They look future-proof. They suggest unlimited upside.
But as multiple speakers emphasized, that flexibility quickly becomes a liability once real users are involved.
When Systems Are Too General, the Model Becomes the Product Designer
As Aiden Bai pointed out, overly general systems force the model to make decisions the product team hasn’t made. The model must infer:
What does the user actually want?
Which constraints matter?
How to sequence actions?
What does “correct” look like?
At the same time, the user is left guessing how to use the product successfully. The result isn’t intelligence—it’s ambiguity.
General systems push cognitive load onto both sides:
The model gets too many degrees of freedom.
The user gets too little guidance.
Neither wins consistently.
In Production, Flexibility Turns Into Fragility
This tradeoff becomes even more pronounced at scale.
Matthew Rastovac, speaking from the perspective of shipping agent systems inside Salesforce, described how generality breaks down quickly in enterprise environments. The more freedom an agent has, the harder it becomes to guarantee predictable behavior—and predictability is non-negotiable when trust is on the line.
Even when a system is technically capable, inconsistent behavior erodes confidence fast. In enterprise settings, users don’t tolerate surprises—especially from software that claims intelligence.
Generality Also Hurts Monetization
Josh Payne highlighted a parallel failure mode from the commercial side.
At Coframe, systems designed to be flexible across many customer use cases became:
Harder to explain.
Harder to position.
Harder to tie to concrete metrics.
When outputs vary too widely, customers struggle to understand why the product is valuable. And if value can’t be explained, it can’t be measured—which makes it nearly impossible to sell or scale.
Generally, in this sense, doesn’t just hurt reliability. It hurts revenue.
Opinionation Is How Products Take Control Back
Across these anecdotes, a consistent pattern emerged:
The more general the system, the more responsibility is abdicated to the model—and the less control the product team retains.
By contrast, the AI products that successfully crossed from prototype to production looked very different. They were highly opinionated.
Aiden described how real progress came not from adding flexibility, but from removing it. Teams narrowed the scope. They encoded domain assumptions directly into workflows. They removed optionality. They chose depth over breadth.
Instead of asking the model to figure everything out, they asked a different question:
“What decisions should the product make so the model doesn’t have to?”
Opinionated Systems Teach Users How to Succeed
This idea surfaced again when Linus Lee spoke about interfaces and cognition. Every AI product teaches users how to think with it—whether intentionally or not:
General systems teach uncertainty.
Opinionated systems teach clarity.
When workflows are explicit:
Users learn faster.
Trust builds more quickly.
Success becomes repeatable.
The product becomes legible instead of mysterious.
Why Specificity Wins in the Real World
In practice, the difference is stark.
General systems tend to:
Produce unpredictable outputs.
Fail in subtle, hard-to-debug ways.
Create a UX that’s difficult to explain.
Erode trust through inconsistency.
Opinionated systems tend to:
Surface fewer failure modes.
Make success repeatable.
Clarify what “good usage” looks like.
Feel reliable even when the model isn’t perfect.
As one speaker noted during the session:
“Models don’t fail gracefully—products have to make them fail gracefully.”
That only happens when constraints are intentional.
Prototypes Need Breadth. Products Need Structure.
This led to one of the clearest takeaways of the panel:
Generality makes prototypes impressive.
Specificity makes products usable.
Prototypes exist to explore what’s possible. Products exist to work—repeatedly, for real users, in real conditions.
Opinionation isn’t premature optimization. It’s the mechanism by which AI systems become dependable.
Teams that delay opinionation often end up retrofitting guardrails onto systems that were never designed to support them. Teams that embrace it early build foundations that scale.
The Quiet Contrarian Insight
In a landscape obsessed with flexibility and “AI that can do everything,” this panel offered a quieter, more durable insight:
The path from prototype to product isn’t paved with more general intelligence. It’s paved with tighter workflows, clearer assumptions, and intentional constraints.
2. Control Over the Environment Determines Feasibility
Another sharp dividing line between prototypes and real products emerged quickly in this panel: how much control the system has over its environment.
Across very different domains—product workflows, enterprise systems, and creative tooling—the same pattern repeated:
“AI systems perform dramatically better when the environment is constrained.”
When teams control:
The inputs.
The structure of the task.
The available actions.
The shape of the output.
AI systems feel capable, reliable, and even “smart.” When they don’t, reliability collapses fast.
AI Thrives in Structured Worlds
Several speakers described how early success almost always happened in environments where:
Inputs were owned or normalized.
Patterns are repeated frequently.
Constraints were known ahead of time.
Failure modes were visible and enumerable.
In these settings, models didn’t need to reason from first principles every time. They could operate within guardrails.
As Aiden Bai noted, many early prototypes feel magical precisely because they live in these controlled worlds. The system works on clean data, predictable tasks, and narrow problem definitions. It’s not that the model is unusually capable—it’s that the environment is unusually forgiving.
This creates a dangerous illusion.
Leaving the Sandbox Is Where Things Break
The moment these systems leave controlled environments, cracks appear.
Matthew Rastovac spoke directly to this from an enterprise perspective. Once AI systems interact with real customer data, real workflows, and real organizational complexity, unpredictability spikes. Inputs aren’t clean. Processes aren’t linear. Edge cases aren’t rare—they’re constant.
Enterprise systems introduce:
Inconsistent schemas.
Legacy workflows.
Partial permissions.
Conflicting sources of truth.
Human-driven exceptions.
In those conditions, even strong models struggle—not because they lack intelligence, but because they’re being asked to operate without a stable frame of reference.
Arbitrary Inputs Are the Enemy of Reliability
This challenge becomes even more pronounced in domains like code, content, and knowledge work.
Josh Payne described how systems that worked well on curated examples failed once exposed to the diversity of real customer data. What looked robust in testing collapsed under the weight of:
Messy inputs.
Inconsistent structure.
Unclear user intent.
These failures weren’t dramatic crashes. They were subtle. Outputs were “almost” right—just wrong enough to erode trust.
And because the failures were inconsistent, they were hard to debug and even harder to explain to users.
Toy Examples Hide Real Constraints
One of the most consistent failure modes discussed on the panel was over-reliance on toy examples.
Many impressive prototypes fail because:
They’re built on idealized data.
They assume cooperative users.
They ignore edge cases.
They avoid ambiguous scenarios.
These prototypes answer the question: “Can the model do this?”
Production systems must answer a harder one: “Can the system do this reliably, every day, for imperfect users, under imperfect conditions?”
That gap is where most AI products fail.
Environmental Control Beats Model Power
A key insight from the panel was that model capability is often the wrong lever to pull.
Teams instinctively respond to failures by:
Switching models.
Increasing context windows.
Tuning prompts.
Layering complexity.
But as multiple speakers emphasized, these changes rarely fix the root problem. The issue is not intelligence—it’s exposure.
Systems fail because:
They’re asked to handle arbitrary inputs.
They lack clear task boundaries.
They don’t know which failures matter.
They don’t control how work enters the system.
As Linus Lee framed it, feasibility is not just a modeling question—it’s a product and interface question. The more a system can shape the environment it operates in, the less it has to rely on raw reasoning.
Task Framing Is the Hidden Superpower
The most successful teams on the panel didn’t try to eliminate environmental complexity. They absorbed it into the product design.
They:
Pre-processed inputs.
Guided users into structured flows.
Constrained actions intentionally.
Limited surface area for failure.
By doing so, they reduced the cognitive burden on the model and increased consistency for users.
The takeaway was clear:
“Production success depends less on model power and more on environmental control and task framing.”
When teams own the environment, AI looks capable. When they don’t, even the best models look unreliable.
The Practical Implication
If Section 1 argued that opinionated workflows win, Section 2 explains why.
Opinionation isn’t just about UX. It’s about feasibility.
The fastest path from prototype to product is not giving AI more freedom—it’s deciding, deliberately, where freedom is dangerous and structure is necessary.
3. User Understanding Is as Important as Model Accuracy
Several of the failures described on this panel had nothing to do with model quality.
The systems technically worked. The outputs were often reasonable. The models were capable. And yet—users still failed.
This surfaced a critical distinction the panel kept returning to:
“Many AI product failures are UX failures, not technical ones.”
When Users Don’t Know How to Succeed, Accuracy Doesn’t Matter
Across multiple anecdotes, speakers described situations where:
Users didn’t know what to ask.
Users didn’t know how to phrase inputs.
Users didn’t understand what the system could or couldn’t do.
Users couldn’t tell whether an output was “good.”
Even when the system produced correct or useful responses, users lacked confidence in how to use it.
As Aiden Bai explained, this creates a subtle but fatal problem: users blame themselves. When they don’t know whether they’re using a product correctly, they stop experimenting. They hesitate. Eventually, they churn—not because the system failed, but because success felt accidental.
Hidden Mental Models Kill Adoption
A recurring theme was the danger of implicit mental models.
Many AI products assume users will intuitively understand:
What kind of inputs work best?
How much context to provide?
When is the model confident versus guessing?
Where are the system’s boundaries?
But as Linus Lee emphasized, users don’t arrive with the product team’s mental model. When success depends on unspoken rules, only power users thrive—everyone else quietly fails.
This creates a false signal:
The product “works” for a small group.
Engagement looks healthy at the surface.
But learning doesn’t spread.
Without explicit guidance, the system becomes brittle outside of expert hands.
Affordances Matter More Than Capability
Several panelists stressed that capability is useless if affordances are unclear.
Matthew Rastovac described this tension in enterprise contexts. Even highly capable agent systems struggled when users couldn’t predict behavior or understand why certain actions were taken.
In those environments, confusion is indistinguishable from risk, and risk is unacceptable.
When affordances are unclear:
Users hesitate to rely on outputs.
Teams introduce manual checks.
Automation stalls.
Trust erodes.
The system doesn’t need to be perfect. It needs to be legible.
Trust Depends on Understanding, Not Just Accuracy
Josh Payne framed this from a commercial perspective. Customers don’t just want correct outputs. They want to understand why the product helps them.
If users can’t explain the value of a system to a colleague, adoption doesn’t spread, and renewal becomes fragile.
Trust, in this sense, isn’t about correctness alone. It’s about predictability, explanation, and confidence.
Users trust systems they can reason about—even if those systems are imperfect.
“Good Usage” Must Be Taught, Not Discovered
One of the clearest lessons from the panel was that good usage doesn’t emerge naturally in AI products.
If users must discover:
What to ask?
How to phrase inputs?
How to evaluate outputs?
When to intervene?
Most of them won’t.
Successful teams made good use of explicit. They:
Constrained inputs.
Provided examples.
Guided first actions.
Surfaced boundaries clearly.
They didn’t assume users would figure it out.
If Users Can’t Explain It, the Product Doesn’t Exist
This led to one of the most blunt conclusions of the panel:
“If users can’t explain how your product helps them, the product doesn’t exist.”
Accuracy alone doesn’t create understanding.
Understanding creates confidence. Confidence creates habit.
Without that chain, even technically impressive systems fail to become products.
The Practical Takeaway
Model accuracy matters, but user comprehension determines whether accuracy is ever experienced.
The teams that succeeded didn’t just build smarter systems. They built systems that taught users how to succeed.
In AI products, clarity is not a UX polish. It’s a core capability.
4. Prototypes Are Cheap — Maintenance Is the Real Cost
One of the most sobering insights from the panel was that AI has made prototyping deceptively easy.
With modern models, teams can:
Stand up impressive demos in days.
Chain together workflows quickly.
Simulate “end-state” product behavior early.
This is a genuine gift. It dramatically lowers the barrier to exploration.
But as multiple speakers warned, it’s also a trap.
The Dangerous Pattern AI Enables
The panel described a pattern that has become increasingly common:
Teams prototype quickly.
Early demos look strong.
Features gain internal and external momentum.
The system gets shipped.
Long-term maintenance costs quietly explode.
Because AI prototypes look so close to finished products, teams often skip a crucial step: asking whether the system is worth maintaining.
As Aiden Bai noted, many teams now treat “we can build this” as sufficient justification to ship. But in AI, feasibility and sustainability are very different questions.
Shipping Is a Commitment, Not a Milestone
Once an AI feature ships, it stops being an experiment.
It becomes:
Something users rely on.
Something customers expect to improve.
Something that must remain stable.
Something that must adapt as models change.
Matthew Rastovac emphasized this from an enterprise perspective. In large organizations, every shipped capability creates an implicit contract. Even “experimental” features quickly become assumed infrastructure.
Removing or degrading them later is far harder than never shipping them at all.
The cost of reversal is high—both technically and politically.
AI Features Age Faster Than Traditional Software
Another key distinction surfaced on the panel: AI features don’t stay still.
Unlike traditional software, AI systems must evolve alongside:
Changing model behavior.
Shifting user expectations.
New failure modes.
Emerging best practices.
What worked six months ago may feel broken today—not because the system regressed, but because the surrounding ecosystem moved.
As Josh Payne pointed out, this makes AI features uniquely expensive to maintain. They require continuous reevaluation, not occasional updates.
Without active stewardship, quality decays silently.
Debugging Gets Harder Over Time, Not Easier
Several speakers also highlighted how maintenance cost compounds in non-obvious ways.
Early on:
Failures are obvious.
Edge cases are limited.
The system’s behavior is still well understood.
Over time:
Failures become subtle.
Behavior drifts.
Assumptions break.
No one fully remembers why decisions were made.
Debugging shifts from “what broke?” to “why does this behave like this at all?”
That transition is where many AI products stall.
The Question Teams Rarely Ask
All of this led to one of the most important reframes of the panel.
The real question isn’t:
“Can we build this?”
With modern AI, the answer is almost always yes.
The real question is:
“Are we willing to maintain this for years?”
That means being willing to:
Own its failures.
Evolve it as models change.
Explain it to users repeatedly.
Defend it internally.
Invest in its long-term quality.
If the answer is no, shipping the prototype is often a mistake—no matter how impressive it looks.
A More Disciplined Definition of Speed
This insight ties directly back to the panel’s broader theme: real speed is long-term speed.
Shipping something that creates drag six months later is not velocity. It’s debt.
The teams that succeed don’t ship fewer prototypes. They ship fewer commitments.
They explore aggressively, but commit selectively.
The Practical Takeaway
AI makes it easy to build things. It does not make owning them easy.
Teams that treat every prototype as a potential long-term system make different decisions:
They constrain the scope earlier.
They delay shipping until maintenance is understood.
They design for evolution, not just launch.
In an era where prototypes are cheap, judgment about what to ship becomes the real competitive advantage.
5. Data Reality Beats Synthetic Optimism
Many of the production failures discussed on this panel didn’t stem from model weakness.
They stemmed from a data mismatch.
Again and again, speakers described the same underlying issue: systems that looked impressive in controlled testing environments broke down almost immediately when exposed to real-world data.
The problem wasn’t intelligence. It was optimism.
The Comfort of Clean Data
AI prototypes are often built on data that is:
Clean.
Structured.
Well-labeled.
Internally generated.
Carefully curated.
This makes early progress feel smooth. Outputs look coherent. Failure rates appear low. The system feels “ready.”
But as Josh Payne noted, this creates a false sense of confidence. Clean data hides the very conditions that define production environments: ambiguity, inconsistency, and noise.
Synthetic data, in particular, tends to encode the assumptions of the team that generated it. That makes it useful for testing logic, but dangerous for validating feasibility.
Real Data Is Messy — And Honest
Once systems encounter real user data, the illusion collapses.
Matthew Rastovac described how quickly edge cases surface inside enterprise systems. Inputs arrive partially filled, inconsistently formatted, or shaped by legacy processes no one fully understands.
Data sources conflict. Human behavior introduces exceptions that no synthetic dataset anticipates.
In those conditions:
Models hallucinate more often.
Confidence signals break down.
Workflows fail silently.
Trust erodes.
These failures aren’t rare. They’re immediate.
Generalization Is Not Guaranteed
A critical mistake surfaced repeatedly in the panel: assuming that strong performance on one dataset implies strong performance everywhere.
As Aiden Bai pointed out, model behavior is highly sensitive to distribution shifts. What works well on curated inputs can fail dramatically when:
Vocabulary changes.
Structure degrades.
Context is incomplete.
User intent is unclear.
Generalization is not automatic, and in many cases, it never arrives without deliberate intervention.
Feasibility Must Be Proven Early
One of the strongest recommendations from the panel was simple, but uncomfortable:
“Use real data as early as possible.”
Not after the prototype. Not after the demo. Not after initial traction.
Early feasibility checks save enormous downstream cost. They reveal:
Whether the problem is actually solvable?
Where constraints need to be added?
How much preprocessing is required?
Which failure modes matter most?
Teams that delay real-data testing often spend months optimizing systems that were never viable in the first place.
Edge Cases Aren’t Edge Cases
Another subtle but important point: in production, edge cases stop being edges.
Once a system is deployed:
Rare inputs appear regularly.
Unexpected usage becomes normal.
Misuse becomes common.
Ambiguity becomes the default.
As Linus Lee framed it, production environments don’t just surface edge cases. They invert them.
What seemed unlikely in testing becomes inevitable in the wild.
This is why synthetic optimism fails so reliably. It prepares teams for best-case scenarios in a world dominated by worst-case inputs.
Garbage In Still Applies — Faster Than Ever
The panel returned to an old truth, with a modern twist:
“Garbage in, garbage out still applies. AI just makes the consequences arrive faster.”
Bad data doesn’t just degrade performance. It accelerates failure.
Because AI systems act confidently even when they’re wrong, poor inputs don’t produce obvious crashes. They produce plausible errors—the most dangerous kind.
The Practical Takeaway
Teams that succeed don’t avoid messy data. They confront it immediately.
They:
Test with real inputs early.
Design workflows to absorb noise.
Constrain what data is allowed in.
Surface uncertainty explicitly.
Build around failure, not perfection.
In AI products, optimism is expensive.
Reality is cheaper—if you face it early.
6. Building an AI Product Means Building Two Things
One of the most important conceptual frameworks to emerge from the panel was deceptively simple:
“When you ship an AI product, you are building two systems at once.”
Most teams only focus on the first.
The First System: The Product Users See
The first system is the obvious one:
The interface.
The workflows.
The outputs.
The features customers interact with.
This is the artifact teams demo, launch, and market. It’s where most effort is visibly spent, and where most AI conversations begin.
But as the panel made clear, this system alone is not enough.
The Second System: The One That Keeps the First Alive
The second system is quieter, less visible, and far more decisive.
It’s the organizational system that:
Observes how the product behaves in the wild.
Detects when outputs degrade.
Understands why failures occur.
Enables safe iteration.
Evolves as models and user expectations change.
This system doesn’t ship to customers, but without it, the customer-facing product inevitably decays.
As Matthew Rastovac emphasized from an enterprise standpoint, AI products don’t just require ongoing support. They require continuous interpretation.
Outputs need context. Failures need explanation. And teams need mechanisms to decide when a system is “good enough” versus when it’s quietly drifting.
Why AI Products Are Fundamentally Different
In traditional software, the rules are relatively stable:
Logic is deterministic.
Behavior changes only when engineers change it.
Best practices evolve slowly.
AI breaks all three assumptions.
As several speakers noted:
Model behavior can shift without code changes.
Upgrades introduce new capabilities and new regressions.
User expectations evolve as AI becomes more commonplace.
Yesterday’s “impressive” becomes today’s “table stakes.”
This means AI products don’t just age. They mutate.
Without a strong second system in place, teams lose the ability to reason about what’s happening inside their own product.
The Invisible Work That Actually Determines Success
When the panel discussed teams that successfully shipped AI products at scale, the conversation quickly moved away from prompts and models and toward internal processes.
Long-term success depended far more on:
Observability into real-world usage.
Fast feedback loops.
Clear ownership of failure modes.
Evaluation infrastructure that evolves over time.
Teams that actively learn from mistakes.
As Aiden Bai noted, teams that move quickly without these systems often appear productive, until suddenly they aren’t.
Progress stalls not because the product is bad, but because no one can confidently change it anymore.
Iteration Without Understanding Is Just Thrash
Another key insight was that iteration alone is not a virtue.
Teams can ship frequently and still move backward if they:
Don’t understand why changes help or hurt.
Lack signal on the output quality.
Can’t trace failures to causes.
Don’t know which metrics actually matter.
This is where the second system earns its keep. It transforms iteration from guesswork into learning.
As Linus Lee framed it, the real challenge isn’t building intelligence. It’s building understanding around intelligence.
Without shared understanding inside the team, velocity collapses into churn.
Evaluation Is a Living System, Not a One-Time Setup
Evaluation came up repeatedly as a core part of this second system, but with an important caveat.
Evals are not something you “set and forget.”
They must:
Evolve as the product evolves.
Reflect real user behavior.
Adapt to new use cases.
Change as expectations change.
Static evals freeze assumptions in time. Living evals encode learning.
Teams that treated evaluation as infrastructure, not tooling, were better positioned to move fast without breaking trust.
The Real Competitive Advantage
By the end of the discussion, a clear pattern had emerged.
The most successful AI teams weren’t the ones with:
The biggest models.
The cleverest prompts.
The most impressive demos.
They were the ones with:
Tight learning loops.
Strong internal feedback.
Clear ownership.
The ability to change their product with confidence.
In other words, they built organizations that could evolve as quickly as their technology.
The Practical Takeaway
AI products are not static artifacts. They are living systems.
And living systems require:
Observation.
Care.
Feedback.
Adaptation.
If you only build the product users see, you will eventually lose control of it.
If you build the second system—the one that understands, evaluates, and evolves the first—you earn the right to ship AI into the real world.
7. Evals Are Automation — Not Truth
Evaluations came up repeatedly on the panel, but not in the way many teams expect.
Rather than positioning evals as a silver bullet, the speakers shared a more cautious, experience-earned view:
“Evals scale insight, but they reduce resolution.”
They are powerful tools. They are also blunt instruments.
Understanding that tradeoff is critical to building AI products that improve over time instead of calcifying prematurely.
What Evals Are Actually Good At
At their best, evals do three things extremely well:
They automate human judgment.
They enable iteration at scale.
They prevent regressions.
Several speakers described evals as essential guardrails. They make sure teams don’t move backward as systems evolve.
But guardrails are not maps.
Where Evals Quietly Fail
The panel was equally clear about what evals don’t do well.
Evals:
Rely on proxy signals.
Encode assumptions that may be wrong.
Flatten nuance into binary scores.
Struggle with edge cases.
Fail to capture intent, context, or taste.
As Daksh Gupta noted elsewhere in the event, once an eval exists, teams tend to optimize for it, even when it no longer reflects reality.
What started as a helpful abstraction slowly becomes a constraint on thinking.
The risk isn’t that evals are inaccurate.
The risk is that they are confidently incomplete.
Resolution vs Scale Is a Real Tradeoff
A key mental model that emerged was the idea of resolution.
Human review has:
High resolution.
Strong intuition.
Deep contextual awareness.
But it doesn’t scale.
Evals, by contrast, have:
Massive scale.
Consistency.
Speed.
But low resolution.
As Linus Lee framed it during the discussion, evals compress complex judgment into simplified signals. That compression is useful, but it necessarily discards information.
The mistake teams make is assuming compression is harmless.
Evals Can Freeze Bad Assumptions
Several speakers warned about introducing evals too early.
When evals are created before:
Failure modes are understood.
Good usage is well defined.
The product has stabilized.
They tend to encode guesses, not knowledge.
From that point on:
The system optimizes toward the eval.
Exploration slows.
Unexpected behaviors are suppressed.
Real learning stalls.
What looks like progress is often just alignment with an incomplete metric.
How Great Teams Actually Use Evals
The most effective teams on the panel treated evals very differently.
They used evals as:
Learning accelerators, not arbiters of truth.
Ways to scale known insights, not discover new ones.
Safety nets, not steering mechanisms.
Human judgment remained central.
Teams continued to:
Review real outputs.
Talk to users directly.
Interrogate surprising behavior.
Revisit eval criteria frequently.
Evals didn’t replace judgment. They made judgments faster and more focused.
The Real Goal of Evaluation
This led to one of the cleanest reframes of the panel:
“The goal of evals isn’t perfection. It’s making humans faster at understanding where models fail.”
Perfection is a mirage. Understanding is durable.
When evals are used to surface where to look, not what to believe, they unlock speed without sacrificing insight.
The Practical Takeaway
Evals are infrastructure, not intelligence.
They are most powerful when:
Grounded in deep domain understanding.
Updated as products evolve.
Paired with continuous human review.
Treated as provisional, not absolute.
Teams that mistake evals for truth slow themselves down.
Teams that use evals to amplify learning move faster and with confidence.
8. Frameworks Encode Values — Choose Carefully
One of the quieter, but most consequential insights from the panel was that framework choice is not a neutral technical decision.
It’s philosophical.
Frameworks don’t just provide abstractions. They encode:
Assumptions about how work should be done.
Values about speed versus safety.
Opinions about who the product is for.
Mental models about how systems should evolve.
When teams adopt a framework, they’re not just choosing tooling. They’re choosing a worldview.
Frameworks Optimize for Something — Always
Several speakers noted that most modern AI frameworks are optimized for a specific phase of development.
Common priorities include:
Speed of iteration.
Ease of onboarding.
Approachability for new users.
Rapid prototyping.
These are not bad goals. In fact, they’re often exactly what teams need early on.
But as the panel emphasized, those same values frequently come into conflict with what production systems require.
What Prototyping Frameworks Often Trade Away
Frameworks designed for speed and flexibility tend to de-emphasize:
Robustness.
Explicit control.
Debuggability.
Long-term evolvability.
Early on, these tradeoffs are invisible. Everything works. Changes are easy. Velocity feels high.
Over time, however, the costs surface.
As Linus Lee pointed out, abstraction layers that hide complexity also hide causality. When something goes wrong, teams struggle to understand why.
Behavior becomes emergent rather than intentional. Debugging shifts from reasoning to guesswork.
The framework didn’t break. It did exactly what it was designed to do.
Tooling Shapes How Teams Think
A subtle, but important point emerged during the discussion: frameworks don’t just shape systems. They shape teams.
They influence:
How are problems framed?
Where do teams look for solutions?
Which tradeoffs feel “normal?”
What kinds of questions get asked?
Frameworks optimized for rapid demos encourage experimentation and breadth. Frameworks optimized for production encourage constraint, observability, and discipline.
Neither is universally correct. But mismatches are costly.
Speed Now vs Speed Later
Several speakers highlighted a recurring mistake: optimizing for early velocity at the expense of future movement.
Frameworks that:
Make it easy to ship quickly.
But hard to change direction.
Or painful to evolve.
Often impose invisible ceilings on long-term speed.
As Aiden Bai noted earlier in the panel, the fastest teams long-term are not the ones that move fastest on day one. They’re the ones that preserve optionality.
Framework choice plays a large role in whether that optionality exists.
Adoption Is a Commitment
Once a framework is deeply embedded:
Workflows form around it.
Team expertise concentrates within it.
Migration costs rise.
Architectural decisions harden.
At that point, changing frameworks is no longer a refactor. It’s a replatforming.
This is why the panel encouraged teams to treat framework adoption with the same seriousness as core architectural decisions.
The Practical Reframe
The panel offered a simple, but powerful way to think about frameworks:
“Ask not just ‘What does this framework make easy?’ Ask ‘What does it make hard?’”
Every framework makes something difficult:
Introspection.
Control.
Customization.
Evolution.
Those tradeoffs only become painful when the product outgrows the framework’s original intent.
The Practical Takeaway
Frameworks are leverage, but leverage cuts both ways.
They can:
Accelerate learning.
Reduce boilerplate.
Enable fast exploration.
They can also:
Constrain thinking.
Obscure failure modes.
Slow adaptation over time.
Choosing a framework means choosing which problems you want to solve, and which problems you’re willing to inherit later.
In AI products, where change is constant and certainty is rare, that choice deserves more intention than it usually gets.
9. Task Decomposition Beats End-to-End Autonomy
One of the most consistent reframes on the panel challenged a question many AI teams instinctively ask:
“Can an agent do this end-to-end?”
The panel suggested a better one:
“How should this task be factored between the human and the model?”
That shift, from autonomy to decomposition, turned out to be decisive.
End-to-End Autonomy Is a Fragile Goal
Several speakers described early attempts to build fully autonomous agents that could:
Take a vague input.
Reason through a complex task.
Execute multiple steps.
Deliver a finished result.
These systems often looked impressive in demos. But they failed in production for predictable reasons.
End-to-end autonomy concentrates too much responsibility in one place:
Intent interpretation.
Decision-making.
Execution.
Error handling.
When something goes wrong, there’s no clear boundary for intervention. Failure becomes opaque. Trust collapses quickly.
As one panelist noted:
“Users don’t mind AI helping. They mind AI disappearing into a black box.”
Decomposition Creates Control
By contrast, the teams that found real success broke complex tasks into smaller, legible pieces.
Instead of a single autonomous flow, they designed systems with:
Intermediate artifacts.
Assistive steps.
Explicit checkpoints.
Explainable outputs.
Each step answered a narrower question. Each output gave the user something concrete to react to.
This approach didn’t just improve reliability. It improved collaboration.
Humans Are Better Judges Than Executors
A recurring insight was that humans and models excel at different parts of the workflow.
Models are strong at:
Pattern recognition.
Synthesis.
Drafting.
Proposing options.
Humans are strong at:
Judgment.
Context.
Prioritization.
Responsibility.
When tasks are decomposed intentionally, each party does what it does best.
As Aiden Bai pointed out earlier in the panel, systems that treat AI as a collaborator, not a replacement, tend to scale better.
Users feel in control. They understand where the system helps and where it defers.
Explainability Drives Trust
Matthew Rastovac emphasized that explainability isn’t a compliance requirement. It’s a usability requirement.
When users can see:
How outputs were generated.
What assumptions were made.
Where uncertainty exists.
They’re far more likely to trust the system, even when it makes mistakes.
Decomposed systems surface reasoning naturally, because each step has a purpose.
End-to-end systems hide reasoning, because there’s nowhere to expose it without breaking the illusion of autonomy.
Adoption Follows Legibility
Another theme that emerged was adoption speed.
Systems that relied on full autonomy:
Required more onboarding.
Triggered more skepticism.
Produced more hesitation.
Systems that offered assistance in steps:
Felt safer.
Were easier to learn.
Integrated more naturally into existing workflows.
As Josh Payne noted earlier in the panel, adoption isn’t about how powerful a system is.
It’s about how easily users can see themselves using it successfully.
Human + AI Beats AI Alone
Across anecdotes and domains, the conclusion was consistent:
“Human + AI systems outperform AI-only systems when boundaries are explicit.”
Explicit boundaries:
Clarify responsibility.
Reduce surprise.
Enable graceful failure.
Preserve user agency.
Autonomy can be added later, once trust, understanding, and structure exist.
The Practical Takeaway
The question isn’t whether AI can do something end-to-end.
It’s whether it should.
Teams that default to decomposition:
Ship faster.
Build trust earlier.
Adapt more easily.
Avoid catastrophic failure.
In AI products, autonomy is not the starting point. It’s the reward for getting everything else right.
10. The Hardest Problems Are Still Hard
The panel closed on a sobering, but ultimately empowering truth: some problems are still genuinely hard.
Not hard because teams lack talent. Not hard because models aren’t improving.
Hard because the problems themselves sit at the edge of what today’s systems can reliably handle.
And acknowledging that reality turned out to be a strength, not a weakness.
Value and Difficulty Are Still Correlated
Several speakers noted a pattern that can be uncomfortable in an era of rapid progress.
The most valuable problems tend to be:
Deeply contextual.
Poorly structured.
Full of ambiguity.
Dependent on human judgment.
Embedded in messy real-world systems.
These are exactly the problems where AI looks promising, and where it most often breaks down in production.
As Linus Lee framed it earlier, intelligence alone doesn’t solve these problems.
They require alignment between models, interfaces, workflows, and human expectations, and that alignment is still hard to achieve.
Model Capability Has Limits — And That’s Okay
The panel was notably clear-eyed about current model limitations.
Despite dramatic improvements:
Reasoning degrades under uncertainty.
Long-horizon tasks remain fragile.
Edge cases dominate real usage.
Confidence often exceeds correctness.
Pretending these limits don’t exist leads teams to ship systems that fail silently, and damage trust in the process.
As Matthew Rastovac noted from an enterprise lens, realism about limitations is often what unlocks adoption.
Users are far more forgiving of systems that clearly communicate what they can’t do than systems that promise everything and fail unpredictably.
Exploration Is Not the Same as Production
A crucial distinction emerged between exploration and production.
Exploration:
Tolerates failure.
Values learning.
Embraces uncertainty.
Rewards ambition.
Production:
Demands reliability.
Requires accountability.
Exposes weaknesses.
Punishes overreach.
The panel emphasized that confusing these two modes is one of the most common causes of AI product failure.
Exploration is valuable. But production requires restraint.
Honest Feasibility Checks Save Time
Several speakers described projects that only succeeded once teams stopped asking “How do we make this work?” and started asking “Should this exist right now?”
That shift unlocked better decisions:
Reframing the problem.
Narrowing scope.
Delaying automation.
Changing the abstraction.
Temporarily walking away.
This wasn’t failure. It was judgment.
As Aiden Bai put it earlier in the session, knowing when not to ship is just as important as knowing how to ship quickly.
Walking Away Is Sometimes the Fastest Path Forward
One of the most counterintuitive takeaways of the panel was that walking away can be a form of progress.
Teams that succeed long-term:
Revisit problems as models evolve.
Reattempt challenges with better tools.
Recognize when timing is wrong.
Preserve optionality instead of forcing solutions.
Walking away doesn’t mean abandoning ambition.
It means sequencing it correctly.
The Mature View of AI Product Development
By the end of the discussion, a clear philosophy had emerged.
The best teams:
Push hard where leverage exists.
Design carefully where risk is high.
Accept limits without resignation.
Combine optimism with discipline.
They don’t mistake possibility for readiness.
The Final Takeaway
AI is expanding what’s possible, rapidly.
But not everything possible today is viable today.
The teams that win are not the ones who chase the hardest problems blindly.
They’re the ones who:
Understand the limits of current models.
Respect domain complexity.
Choose the right problems at the right time.
In an industry driven by acceleration, the panel offered a grounding reminder:
Progress comes not just from pushing forward, but from knowing when to pause, reframe, and return stronger later.


