What is the biggest mindset shift for AI product development?

The locus of control moves from build time to runtime. In traditional software you remove uncertainty before shipping, then the product is done. In AI products the uncertainty lives in production permanently, so you ship a control loop around an unreliable component and manage its behavior continuously rather than finishing it once.

Why do most AI product initiatives fail?

Independent studies converge on failure rates near 80% for AI projects and higher for generative AI pilots, and the cited causes are mostly organizational, not technical. Teams apply a deterministic playbook (fixed specs, pass-or-fail QA, ship and move on) to a probabilistic system that needs evals, context engineering, failure-tolerant design, and continuous ownership.

What is eval-driven development for product managers?

You stop writing a spec that defines exact behavior and instead define what a good output looks like as a measurable rubric, then score AI outputs against real test cases. The rubric becomes the PRD for AI behavior and evals become the sensor that tells you whether anything is actually ready to ship.

Should leaders wait for better models before building?

Don't stall, but don't build what the next model release will hand you free either. The durable investment is the scaffolding around the model: data, context, guardrails, evals, and UX. Build that now, own it continuously, and let model upgrades raise the ceiling underneath work you already own.

AI Product Management: 7 Mental Models for Leaders

The hard part of moving to AI-led product development is not learning new tools. It is that the assumptions your craft was built on quietly stop holding.

Here is the shift underneath all the others. In traditional software, your job was to remove uncertainty before you shipped: spec it, test it, ship it, done. The uncertainty lived at build time and you killed it there. AI inverts this. The uncertainty does not disappear when you ship. It moves into production and stays there. The locus of control moves from build time to runtime, which means you are no longer shipping a finished artifact. You are shipping a control loop around a component that will never be fully reliable. Evals are the sensors in that loop, guardrails and fallbacks are the actuators, and your rubric is the setpoint. Every mental model below is a consequence of that one move.

The stakes are real. Independent studies converge on a grim picture: RAND's report cites estimates that more than 80% of AI projects fail, about twice the rate of non-AI IT projects, and MIT's Project NANDA found that roughly 95% of generative AI pilots delivered no measurable profit impact. Treat the exact figures with some caution; both have drawn methodology criticism, and "failure" is defined differently across them. What is hard to argue with is the direction and the diagnosis: the failures are mostly organizational, not a problem with the models. Teams keep running a deterministic playbook against a probabilistic system. Here is the playbook that replaces it.

1. Output is a distribution, not a value

Deterministic software gives you a contract: input X always produces output Y. You can assert equality in a test and trust it forever.

A language model breaks that contract. The same prompt produces different answers across runs, and small wording changes swing quality. You are no longer reasoning about a value. You are reasoning about a distribution of outputs, most of which should be good and some of which will not be.

This is the root, and the other six follow from it. Once behavior is statistical, "does it work?" stops being yes or no. The real questions are how often it fails, how badly, and what happens when it does.

Do this Monday: run your key prompt ten times against the same input and look at the spread. That variance, not any single output, is the thing you are actually shipping.

2. Your spec becomes a rubric

In traditional development the PRD defines behavior and QA checks the build against it. With AI the spec cannot fully define the behavior, because the behavior is probabilistic. So the artifact changes. Instead of a list of requirements you write a rubric: what does a good output look like, on which dimensions (accuracy, tone, format, safety), and how do you score it?

This is eval-driven development, and the rubric is, in effect, a PRD for AI behavior. Evals do for AI features what unit tests did for code. They turn "I think this is better" into "we tested 200 cases and accuracy rose without regressing tone." In control-loop terms, evals are your only sensor. Ship without them and you are flying blind on a system you cannot see.

Kevin Weil, OpenAI's chief product officer, has said plainly that writing evals is becoming a core skill for product managers. If your team still ships AI features on vibe checks, that is the first habit to kill.

Do this Monday: write the rubric and assemble 30 to 50 real test cases before you write another prompt.

3. The model is the smallest part of the product

This is the one leaders get wrong most often, and it is the most expensive. The instinct is to treat the model as the product and assume a better model fixes everything.

The data disagrees. MIT's GenAI Divide report traced the failures not to model quality but to a learning and integration gap. Generic tools work for individuals because they are flexible, but they stall inside companies because they do not adapt to the workflow. The product is the loop, not the component: retrieval, context, guardrails, fallbacks, the data you feed the model, and the interface that sets user expectations. That scaffolding is where defensibility lives, because anyone can call the same API you can.

A blunt test: if a better base model would make your feature pointless, you built a demo, not a product. Durable products get better when the model improves; they do not get erased by it.

4. You discover capabilities, you don't specify them

In traditional planning you decide what to build, then build it. With AI you cannot reliably know what the model can and cannot do for your specific case until you try. The capability envelope is empirical.

So the order of operations inverts. Spike before you plan. Build a throwaway prototype against real inputs to find the frontier of what works, then commit the roadmap. A feature that looks trivial may be unreliable, and one that looks ambitious may already work out of the box. Planning a quarter of AI work without probing that boundary is how teams commit to features the model cannot deliver and miss the ones it could.

5. Every call carries a price

Traditional software has near-zero marginal cost: once it is built, the millionth user costs almost nothing. That single assumption underwrites most SaaS economics.

AI breaks it. Every inference has a real cost in tokens and a real cost in latency. A feature that calls a large model on every keystroke has a completely different margin profile from one that caches and batches. This is not a detail for engineering to handle later. It is a product and pricing decision, because it determines what you can offer, at what price, to whom. Leaders who model AI products on zero-marginal-cost intuition get surprised by their own gross margins.

Do this Monday: estimate cost per action and latency per action before you commit to an interaction pattern, not after.

6. Design for being wrong

Because output is a distribution, the model will be confidently wrong some percentage of the time, and you cannot prompt that to zero. The mature response is to build the product so a wrong answer fails safely. The failure path is not an edge case here. It is a feature, and it gets heavy traffic.

In practice: surface confidence and uncertainty instead of presenting every answer as fact, keep a human in the loop where the cost of error is high, make actions reversible, and prefer suggesting over auto-executing for anything consequential. This is the override in your control loop. Traditional design optimizes the happy path; AI design has to treat the unhappy path as first class.

7. Shipping is the start, not the end

In traditional software, launch is roughly the finish line. With AI, launch is where the real work begins, for two reasons that pull in opposite directions.

The floor rises under you. The model frontier moves every few months, and a capability that needs careful engineering today may ship as a default tomorrow. That is good, if you built above the waterline: invested in durable scaffolding (data, context, evals, UX) rather than features tightly coupled to today's model quirks.

The floor can also rot under you. A model upgrade, a prompt change, or quiet data drift can degrade a live feature without anyone touching the code, and you will not notice unless your evals are running in production. This is the leadership-level shift the failure studies keep pointing at. AI features are owned continuously, not shipped once. Budget for the loop, not the launch, and stop running AI as a one-time IT project. That single resourcing mistake is behind a large share of the 80%.

The takeaway

Pull the seven together and they are one idea: you have stopped building artifacts and started running control loops around unreliable components. Specs become rubrics, QA becomes evals, the model becomes a part rather than the whole, and cost, failure, and drift become design inputs instead of afterthoughts.

If you do one thing this week, take a single AI feature you are shipping and answer two questions about it. What is the rubric that defines a good output, and what happens when the output is wrong? If you cannot answer both, you do not yet have a product. You have a demo with a control loop missing.

7 Mental Models for AI-Led Product Development

1. Output is a distribution, not a value

2. Your spec becomes a rubric

3. The model is the smallest part of the product

4. You discover capabilities, you don't specify them

5. Every call carries a price

6. Design for being wrong

7. Shipping is the start, not the end

The takeaway

Frequently asked questions

Read next

Enterprise Design Thinking for AI: What Breaks

AI Is a Bubble. So Was the Internet.

What AI Orchestration Actually Is, From First Principles

Get new posts by email