How to measure AI ROI without fooling yourself

Why most AI ROI math falls apart
The standard enterprise AI report is a license count, a login chart, and a quote from an enthusiastic early adopter. None of that is ROI. Seats are a cost line; logins are curiosity; enthusiasm is week-two behavior. The numbers that matter — hours returned, cost per unit of work, cycle time, error rates, revenue per rep — rarely appear, because nobody wrote them down before the rollout.
That's the most charitable reading of McKinsey's finding that only 39% of organizations report any EBIT impact from AI, with most of those under 5% of EBIT. Some of that gap is value that doesn't exist yet. A meaningful share is value that exists but was never instrumented, so nobody can defend it in a budget review.
The classic failure is the unfalsifiable claim: "we saved 10,000 hours." Saved relative to what baseline? Measured how? Did anyone redeploy those hours into something the business can see? Without answers, the number evaporates the first time a CFO pushes on it.
The three layers: capability, usage, outcomes
Capability — can your people actually use AI for their role? Measured with skills checks and task-based tests, not training attendance. Capability is the leading indicator of everything downstream.
Usage — do they use it, weekly, inside real workflows? The honest signal is retention after week four, when novelty wears off. Usage that survives a month is habit; usage that doesn't was a demo.
Outcomes — does a baselined business number move? Hours redeployed, cost per ticket, days of cycle time, error rates, revenue per rep. Outcomes only count when attributable to the workflow that changed.
Each layer predicts the next, which means you can manage them in order. (This is the same Capability–Usage–Momentum lens behind our free AI proficiency assessment — twelve questions, instant score, no account.)
A baseline-first method, in five steps
The discipline is unglamorous and entirely doable:
- Pick the workflow and write the baseline first — hours, cost, error rate, cycle time, dated and signed.
- Instrument usage where the work happens, not in a survey three months later.
- Attribute honestly: stagger the rollout across teams, or keep a comparison group, so the delta means something.
- Convert hours carefully — time saved only counts when it becomes redeployed capacity, avoided cost, or measurable throughput.
- Report a range, not a point estimate, and recompute quarterly. Precision you can't defend is worse than a defensible interval.
What good looks like at 3, 6, and 12 months
Three months: capability scores up, weekly usage holding after the novelty dip, and one workflow with a baseline number visibly moving.
Six months: two or three workflows with attributable gains, one scaled from a team to a function — and at least one pilot killed on schedule, because a kill list is what measurement discipline looks like from the outside.
Twelve months: function-level cost or throughput changes visible in ordinary operating reviews, and AI line items owned by the functions that benefit, not parked in an innovation budget. That's the point where AI ROI stops being a special report and becomes ordinary management.
Frequently asked questions
What's a realistic timeframe to see AI ROI?
Workflow-level impact in about 90 days if you baselined first. Function-level impact in six to twelve months. Enterprise EBIT impact is slower — which is consistent with McKinsey finding most reported impact still under 5% of EBIT.
Is "hours saved" a real ROI metric?
Only after conversion. Hours count when they become redeployed capacity, avoided hiring or vendor cost, or measurable throughput. Unconverted hours-saved claims should be discounted heavily — they rarely survive a CFO's second question.
What metrics belong on an AI adoption dashboard?
One per layer: a capability score from skills checks, weekly active usage inside target workflows (with week-four retention), and the baselined business number each workflow is supposed to move.
Why do so few companies report EBIT impact from AI?
Two stacked reasons: most pilots never change behavior (MIT found 95% show no P&L impact), and much of the value that does exist was never baselined or instrumented, so it can't be credibly claimed.