The produce forecasting pilot died on a Tuesday. The model was fine.
What broke was upstream, in a place no slide in the rollout deck had ever looked. The receiving step ran three different ways depending on who opened the dock. The handoff to the floor lead happened verbally, sometimes at 5:42 in the morning, sometimes at 6:15, sometimes not at all. The workflow the model was sitting on top of was not one workflow. It was three, held together by one veteran who knew which version applied on which morning, and who happened to be off the week the pilot went live.
Her name does not matter, because she is a composite, but her function was specific and it is worth seeing clearly. She had run the back dock for eleven years. She knew that the Tuesday truck came heavy on berries and light on stone fruit, that the new hire on the pallet jack would over-receive lettuce if nobody caught him, that the system's suggested order had been wrong about bananas every Monday since a reset two springs ago, and so you simply overrode it without thinking. None of that was written down. It did not need to be, because she was there, and being there was the system. The forecast the chain deployed was, in a sense, trying to stand in for eleven years of one person reading the truck and the weather and the new kid's hands at five in the morning. It was a good forecast. It was not close to enough.
On paper, the chain had deployed a generative demand forecast. In practice, the floor was running the same hand-tallied par sheet it had used for fifteen years, and the forecast output sat in a tab no associate opened after the first morning. The pilot did not fail because the math was wrong. The math was the easy part. It failed because the work the math was supposed to improve had never been made visible to anyone, including the people who built the tool. The veteran's three versions were not in any document. They were in her head, and the week she was out, the operation found out what it had actually been depending on.
The morning it went live, the regional team was on a call to watch the numbers, and the numbers looked good. What the call could not see was the floor. The closing lead from the night before had left the dock staged the way he always did, which was not the way the day crew ran it. The associate assigned to pull the forecast had been moved to cover a call-out in dairy. And the forecast itself, accurate to a decimal, described an order for a routine the crew was not running that morning. By seven the floor had fallen back to the par sheet, because the par sheet worked and the new thing did not fit the morning they were actually having. Nobody decided this. It is just what a floor does under pressure. It reaches for the thing that works.
Nobody called it a failure that Tuesday. There was no incident, no outage, no error message. The forecast kept generating. The dashboard kept showing green. It is just that the floor stopped using it, quietly, the way a floor always does with a tool that does not fit the work. By the end of the quarter the pilot was described in the steering committee as inconclusive, which is the word an organization uses when it cannot say what went wrong because nothing technically did. The model performed to spec. The deployment was clean. And the thing changed nothing, because the thing was never the model.
You can watch the death in the usage logs, if anyone thinks to pull them, which mostly no one does. In the version of this I know best, someone finally pulled them a month in. The forecast had been opened forty times the first morning, eight times the second, twice the third, and after that the line just flattens against the floor, a tool generating output into a room nobody was in. The floor had a name for the tab by then, the kind of name a crew gives a thing that wastes their time, and they had stopped clicking it before the first week was out. The numbers the steering committee watched, sales per labor hour, shrink, service level, never moved, which was read upstairs as stability and on the floor as proof the new thing had never mattered. Both readings were right. That is the quiet horror of it. The operation was healthy and the pilot was dead, and the two facts sat next to each other for a quarter without anyone connecting them.
I have sat in the version of that steering committee more times than I want to count. The slide says inconclusive, or it says promising with caveats, or it says we are still gathering learnings, and everyone nods, because the alternative is to say out loud that the organization spent real money pointing a capable tool at work it had never bothered to understand. That is a harder sentence to put on a slide. So the pilot is shelved, the vendor is thanked, the budget moves to the next promising capability, and the floor goes on running the par sheet it never stopped running. A year later someone asks why the AI investment has not moved the profit-and-loss, and no one in the room can answer, because the answer happened on a Tuesday, on a dock none of them had ever stood on.
I am writing this as an operator who has watched that pattern repeat, in the operation I run and in publicly documented retail-AI deployments across the industry. The vendor demo is rarely the problem. What fails is the work underneath the model: the unwritten routines, the verbal handoffs, the three-versions-of-the-same-process that the org chart conceals and that no one is paid to surface. I have come to believe this is the single most expensive misunderstanding in retail technology right now, and it is hiding in plain sight, because it does not look like a technology problem and so the technology people do not own it, and it does not look like an operations problem and so the operations people do not own it either.
I have spent most of my working life on the operations side of that gap, running high-volume retail, carrying a profit-and-loss number, hitting a payroll with no slack in it. I have also spent the last few years on the other side of it, building the systems, learning what the models can and cannot do, watching the demos with an operator's suspicion and a builder's curiosity at the same time. That double vantage is the reason for this book. The people who study the work mostly do not build the tools, and the people who build the tools mostly do not run the floor, and the space between them is the same space the produce pilot fell into. I am not writing as a vendor with something to sell you or as an academic with a model to defend. I am writing as someone who has had to make this work on a real floor, with real people, on a real Tuesday, and has watched it fail for reasons that had nothing to do with the thing everyone was looking at.
The timing is what makes this urgent rather than merely interesting. The tools are here now, capable and cheap and multiplying, and the floor is already reaching for them, often without anyone's permission and usually without anyone's plan. That is close to the worst possible combination, real capability meeting unmapped work with no one assigned to join the two. Every quarter an organization waits to name this function is a quarter of pilots dying quietly on docks, of budgets spent on the twenty percent, of veterans carrying translation work that leaves the building the day they retire. The window to do this deliberately, before the failures harden into a settled belief that retail AI just does not work, is open now and will not stay open. The operations that figure out the eighty percent in the next few years will compound a lead the ones still buying tools cannot catch.
The pattern is not local
It would be comforting to treat the dead pilot as a one-off, a local failure of one chain or one team. It is not. It is the dominant pattern, and the research is blunt about it.
MIT NANDA's State of AI in Business 2025 finds that ninety-five percent of enterprise generative-AI pilots produce no measurable profit-and-loss impact, and attributes the gap to a learning gap for tools and organizations rather than to model quality. Sit with that number. Nineteen of every twenty pilots, including the ones that demo beautifully, produce nothing the business can measure. The conventional reading is that the technology is not ready. The data says the technology is mostly working and the organizations are not absorbing it.
BCG's enterprise survey reports that seventy-four percent of companies struggle to achieve and scale AI value, and locates seventy percent of the implementation difficulty in people and process, twenty percent in technology, and only ten percent in the algorithm itself. McKinsey's global survey shows that while eighty-eight percent of organizations use AI in at least one function, only thirty-nine percent report any earnings impact, and the firms moving the needle are the ones redesigning workflows rather than layering models on top of the work that already exists. From the top of the house, IBM's CEO study finds that sixty-four percent of chief executives say succeeding with generative AI depends more on people's adoption than on the technology. And from the worker's side, Microsoft and LinkedIn's Work Trend Index reports that most knowledge workers already use AI at work, most bring their own tools to do it, and most leaders concede their company has no plan for implementing it.
Read those five findings together and a single shape emerges. The models work well enough. The adoption is already happening, often without permission. And the value is not landing, because the work the AI is supposed to improve has never been mapped, and no one is accountable for mapping it. The bottleneck is not the model. The bottleneck is organizational.
A word on these numbers before they carry too much weight. The 95% figure, and the 70% that locates the difficulty in people and process, come from practitioner research, not peer-reviewed studies, and they should be read as strong signals rather than settled law. The reason to trust the direction they point is that the peer-reviewed work agrees. The productivity J-curve literature shows that gains from a general-purpose technology lag the complementary work of redesigning how the organization operates, which is the same claim in academic form. The headline stats are the alarm. The J-curve is the mechanism. The argument rests on both, not on either alone.
You have probably lived some version of this. The tool that demoed beautifully and then quietly went unused. The dashboard that stayed green while everyone on the floor knew the thing was not working. The pilot that was never killed and never scaled, just left to fade, because killing it would have required explaining what went wrong and no one could. If you have, then you already know in your gut what the data says out loud. The model is not the bottleneck. Something upstream of it is, and the rest of this book is an argument about what that something is and what to do about it.
Change the technology and the shape holds. A regional grocer I will describe as a composite put computer vision on its shelves, cameras and a model trained to flag out-of-stocks in real time, the kind of deployment that photographs beautifully in a board deck. The model was genuinely good. It caught gaps a walking associate would miss, and it caught them fast. On the demo floor, with a manager standing there primed to respond, it looked like the future.
Then it shipped to three hundred stores, and the alerts landed nowhere. There was no decision about who received them, how they ranked against the forty other things a closing crew is already doing, or what an associate was supposed to drop in order to chase a flag from a camera. So the alerts piled into a queue nobody owned, and the floor did what floors do with a firehose of un-prioritized tasks from a system that does not understand their morning. They ignored it. Worse, the few associates who tried to chase every alert fell behind on the work that actually moved the store, because the tool had quietly added a second job on top of the first and called it efficiency. Within two months the alerts were muted in most stores, not by a decision anyone made on a slide, but by three hundred crews independently reaching the same conclusion, which is that a tool that does not fit the work is just noise with a budget. The vision model was never the problem. The problem was that no one had done the eighty percent, the work of deciding how an alert becomes an action on a real floor without breaking the floor, and no one was named to own that translation. Same Tuesday. Different camera.
Two pilots, two technologies, one shape. A forecast that was accurate and unused. A vision model that was sharp and ignored. In both, the capability was real and the value never landed, for the same reason, which is that the work between the model and the floor, the deciding and the sequencing and the fitting of a new thing into a real morning, was nobody's job. That layer, the translation between what the AI can do and what the floor actually does, is where retail AI lives or dies, and in most operations it is empty. The chapters ahead are about filling it on purpose, with a named role and a real method, instead of leaving it to whichever veteran happens to care.
The conventional explanation, and what it leaves out
When a pilot dies, the post-mortem usually names three or four culprits. The model was not mature enough. The data was dirty. Change management was under-resourced. And sometimes a fourth, that regulatory or compliance friction got in the way. Each of these is partly correct. Models do fail. Data is often a mess. Change management is chronically underfunded. Compliance is real.
But after years of watching this up close in a high-volume grocery environment, triangulated against the literature, I am convinced these explanations together do not account for the whole loss. There is a residual the conventional story cannot reach, and it is the residual this book is about.
The residual is this. The operator role responsible for translating AI capability into frontline workflow does not exist as a named, credentialed, measured competency. The work of that role is real and it is happening, badly, smeared across store managers, ops directors, regional vice presidents, and corporate AI leads, none of whom is evaluated on translation outcomes. Everybody touches a piece of it. Nobody owns it. The work is unowned, and unowned work does not get done well, because no one is accountable for whether it gets done at all.
You will meet that residual again in the next chapter, where I take the conventional explanations apart one at a time and show why each is real and none is sufficient. For now it is enough to name it, and to sit with how strange it is. The function that decides whether a retail AI investment lives or dies is performed, when it is performed at all, by whoever happens to care that day, and it appears on no job description in the building.
That is why the produce pilot died. Not because the model was weak, but because the translation between the model and the 6 a.m. floor was nobody's job. The veteran with the three versions in her head was doing that translation informally, for free, invisibly, and when she was out, the translation simply stopped, and the pilot had nothing to stand on.
The thesis
So here is the argument this book makes, stated plainly and then earned across the chapters that follow.
Retail AI pilots fail at scale not because the models are weak but because the operator role responsible for translating AI into frontline workflow has not been named, the competencies it requires have not been defined, and the scoreboard used to evaluate its performance is borrowed from a pre-AI retail era. Naming the role and rebuilding the scoreboard is the prerequisite for the J-curve to bend.
That last phrase comes from the research on general-purpose technologies. Brynjolfsson, Rock, and Syverson documented what they call the productivity J-curve, the pattern where a transformative technology actually dips measured productivity in its early years before the complementary investments, the new processes and new skills and new ways of organizing, compound into gains. The J-curve names the dip. It does not tell you how to climb out of it. The role and the scoreboard are how retail climbs out of it. They are the complementary investment the J-curve says you need and the conventional rollout never makes.
It is worth being concrete about what that dip looks like on a floor, because the abstraction hides the hard part. When an operation does the real work, the mapping and the codifying and the validating, the legacy numbers often get worse before they get better. You pull a veteran off the line for two weeks to shadow a workflow, and that is two weeks of labor the profit-and-loss can see and the payoff it cannot. You codify a routine, and the first week of running it the new way is slower than the old way, because the old way lived in muscle memory and the new way does not yet. A leader watching only the legacy scoreboard sees cost go up and output dip and concludes the initiative is failing, at exactly the moment it is starting to work. The J-curve is the shape of an investment that costs before it compounds, and the reason most retail AI never makes it out of the dip is that nobody built the instrument that could tell a healthy dip from a real failure. That instrument is in this book too.
Here is why operations skip that work even when they know better. The eighty percent is expensive in the one currency a retail operation cannot print, which is frontline time. Shadowing a workflow, codifying a routine, validating it against the floor, all of it spends hours that are already committed to running the store. The twenty percent, the model, is bought with money, and money is easier to find than a veteran's two weeks. So the organization spends where spending is easy, on the tool, and starves the place the value actually comes from, the work, and then is surprised when the tool it could afford fails for lack of the work it could not. The inversion this book argues for is, in part, a budgeting argument. Put your scarcest resource, attention on the floor, where the problem actually is.
The model was the twenty percent. The work underneath was the eighty percent. And almost every retail AI pilot I have watched up close has spent its energy backwards, perfecting the twenty percent it could buy from a vendor and ignoring the eighty percent that only an operator can do.
There is a name for the person who should have kept that pilot alive, and an instrument that would have shown it dying in time to act. This book gives you both. The role is the Workflow-First AI Leader, the operator accountable for translating AI into frontline work. The instrument is a scoreboard built to see AI landing or stalling while you can still do something about it, not a quarter later in the profit-and-loss. The chapters build to both. For now, hold the one thing that matters. The pilot did not die for lack of a model. It died for lack of an owner.
What this book gives you
This book is built on a framework, Workflow-First AI™, and its load-bearing principle fits in two sentences. The work is 80% of the problem. The model is the easy part. The whole of it reduces to one instruction: start with the work, not the model. The chapters that follow turn that instruction into something you can run. The book makes four moves.
First, it names the role. The Workflow-First AI Leader™ is the operator who diagnoses frontline workflow before any model is pointed at it, codifies the work into a form AI can act on, translates corporate AI strategy into floor-level execution, gates every deployment on responsible-use criteria, and owns frontline adoption as the primary measure of the function. Chapter 9 defines it in full, in terms a hiring manager, an academic, and a working operator can each act on.
Second, it derives the competencies the role requires and anchors each to the research and to a signal you can measure, so the role is not a slogan but a job with a development path.
Third, it proposes The Operator's Scoreboard™, a four-quadrant measurement system that augments rather than replaces the legacy P&L scoreboard, on the theory, after Kaplan and Norton, that financial numbers lag the operating decisions that produce them. Chapter 7 builds it. The legacy scoreboard could not see the produce pilot dying. The Operator's Scoreboard can.
Fourth, it closes with a ninety-day diagnostic a general manager or operations leader can start on Monday, so the framework's value is operational, not theoretical. Chapter 11 lays it out.
The rest of this chapter you already have. It is the dead pilot, and the question it leaves on the table, which is the question every chapter from here answers in a different way. What role on your org chart was responsible for keeping that pilot alive, and how would you have known, before the quarter closed, that it was already dead?
Chapter 2 takes up the failure pattern directly, walks through the conventional explanations one at a time, and shows why each is real and none is sufficient. Then it names what the whole industry, and most of the research, has been stepping around.
The takeaway
The move. See that the pilot died for lack of an owner, not a model.
The signal to watch. A tool that demos well and quietly goes unused.
Do this. Pick one stalled AI tool and ask who owns whether it lands.
### Sources note
Distinguished by type so the claims can be cited precisely.
- MIT NANDA. (2025). The State of AI in Business 2025. (Practitioner research.) Ninety-five percent of enterprise generative-AI pilots produce no measurable P&L impact; gap attributed to an organizational learning gap, not model quality.
- BCG. (2024). AI Adoption in 2024. (Practitioner research.) Seventy-four percent of companies struggle to scale AI value; seventy percent of difficulty is people and process, ten percent the algorithm.
- McKinsey. (2025). The State of AI: Global Survey. (Practitioner research.) Eighty-eight percent use AI in at least one function; thirty-nine percent report EBIT impact; movers redesign workflows.
- IBM Institute for Business Value. (2024). CEO Study. (Corporate primary research.) Sixty-four percent of CEOs say success depends more on adoption than on the technology.
- Microsoft & LinkedIn. (2024). Work Trend Index. (Corporate primary research.) Most knowledge workers use AI at work and bring their own; most leaders report no implementation plan.
- Brynjolfsson, E., Rock, D., & Syverson, C. (2021). The Productivity J-Curve. American Economic Journal: Macroeconomics. (Peer-reviewed.) Transformative technologies dip measured productivity before complementary intangible investments compound.
Composite note: the produce forecasting pilot, the veteran with three versions of the receiving step, and the shelf-vision out-of-stock deployment are composite illustrations drawn from publicly documented retail-AI deployment patterns and frontline observation. They do not depict any single operator or organization.