By Michael Martin, Managing Director, Data & AI
What You’ll Learn:
- Why the model evaluation process is consuming attention that should be directed at a more consequential question – and the structural fix
- Why most enterprise environments aren’t ready to support AI at production scale
- The three gaps that determine whether AI delivers measurable returns or stalls in pilot
- What the 5% of organizations producing measurable AI returns did differently – and what most organizations optimizing for pilots are skipping
- Five decisions that separate organizations accumulating outcomes from those accumulating pilots
> Save this as a PDF to read offline or share with your team.
The question isn’t whether AI can deliver business value. It already is for 5% of the organizations that have deployed it. The remaining 95% are running the same models and getting fundamentally different results. That gap isn’t a model problem. It’s a system problem. This paper makes the case for where enterprise AI outcomes are actually determined – and what it takes to close the gap.
The Gap Between Expectation and Return
Most CIOs are not pessimistic about AI. They are frustrated. That distinction matters. Pessimism implies the technology cannot deliver. Frustration is the experience of watching technology that clearly can deliver fall short of what was promised – in the pilot, in the board deck, in the vendor narrative. That frustration is widespread, and it is measurable.
MIT finds that 95% of enterprise AI initiatives produce no measurable P&L impact. BCG reports that only 5% of companies have achieved value at scale. McKinsey finds that while adoption is widespread, only one in ten organizations has moved beyond pilot-stage outcomes. The signal across these data points is consistent: access to AI capability is no longer the differentiator. Converting that capability into business performance is.
The underperformance itself is not dramatic. It is quiet.
The model works. The pilot produces strong results. The system functions. But when measured against the business case that justified the investment, the returns fall short. Cost reduction projections narrow and productivity gains flatten in real workflows. Outputs require enough human validation that the net efficiency gain is a fraction of the forecast. Over time, expectations recalibrate downward – not because the system failed, but because it never fully delivered.
This is what underperformance looks like in practice. It is not the absence of capability; it is the failure to operationalize it at scale.
Optimizing the Wrong Layer
The model evaluation process has become highly sophisticated. Organizations benchmark accuracy, latency, cost, context window, and vendor risk. These are reasonable considerations. The problem is not that organizations are choosing the wrong model. It is that the evaluation process consumes the attention that should be directed at a more consequential question: whether the enterprise system is capable of supporting any model at production scale.
That question is harder to answer, and therefore easier to defer. A model can be evaluated in a controlled environment in an afternoon. The readiness of an enterprise’s data foundation, integration architecture, and governance model cannot. As a result, organizations optimize what is measurable and postpone what is structural. They deploy into environments that were never designed for continuous, machine-driven interaction and encounter the gap.
The pilot environment conceals the problem entirely. In a controlled pilot, the model runs on curated data within a constrained scope, managed by a small team. The results are strong because the conditions are designed to be strong. That is the right way to evaluate a model, but it creates a misleading signal about production readiness. The moment AI is introduced into real operations, it interacts with fragmented data, legacy APIs, and security models built for human-paced access. The model performs as expected. The system does not.
Organizations reporting significant financial returns are twice as likely to have inverted the sequence, finds McKinsey. They redesigned workflows, data structures, and integration pathways before selecting a model, not after. The model decision did not drive the outcome. The system it was deployed into did.
The Three Gaps that Actually Determine Outcomes
Gap 1: The Data Foundation is Not Ready
Enterprise AI is only as reliable as the data it operates on. In most organizations, that foundation is not yet sufficient. Data is fragmented across systems, inconsistently defined, and often insufficiently governed for operational use. Qlik finds 81% of data leaders struggle with data quality, driving low confidence in AI readiness. That lack of trust becomes a limiting factor when AI systems depend on that data not just for reporting, but for real-time reasoning and action.
In a reporting context, poor data quality creates delayed or partial insight. In an AI context, it creates immediate and compounding error. An agent operating on unmastered or inconsistent data does not produce occasional inaccuracies. It produces them continuously, across every interaction it touches.
This is why data readiness is not a prerequisite for experimentation – it is a prerequisite for production.
Gap 2: The Integration Infrastructure Was Built for a Different Speed
Most large enterprises carry 10 to 15 years of API and integration infrastructure designed for human-paced interaction. Roughly 50% are entirely blind to machine-to-machine traffic and cannot monitor their AI agents.
AI operates differently. It interacts continuously, across systems, at machine speed. That shift exposes limitations that were previously invisible. APIs that were sufficient for periodic human use begin to show latency, inconsistency, or instability under continuous load. Legacy interfaces that were never modernized become points of risk when exposed to autonomous systems.
The issue is not that these systems are broken. It is that they were built for a different operating model. AI does not break systems at the core, it fractures them at the seams – at the points where systems connect, where data moves, and where control is the weakest. Organizations that have not invested in modernizing their integration layer discover that their ability to scale AI is constrained not by the model, but by the infrastructure it depends on.
Gap 3: Governance Was Designed for Review, Not Runtime
Traditional governance answers two questions: who has access, and how data is protected. Those are necessary, but they are insufficient when AI systems begin to act. The governance question shifts from access to behavior: what the system is allowed to do, under what conditions, and when it must stop. Most organizations have not made that shift, and the gap becomes visible under scale.
Consider an AI-enabled process interacting with procurement systems. Under normal conditions, transactions fall within predictable ranges. The system identifies a legitimate need and initiates an order – but instead of operating within expected thresholds, it submits a request for 5,000 units. The request is technically valid. The system executes: inventory is allocated, downstream workflows trigger, and the impact propagates before intervention occurs.
This is not a model failure. It is the system operating without defined boundaries. Governance in an AI context must operate at runtime – establishing thresholds for what can be executed autonomously, what triggers escalation, and what requires intervention. Without those controls, organizations are not scaling automation, they are scaling unbounded execution.
Architectural Discipline: What the Best Organizations Understood Early
Two decisions Amazon made early in its evolution illustrate what most organizations are missing.
The first was the API mandate. Every team was required to expose its data and functionality through service interfaces. No direct data reads. No shortcuts. No exceptions. At the time, this looked like unnecessary overhead – slowing teams down to enforce architectural purity that did not appear tied to immediate business value.
In practice, it did the opposite. That discipline forced the organization to treat every internal capability as something that could be reliably accessed, reused, and scaled. What began as an internal constraint became the foundation for AWS. The architecture that enabled internal coordination became the product that reshaped the market.
The lesson is not that APIs matter. It is that architectural discipline compounds.
Most enterprises today are at a similar inflection point with AI. Building a governed, standardized, and observable integration layer feels like overhead when the immediate pressure is to deliver AI use cases quickly. But the organizations that make that investment early are not slowing down – they are building the system that allows them to scale without rework.
The second decision was embedding security directly into product teams rather than treating it as a downstream checkpoint. The insight was simple: retrofitting control after deployment is always more expensive than designing it in from the start. The cost is not just remediation, but exposure, delay, and lost trust.
In the context of AI, that principle becomes more urgent. AI systems do not just process data – they act on it. They interact with APIs, trigger workflows, and execute decisions at scale and speed traditional systems never did. When control is applied after the fact, the system has already moved. The cost is no longer theoretical.
Organizations that treat governance and security as design inputs are not adding friction. They are creating the conditions required for safe autonomy. Those that defer it are not moving faster. They are accumulating risk that will surface later, often at scale.
MCP and the Shift to a Scalable Integration Model
Most organizations are still approaching AI integration as a series of individual connections – each new agent or use case requiring its own pathway into enterprise systems. That model does not scale.
BCG’s analysis reinforces the same pattern: integration effort rises quadratically as agents proliferate. Each connection introduces its own security considerations, governance logic, and failure modes. Over time, the environment becomes more complex, not more capable.
Amazon’s API mandate is the right mental model for understanding why Model Context Protocol (MCP) is the most consequential infrastructure development in enterprise AI since the large language model itself. Rather than treating each AI interaction as a separate integration problem, MCP introduces a centralized interaction layer. Systems connect once. Agents consume those capabilities through a shared interface. Integration efforts increase linearly – moving from a repeated implementation task to an architectural decision made once and reused.
This is not just an efficiency gain. It is a control gain. Centralization allows governance to be applied consistently. It makes system behavior observable. It reduces duplication not only of integration effort, but of risk. Without a model like this, organizations do not just accumulate integrations – they accumulate inconsistency, with each new use case introducing variation in how systems are accessed and controlled.
With a centralized layer, those decisions are made once and enforced consistently. MCP adoption is accelerating not because organizations need another tool, but because they need a way to prevent integration complexity from becoming the limiting factor in AI scale.
Organizations that already built disciplined integration practices over the past decade are positioned to adopt this model quickly. Those that did not are not facing a tooling gap. They are facing a sequencing problem. One that no model selection decision can resolve.
What the 5% Did Differently
The organizations producing measurable AI returns did not get there by solving for intelligence first. They solved for execution first. Across the market, the pattern is consistent. While adoption is widespread, BCG reports only 5% of organizations have translated that adoption into measurable value. The difference is not access to models. It is how those organizations approach the problem.
They recognized early that the model would not be the limiting factor. The constraint would be the system the model operated within. That recognition changed how they invested. They did not delay AI initiatives while waiting for perfect infrastructure. But they also did not deploy into environments that were not prepared to support them. Instead, they sequenced their investments around the conditions required for AI to operate reliably at scale.
That sequencing is what separates early momentum from durable outcomes.
- Audit the API portfolio before expanding agent footprint. Understand which systems agents will interact with, how those systems are exposed, and where legacy interfaces introduce risk. APIs that were ‘good enough’ for human-paced interaction are not sufficient for continuous, machine-driven access.
- Master data before feeding it to an agent. Master data is not treated as a reporting asset, but as an operational dependency. Deploying AI into inconsistent or untrusted data environments does not accelerate outcomes – it accelerates error. Ensure that the data systems depend on is governed, synchronized, and usable in real time.
- Treat MCP as the architectural decision it is. Integration is not approached as a series of one-off implementations, but as a reusable layer that supports multiple use cases. MCP reduces duplication, centralizes control, and creates consistency in how systems are accessed and governed.
- Define governance thresholds before deployment, not after the first incident. Rather than defining controls after deployment, establish thresholds for how systems are allowed to behave before agents are introduced. Define what can be executed autonomously, what requires escalation, and where human intervention remains necessary. Governance is treated as part of system design, not as a response to failure.
- Be appropriately skeptical of experience claims in this space. The AI integration layer did not exist in its current form 18 months ago. The more meaningful signal is depth in the disciplines that actually determine outcomes – API management, data mastering, governance, security architecture – the ability to apply them in an AI context with real production experience.
None of these decisions accelerate a pilot. They are not visible in early demonstrations. But they determine what can be scaled, trusted, and extended. This is where most organizations diverge. They optimize for early progress – getting a use case live, proving capability, demonstrating momentum.
The organizations in the 5% optimize for durability. They invest in the conditions that allow AI to operate reliably, not just impressively. Over time, the difference becomes structural. Organizations that focus on capability accumulate pilots. Organizations that focus on system readiness accumulate outcomes.
The Verdict: Most Organizations are Solving the Wrong Problem
The model is improving rapidly. It will continue to become more capable, more accessible, and more cost-efficient. The system that makes the model usable in an enterprise context is not improving at the same pace – and that is where the gap is forming. Most organizations are still focused on selecting and optimizing the model, because that is the most visible part of the system. It is also the easiest to compare, the fastest to evaluate, and the most straightforward to explain. But it is not where enterprise outcomes are determined.
Those outcomes are determined by the data the model depends on, the systems it interacts with, and the boundaries within which it operates. This is why the results are so uneven. Organizations deploying similar models are seeing fundamentally different outcomes, not because the models behave differently, but because the environments they are deployed into do. One environment supports reliable interaction, consistent data, and controlled execution. The other does not.
AI does not create that difference. It exposes it. The organizations pulling ahead are not the ones that selected the best model. They are the ones that built a system capable of making any model work. They recognized that intelligence without infrastructure does not scale, and that automation without control does not create value. Most organizations are still optimizing the wrong layer. Until that changes, the pattern will remain the same: strong pilots, uneven production performance, and returns that never fully match the expectation that justified the investment.
The model is not the constraint. The system is.
Most organizations we speak with already know something isn’t working. They’re just not certain where the failure is actually occurring. If your AI program is producing noise rather than decisions, it’s almost always a system readiness problem, not a model problem. We can help you identify which in a single conversation. Request a consultation.
