1. The Decision Isn't Binary — and the Framing That Makes It Binary Is the Problem
The first serious question an enterprise AI program runs into, after the initial enthusiasm settles, is where the models should actually run. The conversation tends to polarise quickly. The security team argues for on-premise: keep the data inside, eliminate the question of cross-border transfer. The product team argues for external: the frontier models do things the internal ones can't, and falling behind is more dangerous than the controlled risk. The platform team argues for whatever they can stand up fastest. And in most organisations, the conversation goes around this triangle for months before someone notices that the three positions describe different workflows, and that the right answer is probably some of each.
Most enterprise AI deployment conversations start from a single question: should we use external LLMs or run our own? The framing assumes the answer applies uniformly to the company. It doesn't.
A typical enterprise has dozens of workflows where AI would be useful. Some of them — drafting marketing copy, summarising public documents, generating internal training content — have no meaningful sensitivity constraint. The data is fine to send anywhere. The most capable models, regardless of where they're hosted, are the right choice. There's no operational case for running internal infrastructure to handle these.
Other workflows in the same enterprise — processing customer service tickets, analysing operational logs, drafting clinical notes, reviewing loan applications — sit under constraints that range from contractual data location commitments to sector-specific data posture to internal governance about what gets sent where. The same model that's fine for marketing copy is not fine for these. The choice for these workflows is real.
A third category — classified defence operations, lawful intercept data, certain healthcare scenarios — sits under constraints that simply do not allow any external endpoint, regardless of safeguards. The choice for these workflows is also real, but it's a different choice.
Treating these three categories as if they're one decision produces bad architecture. You either over-protect the easy workflows (running them on internal infrastructure that costs operational effort without delivering proportionate value) or under-protect the hard ones (sending them to external endpoints because that's the company's "AI strategy"). The realistic answer is to make different decisions for different workflow categories, and to operate the infrastructure that lets those decisions coexist.
2. External LLMs — What They're Actually Good At, and What Limits Them
The case for external LLMs is straightforward and worth being precise about. The frontier models — GPT-5-class systems, Claude Opus-class, Gemini Ultra-class — are operating at a capability level that no enterprise is going to match with internal infrastructure. They have context windows that handle entire document portfolios. They reason over complex structures with quality that smaller models can't approach. They get better every few months without the enterprise paying for retraining cycles. For workflows where capability matters, this is not a marginal advantage. It's a different category of system.
The cost side is also worth being honest about. External LLMs are accessed through APIs operated by their vendors, with the model running on the vendor's infrastructure, in the vendor's data centres, under the vendor's operational control. The enterprise sends data to those endpoints and receives responses back. The vendor's contractual commitments — data processing agreements, regional endpoints, deletion policies — describe what the vendor will and won't do with the data. They don't describe what the vendor's host country's law might allow other parties to do.
For workflows where the data isn't sensitive, this is a non-issue. For workflows where the data is sensitive but the data location constraint is satisfiable through architectural choices (regional endpoints, transformation before transmission, customer-controlled mapping), this is solvable — and the broader pillar on this approach covers how.
For workflows where the data is sensitive and the constraint is that no version of the data, however transformed, can be sent to an external endpoint, external LLMs are not an option. Some workflows genuinely fall into this category. The mistake is assuming all sensitive workflows do.
3. On-Premise LLMs — The Strengths and the Brittle Parts
The case for running models on internal infrastructure is also straightforward at first glance. The data never leaves. The contractual questions about cross-border transfer simply don't arise. For workflows under absolute data location constraints — defence, certain regulated healthcare categories, certain financial transaction segments — this is the only option, and it's worth taking seriously.
The strengths are real:
- Data location is solved by definition. The model runs where the data is; the question of transfer doesn't engage. For workflows under sector-specific commitments that bind the enterprise to keep operational data in defined boundaries, this maps cleanly onto the constraint.
- Operational control is total. The model can be tuned, fine-tuned, evaluated, monitored, and rolled back under the enterprise's own change-management processes. There's no vendor on the other side making model updates the enterprise didn't approve.
- Latency can be made predictable. A model running on the enterprise's own network avoids the round-trip to an external endpoint, which matters for some real-time workflows.
The costs are also real, and tend to be underestimated by teams that haven't run production AI infrastructure before:
- Model capability lags. The open-source models an enterprise can practically host — Llama, Mistral, Qwen, and their successors — are good models. They are not, at any given moment, as capable as the frontier external models on the dimensions enterprises usually care about: reasoning over complex documents, handling unfamiliar formats, long-context analysis. The gap is typically twelve to eighteen months and may narrow over time, but it doesn't close.
- The operational cost is large and continuous. Running a serious model in production means GPU infrastructure, model serving infrastructure (vLLM, TGI, or similar), evaluation pipelines, monitoring, and the team that knows how to keep all of it running. The cost is dominated not by the GPUs but by the team. A small AI infrastructure team for enterprise serving is five to ten engineers; a serious one is double that. The model itself is the cheap part.
- The brittleness shows up in updates. Every few months, the frontier external models leap forward in ways that change what business teams expect. The internal model, however well-tuned, doesn't leap. The gap between "what AI can do" in the public conversation and "what our internal AI can do" in the enterprise grows, and the pressure to do something about it grows with it.
For workflows where the constraint is absolute — and where the workflow is bounded enough that a capable smaller model is sufficient — on-premise is the right answer. For workflows where the constraint is real but not absolute, and where the AI capability matters, on-premise alone produces a system that works but underperforms.
4. Hybrid — Not a Compromise, but the Architecturally Honest Answer
The third option — running a hybrid topology where some workflows go to external endpoints and some run on internal infrastructure — gets dismissed too quickly in enterprise conversations. The dismissal usually takes one of two forms.
The first is operational: "running both is more complex than running one, so we should pick one." This is true but misses the point. Running both isn't more complex if the workflows that need each are different workflows. The complexity of a hybrid topology is in the routing layer that decides which model gets which workflow. That routing layer is not optional — even a pure-external or pure-on-premise enterprise has a routing layer, it's just trivial — and once the routing layer exists, supporting two backends is incremental complexity, not categorical complexity.
The second is governance: "we should have one policy for AI." This is also true and also misses the point. The single policy isn't "all AI runs externally" or "all AI runs on-premise." The single policy is "workflows of class X run externally with these safeguards; workflows of class Y run on-premise; the routing is enforced and audited." That's a coherent governance posture, and it's the one most large enterprises end up with whether they planned for it or not.
What makes hybrid the architecturally honest answer is that it matches the actual structure of enterprise AI workloads. Some workflows benefit massively from frontier capability and tolerate transformation-based safeguards. Some workflows have constraints that rule out external endpoints regardless of safeguards. Forcing all workflows into either bucket produces a system that's wrong for some of them. Hybrid lets each workflow get the deployment it needs.
The hard part of hybrid isn't running two backends. It's making the routing decision precise enough that workflows go where they should, with the policy enforced rather than ignored, and the audit trail clear enough that the team can defend the decisions later. That's a design problem, not an infrastructure problem.
5. A Decision Framework for Workflows
A useful way to make the decision concrete is to evaluate each workflow against four questions:
- What is the sensitivity of the data the workflow operates on? Not the maximum sensitivity of any data anywhere in the enterprise — the sensitivity of the specific data this workflow needs. Marketing copy and customer records are different workflows even if the same business unit owns both.
- What is the capability requirement of the AI task? Some tasks — classification of incoming tickets into ten categories, simple entity extraction, format conversion — work well on smaller models. Some tasks — reasoning over a hundred-page contract, synthesising root causes from heterogeneous logs, drafting clinical narratives — need frontier capability.
- What is the constraint structure? Is the constraint absolute (no version of the data can leave) or conditional (the data can leave if transformed appropriately)? Is it driven by external commitments (customer agreements, sector commitments) or internal governance (the company's own data posture)?
- What is the workflow's tolerance for capability degradation? Some workflows will deliver value with a model that's twelve months behind the frontier. Some workflows are themselves the differentiating capability the business is trying to build, and the model gap is the gap.
The answers cluster into rough categories:
| Workflow category | Examples | Deployment fit |
|---|---|---|
| Low-sensitivity, high-capability requirement, no constraint | Marketing copy, public document summaries, internal training content | External LLM, no transformation |
| Sensitive but transformable, high-capability requirement, conditional constraint | Customer service tickets, operational logs, contract review, clinical notes | External LLM + transformation layer (Path A) |
| Sensitive, capability-tolerant, absolute constraint | Defence workflows, certain classified categories, lawful intercept | On-premise model (Path B) |
| Sensitive, high-capability requirement, absolute constraint | The hardest category — the gap between what's needed and what's allowed | Narrow the task, wait for on-premise capability, or accept the gap |
Most enterprises find that their workflow inventory spans all four categories. The deployment architecture has to handle all four, which means hybrid by default.
6. What Hybrid Looks Like in Practice
A hybrid deployment isn't "we have one model here and one model there." It is an integration architecture where the routing decision — which backend handles which request — is policy-driven, auditable, and consistent.
The components that make this work:
6.1 A Unified Integration Layer
Workflows don't call the external endpoint or the on-premise model directly. They call an integration layer that handles the routing. This means workflows are insulated from changes in the backend selection. If a workflow needs to move from external to on-premise (because of a new constraint, a vendor change, or a policy update), the workflow doesn't change — the routing rule does.
6.2 Policy-Driven Routing
The decision of where a workflow runs is encoded in policy, not in the workflow's own code. "Workflows tagged as customer-service in the EU region go to the external endpoint with the encapsulation layer. Workflows tagged as classified go to the on-premise model. Workflows tagged as marketing-content go to the external endpoint without transformation." The policy is versioned and auditable.
6.3 A Transformation Layer for the External Path
When workflows are routed to the external endpoint, sensitive elements are transformed before transmission and reconstructed on response. The same transformation infrastructure works regardless of which external endpoint is the routing target — the abstraction is over the backend, not over a specific vendor.
6.4 Shared Governance
The audit logs, the policy management, the access controls cover both paths uniformly. The team doesn't have two governance frameworks, one for external and one for on-premise. They have one, with the path of each request recorded.
When these four components are in place, the question of "where should this workflow run" becomes a policy decision rather than an architecture commitment. The workflow can move between paths as needs change without rewriting integrations.
7. What Goes Wrong When One Path Is Treated as the Entire Answer
Two failure modes show up consistently in enterprises that commit to a single deployment path.
- The all-external organisation. A company that decides external LLMs are the entire AI strategy will run into workflows the policy won't allow, and one of two things happens. Either those workflows don't get AI (and the company falls behind on the work where AI matters most), or the workflows get AI through unofficial channels — employees pasting sensitive data into consumer chatbots, business units procuring AI tools outside the central process, vendors integrated without the central security review. Shadow AI is the predictable consequence of an AI strategy that doesn't account for workflows that can't fit it.
- The all-on-premise organisation. A company that decides external LLMs are unacceptable across the board will run into the capability gap. Internal models will be good enough for some workflows and not for others. The workflows where they're not good enough will either underperform (and the company's competitors will pull ahead on those tasks), or business units will route around the policy through the same shadow channels. The discipline of an all-on-premise posture is harder to maintain than it looks, especially as the frontier external models continue to advance.
Both failure modes share a structure: a policy that doesn't account for the heterogeneity of enterprise workflows ends up bypassed in the workflows that don't fit, and the bypass is harder to govern than the explicit path would have been.
8. The Architecture This Leads To
For most regulated enterprises, the deployment that emerges from working through these questions has a few consistent properties.
There is a backend for external LLMs, accessed through a transformation layer that handles the cases where data sensitivity requires it. There is an on-premise model — usually a smaller, well-chosen open-source model — that handles workflows where external endpoints aren't viable. There is a routing layer that decides which workflow goes where based on policy. There is a unified governance and audit framework covering both paths. And there is the recognition that this isn't a finished state — workflows move between paths as constraints change, new external models become available, new on-premise capabilities mature, and the business's own posture evolves.
LLM Capsule supports both paths as part of a single architecture: Path A routes workflows to an approved external LLM through the encapsulation layer that handles tokenisation, structure preservation, and reconstruction inside the enterprise environment; Path B routes workflows to an on-premise local model when external endpoints aren't an option. The same encapsulation layer, the same policy framework, and the same audit cover both. The deployment decision becomes a policy choice per workflow, not an architecture commitment for the company.
- The deployment question isn't binary — different workflows have different sensitivity, capability requirements, and constraints; a single-path policy is wrong for some of them
- External LLMs deliver frontier capability through contractual (not architectural) protection — fine for low-sensitivity workflows, solvable with a transformation layer for conditional constraints, off-limits for absolute constraints
- On-premise solves data location by definition but lags the frontier by 12–18 months; the dominant cost is the team (5–10 engineers minimum), not the GPUs
- Hybrid is the architecturally honest answer — it matches the actual structure of enterprise workloads instead of forcing them into one bucket
- Four questions sort workflows into four categories: sensitivity, capability requirement, constraint structure (absolute vs conditional), tolerance for capability degradation
- Working hybrid needs four components: unified integration layer, policy-driven routing, transformation layer for the external path, shared governance
- All-external organisations produce shadow AI; all-on-premise organisations produce capability-gap workarounds — both failure modes are bypassed policies
- Path A (external + encapsulation) and Path B (on-premise local) under one policy and audit framework is the deployment most regulated enterprises end up at, planned or not
Frequently Asked Questions
Why isn't the deployment decision just external or on-premise?
Because a typical enterprise has dozens of workflows with different sensitivity, capability requirements, and constraints. Marketing copy and customer service tickets are different workflows even if the same business unit owns both. Treating these as a single decision either over-protects the easy workflows (running them on internal infrastructure that costs operational effort without delivering proportionate value) or under-protects the hard ones (sending them to external endpoints because that's the company's "AI strategy"). The realistic answer is different decisions for different workflow categories.
What are external LLMs actually good at, and what are their limits?
Frontier external models (GPT-5-class, Claude Opus-class, Gemini Ultra-class) operate at a capability level no enterprise will match internally — long context windows, complex document reasoning, regular improvement without retraining cycles. For workflows where capability matters, this is a different category of system. The limit is that the model runs on the vendor's infrastructure under the vendor's contractual commitments, not architectural guarantees. For workflows where data sensitivity rules out any external endpoint regardless of safeguards, external LLMs are not an option.
When is on-premise the right answer?
When the constraint is absolute (defence, certain classified categories, workflows where no version of the data, however transformed, can leave) and the workflow is bounded enough that a capable smaller model suffices. The strengths are real — data location is solved by definition, operational control is total, latency can be made predictable. The costs are also real and tend to be underestimated: capability lag of twelve to eighteen months versus frontier models, continuous operational cost dominated by the team (five to ten engineers minimum), and brittleness when the frontier leaps forward and the internal model doesn't.
Isn't running a hybrid topology more complex than picking one path?
Only superficially. Running both isn't more complex if the workflows that need each are different workflows — and they are. The complexity of hybrid lives in the routing layer that decides which model gets which workflow. That routing layer isn't optional; even pure-external or pure-on-premise enterprises have one, it's just trivial. Once the routing layer exists, supporting two backends is incremental complexity, not categorical complexity. The hard part is making the routing decision precise enough that workflows go where they should, with policy enforced and audit trail clear.
How do you decide which workflow goes where?
Evaluate each workflow against four questions. Data sensitivity (of the specific data this workflow needs, not the maximum anywhere in the enterprise). Capability requirement (does the task need frontier reasoning, or will a smaller model suffice). Constraint structure (absolute or conditional, externally driven or internal governance). Tolerance for capability degradation. The answers cluster into rough categories: low-sensitivity goes external, transformable-with-high-capability goes external with transformation, absolute-constraint-but-bounded goes on-premise, absolute-with-high-capability is the hardest — sometimes not fully solvable today.
What goes wrong with an all-external or all-on-premise approach?
All-external organisations run into workflows the policy won't allow, and either those workflows don't get AI (the company falls behind on the work where AI matters most) or they get AI through unofficial channels — shadow AI is the predictable consequence. All-on-premise organisations run into the capability gap; internal models will be good enough for some workflows and not for others, and business units route around the policy through the same shadow channels. Both failure modes share a structure: a policy that doesn't account for workflow heterogeneity ends up bypassed.
What does a working hybrid architecture actually look like?
Four components. A unified integration layer (workflows call this layer, not the backends directly, so they're insulated from routing changes). Policy-driven routing (the decision of where a workflow runs is encoded in policy, versioned and auditable, not hard-coded in the workflow). A transformation layer for the external path (sensitive elements transformed before transmission and reconstructed on response, abstracted over which external vendor is the target). Shared governance (one audit log, one policy framework, one access control model covering both paths uniformly). When all four are in place, where a workflow runs becomes a policy decision rather than an architecture commitment.