1. Why Teams Choose to Self-Host in the First Place
The decision to run an LLM on the enterprise's own infrastructure is reasonable. There are workflows where it's the right answer, environments where it's the only answer, and strategic reasons to invest in internal AI capability even when external options exist. None of this is the question.
The question is whether the team making the decision understands what they're actually committing to. In a striking number of enterprise AI programs, the choice to run internally was made based on a model of cost that turns out to be wrong in predictable ways. The model itself — the actual weights, the open-source release the team is going to deploy — is the cheapest part of the deployment. The rest of the cost structure is where the surprises live, and most of those surprises arrive months after the commitment is too far along to reverse.
This article walks through the cost layers of running a serious LLM internally, what each one actually involves, and where teams consistently under-budget. It's not an argument against self-hosting. It's an attempt to make the decision an informed one.
Before getting into the costs, it's worth being precise about the reasons that lead enterprises to self-host. The reasons are real and the architecture should match them.
- Data location constraints are absolute. Some workflows cannot send data to any external endpoint, regardless of transformation or safeguards. Defence operations, certain regulated healthcare categories, lawful intercept handling, classified financial workflows. For these, self-hosting isn't a choice — it's the only architecture that works. The cost analysis is then about which internal deployment, not whether to deploy internally.
- Sector-specific commitments restrict the deployment topology. A telecom operator under sector-specific data location requirements may need parts of its operational AI to run inside the operator's own network even when transformation-based approaches could theoretically address the constraint. The contract or the data posture is binding regardless of architecture.
- Strategic capability investment. Some enterprises view AI as a long-term capability they need to own, not rent. The reasoning is that AI is going to be central to the business, the vendor landscape is uncertain, and the company prefers to build internal expertise. This is a defensible position that doesn't depend on data location constraints at all.
- Cost projection at scale. For workflows with very high request volumes, the per-token cost of external API calls eventually exceeds the fixed cost of running infrastructure. For most enterprises this point is further out than they think, but the projection is sometimes the explicit reason for self-hosting.
- Operational predictability. External LLM vendors change models, prices, terms, and availability without the enterprise's approval. For workflows where this volatility is unacceptable, controlling the model lifecycle internally has real value.
Each of these is a legitimate basis for self-hosting. The mistake isn't in the reasoning — it's in the budget assumed once the decision is made.
2. The Model — The Cheap Part
The first surprise for teams new to this is that the model itself costs very little.
The leading open-source models — Llama, Mistral, Qwen, and their successors — are released under licences that allow enterprise use without per-token fees. Downloading the weights costs nothing. Quantising them to run on the available hardware costs nothing. There are no licensing negotiations, no usage-based billing, no per-seat costs for the model as such.
This is the visible part of the cost analysis, and it's where the optimistic projections come from. "The model is free; running it ourselves must be cheaper than the API." The calculation looks decisive when only the model cost is in it.
What the calculation is missing is everything else.
3. Infrastructure — The Layer Most Teams Budget For
The next layer of cost is infrastructure, and this is the layer most teams do budget for, often realistically.
Running a serious open-source model in production means GPUs. The exact specification depends on the model size, the quantisation level, the throughput required, and the latency targets. A small deployment running a quantised mid-size model for low-throughput internal workflows might fit on a few enterprise GPUs at modest cost. A serious deployment running a larger model for high-throughput production workflows needs a dedicated GPU cluster with the cooling, power, and networking that implies.
The GPU cost can be calculated. A team that's done the work knows what they need and what it costs, whether that's hardware capex or cloud reserved instances or some combination. The capex side gets reasonable budget attention.
What's often less well-budgeted at this layer is the supporting infrastructure that makes the GPUs useful: the model serving stack (vLLM, TGI, or equivalent), the request routing and load balancing in front of the serving layer, the storage for model weights and caches, the network capacity to handle the request volume, the monitoring infrastructure that watches all of it. These aren't expensive individually but they add up, and they require deliberate planning. The serving stack in particular has matured significantly in recent years; it's no longer a research artefact, but it's still infrastructure that the team has to deploy and operate.
Beyond the initial deployment, the infrastructure has to scale with usage. A workflow that grows from a hundred requests per day to ten thousand requests per day needs proportionally more capacity. This is the routine cost of running production systems, but it's worth factoring into the multi-year projection rather than assuming the day-one budget covers the long term.
4. People — The Layer Most Teams Underestimate
The layer where the projections most consistently break down is people. Running a production LLM is not a thing the existing infrastructure team does on the side. It requires a dedicated team with specific skills, and those skills are expensive in the current market.
A minimal team for serious internal LLM deployment includes:
- ML platform engineers who understand model serving, quantisation, throughput optimisation, and the operational characteristics of the inference stack. Two to four engineers, depending on scale.
- ML operations engineers who handle the monitoring, alerting, capacity planning, and on-call rotation for the inference infrastructure. Two to three engineers, often overlapping with the platform engineers in smaller teams.
- Evaluation engineers who maintain the golden datasets, the evaluation pipelines, the quality monitoring, and the regression testing for model updates. One to two engineers, with significant input from the workflow teams that use the model.
- Workflow integration engineers who build and maintain the layer between the model API and the actual workflows in the business. The number scales with the number of workflows; typically two to three engineers per major workflow domain.
- Technical leadership with enough ML and infrastructure experience to make the architectural decisions and review the work. One senior engineer at minimum, often a small leadership group for larger deployments.
For an enterprise running internal LLMs for a few business units, this is six to ten people. For a larger deployment serving multiple business units across a complex enterprise, it's fifteen to twenty.
The cost of these teams, in any major market, is substantial. ML engineers with the right skills earn well above the general engineering market. The team is small in headcount but expensive in burn rate. Over a three-year horizon, the people cost typically exceeds the infrastructure cost by a meaningful margin — for many deployments, by a factor of two or three.
This is the calculation that most often shifts when the team that proposed the project doesn't initially own the people cost. The infrastructure budget gets sponsored; the team has to be hired; and when the team is hired, the rest of the program has to fund it.
5. Time — The Cost That Compounds
A category that's harder to put a number on but easy to underestimate: time. Specifically, the gap between deciding to self-host and having a useful workflow in production.
The decision is fast. The procurement of hardware or cloud commitment takes weeks. The initial deployment of the serving stack and a first model takes more weeks. Getting the first workflow integrated and useful takes months. Achieving the operational maturity to run multiple workflows reliably takes longer still.
For most enterprises, the realistic timeline from "we'll build our own" to "we're running production AI on it" is somewhere between nine and eighteen months. Some teams move faster; many move slower. During this window, the enterprise either has no AI in the workflows the project is meant to serve, or it has AI through external endpoints that the project was meant to replace.
This matters for two reasons. The first is the opportunity cost: the workflows that would have been improved by AI aren't improved during the build period, and the business value of that delay is real. The second is the strategic risk: during the same window, the external LLM landscape continues to advance, and the internal model that finally goes into production may already be two iterations behind what the external alternative offers.
A common pattern: the internal model arrives at production quality just in time for the team to discover that the workflow it was supposed to support has been re-architected around external endpoints in the interim. The project succeeds technically and fails strategically.
6. Maintenance — The Cost That Doesn't End
Once the system is running, the costs don't stop. They shift to a different category.
- Model updates. Every few months, new open-source models get released with meaningfully better capabilities. The team has to evaluate each release, decide whether to upgrade, plan the migration, run the parallel evaluation, and execute the cutover. For a single model, this is a recurring engineering project — not enormous, but ongoing. For multiple models supporting different workflows, it's a significant portion of the team's time.
- Evaluation and quality monitoring. The model that worked well last quarter may not work as well on this quarter's input distribution. Workflows evolve, documents change, business contexts shift. The evaluation infrastructure has to catch quality drift before users complain. This is sustained engineering work, not something the team can set up once and forget.
- Infrastructure updates. The serving stack gets updates; the underlying GPU drivers change; security patches arrive; the cloud platforms deprecate things. None of this is dramatic, but it requires the team to stay current and apply changes without breaking the production workflows.
- Workflow evolution. As the business changes, the workflows the model supports change. New prompts have to be developed, evaluated, and rolled out. Old prompts have to be deprecated. Workflow integrations have to be updated. The team that built the system is the team that maintains the workflows it serves, which is more work than the initial build implied.
The maintenance cost is the cost that's hardest to project because it doesn't appear in the initial budget. It shows up over the multi-year horizon as the team's sustained burn against a growing portfolio of workflows. Enterprises that succeed with self-hosted AI tend to be the ones that planned for the maintenance investment from the start; enterprises that struggle are usually the ones that treated the initial deployment as the cost.
7. Where Self-Hosting Is the Right Answer
Given all this, the cases where self-hosting is the clearly right answer are specific.
- When data location constraints are absolute. Workflows that cannot send data to any external endpoint must run internally. The cost is the cost of meeting the constraint, and the alternative is not having AI in the workflow at all.
- When the workflow volume justifies the fixed cost. Very high request volumes — millions of requests per day at sustained scale — amortise the infrastructure and people costs across enough work to come out ahead of per-token billing. The break-even point is further out than most enterprises think, but for some workflows it's clearly inside the planning horizon.
- When the strategic case is explicit and funded. Building internal AI capability as a long-term investment is defensible. The decision should be made with full visibility into the multi-year cost, not as a side effect of "we'll save on API fees."
8. Where Self-Hosting Is the Wrong Answer
The cases where self-hosting is the wrong answer are also specific.
- When the data location constraint can be addressed by transformation. For workflows where the data can stay in the EU region through architectural means — encapsulation, tokenisation, customer-controlled mapping — the external endpoint with appropriate safeguards is a faster, cheaper, more capable solution. The cost analysis usually doesn't favour self-hosting for these workflows.
- When the workflow needs frontier capability. The internal model lags the frontier external model on the dimensions that matter most for complex reasoning, long context handling, and unfamiliar document types. Some workflows can live with the gap; some can't. Self-hosting for the workflows that can't produces a system that works but underperforms.
- When the cost projection didn't include the full team. A self-hosting program funded only for infrastructure ends up either limping along understaffed or quietly absorbing the budget that was meant for other work. Either way, the actual cost catches up with the optimistic projection.
9. The Pattern Most Enterprises End Up With
Across enterprises that have worked through this, the deployment that emerges usually isn't pure self-hosting. It's a hybrid where the self-hosted infrastructure handles the workflows that genuinely require it — the absolute-constraint cases, the strategic-investment cases, the volume cases where the math works out — and external endpoints with transformation handle the rest.
The hybrid pattern isn't a compromise. It's the architecture that matches the cost profile of each workflow category to its actual constraint. Workflows that need self-hosting get self-hosting. Workflows that don't get external endpoints with safeguards. The team operates one routing layer, one governance framework, and two backends — at a substantially lower total cost than running one backend that has to handle everything.
For the broader argument about why hybrid is the architecturally honest answer rather than a political compromise, see the pillar on where to run enterprise AI. For the routing layer that makes hybrid actually work, see the article on routing AI workflows between cloud and local models. For the workflows at the strictest end of the self-hosting case — where no external endpoint is acceptable — see the article on when AI must run without network access.
- The model itself is free — the surprises live in every other layer of the cost structure
- Infrastructure (GPU + serving stack + monitoring) is real but typically budgeted reasonably
- People are the layer that most consistently breaks the projection — 6–10 engineers minimum, 15–20 for larger deployments, exceeding infrastructure cost by 2–3× over three years
- Time compounds as opportunity cost — 9–18 months from decision to production AI, during which the workflows wait and the external frontier advances
- Maintenance doesn't end — model updates, quality drift, infrastructure patches, workflow evolution accumulate as sustained team burn
- Right when: constraints are absolute · volume justifies the fixed cost · the strategic case is explicit and funded
- Wrong when: transformation could address the constraint · the workflow needs frontier capability · the projection didn't include the full team
- The pattern most enterprises end up with is hybrid — self-hosted for the workflows that need it, external endpoints with transformation for the rest, under one routing layer and one governance framework
Frequently Asked Questions
Isn't self-hosted AI cheaper than paying per-token for external APIs?
Only when the calculation includes everything, not just the model. The model itself is free — Llama, Mistral, Qwen, and their successors are released under enterprise-friendly licences with no per-token fees. Where the calculation usually breaks is everything else: GPU infrastructure (budgeted reasonably), the team to operate it (underbudgeted), the time between deciding and being in production (compounds as opportunity cost), and ongoing maintenance (doesn't end). Over a three-year horizon, the people cost typically exceeds the infrastructure cost by a factor of two or three. The break-even point versus API billing is further out than most enterprises think.
How big does the team need to be to run a serious internal LLM?
For a few business units, six to ten people. For a larger deployment serving multiple business units across a complex enterprise, fifteen to twenty. The composition is ML platform engineers (2–4), ML operations engineers (2–3), evaluation engineers (1–2), workflow integration engineers (2–3 per major workflow domain), and technical leadership (at minimum one senior engineer). These are small headcounts but expensive burn rates — ML engineers with the right skills earn well above the general engineering market. The team cost is what most projections under-budget.
When is self-hosting actually the right answer?
Three specific cases. First, when data location constraints are absolute — workflows that cannot send data to any external endpoint regardless of safeguards must run internally; the cost is the cost of meeting the constraint. Second, when workflow volume justifies the fixed cost — sustained millions of requests per day amortise infrastructure and team costs across enough work to come out ahead of per-token billing. Third, when the strategic case is explicit and funded — building internal AI capability as a long-term investment, with full visibility into the multi-year cost rather than as a side effect of "we'll save on API fees."
When is self-hosting the wrong answer?
Three specific cases. When the data location constraint can be addressed by transformation — for workflows where data can stay in the EU region through encapsulation, tokenisation, and customer-controlled mapping, the external endpoint with appropriate safeguards is faster, cheaper, and more capable. When the workflow needs frontier capability — internal models lag the frontier on complex reasoning, long context, and unfamiliar document types; self-hosting workflows that can't tolerate this gap produces systems that work but underperform. When the cost projection didn't include the full team — programs funded only for infrastructure end up understaffed or quietly absorbing budgets meant for other work.
What's the realistic timeline from deciding to self-host to running production AI?
Nine to eighteen months for most enterprises. Procurement takes weeks, initial serving-stack deployment takes more weeks, getting the first workflow integrated and useful takes months, and operational maturity to run multiple workflows reliably takes longer still. During this window the enterprise has no AI in the workflows the project is meant to serve, or it has AI through the external endpoints the project was meant to replace. The opportunity cost is real, and during the same window the external LLM landscape continues to advance — the internal model that finally goes into production may already be behind.
Does the cost stop once the system is running?
No — it shifts categories. Model updates (new open-source releases every few months requiring evaluation, migration, parallel evaluation, cutover). Quality monitoring (workflows evolve, input distributions shift, evaluation infrastructure has to catch drift). Infrastructure updates (serving stack updates, GPU driver changes, security patches, cloud deprecations). Workflow evolution (new prompts to develop, old prompts to deprecate, integrations to update). This is the cost hardest to project because it doesn't appear in the initial budget — it shows up over multi-year horizons as the team's sustained burn against a growing workflow portfolio.
What pattern do most enterprises end up with?
Hybrid — not as a political compromise but as the architecture that matches each workflow category's cost profile to its actual constraint. Self-hosted infrastructure handles workflows that genuinely require it (absolute-constraint cases, strategic-investment cases, volume cases where the math works out). External endpoints with transformation handle the rest. The team operates one routing layer, one governance framework, and two backends — at substantially lower total cost than running one backend that has to handle everything.