July 1, 2026TechRevati

Private LLM inference for regulated EU teams: self-host, EU-region API, or US-hosted?

A decision-first, tradeoffs-honest guide for engineering and platform leads: when to self-host private LLM inference, when an EU-region API is enough, and when US-hosted is fine — scored by data sensitivity, the AI Act, DORA and GDPR, latency, and real GPU cost — plus the reference stack and the one design decision that keeps every option open.

ai
data-sovereignty
compliance
dora

Self-host, EU-region API, or US-hosted — which private LLM inference model fits a regulated EU team?

There are three real ways to run an LLM behind a regulated EU workload, and they sit on a spectrum of control versus operational burden. Self-hosted inference — open weights running on your own hardware or a single-tenant VPC, on-prem or air-gapped — gives you maximum custody and maximum ops cost. An EU-region managed API — a hosted model with contractual EU data residency — gives you most of the residency benefit with none of the GPU operations. A US-hosted API gives you the broadest model choice and the lowest ops burden, but turns every call into a cross-border transfer you have to justify. This is the framework we use at TechRevati to choose deliberately per workload rather than defaulting an entire platform to one answer.

This is engineering and practitioner guidance grounded in production delivery of retrieval and inference systems. It is not legal advice and it is not a compliance guarantee. Accountability for your GDPR transfers, your AI Act obligations, and your DORA register stays with your organisation and your own counsel. A vendor supplies facts about where and how a system runs; your firm carries the obligation. Verify everything here against the official texts and your DPO.

One clarification up front. The deployment axis in this post — where and how the model is hosted — is separate from the provider axis, whose weights you run. We covered the provider question, EU-origin models like Mistral versus US frontier models, in our post on the EU-sovereign AI stack. Here we hold the model aside and decide hosting: the same model can often be reached as a US API, an EU-region API, or self-hosted, and those three have very different compliance and cost profiles.

The four factors that actually decide it

Score the workload against these before you pick — per workload, not per company. A document-drafting assistant and a fraud-triage model do not get the same answer.

1. Data sensitivity: what actually enters the prompt?

Start with the data, because it caps your options. An LLM workload moves data in three places, each needing a residency answer: the prompt (the user query plus retrieved context — often the most sensitive payload), the embeddings (vector representations that are not reliably anonymised — they can be partially inverted toward source text, so treat them as derived personal data), and the logs and telemetry a provider retains, sometimes for a window you don't control.

Special-category or regulated data — health, financial records, biometric, public-sector, anything under legal privilege: this is where self-hosting or a strictly-scoped EU path earns its keep. We have run this end to end — an AI Nexus ITSM deployment on-prem with zero egress, and a fully-local Ollama-plus-ServiceNow assistant where Microsoft Defender data never left the customer perimeter — precisely because the data could not leave.
Confidential business data, pseudonymised or low-volume personal data: an EU-region API with residency and a data-processing agreement is usually proportionate.
Public, synthetic, or non-personal data: US-hosted is fine. Don't over-engineer sovereignty for content that carries no obligation — that's cost with no return.

2. Regulation: AI Act, DORA, GDPR

Three regimes bite differently, and none is satisfied by hosting choice alone.

GDPR international transfers. Sending personal data to a US service is a restricted transfer. Post-Schrems II it rests on a lawful mechanism — Standard Contractual Clauses plus a transfer impact assessment, or reliance on the EU–US Data Privacy Framework where the provider is certified. The DPF's long-term durability is itself contested, so a transfer strategy that rests on it alone carries risk. A US-hosted path is not unlawful; it is conditional — you must perform, document, and maintain that assessment. An EU-resident path removes the cross-border transfer entirely, so there is nothing to defend. For many regulated buyers, "no transfer to assess" beats a few points of model quality.
EU AI Act. The Act regulates use and documentation, not where the weights sit. A US model used carefully can be compliant; an EU model used carelessly won't be. But self-hostable EU infrastructure makes the data-governance, logging, and record-keeping duties for high-risk systems far easier — you hold the logs directly. It reduces friction; it does not grant compliance.
DORA. If you supply ICT to an EU financial entity, your inference endpoint and vector store are ICT and land in the Register of Information — and a single-vendor, single-region LLM dependency is the textbook concentration risk DORA asks you to surface. Self-hosting or an abstracted, substitutable inference boundary is how you answer the "actual processing region," "subcontractors," and "exit plan" rows. We went field-by-field on this in the DORA build checklist.

3. Latency

Physics is a real input. An EU-hosted model serving EU users avoids the transatlantic round-trip, and co-locating the model with your vector store in one EU region is usually the fastest path for EU traffic. A US API adds cross-Atlantic hops to every call — tolerable for async or batch work, noticeable in interactive chat or an agent loop making many sequential calls. Self-hosting near your users can be the lowest-latency option of all, provided you've sized capacity so requests aren't queuing behind a GPU shortage. An under-provisioned self-host is slower than any hosted API, not faster.

4. Cost

Be honest about the shape of the bill. Hosted APIs (US or EU-region) price per token with zero ops burden and win decisively at low or spiky volume — you pay only for what you use, and there's no idle GPU. Self-hosting trades per-token fees for GPU capital or reserved-instance cost, plus the real operational load: serving, autoscaling, patching, capacity planning, on-call, and the MLOps headcount to run it. It wins economically only at sustained high utilisation, and only if you actually have the team. A GPU idling overnight is pure loss; an API you simply don't call is free. Model your real duty cycle first — most teams overestimate utilisation and underestimate the ops line.

The decision table

If the workload is…	Lean toward	Why
Special-category / regulated data, or a contractual residency/air-gap requirement	Self-hosted (on-prem / single-tenant EU / air-gapped)	Direct custody of prompts, embeddings, logs; no transfer to defend; clean DORA exit story
Confidential or low-volume personal data, no ops team to spare	EU-region managed API	Residency + DPA without GPU ops; proportionate to the risk
Public / synthetic / non-personal data	US-hosted API	Broadest model choice, lowest ops burden, no transfer triggered
High sustained inference volume with in-house MLOps	Self-hosted EU	Per-token economics flip in your favour at scale
Low or spiky volume, no MLOps capacity	Managed API (EU-region if any personal data)	Pay-per-use, no idle GPU, no on-call
Interactive / agentic, EU users, latency-sensitive	EU-region or self-hosted in-region	Avoids transatlantic round-trips per call
Mixed platform (most real estates)	Hybrid, routed by data class	Sovereign default for regulated paths, hosted API for the rest

What the self-hosted stack actually is

If you land on self-hosting, the reference architecture is five layers you own end to end. A serving runtime — vLLM is our default for production, multi-user inference (PagedAttention plus continuous batching, OpenAI-compatible API); TGI is a fair alternative; Ollama is right for developer workstations, low-volume internal tools, and air-gapped edge boxes, not high-concurrency load. Open weights you can pull, pin, and run inside your boundary — Mistral is our default (capable, self-hostable, EU-origin), with Llama and Qwen also viable (licences differ — Apache-2.0 for the open Mistral models, a community licence for Llama — so check the terms before you ship); size the model to the task, not the leaderboard. Qdrant as the self-hostable, EU-resident vector store, co-located with inference and using payload filtering to enforce tenant boundaries at query time. A gateway speaking the OpenAI-compatible API, which is the highest-leverage piece: it gives you substitutability (a DORA concentration-risk mitigation), sensitivity-based routing in code, one place for authz and rate limits, and a single cost chokepoint. And observability wired twice — for ops (latency, throughput, GPU utilisation, queue depth) and for audit (structured logs of inputs, model and version, retrieval sources, outputs, with defensible retention). Because you host it, those logs are EU-resident by construction.

The same stack lands three ways: on-prem GPU (maximum control, you own the hardware and the ops), EU-sovereign cloud (your VPC in an EU region — verify logging and telemetry are regionalised too, not just inference), and air-gapped (no egress; models pulled once through a controlled channel, updates via a reviewed artefact pipeline). We have run the fully-local end of this in production, so the air-gapped topology is a delivered pattern, not a slide.

Our recommendation: route by data class, don't pick one answer

Almost no real platform is monolithic. The pragmatic architecture is a hybrid routed by data classification: default regulated and data-sensitive paths to a self-hostable, EU-resident deployment; send confidential-but-lower-risk traffic to an EU-region managed API; let genuinely non-sensitive or public workloads use whatever hosted model is best and cheapest. Make the routing explicit in code and legible in your data map — sensitivity-based routing that emerges by accident is exactly what fails an audit.

The single decision that makes all of this reversible is to abstract the inference boundary from day one. If your application talks to a thin internal gateway rather than hard-wiring one vendor's SDK, then "self-host, EU-region, or US-hosted" becomes a routing and configuration choice per workload. Encode the rule that a self-hosted primary fails over to a US API only when the data class permits it — don't rely on an operator remembering it at 3 a.m. Add structured logging and pinned model provenance, and your DORA substitutability and exit story becomes a deployment you already run, not a diagram you hope holds. Budget, too, for a golden-set evaluation you can re-run on any candidate model — self-hosting means you own the upgrade, and without a regression suite every new set of weights is a gamble you discover in production.

The honest limits

Self-hosting is not a free win. You take on GPU cost, capacity planning, an upgrade treadmill, and an ops burden a hosted API simply doesn't have. At the very top end, US frontier models still tend to lead on the hardest reasoning, agentic tool use, and long-context tasks — the gap has narrowed sharply, and for retrieval-grounded answering, extraction, and classification an EU-resident model is more than adequate (there, retrieval quality is the bottleneck, not model size), but the frontier is real. US-hosted inference remains a valid, lawful choice for a large share of workloads. The point of this framework is not to push everything on-prem; it's to make the choice deliberate and recorded rather than defaulted. Sovereignty you don't need is just cost.

Where TechRevati fits

If you're mapping these choices against a live or planned system, our security overview covers how we handle isolation, logging, and the data boundaries this framework depends on — including on-prem and zero-egress deployments. The compliance overview maps each AI Act, DORA, and GDPR obligation to evidence we deliver rather than a promise. And the Sovereign RAG Pilot is a scoped, single-tenant way to stand up this exact stack — a self-hostable serving runtime over open weights, Qdrant for retrieval, EU-resident — against your own residency, latency, and concentration-risk numbers, so the tradeoffs in this post become measurements instead of estimates, with the audit trail produced as a by-product. Reach us at hello@techrevati.com.