Cloud vs Iron: Where to Run Your SOTA Models in 2026

Should you rent GPUs or buy your own? A practical engineer's guide to choosing between cloud and on-premise infrastructure for running state-of-the-art open-weight LLMs in 2026 — with pricing tables, inference stack benchmarks, and a decision matrix.

Nacio-Félix Laubressac

Jun 01, 2026

11 min

cloud vs iron where to run sota models 2026

Your team just shipped an AI feature. Users love it. Traffic is climbing. The OpenAI bill is eating your runway. Someone asks the question that every AI team eventually faces: “Should we run this ourselves?”

This is the guide that answers it --- not with theory, but with real numbers, real hardware, and a decision framework you can use today.

TL;DR

Cloud wins for most teams. If your GPU utilization sits below 70% (and it probably does), renting is cheaper and simpler.
Specialist GPU clouds are 40—85% cheaper than AWS/GCP/Azure. Use CoreWeave, Lambda Labs, RunPod, or Spheron instead of hyperscalers unless you need specific compliance features.
Start with vLLM. It has the widest model support, no compilation step, and competitive throughput. Move to TensorRT-LLM only when your model is stable and you need maximum performance.
H100 SXM5 is the practical default GPU in 2026. The B200 is faster but supply-constrained.
On-premise breaks even at 80%+ sustained utilization over a 3-year horizon --- or in under 4 months for high-utilization workloads.

The Scenario

Imagine a ten-person AI startup in mid-2026. They have been running Qwen3.5-35B behind a chatbot API, paying per token to a model-as-a-service provider. Their monthly inference bill just crossed $8,000. The CFO wants to cut costs. The CTO wants to own the infrastructure. The engineering team wants to stop worrying about rate limits and data leaving their VPC.

They need to decide: rent GPUs from a cloud provider, or buy their own hardware?

This is the decision we will walk through. Along the way, we will cover which GPU to pick, which inference engine to use, which cloud provider offers the best deal, and which open-weight models are actually worth self-hosting right now.

The Utilization Rule

Before we compare prices or hardware, there is one number that matters more than anything else: your GPU utilization rate.

The research is clear:

Utilization	Winner	Why
Under 70%	Cloud	You pay only for what you use. Idle hardware is wasted money.
70—80%	Toss-up	Depends on time horizon, compliance needs, and team expertise.
80%+ sustained	On-premise	Hardware pays for itself. Up to 18x cost advantage per million tokens over MaaS APIs over 5 years.

Most production teams operate at 40—65% utilization due to traffic variability. This means cloud wins for the majority of use cases. But if you are running batch inference jobs 24/7 or serving a high-traffic API with predictable load, owning hardware becomes compelling.

The break-even math is surprisingly aggressive for high-utilization workloads: on-premise infrastructure can break even in under 4 months compared to hyperscaler pricing, according to Lenovo’s 2026 TCO whitepaper.

Sources: Spheron — On-Premise vs Cloud, April 2026 | Lenovo Press — TCO Whitepaper, 2026

Which GPU in 2026?

The GPU market in 2026 is dominated by NVIDIA’s Hopper and Blackwell architectures. Here is the landscape:

GPU	VRAM	Bandwidth	Key Advantage	Status
NVIDIA B200	192 GB HBM3e	8 TB/s	~4x H100 inference throughput	Constrained supply, waitlists
NVIDIA H200	141 GB HBM3e	4.8 TB/s	H100 upgrade, strong availability	4—8 week lead time
NVIDIA H100 SXM5	80 GB HBM3	3.35 TB/s	Production workhorse, best availability	Standard choice in 2026
NVIDIA A100	80 GB HBM2e	2 TB/s	Solid mid-range, falling prices	Under $1/hr spot on some clouds
RTX 4090 (consumer)	24 GB GDDR6X	1 TB/s	Best FLOPS/$ for small models	Local/hobbyist use only

Practical recommendations:

Default choice: H100 SXM5. Best availability, proven in production, supported by every inference engine.
Future-proofing: B200, if you can get one. It delivers roughly 4x the inference throughput of an H100, but supply remains constrained through mid-2026.
Budget option: A100 spot instances. Performance is solid and prices are falling below $1/hr on some providers.
Local dev: RTX 4090. Incredible FLOPS-per-dollar for models under 24 GB, but not suitable for production serving.

The buy vs rent math for one H100: Buying an H100 costs approximately $30,000. Renting one at $5/hr means break-even at roughly 6,000 hours --- about 3.4 years at 5 hours per day --- before accounting for power, cooling, and facilities. This is why utilization matters so much.

Sources: Inworld AI — GPU Cloud Comparison, April 2026 | WhiteFiber — GPU Guide

Cloud Providers: Who Has the Best GPU Pricing?

This is where the market gets interesting. Specialist GPU cloud providers are 40—85% cheaper than the hyperscalers for equivalent GPU workloads. AWS, GCP, and Azure add egress fees, storage premiums, and networking charges that inflate the real cost significantly.

Provider	GPUs Available	Best For	Notes
CoreWeave	H100, B200, A100	Large training, enterprise	Kubernetes-native, HIPAA/SOC2/ISO27001
Lambda Labs	H100, B200	Researchers, flexible inference	Transparent pricing, strong deep learning images
RunPod	H100, A100, RTX	Cost-conscious teams	Spot pricing, wide GPU range
Spheron	H100 SXM5	Cheapest spot pricing	H100 spot from $1.03/hr (May 2026)
AWS (p5)	H100	Regulated orgs, existing cloud footprint	$4.10—6.88/hr/GPU on-demand
GCP (A3 High)	H100	GKE/Vertex AI integration	$11—16/hr/GPU --- most expensive
Azure	H100, B200	Enterprise governance, BYOK	Quantum-2 InfiniBand, 3.2 Tbps

Pricing Snapshot (May—June 2026)

GPU	Provider	On-Demand (/hr)	Spot (/hr)	Est. Cost/M Tokens
H100 SXM5	Spheron	~$2.90	$1.03	$0.47
H100 SXM5	AWS (p5)	$4.10—6.88	Varies	---
B200	Various	~$4—5	$2.12	$0.42
A100	Various	~$1.50—2.00	Under $1.00	$0.57

The takeaway: If you are paying AWS or GCP prices for GPU inference, you are overpaying. Switch to a specialist provider and your costs drop by half or more. Use hyperscalers only when you have specific compliance requirements (HIPAA, FedRAMP) or deep integration with their ecosystem (GKE, SageMaker).

Sources: Spheron — GPU Pricing, May 2026 | Inworld AI — GPU Cloud Comparison, April 2026

The Inference Stack: vLLM vs SGLang vs TensorRT-LLM

Choosing the right GPU is only half the battle. The software stack that serves your model determines your actual throughput, latency, and operational complexity. In 2026, the inference engine landscape has consolidated around three serious contenders.

The 2026 Recommendation Hierarchy

Start with vLLM --- widest model support, no compilation step, competitive throughput, best documentation. This is the safe default for any team.
Move to TensorRT-LLM --- when your model is stable for months and you need maximum throughput. Expect a 20—30% throughput gain over vLLM, but a 28-minute compilation step that must be rerun whenever the model changes.
Use SGLang --- when individual request latency matters most (real-time chat, structured output generation, shared-prefix workloads). It excels at minimizing time-to-first-token.
Avoid TGI (HuggingFace) --- officially in maintenance mode. HuggingFace themselves recommend migrating to vLLM or SGLang.

Benchmark Comparison

Engine	Throughput	Latency (TTFT)	Model Support	Compilation	Best For
vLLM	4/5	3/5	5/5	None	General production, wide model coverage
SGLang	3/5	5/5	4/5	None	Real-time chat, structured generation
TensorRT-LLM	5/5	4/5	3/5	~28 min	Maximum throughput, stable models
llama.cpp	2/5	3/5	5/5	None	CPU, local, edge, consumer GPU
Ollama	2/5	3/5	4/5	None	Local development only --- not for production
TGI	3/5	3/5	4/5	None	Maintenance mode --- migrate away

Benchmark note: On H100 SXM5, TensorRT-LLM delivers the best throughput and latency but requires compilation. On the Blackwell B200, TensorRT-LLM reaches 1,000 tokens per second per user on Llama 4 Maverick --- a staggering number that makes it the clear choice for high-volume production workloads on next-gen hardware.

Sources: Spheron — Inference Benchmarks, March 2026 | The AI Engineer — Inference Showdown, March 2026

Models Worth Self-Hosting in 2026

Not every model justifies the infrastructure investment. These are the open-weight models that make self-hosting worthwhile in 2026:

Model	Params (Total / Active)	Context Window	Why Self-Host?
Qwen3-Coder-30B-A3B	30B / 3B MoE	256K	Best coding MoE for 32—64 GB hardware
Qwen3.5-35B-A3B	35B / 3B MoE	262K	Best general-purpose MoE at this size
Qwen3.5-122B-A10B	122B / 10B MoE	128K	Near-frontier performance, needs 64—96 GB
Nemotron-30B (Omni)	30B / 3B MoE	1M	Best multimodal + speed; Mamba2 hybrid
DeepSeek V4 Pro	Large MoE	---	High Intelligence Index score
Kimi K2.6	MoE	---	Top open-weight Intelligence Index (54)

Why these models specifically? They are Mixture-of-Experts architectures, which means they activate only a fraction of their total parameters during inference. A 35B MoE model with 3B active parameters runs nearly as fast as a dense 3B model while delivering quality closer to a 30B+ dense model. This makes them incredibly efficient to self-host --- you get frontier-tier quality at a fraction of the compute cost.

The Nemotron-30B Omni deserves special mention: its Mamba2 hybrid architecture and 1M context window make it uniquely suited for multimodal workloads (text, image, audio) on a single GPU.

The Decision Matrix

Here is the consolidated decision framework based on your specific scenario:

Scenario	Recommendation
Early-stage / variable load	Cloud spot (RunPod, Spheron) + vLLM
Production, growing traffic	CoreWeave or Lambda H100/B200 + vLLM
Max performance, stable model	TensorRT-LLM or NVIDIA NIM container
High utilization + data residency	Own H100/B200 servers + vLLM
Multimodal / omni needs	Nemotron-30B Omni on H100 cloud
Apple Silicon / on-device	MLX + Qwen3-Coder-30B (no cloud needed)
Regulated (HIPAA, ISO27001)	CoreWeave or Azure + BYOK

Back to Our Startup

Let us return to the scenario we opened with. That ten-person startup with the $8,000/month inference bill?

Here is what they would do:

Switch from MaaS to self-hosted inference. Rent an H100 SXM5 on Spheron at $1.03/hr spot (or $2.90 on-demand). Run Qwen3.5-35B-A3B through vLLM. Their per-token cost drops from the API’s premium pricing to roughly $0.47 per million tokens.
Use a specialist GPU cloud, not a hyperscaler. Spheron, RunPod, or Lambda Labs will save them 40—85% compared to AWS or GCP for the same GPU.
Start with vLLM, no optimization needed yet. At their traffic level, vLLM’s throughput on a single H100 is more than sufficient. They can migrate to TensorRT-LLM later if throughput becomes a bottleneck.
Monitor utilization. If they consistently hit 80%+ utilization, they should start planning an on-premise purchase. If they sit at 40—60%, staying on cloud spot instances is the right call.

The result: their monthly inference bill drops from $8,000 to roughly $800—1,500, depending on traffic patterns. That is a 5—10x cost reduction with better latency, full data control, and no rate limits.

Why This Stack?

A few words on the reasoning behind these recommendations, in case you are weighing alternatives:

Why specialist clouds over hyperscalers? AWS, GCP, and Azure charge premium prices for GPU instances because they bundle networking, storage, compliance, and ecosystem lock-in into the price. If you do not need those extras, you are subsidizing features you do not use. Specialist providers give you the same GPU at a fraction of the cost.

Why vLLM as the default? It works with virtually every model architecture out of the box. No compilation step means you can swap models in minutes. The documentation is excellent, the community is active, and the performance is competitive. You only need TensorRT-LLM’s 20—30% throughput gain when you are pushing the limits of your hardware --- and when you are, the 28-minute compilation step is worth it.

Why MoE models? Mixture-of-Experts architectures give you the quality of a large model at the inference cost of a small one. Qwen3.5-35B-A3B activates only 3 billion of its 35 billion parameters per token. This means you get near-frontier performance on hardware that would struggle with a dense 35B model.

Why not just use Ollama for everything? Ollama is fantastic for local development and experimentation. It is not designed for production serving at scale. It lacks the batching, scheduling, and throughput optimizations that vLLM, SGLang, and TensorRT-LLM provide. Use Ollama on your laptop; use vLLM in production.

Sources

#	Source	Date	URL
1	Spheron: LLM On-Premise vs Cloud	April 2026	https://www.spheron.network/blog/llm-inference-on-premise-vs-cloud/
2	Spheron: GPU Cloud Pricing 2026	May 2026	https://www.spheron.network/blog/gpu-cloud-pricing-comparison-2026/
3	Inworld: Best GPU Cloud for AI Inference	April 2026	https://inworld.ai/resources/best-gpu-cloud-ai-inference
4	Lenovo Press: On-Premise vs Cloud TCO 2026	2026	https://lenovopress.lenovo.com/lp2368-on-premise-vs-cloud-generative-ai-total-cost-of-ownership-2026-edition
5	Spheron: vLLM vs TensorRT vs SGLang	March 2026	https://www.spheron.network/blog/vllm-vs-tensorrt-llm-vs-sglang-benchmarks/
6	The AI Engineer: Inference Engine Showdown	March 2026	https://theaiengineer.substack.com/p/vllm-vs-ollama-vs-sglang-vs-tensorrt
7	WhiteFiber: Best GPUs for LLM Inference	2025/2026	https://www.whitefiber.com/compare/best-gpus-for-llm-inference-in-2025
8	BenchLM: Nemotron vs Qwen3.5-27B	2026	https://benchlm.ai/compare/nemotron-3-nano-omni-30b-a3b-vs-qwen3-5-27b
9	NVIDIA Nemotron 3 Nano Omni Review	April 2026	https://www.buildfastwithai.com/blogs/nvidia-nemotron-3-nano-omni-2026
10	Awesome Agents: Qwen3.5-35B vs Nemotron	Feb 2026	https://awesomeagents.ai/tools/qwen-3-5-35b-a3b-vs-nemotron-3-nano/

The GPU market moves fast. The inference stack evolves faster. But the fundamental tradeoff --- rent vs own, cloud vs metal, flexibility vs control --- is the same decision every AI team faces as they scale. Pick the right tool for your utilization, your timeline, and your constraints. Everything else follows.