Cloud vs Iron: Where to Run Your SOTA Models in 2026
Your team just shipped an AI feature. Users love it. Traffic is climbing. The OpenAI bill is eating your runway. Someone asks the question that every AI team eventually faces: “Should we run this ourselves?”
This is the guide that answers it --- not with theory, but with real numbers, real hardware, and a decision framework you can use today.
TL;DR
- Cloud wins for most teams. If your GPU utilization sits below 70% (and it probably does), renting is cheaper and simpler.
- Specialist GPU clouds are 40—85% cheaper than AWS/GCP/Azure. Use CoreWeave, Lambda Labs, RunPod, or Spheron instead of hyperscalers unless you need specific compliance features.
- Start with vLLM. It has the widest model support, no compilation step, and competitive throughput. Move to TensorRT-LLM only when your model is stable and you need maximum performance.
- H100 SXM5 is the practical default GPU in 2026. The B200 is faster but supply-constrained.
- On-premise breaks even at 80%+ sustained utilization over a 3-year horizon --- or in under 4 months for high-utilization workloads.
The Scenario
Imagine a ten-person AI startup in mid-2026. They have been running Qwen3.5-35B behind a chatbot API, paying per token to a model-as-a-service provider. Their monthly inference bill just crossed $8,000. The CFO wants to cut costs. The CTO wants to own the infrastructure. The engineering team wants to stop worrying about rate limits and data leaving their VPC.
They need to decide: rent GPUs from a cloud provider, or buy their own hardware?
This is the decision we will walk through. Along the way, we will cover which GPU to pick, which inference engine to use, which cloud provider offers the best deal, and which open-weight models are actually worth self-hosting right now.
The Utilization Rule
Before we compare prices or hardware, there is one number that matters more than anything else: your GPU utilization rate.
The research is clear:
| Utilization | Winner | Why |
|---|---|---|
| Under 70% | Cloud | You pay only for what you use. Idle hardware is wasted money. |
| 70—80% | Toss-up | Depends on time horizon, compliance needs, and team expertise. |
| 80%+ sustained | On-premise | Hardware pays for itself. Up to 18x cost advantage per million tokens over MaaS APIs over 5 years. |
Most production teams operate at 40—65% utilization due to traffic variability. This means cloud wins for the majority of use cases. But if you are running batch inference jobs 24/7 or serving a high-traffic API with predictable load, owning hardware becomes compelling.
The break-even math is surprisingly aggressive for high-utilization workloads: on-premise infrastructure can break even in under 4 months compared to hyperscaler pricing, according to Lenovo’s 2026 TCO whitepaper.
Sources: Spheron — On-Premise vs Cloud, April 2026 | Lenovo Press — TCO Whitepaper, 2026
Which GPU in 2026?
The GPU market in 2026 is dominated by NVIDIA’s Hopper and Blackwell architectures. Here is the landscape:
| GPU | VRAM | Bandwidth | Key Advantage | Status |
|---|---|---|---|---|
| NVIDIA B200 | 192 GB HBM3e | 8 TB/s | ~4x H100 inference throughput | Constrained supply, waitlists |
| NVIDIA H200 | 141 GB HBM3e | 4.8 TB/s | H100 upgrade, strong availability | 4—8 week lead time |
| NVIDIA H100 SXM5 | 80 GB HBM3 | 3.35 TB/s | Production workhorse, best availability | Standard choice in 2026 |
| NVIDIA A100 | 80 GB HBM2e | 2 TB/s | Solid mid-range, falling prices | Under $1/hr spot on some clouds |
| RTX 4090 (consumer) | 24 GB GDDR6X | 1 TB/s | Best FLOPS/$ for small models | Local/hobbyist use only |
Practical recommendations:
- Default choice: H100 SXM5. Best availability, proven in production, supported by every inference engine.
- Future-proofing: B200, if you can get one. It delivers roughly 4x the inference throughput of an H100, but supply remains constrained through mid-2026.
- Budget option: A100 spot instances. Performance is solid and prices are falling below $1/hr on some providers.
- Local dev: RTX 4090. Incredible FLOPS-per-dollar for models under 24 GB, but not suitable for production serving.
The buy vs rent math for one H100: Buying an H100 costs approximately $30,000. Renting one at $5/hr means break-even at roughly 6,000 hours --- about 3.4 years at 5 hours per day --- before accounting for power, cooling, and facilities. This is why utilization matters so much.
Sources: Inworld AI — GPU Cloud Comparison, April 2026 | WhiteFiber — GPU Guide
Cloud Providers: Who Has the Best GPU Pricing?
This is where the market gets interesting. Specialist GPU cloud providers are 40—85% cheaper than the hyperscalers for equivalent GPU workloads. AWS, GCP, and Azure add egress fees, storage premiums, and networking charges that inflate the real cost significantly.
| Provider | GPUs Available | Best For | Notes |
|---|---|---|---|
| CoreWeave | H100, B200, A100 | Large training, enterprise | Kubernetes-native, HIPAA/SOC2/ISO27001 |
| Lambda Labs | H100, B200 | Researchers, flexible inference | Transparent pricing, strong deep learning images |
| RunPod | H100, A100, RTX | Cost-conscious teams | Spot pricing, wide GPU range |
| Spheron | H100 SXM5 | Cheapest spot pricing | H100 spot from $1.03/hr (May 2026) |
| AWS (p5) | H100 | Regulated orgs, existing cloud footprint | $4.10—6.88/hr/GPU on-demand |
| GCP (A3 High) | H100 | GKE/Vertex AI integration | $11—16/hr/GPU --- most expensive |
| Azure | H100, B200 | Enterprise governance, BYOK | Quantum-2 InfiniBand, 3.2 Tbps |
Pricing Snapshot (May—June 2026)
| GPU | Provider | On-Demand (/hr) | Spot (/hr) | Est. Cost/M Tokens |
|---|---|---|---|---|
| H100 SXM5 | Spheron | ~$2.90 | $1.03 | $0.47 |
| H100 SXM5 | AWS (p5) | $4.10—6.88 | Varies | --- |
| B200 | Various | ~$4—5 | $2.12 | $0.42 |
| A100 | Various | ~$1.50—2.00 | Under $1.00 | $0.57 |
The takeaway: If you are paying AWS or GCP prices for GPU inference, you are overpaying. Switch to a specialist provider and your costs drop by half or more. Use hyperscalers only when you have specific compliance requirements (HIPAA, FedRAMP) or deep integration with their ecosystem (GKE, SageMaker).
Sources: Spheron — GPU Pricing, May 2026 | Inworld AI — GPU Cloud Comparison, April 2026
The Inference Stack: vLLM vs SGLang vs TensorRT-LLM
Choosing the right GPU is only half the battle. The software stack that serves your model determines your actual throughput, latency, and operational complexity. In 2026, the inference engine landscape has consolidated around three serious contenders.
The 2026 Recommendation Hierarchy
- Start with vLLM --- widest model support, no compilation step, competitive throughput, best documentation. This is the safe default for any team.
- Move to TensorRT-LLM --- when your model is stable for months and you need maximum throughput. Expect a 20—30% throughput gain over vLLM, but a 28-minute compilation step that must be rerun whenever the model changes.
- Use SGLang --- when individual request latency matters most (real-time chat, structured output generation, shared-prefix workloads). It excels at minimizing time-to-first-token.
- Avoid TGI (HuggingFace) --- officially in maintenance mode. HuggingFace themselves recommend migrating to vLLM or SGLang.
Benchmark Comparison
| Engine | Throughput | Latency (TTFT) | Model Support | Compilation | Best For |
|---|---|---|---|---|---|
| vLLM | 4/5 | 3/5 | 5/5 | None | General production, wide model coverage |
| SGLang | 3/5 | 5/5 | 4/5 | None | Real-time chat, structured generation |
| TensorRT-LLM | 5/5 | 4/5 | 3/5 | ~28 min | Maximum throughput, stable models |
| llama.cpp | 2/5 | 3/5 | 5/5 | None | CPU, local, edge, consumer GPU |
| Ollama | 2/5 | 3/5 | 4/5 | None | Local development only --- not for production |
| TGI | 3/5 | 3/5 | 4/5 | None | Maintenance mode --- migrate away |
Benchmark note: On H100 SXM5, TensorRT-LLM delivers the best throughput and latency but requires compilation. On the Blackwell B200, TensorRT-LLM reaches 1,000 tokens per second per user on Llama 4 Maverick --- a staggering number that makes it the clear choice for high-volume production workloads on next-gen hardware.
Sources: Spheron — Inference Benchmarks, March 2026 | The AI Engineer — Inference Showdown, March 2026
Models Worth Self-Hosting in 2026
Not every model justifies the infrastructure investment. These are the open-weight models that make self-hosting worthwhile in 2026:
| Model | Params (Total / Active) | Context Window | Why Self-Host? |
|---|---|---|---|
| Qwen3-Coder-30B-A3B | 30B / 3B MoE | 256K | Best coding MoE for 32—64 GB hardware |
| Qwen3.5-35B-A3B | 35B / 3B MoE | 262K | Best general-purpose MoE at this size |
| Qwen3.5-122B-A10B | 122B / 10B MoE | 128K | Near-frontier performance, needs 64—96 GB |
| Nemotron-30B (Omni) | 30B / 3B MoE | 1M | Best multimodal + speed; Mamba2 hybrid |
| DeepSeek V4 Pro | Large MoE | --- | High Intelligence Index score |
| Kimi K2.6 | MoE | --- | Top open-weight Intelligence Index (54) |
Why these models specifically? They are Mixture-of-Experts architectures, which means they activate only a fraction of their total parameters during inference. A 35B MoE model with 3B active parameters runs nearly as fast as a dense 3B model while delivering quality closer to a 30B+ dense model. This makes them incredibly efficient to self-host --- you get frontier-tier quality at a fraction of the compute cost.
The Nemotron-30B Omni deserves special mention: its Mamba2 hybrid architecture and 1M context window make it uniquely suited for multimodal workloads (text, image, audio) on a single GPU.
The Decision Matrix
Here is the consolidated decision framework based on your specific scenario:
| Scenario | Recommendation |
|---|---|
| Early-stage / variable load | Cloud spot (RunPod, Spheron) + vLLM |
| Production, growing traffic | CoreWeave or Lambda H100/B200 + vLLM |
| Max performance, stable model | TensorRT-LLM or NVIDIA NIM container |
| High utilization + data residency | Own H100/B200 servers + vLLM |
| Multimodal / omni needs | Nemotron-30B Omni on H100 cloud |
| Apple Silicon / on-device | MLX + Qwen3-Coder-30B (no cloud needed) |
| Regulated (HIPAA, ISO27001) | CoreWeave or Azure + BYOK |
Back to Our Startup
Let us return to the scenario we opened with. That ten-person startup with the $8,000/month inference bill?
Here is what they would do:
-
Switch from MaaS to self-hosted inference. Rent an H100 SXM5 on Spheron at $1.03/hr spot (or $2.90 on-demand). Run Qwen3.5-35B-A3B through vLLM. Their per-token cost drops from the API’s premium pricing to roughly $0.47 per million tokens.
-
Use a specialist GPU cloud, not a hyperscaler. Spheron, RunPod, or Lambda Labs will save them 40—85% compared to AWS or GCP for the same GPU.
-
Start with vLLM, no optimization needed yet. At their traffic level, vLLM’s throughput on a single H100 is more than sufficient. They can migrate to TensorRT-LLM later if throughput becomes a bottleneck.
-
Monitor utilization. If they consistently hit 80%+ utilization, they should start planning an on-premise purchase. If they sit at 40—60%, staying on cloud spot instances is the right call.
The result: their monthly inference bill drops from $8,000 to roughly $800—1,500, depending on traffic patterns. That is a 5—10x cost reduction with better latency, full data control, and no rate limits.
Why This Stack?
A few words on the reasoning behind these recommendations, in case you are weighing alternatives:
Why specialist clouds over hyperscalers? AWS, GCP, and Azure charge premium prices for GPU instances because they bundle networking, storage, compliance, and ecosystem lock-in into the price. If you do not need those extras, you are subsidizing features you do not use. Specialist providers give you the same GPU at a fraction of the cost.
Why vLLM as the default? It works with virtually every model architecture out of the box. No compilation step means you can swap models in minutes. The documentation is excellent, the community is active, and the performance is competitive. You only need TensorRT-LLM’s 20—30% throughput gain when you are pushing the limits of your hardware --- and when you are, the 28-minute compilation step is worth it.
Why MoE models? Mixture-of-Experts architectures give you the quality of a large model at the inference cost of a small one. Qwen3.5-35B-A3B activates only 3 billion of its 35 billion parameters per token. This means you get near-frontier performance on hardware that would struggle with a dense 35B model.
Why not just use Ollama for everything? Ollama is fantastic for local development and experimentation. It is not designed for production serving at scale. It lacks the batching, scheduling, and throughput optimizations that vLLM, SGLang, and TensorRT-LLM provide. Use Ollama on your laptop; use vLLM in production.
Sources
The GPU market moves fast. The inference stack evolves faster. But the fundamental tradeoff --- rent vs own, cloud vs metal, flexibility vs control --- is the same decision every AI team faces as they scale. Pick the right tool for your utilization, your timeline, and your constraints. Everything else follows.