Skip to main content

Cloud vs Iron: Where to Run Your SOTA Models in 2026

Should you rent GPUs or buy your own? A practical engineer's guide to choosing between cloud and on-premise infrastructure for running state-of-the-art open-weight LLMs in 2026 — with pricing tables, inference stack benchmarks, and a decision matrix.

11 min
cloud vs iron where to run sota models 2026

Your team just shipped an AI feature. Users love it. Traffic is climbing. The OpenAI bill is eating your runway. Someone asks the question that every AI team eventually faces: “Should we run this ourselves?”

This is the guide that answers it --- not with theory, but with real numbers, real hardware, and a decision framework you can use today.


TL;DR

  • Cloud wins for most teams. If your GPU utilization sits below 70% (and it probably does), renting is cheaper and simpler.
  • Specialist GPU clouds are 40—85% cheaper than AWS/GCP/Azure. Use CoreWeave, Lambda Labs, RunPod, or Spheron instead of hyperscalers unless you need specific compliance features.
  • Start with vLLM. It has the widest model support, no compilation step, and competitive throughput. Move to TensorRT-LLM only when your model is stable and you need maximum performance.
  • H100 SXM5 is the practical default GPU in 2026. The B200 is faster but supply-constrained.
  • On-premise breaks even at 80%+ sustained utilization over a 3-year horizon --- or in under 4 months for high-utilization workloads.

The Scenario

Imagine a ten-person AI startup in mid-2026. They have been running Qwen3.5-35B behind a chatbot API, paying per token to a model-as-a-service provider. Their monthly inference bill just crossed $8,000. The CFO wants to cut costs. The CTO wants to own the infrastructure. The engineering team wants to stop worrying about rate limits and data leaving their VPC.

They need to decide: rent GPUs from a cloud provider, or buy their own hardware?

This is the decision we will walk through. Along the way, we will cover which GPU to pick, which inference engine to use, which cloud provider offers the best deal, and which open-weight models are actually worth self-hosting right now.


The Utilization Rule

Before we compare prices or hardware, there is one number that matters more than anything else: your GPU utilization rate.

The research is clear:

UtilizationWinnerWhy
Under 70%CloudYou pay only for what you use. Idle hardware is wasted money.
70—80%Toss-upDepends on time horizon, compliance needs, and team expertise.
80%+ sustainedOn-premiseHardware pays for itself. Up to 18x cost advantage per million tokens over MaaS APIs over 5 years.

Most production teams operate at 40—65% utilization due to traffic variability. This means cloud wins for the majority of use cases. But if you are running batch inference jobs 24/7 or serving a high-traffic API with predictable load, owning hardware becomes compelling.

The break-even math is surprisingly aggressive for high-utilization workloads: on-premise infrastructure can break even in under 4 months compared to hyperscaler pricing, according to Lenovo’s 2026 TCO whitepaper.

Sources: Spheron — On-Premise vs Cloud, April 2026 | Lenovo Press — TCO Whitepaper, 2026


Which GPU in 2026?

The GPU market in 2026 is dominated by NVIDIA’s Hopper and Blackwell architectures. Here is the landscape:

GPUVRAMBandwidthKey AdvantageStatus
NVIDIA B200192 GB HBM3e8 TB/s~4x H100 inference throughputConstrained supply, waitlists
NVIDIA H200141 GB HBM3e4.8 TB/sH100 upgrade, strong availability4—8 week lead time
NVIDIA H100 SXM580 GB HBM33.35 TB/sProduction workhorse, best availabilityStandard choice in 2026
NVIDIA A10080 GB HBM2e2 TB/sSolid mid-range, falling pricesUnder $1/hr spot on some clouds
RTX 4090 (consumer)24 GB GDDR6X1 TB/sBest FLOPS/$ for small modelsLocal/hobbyist use only

Practical recommendations:

  • Default choice: H100 SXM5. Best availability, proven in production, supported by every inference engine.
  • Future-proofing: B200, if you can get one. It delivers roughly 4x the inference throughput of an H100, but supply remains constrained through mid-2026.
  • Budget option: A100 spot instances. Performance is solid and prices are falling below $1/hr on some providers.
  • Local dev: RTX 4090. Incredible FLOPS-per-dollar for models under 24 GB, but not suitable for production serving.

The buy vs rent math for one H100: Buying an H100 costs approximately $30,000. Renting one at $5/hr means break-even at roughly 6,000 hours --- about 3.4 years at 5 hours per day --- before accounting for power, cooling, and facilities. This is why utilization matters so much.

Sources: Inworld AI — GPU Cloud Comparison, April 2026 | WhiteFiber — GPU Guide


Cloud Providers: Who Has the Best GPU Pricing?

This is where the market gets interesting. Specialist GPU cloud providers are 40—85% cheaper than the hyperscalers for equivalent GPU workloads. AWS, GCP, and Azure add egress fees, storage premiums, and networking charges that inflate the real cost significantly.

ProviderGPUs AvailableBest ForNotes
CoreWeaveH100, B200, A100Large training, enterpriseKubernetes-native, HIPAA/SOC2/ISO27001
Lambda LabsH100, B200Researchers, flexible inferenceTransparent pricing, strong deep learning images
RunPodH100, A100, RTXCost-conscious teamsSpot pricing, wide GPU range
SpheronH100 SXM5Cheapest spot pricingH100 spot from $1.03/hr (May 2026)
AWS (p5)H100Regulated orgs, existing cloud footprint$4.10—6.88/hr/GPU on-demand
GCP (A3 High)H100GKE/Vertex AI integration$11—16/hr/GPU --- most expensive
AzureH100, B200Enterprise governance, BYOKQuantum-2 InfiniBand, 3.2 Tbps

Pricing Snapshot (May—June 2026)

GPUProviderOn-Demand (/hr)Spot (/hr)Est. Cost/M Tokens
H100 SXM5Spheron~$2.90$1.03$0.47
H100 SXM5AWS (p5)$4.10—6.88Varies---
B200Various~$4—5$2.12$0.42
A100Various~$1.50—2.00Under $1.00$0.57

The takeaway: If you are paying AWS or GCP prices for GPU inference, you are overpaying. Switch to a specialist provider and your costs drop by half or more. Use hyperscalers only when you have specific compliance requirements (HIPAA, FedRAMP) or deep integration with their ecosystem (GKE, SageMaker).

Sources: Spheron — GPU Pricing, May 2026 | Inworld AI — GPU Cloud Comparison, April 2026


The Inference Stack: vLLM vs SGLang vs TensorRT-LLM

Choosing the right GPU is only half the battle. The software stack that serves your model determines your actual throughput, latency, and operational complexity. In 2026, the inference engine landscape has consolidated around three serious contenders.

The 2026 Recommendation Hierarchy

  1. Start with vLLM --- widest model support, no compilation step, competitive throughput, best documentation. This is the safe default for any team.
  2. Move to TensorRT-LLM --- when your model is stable for months and you need maximum throughput. Expect a 20—30% throughput gain over vLLM, but a 28-minute compilation step that must be rerun whenever the model changes.
  3. Use SGLang --- when individual request latency matters most (real-time chat, structured output generation, shared-prefix workloads). It excels at minimizing time-to-first-token.
  4. Avoid TGI (HuggingFace) --- officially in maintenance mode. HuggingFace themselves recommend migrating to vLLM or SGLang.

Benchmark Comparison

EngineThroughputLatency (TTFT)Model SupportCompilationBest For
vLLM4/53/55/5NoneGeneral production, wide model coverage
SGLang3/55/54/5NoneReal-time chat, structured generation
TensorRT-LLM5/54/53/5~28 minMaximum throughput, stable models
llama.cpp2/53/55/5NoneCPU, local, edge, consumer GPU
Ollama2/53/54/5NoneLocal development only --- not for production
TGI3/53/54/5NoneMaintenance mode --- migrate away

Benchmark note: On H100 SXM5, TensorRT-LLM delivers the best throughput and latency but requires compilation. On the Blackwell B200, TensorRT-LLM reaches 1,000 tokens per second per user on Llama 4 Maverick --- a staggering number that makes it the clear choice for high-volume production workloads on next-gen hardware.

Sources: Spheron — Inference Benchmarks, March 2026 | The AI Engineer — Inference Showdown, March 2026


Models Worth Self-Hosting in 2026

Not every model justifies the infrastructure investment. These are the open-weight models that make self-hosting worthwhile in 2026:

ModelParams (Total / Active)Context WindowWhy Self-Host?
Qwen3-Coder-30B-A3B30B / 3B MoE256KBest coding MoE for 32—64 GB hardware
Qwen3.5-35B-A3B35B / 3B MoE262KBest general-purpose MoE at this size
Qwen3.5-122B-A10B122B / 10B MoE128KNear-frontier performance, needs 64—96 GB
Nemotron-30B (Omni)30B / 3B MoE1MBest multimodal + speed; Mamba2 hybrid
DeepSeek V4 ProLarge MoE---High Intelligence Index score
Kimi K2.6MoE---Top open-weight Intelligence Index (54)

Why these models specifically? They are Mixture-of-Experts architectures, which means they activate only a fraction of their total parameters during inference. A 35B MoE model with 3B active parameters runs nearly as fast as a dense 3B model while delivering quality closer to a 30B+ dense model. This makes them incredibly efficient to self-host --- you get frontier-tier quality at a fraction of the compute cost.

The Nemotron-30B Omni deserves special mention: its Mamba2 hybrid architecture and 1M context window make it uniquely suited for multimodal workloads (text, image, audio) on a single GPU.


The Decision Matrix

Here is the consolidated decision framework based on your specific scenario:

ScenarioRecommendation
Early-stage / variable loadCloud spot (RunPod, Spheron) + vLLM
Production, growing trafficCoreWeave or Lambda H100/B200 + vLLM
Max performance, stable modelTensorRT-LLM or NVIDIA NIM container
High utilization + data residencyOwn H100/B200 servers + vLLM
Multimodal / omni needsNemotron-30B Omni on H100 cloud
Apple Silicon / on-deviceMLX + Qwen3-Coder-30B (no cloud needed)
Regulated (HIPAA, ISO27001)CoreWeave or Azure + BYOK

Back to Our Startup

Let us return to the scenario we opened with. That ten-person startup with the $8,000/month inference bill?

Here is what they would do:

  1. Switch from MaaS to self-hosted inference. Rent an H100 SXM5 on Spheron at $1.03/hr spot (or $2.90 on-demand). Run Qwen3.5-35B-A3B through vLLM. Their per-token cost drops from the API’s premium pricing to roughly $0.47 per million tokens.

  2. Use a specialist GPU cloud, not a hyperscaler. Spheron, RunPod, or Lambda Labs will save them 40—85% compared to AWS or GCP for the same GPU.

  3. Start with vLLM, no optimization needed yet. At their traffic level, vLLM’s throughput on a single H100 is more than sufficient. They can migrate to TensorRT-LLM later if throughput becomes a bottleneck.

  4. Monitor utilization. If they consistently hit 80%+ utilization, they should start planning an on-premise purchase. If they sit at 40—60%, staying on cloud spot instances is the right call.

The result: their monthly inference bill drops from $8,000 to roughly $800—1,500, depending on traffic patterns. That is a 5—10x cost reduction with better latency, full data control, and no rate limits.


Why This Stack?

A few words on the reasoning behind these recommendations, in case you are weighing alternatives:

Why specialist clouds over hyperscalers? AWS, GCP, and Azure charge premium prices for GPU instances because they bundle networking, storage, compliance, and ecosystem lock-in into the price. If you do not need those extras, you are subsidizing features you do not use. Specialist providers give you the same GPU at a fraction of the cost.

Why vLLM as the default? It works with virtually every model architecture out of the box. No compilation step means you can swap models in minutes. The documentation is excellent, the community is active, and the performance is competitive. You only need TensorRT-LLM’s 20—30% throughput gain when you are pushing the limits of your hardware --- and when you are, the 28-minute compilation step is worth it.

Why MoE models? Mixture-of-Experts architectures give you the quality of a large model at the inference cost of a small one. Qwen3.5-35B-A3B activates only 3 billion of its 35 billion parameters per token. This means you get near-frontier performance on hardware that would struggle with a dense 35B model.

Why not just use Ollama for everything? Ollama is fantastic for local development and experimentation. It is not designed for production serving at scale. It lacks the batching, scheduling, and throughput optimizations that vLLM, SGLang, and TensorRT-LLM provide. Use Ollama on your laptop; use vLLM in production.


Sources

#SourceDateURL
1Spheron: LLM On-Premise vs CloudApril 2026https://www.spheron.network/blog/llm-inference-on-premise-vs-cloud/
2Spheron: GPU Cloud Pricing 2026May 2026https://www.spheron.network/blog/gpu-cloud-pricing-comparison-2026/
3Inworld: Best GPU Cloud for AI InferenceApril 2026https://inworld.ai/resources/best-gpu-cloud-ai-inference
4Lenovo Press: On-Premise vs Cloud TCO 20262026https://lenovopress.lenovo.com/lp2368-on-premise-vs-cloud-generative-ai-total-cost-of-ownership-2026-edition
5Spheron: vLLM vs TensorRT vs SGLangMarch 2026https://www.spheron.network/blog/vllm-vs-tensorrt-llm-vs-sglang-benchmarks/
6The AI Engineer: Inference Engine ShowdownMarch 2026https://theaiengineer.substack.com/p/vllm-vs-ollama-vs-sglang-vs-tensorrt
7WhiteFiber: Best GPUs for LLM Inference2025/2026https://www.whitefiber.com/compare/best-gpus-for-llm-inference-in-2025
8BenchLM: Nemotron vs Qwen3.5-27B2026https://benchlm.ai/compare/nemotron-3-nano-omni-30b-a3b-vs-qwen3-5-27b
9NVIDIA Nemotron 3 Nano Omni ReviewApril 2026https://www.buildfastwithai.com/blogs/nvidia-nemotron-3-nano-omni-2026
10Awesome Agents: Qwen3.5-35B vs NemotronFeb 2026https://awesomeagents.ai/tools/qwen-3-5-35b-a3b-vs-nemotron-3-nano/

The GPU market moves fast. The inference stack evolves faster. But the fundamental tradeoff --- rent vs own, cloud vs metal, flexibility vs control --- is the same decision every AI team faces as they scale. Pick the right tool for your utilization, your timeline, and your constraints. Everything else follows.

Tags

#LLM hosting 2026 #GPU for AI #cloud vs on-premise AI #vLLM #TensorRT-LLM #SGLang #H100 #B200 #inference stack #self-hosted LLM #GPU pricing #open-weight models

Got a web project?

Renard Digital supports you from A to Z: site, domain, email.

Get in touch