Cheapest AI Inference Providers (June 2026)
/ Arvid Andersson
The cheapest inference provider depends on the model you run, and the numbers move month to month. This page pins prices to one model, gpt-oss-120B, a current open-source frontier model that most serverless providers host, so the comparison is apples to apples. Every figure below is taken from the provider's own pricing page on the date noted, not estimated.
Price comparison: gpt-oss-120B
Per 1M tokens, sorted by output price (output tokens usually dominate cost). Verified June 6, 2026. Click a provider for its full profile, or see the filterable inference APIs table.
| Provider | Input / 1M | Output / 1M | Notable for |
|---|---|---|---|
| DeepInfra | $0.039 | $0.19 | Broad open-source catalog |
| Novita | $0.05 | $0.25 | Multi-modal APIs in one platform |
| SiliconFlow | $0.05 | $0.45 | 200+ open LLM and multimodal models |
| Groq | $0.15 | $0.60 | Custom LPU hardware, low TTFT |
| Together.ai | $0.15 | $0.60 | Inference + fine-tuning, same API |
| Fireworks | $0.15 | $0.60 | Production inference, function calling |
Sources: each provider's own pricing or model page, checked June 6, 2026. Prices change frequently; re-check before committing to high-volume usage.
How to read this
On gpt-oss-120B, DeepInfra lists the lowest rate of the group, with Novita and SiliconFlow close behind on input price. Groq, Together.ai, and Fireworks sit at the same published rate on this model, and differentiate on other axes: Groq on first-token latency from its LPU hardware, Together and Fireworks on fine-tuning and production features on the same platform.
Output tokens cost more than input tokens at every provider here, so for generation-heavy workloads the output column matters most. For prompt-heavy workloads (long context, short answers) the input column carries more weight. The table keeps the two separate so the blend reflects your own traffic mix rather than a fixed assumption.
Cheapest is not always best
Price is a starting filter, not the whole decision. A provider that wins on per-token cost may not host the specific model you need, may have lower rate limits, or may be slower for interactive use where time-to-first-token matters more than raw throughput. Data residency is another constraint: if a workload is GDPR-sensitive, EU-hosted options (Nebius, Scaleway, Mistral La Plateforme) matter more than a few cents per million tokens. The European providers page lists those with hosting regions.
For a fuller picture across speed, model catalog, and API compatibility, see AI Inference API Providers Compared.
Why these numbers move
Per-token prices shift as providers compete on hardware utilisation and model demand. A clear example: gpt-oss-120B on DeepInfra was listed around $0.08 per 1M blended in March 2026 and had dropped to roughly $0.05 blended by June 2026. That is the norm, not the exception. Treat any pricing table, including this one, as a snapshot, and check the provider's own page for the model you actually plan to run.
Frequently asked questions
What is the cheapest AI inference provider?
On gpt-oss-120B, a current open-source frontier model, DeepInfra lists the lowest per-token price among the six providers checked here: $0.039 per 1M input tokens and $0.19 per 1M output tokens (sourced from its model page on June 6, 2026). Novita and SiliconFlow follow closely. This is scoped to one model on one date. Prices are model-specific and change often, so the right answer depends on the model you run and is worth re-checking before committing to high-volume usage.
Why do inference prices change so often?
Providers compete on price as hardware utilisation and model demand shift, so per-token rates move month to month. As a concrete example, gpt-oss-120B on DeepInfra was listed around $0.08 per 1M blended in March 2026 and dropped to roughly $0.05 blended by June 2026. Always check the provider's own pricing page for the specific model before relying on a number.
Is the cheapest provider always the best choice?
Not necessarily. Per-token price is one factor; throughput, time-to-first-token, model catalog, rate limits, and data residency all matter. A provider that is cheapest on one model may not host the model you need, or may be slower for interactive workloads. Use price as a starting filter, then weigh speed and model availability for your specific use case.
How is the blended price calculated?
Blended price assumes a fixed ratio of input to output tokens (commonly 3:1) to combine the two rates into a single number for comparison. Because output tokens are usually priced higher than input tokens, the output rate dominates real-world cost on generation-heavy workloads. The table here shows input and output separately so you can compute the blend for your own traffic mix.
Is your product missing?