deepinfra
Run the top AI models using a simple API, pay per use. Low cost, scalable and production ready infrastructure.
DeepInfra hosts 50+ open-source models (Llama 3, Mistral, Mixtral, Gemma, and others) as serverless API endpoints. You send requests to an OpenAI-compatible API and pay per token, with no setup or infrastructure management. The platform auto-scales based on demand.
DeepInfra competes primarily on price and latency, often offering lower per-token costs than other inference providers for the same models. They support text generation, embeddings, image generation, and text-to-speech. The OpenAI-compatible API makes it straightforward to switch from OpenAI by changing the base URL.
Pricing: Per token usage
What is DeepInfra?
DeepInfra is a serverless inference platform that hosts open-weight AI models as API endpoints. You send requests to an OpenAI-compatible API and pay per token, with no infrastructure setup or management required.
Supported Models
DeepInfra hosts roughly 77 open-weight models. Model families include Meta Llama (3.1, 3.2, 4 Scout, 4 Maverick with up to 1M token context), DeepSeek V3, Qwen 2.5 and 3, NVIDIA Nemotron, Mistral, Google Gemma, GLM-5, and Kimi K2.5. Beyond text generation, DeepInfra supports embedding models (BGE-M3), image generation (Flux 3 at $0.07/image), text-to-speech (Chatterbox), and speech recognition. 65 of the 77 models support function calling, and 30 are reasoning models.
API and Integration
The API is compatible with the OpenAI client libraries. If you are already using the OpenAI SDK, switching to DeepInfra requires changing the base URL and API key. The platform supports streaming responses, function calling, JSON mode, and structured output. Framework integrations include LangChain, LlamaIndex, AutoGen, and Vercel AI SDK. Custom model deployment and LoRA adapter support are available for both text and image models.
Infrastructure and Performance
DeepInfra runs on its own US-based infrastructure, including NVIDIA Blackwell HGX B200 systems. The optimization stack includes TensorRT-LLM, speculative decoding, multi-token prediction, and KV-cache-aware routing. Top models reach 200-317 tokens per second output speed with time-to-first-token as low as 0.35 seconds. MoE models on Blackwell with NVFP4 quantization achieve up to 20x cost reduction compared to dense models on older hardware.
Pricing
DeepInfra uses per-token pricing with rates varying by model. Prices range from $0.02 per million tokens (Llama 3.2 3B) to around $1.50 per million tokens for larger models. There are no minimum commitments and you only pay for what you use. Automatic tier progression reduces costs as spending increases. The platform holds SOC 2, ISO 27001, GDPR, and HIPAA compliance certifications.
Who Should Use DeepInfra?
DeepInfra is a good fit for teams that want to use open-weight models without managing GPU infrastructure. The OpenAI-compatible API makes it easy to switch from OpenAI or test multiple open-source models. The Blackwell infrastructure gives it a performance and cost edge for large-scale workloads. It is particularly useful for cost-sensitive production use cases where open-weight models perform well enough for the task.
deepinfra Alternatives
Explore 51 products in the Inference APIs category. View all deepinfra alternatives.
LLMWise
Multi-LLM API orchestration platform for comparing and blending AI models
novita.ai
APIs, Serverless and GPU Instance In One AI Cloud
Nebius
Full-stack AI cloud with GPU infrastructure for training and inference
IonRouter
High-throughput inference API with OpenAI-compatible access to open-source models at half market rate
Is your product missing?