AI Infrastructure Stack
Self-Hosted AI Stack
Run everything on your own infrastructure. For teams that need full control over data, want to avoid API dependencies, or have compliance requirements that rule out third-party services.
Inference Engine
βοΈSelf-hosted inference means running an open-source model on your own GPUs. You need a serving framework that handles batching, streaming, and model loading efficiently.
The most widely used open-source inference engine. PagedAttention for efficient memory management. OpenAI-compatible API server. Supports most popular open-source models.
The simplest way to run models locally. One command to download and serve. Good for development, testing, and trying models before deploying them with vLLM. Not designed for production serving at scale.
Model serving framework with built-in containerization. Package any model as a production API with autoscaling, batching, and monitoring. More opinionated than raw vLLM.
GPU Compute
π₯οΈYou need GPUs to run the models. These providers offer on-demand GPU instances where you control the full environment. Cheaper than inference APIs at scale, but you manage the infrastructure.
On-demand and spot GPU instances. Serverless GPU option for bursty workloads. Pre-built templates for vLLM and common models. Popular with indie developers.
Serverless GPU compute with a Python-first developer experience. Define your environment in code, deploy with one command. Handles autoscaling and cold starts.
GPU marketplace with competitive pricing. Rent GPUs from individual providers. Less polished UX but often cheaper than centralized cloud providers.
Vector Database
ποΈFor RAG or search features, you need a self-hostable vector database. All three options below can run on your infrastructure via Docker or Kubernetes.
Written in Rust. Single binary, easy to deploy. Good filtered search and multi-tenancy. Docker or Kubernetes deployment.
Designed for scale. Handles billions of vectors. Kubernetes-native deployment. More complex to operate than Qdrant but stronger at very large scale.
If you already self-host Postgres, add pgvector. No new service to manage. Good up to ~10M vectors. Pair with full-text search for hybrid retrieval.
Observability
πSelf-hosted observability means your traces and logs stay on your infrastructure too. Both options below are open source and self-hostable.
Self-host via Docker or Kubernetes. MIT licensed, no feature gates on the self-hosted version. Tracing, evals, prompt management.
Open source, OpenTelemetry-native. Self-host and pipe traces into your existing Grafana or Datadog setup. Good for teams with established observability infrastructure.
Open-source LLM testing framework. Focuses on detecting hallucinations, bias, and security issues. Self-hostable. Complements tracing tools with evaluation.
Things to keep in mind
- Self-hosting trades per-token costs for infrastructure costs and operational work. It makes sense at scale (thousands of requests per day) or when data sovereignty requires it. For small workloads, managed APIs are usually cheaper and simpler.
- vLLM + a GPU instance is the standard starting point. GPU requirements depend on model size: a 7B model fits on a single GPU, larger models (70B+) may need multiple GPUs or quantization. Benchmark your specific model before committing to hardware.
- Open-source models have caught up significantly. Llama, Mistral, Qwen, and DeepSeek families cover most production use cases. Check license terms, some are more permissive than others.
- You can mix self-hosted and managed. Run your inference on your own GPUs but use Langfuse cloud for observability, or vice versa. Not everything needs to be self-hosted.
Is your product missing?