vLLM
High-throughput LLM inference engine with PagedAttention for efficient GPU memory usage
vLLM is an open-source inference and serving engine for Large Language Models, originally developed at UC Berkeley. It uses PagedAttention to manage GPU memory efficiently, achieving up to 24x higher throughput compared to Hugging Face Transformers. It supports most popular open-source models including Llama, Mixtral, DeepSeek, and multimodal models like LLaVA. vLLM includes both a fast inference engine and a production-ready OpenAI-compatible serving server, making it a popular choice for self-hosted LLM deployments.
Pricing: Free
vLLM Alternatives
Explore 31 products in the Frameworks & Stacks category. View all vLLM alternatives.
Mastra
TypeScript-first AI framework for building agents, RAG pipelines, and workflows
Ollama
Run large language models locally with a single command
Dify
Easily build and operate generative AI applications. Create Assistants API and GPTs based on any LLMs.
Also listed in
Work on vLLM? Feature it at the top of Frameworks & Stacks.
Is your product missing?