vLLM is a high-throughput and memory-efficient inference and serving engine designed for large language models (LLMs). Developed in the Sky Computing Lab at UC Berkeley, it has become a prominent open-source project supported by a diverse community of contributors.
Key features include:
- State-of-the-art serving throughput - Efficient management of attention key and value memory with PagedAttention.
- Continuous batching - Handles incoming requests effectively with chunked prefill and prefix caching.
- Flexible model execution - Supports piecewise and full CUDA/HIP graphs.
- Quantization support - Offers various precision formats including FP8 and INT4.
- Optimized attention kernels - Utilizes advanced techniques like FlashAttention and Triton.
- OpenAI-compatible API server - Provides seamless integration with popular models from Hugging Face.
- Support for diverse hardware - Compatible with NVIDIA, AMD, and various CPU architectures.
- Multi-LoRA support - Facilitates efficient handling of dense and mixture-of-expert layers.
- Streaming outputs - Allows for structured output generation.
- Community contributions - Encourages collaboration and contributions from users.
vLLM supports over 200 model architectures and is designed for both researchers and developers looking to implement efficient LLM serving solutions.