vLLM: Easy, fast, and cheap LLM inference and serving

Naifan Li — Thu, 27 Nov 2025 21:07:50 +0800

vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.

vLLM is fast with:

State-of-the-art serving throughput
Efficient management of attention key and value memory with PagedAttention
Continuous batching of incoming requests
Fast model execution with CUDA/HIP graph
Quantization: GPTQ, AWQ, INT4, INT8, and FP8
Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.
Speculative decoding
Chunked prefill

vLLM is flexible and easy to use with:

LLM Inference Frameworks - Category - Naifan Li's Blog