Docs · GitHub
vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.
- Run open-source models on vLLM
- Build appplications with vLLM
- Build vLLM
vLLM is fast with:
- State-of-the-art serving throughput
- Efficient management of attention key and value memory with PagedAttention
- Continuous batching of incoming requests
- Fast model execution with CUDA/HIP graph
- Quantization: GPTQ, AWQ, INT4, INT8, and FP8
- Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.
- Speculative decoding
- Chunked prefill
vLLM is flexible and easy to use with:
- Seamless integration with popular HuggingFace models
- High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
- Tensor, pipeline, data and expert parallelism support for distributed inference
- Streaming outputs
- OpenAI-compatible API server
- Support for NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, Arm CPUs, and TPU.
- Support for diverse hardware plugins such as Intel Gaudi, IBM Spyre and Huawei Ascend.
- Prefix caching support
- Multi-LoRA support
Installation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
| # Install vLLM in a fresh new environment (Recommended).
# conda: manage python & cuda version
(vllm-cu128) ➜ /workspace conda create -n vllm-cu128 python=3.12
(vllm-cu128) ➜ /workspace conda activate vllm-cu128
# pip install pipx
(vllm-cu128) ➜ /workspace pip install pipx
# pipx install uv
(vllm-cu128) ➜ /workspace pipx install uv
installed package uv 0.10.2, installed using Python 3.12.12
These apps are now globally available
- uv
- uvx
⚠️ Note: '/root/.local/bin' is not on your PATH environment variable. These apps will not be globally accessible until your PATH is updated. Run `pipx ensurepath` to automatically add it, or manually
modify your PATH in your shell's config file (e.g. ~/.bashrc).
done! ✨ 🌟 ✨
(vllm-cu128) ➜ /workspace pipx ensurepath
# configure uv env
(vllm-cu128) ➜ /workspace vim pyproject.toml
[tool.uv]
index-url = "https://mirrors.aliyun.com/pypi/simple/"
# uv pip install vllm
(vllm-cu128) ➜ /workspace uv init vllm --python 3.12
(vllm-cu128) ➜ /workspace uv venv vllm-venv --python 3.12
(vllm-cu128) ➜ /workspace source vllm-venv/bin/activate
(vllm-venv) (vllm-cu128) ➜ /workspace uv pip install vllm --torch-backend=auto
# configure vllm env
export VLLM_USE_MODELSCOPE=True
export MODELSCOPE_CACHE="/opt/data/private/magiclee/work/modelscope/hub"
# install flash-attn
uv pip install flash-attn --no-build-isolation
|
Offline Batched Inference
- LLM: the main class for running offline inference with vLLM engine.
- SamplingParams: specify the parameters for the sampling process. By default, vLLM will use sampling parameters recommended by model creator by applying the
generation_config.json from the Hugging Face model repository if it exists. However, if vLLM’s default sampling parameters are preferred, please set generation_config="vllm" when creating the LLM instance.
Online Serving with OpenAI-compatible Server
vLLM can be deployed as a server that implements the OpenAI API protocol.
Start the vLLM Server
1
2
3
4
5
6
7
|
# Start vLLM server on the remote server
CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve --config config1.yaml
CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve --config config1.yaml (change port)
# Test on the local machine
curl http://localhost:8000/v1/models
|
GLM-5
Serving FP8 Model on 8xH200 (or H20) GPUs (141GB × 8)
DeepSeek-V3.2-Speciale
Serving on 8xH200 (or H20) GPUs (141GB × 8) / 8xB200 GPUs
https://docs.vllm.ai/projects/recipes/en/latest/DeepSeek/DeepSeek-V3_2-Exp.html#installing-vllm
Qwen3-Coder-Next
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| # config.yaml
model: Qwen/Qwen3-Coder-Next
served-model-name: Qwen3-Coder-Next # model alias name
host: "127.0.0.1"
port: 8000
uvicorn-log-level: "info"
# Memory Optimization
tensor_parallel_size: 8 # Tensor Parallelism (TP): split the model across multiple GPUs.
max_model_len: 200000 # context length
max_num_seqs: 2 # maximum batch size
use_cuda_graph true
--enable-prefix-caching true
|
Kimi-K2.5
It’s not recommended on deploying Kimi-K2.5 with vLLM on 8x A100-80G , the context length can only be set to 70k and can not use flash-attn due to the head size, MLA architecture and the A100’s capability.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| # config.yaml
model: moonshotai/Kimi-K2___5
served-model-name: Kimi-K25 # model alias name
host: "127.0.0.1"
port: 8002
uvicorn-log-level: "info"
trust_remote_code: True
# Memory Optimization
tensor_parallel_size: 8 # Tensor Parallelism (TP): split the model across multiple GPUs.
max_model_len: 50000 # context length
max_num_seqs: 2 # maximum batch size
gpu-memory-utilization: 0.98
enable_auto_tool_choice: true
tool_call_parser: "default" # 或使用支持的解析器
|
OpenAI Compatible API with vLLM
OpenAI Chat Completions API with vLLM
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
| uv pip install openai
from openai import OpenAI
openai_api_key = "omagiclee"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
response = client.chat.completions.create(
model="moonshotai/Kimi-K2.5",
messages=[
{
"role": "system",
"content": "You are a helpful AI Coding assistant."
},
{
"role": "user",
"content": "Write a Python function to calculate the Fibonacci sequence."
}
]
)
|
On Attention Backends
- On NVIDIA CUDA: FLASH_ATTEN or FLASHINFER
doc.vllm.ai/en/latest/features/multimodal_inputs/