Survey
Contents
VLA 模型演进架构
graph TB
subgraph VLM["VLM as Driving Explainer"]
V1[🖼️ Vision
Input] --> VLM1[🤖 VLMs
Processing] VLM1 --> E1[💬 Explain
Q&A
Description] style V1 fill:#e3f2fd style VLM1 fill:#fff3e0 style E1 fill:#fff8e1 end
Input] --> VLM1[🤖 VLMs
Processing] VLM1 --> E1[💬 Explain
Q&A
Description] style V1 fill:#e3f2fd style VLM1 fill:#fff3e0 style E1 fill:#fff8e1 end
subgraph Modular["Modular VLA for AD"]
V2[🖼️ Multimodal<br/>Vision] --> VLM2[🤖 VLMs<br/>Processing]
VLM2 --> IR[🔄 Intermediate<br/>Representation]
IR --> AH[⚙️ Action Head]
AH --> TC1[🎯 Trajectory<br/>Control]
style V2 fill:#e3f2fd
style VLM2 fill:#fff3e0
style IR fill:#f3e5f5
style AH fill:#e8f5e8
style TC1 fill:#fce4ec
end
subgraph EndToEnd["End-to-end VLA for AD"]
V3[🖼️ Multimodal<br/>Vision] --> VLM3[🤖 VLMs<br/>Processing]
VLM3 --> A1[🚗 Action<br/>Output]
style V3 fill:#e3f2fd
style VLM3 fill:#fff3e0
style A1 fill:#e0f2f1
end
subgraph Augmented["Augmented VLA for AD"]
V4[🖼️ Multimodal<br/>Vision] --> RT[🧠 Reasoning VLMs<br/>& Tool-use Agents]
RT --> A2[🚗 Action<br/>Output]
style V4 fill:#e3f2fd
style RT fill:#f3e5f5
style A2 fill:#e0f2f1
end
%% 演进箭头
VLM -.-> Modular
Modular -.-> EndToEnd
EndToEnd -.-> Augmented
%% 样式定义
classDef titleStyle fill:#f9f9f9,stroke:#333,stroke-width:2px,color:#333
class VLM,Modular,EndToEnd,Augmented titleStyle
图表说明:
- VLM as Driving Explainer: 冻结的LLM描述驾驶场景但不产生控制
- Modular VLA: 语言转换为中间表示,动作头转换为轨迹或低级控制
- End-to-end VLA: 单一多模态管道直接将传感器输入映射到动作
- Augmented VLA: 工具使用或CoT VLMs添加长期推理,同时保持端到端控制路径
- LLM as Planner
- VLM as Driving Explainer
- Modular VLA for AD
- End-to-End VLA for AD
- Augmented VLA for AD
Vision-Language-Action Models
📖 图例说明
LLC
Low-Level Control
V2V
Vehicle-to-Vehicle
Pre-training
Pre-training
PDCE
Specific Loss Function
| Model | Time | Data Source | Model | Output | Focus | |||
|---|---|---|---|---|---|---|---|---|
| Input | Dataset | Vision | LLM | Decoder | ||||
| DriveGPT-4 | 2023 | Single | BDD-X | CLIP | LLaMA-2 | - | LLC | Interpretable LLM, Mixed Fine-tuning |
| ADriver-I | 2023 | Single | nuScenes + Private | CLIP ViT | Vicuna-1.5 | - | LLC | Diffusion World Model, Vision-action Tokens |
| RAG-Driver | 2024 | Multi | BDD-X | CLIP ViT | Vicuna-1.5 | - | LLC | RAG Control, Textual Rationales |
| EMMA | 2024 | Multi + State | Waymo fleet | Gemini-VLM | Gemini | - | Multi. | MLLM Backbone, Multi-task Outputs |
| CoVLA-Agent | 2024 | Single + State | CoVLA Data | CLIP ViT | Vicuna-1.5 | - | Traj. | Text + Traj Outputs, Auto-labelled Data |
| OpenDriveVLA | 2025 | Multi | nuScenes | Custom Module | Qwen-2.5 | - | LLC+Traj. | 2-D/3-D Align, SOTA Planner |
| ORION | 2025 | Multi + History | nuScenes + CARLA | QT-Former | Vicuna-1.5 | - | Traj. | CoT Reasoning, Continuous Actions |
| DriveMoE | 2025 | Multi | Bench2Drive | Paligemma-3B | - | - | LLC | Mixture-of-Experts, Dynamic Routing |
| VaViM | 2025 | Video Frames | BDD100K + CARLA | LlamaGen | GPT-2 | - | Traj. | Video-token Pre-training, Vision to Action |
| DiffVLA | 2025 | Multi + State | Navsim-v2 | CLIP ViT | Vicuna-1.5 | - | Traj. | Mixed Diffusion, VLM Sampling |
| LangCoop | 2025 | Single + V2V | CARLA | GPT-4o | GPT-4o | - | LLC | Language-based V2V, High Bandwidth Cut |
| SimLingo | 2025 | Multi | CARLA + Bench2Drive | InternVL2 | Qwen-2 | - | LLC+Traj. | Enhanced VLM, Action-dreaming |
| SafeAuto | 2025 | Multi + State | BDD-X + DriveLM | CLIP ViT | Vicuna-1.5 | - | LLC | Traffic-Rule-Based, PDCE Loss |
| Impromptu-VLA | 2025 | Single | Impromptu Data | Qwen-2.5VL | Qwen-2.5VL | - | Traj. | Corner-case QA, NeuroNCAP SOTA |
| AutoVLA | 2025 | Multi + State | nuScenes + CARLA | Qwen-2.5VL | Qwen-2.5VL | - | LLC+Traj. | Adaptive Reasoning, Multi Benchmark |
Vision-Language-Action Datasets
Evaluation
Open Challenges
References
- A Survey on Vision-Language-Action Models for Autonomous Driving, arXiv, 2025-06