Survey

Published on Sep 28, 2025 Updated on Jan 15, 2026 3 minutes

Contents

VLA 模型演进架构

graph TB subgraph VLM["VLM as Driving Explainer"] V1[🖼️ Vision
Input] --> VLM1[🤖 VLMs
Processing] VLM1 --> E1[💬 Explain
Q&A
Description] style V1 fill:#e3f2fd style VLM1 fill:#fff3e0 style E1 fill:#fff8e1 end

subgraph Modular["Modular VLA for AD"]
    V2[🖼️ Multimodal<br/>Vision] --> VLM2[🤖 VLMs<br/>Processing]
    VLM2 --> IR[🔄 Intermediate<br/>Representation]
    IR --> AH[⚙️ Action Head]
    AH --> TC1[🎯 Trajectory<br/>Control]
    style V2 fill:#e3f2fd
    style VLM2 fill:#fff3e0
    style IR fill:#f3e5f5
    style AH fill:#e8f5e8
    style TC1 fill:#fce4ec
end

subgraph EndToEnd["End-to-end VLA for AD"]
    V3[🖼️ Multimodal<br/>Vision] --> VLM3[🤖 VLMs<br/>Processing]
    VLM3 --> A1[🚗 Action<br/>Output]
    style V3 fill:#e3f2fd
    style VLM3 fill:#fff3e0
    style A1 fill:#e0f2f1
end

subgraph Augmented["Augmented VLA for AD"]
    V4[🖼️ Multimodal<br/>Vision] --> RT[🧠 Reasoning VLMs<br/>& Tool-use Agents]
    RT --> A2[🚗 Action<br/>Output]
    style V4 fill:#e3f2fd
    style RT fill:#f3e5f5
    style A2 fill:#e0f2f1
end

%% 演进箭头
VLM -.-> Modular
Modular -.-> EndToEnd
EndToEnd -.-> Augmented

%% 样式定义
classDef titleStyle fill:#f9f9f9,stroke:#333,stroke-width:2px,color:#333
class VLM,Modular,EndToEnd,Augmented titleStyle

图表说明：

VLM as Driving Explainer: 冻结的LLM描述驾驶场景但不产生控制
Modular VLA: 语言转换为中间表示，动作头转换为轨迹或低级控制
End-to-end VLA: 单一多模态管道直接将传感器输入映射到动作
Augmented VLA: 工具使用或CoT VLMs添加长期推理，同时保持端到端控制路径

LLM as Planner
VLM as Driving Explainer
Modular VLA for AD
End-to-End VLA for AD
Augmented VLA for AD

Vision-Language-Action Models

📖 图例说明

LLC Low-Level Control

V2V Vehicle-to-Vehicle

Pre-training Pre-training

PDCE Specific Loss Function

Model	Time	Data Source		Model			Output	Focus
Model	Time	Input	Dataset	Vision	LLM	Decoder	Output	Focus
DriveGPT-4	2023	Single	BDD-X	CLIP	LLaMA-2	-	LLC	Interpretable LLM, Mixed Fine-tuning
ADriver-I	2023	Single	nuScenes + Private	CLIP ViT	Vicuna-1.5	-	LLC	Diffusion World Model, Vision-action Tokens
RAG-Driver	2024	Multi	BDD-X	CLIP ViT	Vicuna-1.5	-	LLC	RAG Control, Textual Rationales
EMMA	2024	Multi + State	Waymo fleet	Gemini-VLM	Gemini	-	Multi.	MLLM Backbone, Multi-task Outputs
CoVLA-Agent	2024	Single + State	CoVLA Data	CLIP ViT	Vicuna-1.5	-	Traj.	Text + Traj Outputs, Auto-labelled Data
OpenDriveVLA	2025	Multi	nuScenes	Custom Module	Qwen-2.5	-	LLC+Traj.	2-D/3-D Align, SOTA Planner
ORION	2025	Multi + History	nuScenes + CARLA	QT-Former	Vicuna-1.5	-	Traj.	CoT Reasoning, Continuous Actions
DriveMoE	2025	Multi	Bench2Drive	Paligemma-3B	-	-	LLC	Mixture-of-Experts, Dynamic Routing
VaViM	2025	Video Frames	BDD100K + CARLA	LlamaGen	GPT-2	-	Traj.	Video-token Pre-training, Vision to Action
DiffVLA	2025	Multi + State	Navsim-v2	CLIP ViT	Vicuna-1.5	-	Traj.	Mixed Diffusion, VLM Sampling
LangCoop	2025	Single + V2V	CARLA	GPT-4o	GPT-4o	-	LLC	Language-based V2V, High Bandwidth Cut
SimLingo	2025	Multi	CARLA + Bench2Drive	InternVL2	Qwen-2	-	LLC+Traj.	Enhanced VLM, Action-dreaming
SafeAuto	2025	Multi + State	BDD-X + DriveLM	CLIP ViT	Vicuna-1.5	-	LLC	Traffic-Rule-Based, PDCE Loss
Impromptu-VLA	2025	Single	Impromptu Data	Qwen-2.5VL	Qwen-2.5VL	-	Traj.	Corner-case QA, NeuroNCAP SOTA
AutoVLA	2025	Multi + State	nuScenes + CARLA	Qwen-2.5VL	Qwen-2.5VL	-	LLC+Traj.	Adaptive Reasoning, Multi Benchmark

Vision-Language-Action Datasets

Evaluation

Open Challenges

References

A Survey on Vision-Language-Action Models for Autonomous Driving, arXiv, 2025-06