Contents

Survey

VLA 模型演进架构

graph TB subgraph VLM["VLM as Driving Explainer"] V1[🖼️ Vision
Input] --> VLM1[🤖 VLMs
Processing] VLM1 --> E1[💬 Explain
Q&A
Description] style V1 fill:#e3f2fd style VLM1 fill:#fff3e0 style E1 fill:#fff8e1 end
subgraph Modular["Modular VLA for AD"]
    V2[🖼️ Multimodal<br/>Vision] --> VLM2[🤖 VLMs<br/>Processing]
    VLM2 --> IR[🔄 Intermediate<br/>Representation]
    IR --> AH[⚙️ Action Head]
    AH --> TC1[🎯 Trajectory<br/>Control]
    style V2 fill:#e3f2fd
    style VLM2 fill:#fff3e0
    style IR fill:#f3e5f5
    style AH fill:#e8f5e8
    style TC1 fill:#fce4ec
end

subgraph EndToEnd["End-to-end VLA for AD"]
    V3[🖼️ Multimodal<br/>Vision] --> VLM3[🤖 VLMs<br/>Processing]
    VLM3 --> A1[🚗 Action<br/>Output]
    style V3 fill:#e3f2fd
    style VLM3 fill:#fff3e0
    style A1 fill:#e0f2f1
end

subgraph Augmented["Augmented VLA for AD"]
    V4[🖼️ Multimodal<br/>Vision] --> RT[🧠 Reasoning VLMs<br/>& Tool-use Agents]
    RT --> A2[🚗 Action<br/>Output]
    style V4 fill:#e3f2fd
    style RT fill:#f3e5f5
    style A2 fill:#e0f2f1
end

%% 演进箭头
VLM -.-> Modular
Modular -.-> EndToEnd
EndToEnd -.-> Augmented

%% 样式定义
classDef titleStyle fill:#f9f9f9,stroke:#333,stroke-width:2px,color:#333
class VLM,Modular,EndToEnd,Augmented titleStyle

图表说明:

  1. VLM as Driving Explainer: 冻结的LLM描述驾驶场景但不产生控制
  2. Modular VLA: 语言转换为中间表示,动作头转换为轨迹或低级控制
  3. End-to-end VLA: 单一多模态管道直接将传感器输入映射到动作
  4. Augmented VLA: 工具使用或CoT VLMs添加长期推理,同时保持端到端控制路径
  • LLM as Planner
  • VLM as Driving Explainer
  • Modular VLA for AD
  • End-to-End VLA for AD
  • Augmented VLA for AD

Vision-Language-Action Models

📖 图例说明

LLC Low-Level Control
V2V Vehicle-to-Vehicle
Pre-training Pre-training
PDCE Specific Loss Function
ModelTimeData SourceModelOutputFocus
InputDatasetVisionLLMDecoder
DriveGPT-42023SingleBDD-XCLIPLLaMA-2-LLCInterpretable LLM, Mixed Fine-tuning
ADriver-I2023SinglenuScenes + PrivateCLIP ViTVicuna-1.5-LLCDiffusion World Model, Vision-action Tokens
RAG-Driver2024MultiBDD-XCLIP ViTVicuna-1.5-LLCRAG Control, Textual Rationales
EMMA2024Multi + StateWaymo fleetGemini-VLMGemini-Multi.MLLM Backbone, Multi-task Outputs
CoVLA-Agent2024Single + StateCoVLA DataCLIP ViTVicuna-1.5-Traj.Text + Traj Outputs, Auto-labelled Data
OpenDriveVLA2025MultinuScenesCustom ModuleQwen-2.5-LLC+Traj.2-D/3-D Align, SOTA Planner
ORION2025Multi + HistorynuScenes + CARLAQT-FormerVicuna-1.5-Traj.CoT Reasoning, Continuous Actions
DriveMoE2025MultiBench2DrivePaligemma-3B--LLCMixture-of-Experts, Dynamic Routing
VaViM2025Video FramesBDD100K + CARLALlamaGenGPT-2-Traj.Video-token Pre-training, Vision to Action
DiffVLA2025Multi + StateNavsim-v2CLIP ViTVicuna-1.5-Traj.Mixed Diffusion, VLM Sampling
LangCoop2025Single + V2VCARLAGPT-4oGPT-4o-LLCLanguage-based V2V, High Bandwidth Cut
SimLingo2025MultiCARLA + Bench2DriveInternVL2Qwen-2-LLC+Traj.Enhanced VLM, Action-dreaming
SafeAuto2025Multi + StateBDD-X + DriveLMCLIP ViTVicuna-1.5-LLCTraffic-Rule-Based, PDCE Loss
Impromptu-VLA2025SingleImpromptu DataQwen-2.5VLQwen-2.5VL-Traj.Corner-case QA, NeuroNCAP SOTA
AutoVLA2025Multi + StatenuScenes + CARLAQwen-2.5VLQwen-2.5VL-LLC+Traj.Adaptive Reasoning, Multi Benchmark

Vision-Language-Action Datasets

Evaluation

Open Challenges

References

  • A Survey on Vision-Language-Action Models for Autonomous Driving, arXiv, 2025-06

Question