SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment

Published on Jun 18, 2025 Updated on Jan 15, 2026 2 minutes

Contents

Motivation

Task.

Driving
Vision-Language Understanding
- Commentary: describe the current decision and action in natural language.
- VQA: answer a question about the simulated driving scene.
Action Dreaming (language-action alignment): a form of instruction following, to improve the alignment of language and action

Input.

image
navigational information:
ego’s speed
a task prompt with information about the current task
- Driving without language predictions: “Predict the waypoints”
- Commentary + Driving: “What should the ego do next?”
- VQA + Driving: “Q: ?>”
- Action Dreaming: “”

Output.

action represention: a disentangled representation
- temporal speed waypoints -> target speed
- geometric path waypoints -> target angle
language: auto-regressive token prediction

We noticed that:

using only temporal waypoints for steering and speed led to steering problems, especially during turns or when swerving around obstacles.
by using path waypoints, we achieve denser supervision.
we also predict the path when the vehicle is stationary, leading to improved steering behavior.

Architecture. InternVL-2-1B.

Vision Encoder (pretrained): InternViT-300M-448px (ViT)
- To be able to encode dynamic and higher resolutions, we split the input image into 448x448 pixel tiles and extract features for each tile independently.
- To reduce computational overhead due to the quadratic complexity of the LLM, we uses the pixel unshuffle technique to downsample the number of tokens by a factor of 4.
LLM (pretrained): Qwen2-0.5B-Instruct
- Navigational information
  - the next two GPS target points: encode with an MLP to obtain two navigational embeddings.
  - or a high-level language command: use the standard LLM tokenizer to obtain the embeddings.
  - during training, we randomly switch between the two input modalities.
- Speed v: use the speed v in natural language as part of the global LLM prompt.
Token Interleaver: replace the placeholder tokens with the corresponding embeddings.

Training.

First, it auto-regressively generates the language predictions. (CE)
Then in one additional forward pass, it generates the actions consisting of path and waypoints. (smooth-L1)

Inference. two PID controllers to get the steering angle and acceleration.

Leaderboard 2.0:

Bench2Drive

DriveLM-hard (VQA) and Commentary

Action Dreaming