Contents

DRIVEVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

123 · arXiv · GitHub

Motivation

Contribution

Method

/posts/vlas/drivevlm/images/DriveVLM-architecture.webp

Input

  • A sequence of images

Output

Architecture

  • Vision Encoder
  • LLM
  • Vision-Language Adapter

Scene Description.

  • Environment Description
    • Weather: spans conditions from sunny to snowy, affecting visibility and traction
    • Time: distinguishes between daytime and nighttime, impacting driving strategies due to visibility changes
    • Road types: such as urban or highway, introduce different challenges
    • lane conditions: focus on current lane positioning and possible maneuvers, crucial for safe driving decisions.
  • Critical Object Identification
    • Category
    • 2D bounding box

Scene Analysis.

Hierarchical Reasoning.

DriveVLM-Dual

Experiment

References

Question