Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution

Published on Sep 25, 2024 Updated on Jan 15, 2026 VLMs VLMs, LLMs, Qwens 2 minutes

Contents

Qwen Team, Alibaba Group arXiv 2409.12191 QwenLM/Qwen2-VL Hugging Face Qwen/Qwen2-VL

Motivation

Contribution

Method

Architecture

Large Language Model (1.5B/7.6B/72B): Qwen2-VL model is initialized with pre-trained weights from Qwen2 series.

Vision Encoder (675M): Qwen2-VL model employs a constant 675M parameter Vision Transformer (ViT) as visual encoder accross various-sized LLMs.

Naive Dynamic Resolution: To dynamically encoding images of any aspect ratio and resolution into a variable number of visual tokens, Qwen2-VL modify ViT by replacing the original absolute position embedding with 2D-RoPE
fixed resolution during both training and inference
stride of 14 is used for the ViT encoder

Vision-Language Adapter: a simple MLP layer to compress adjacent 2x2 tokens into a single token, with the special <|vision_start|> and <|vision_end|> tokens placed at the begining and end of the compressed visual tokens.

patch size of 14 is employed within ViT

Training Recipe

Training Infrastructure

Storage:

text data: store text data on CPFS (Cloud Parallel File Storage) and use mmap (Memory Map) for efficient access.
vision data: use OSS (Object Storage Service) for persistent storage.
- video data decoding: develop a caching decoding technique due to several failed attempts with open-source (FFmpeg) and in-house software.

Parallelism:

use 3D parallellism, including data parallelism (DP), tensor parallelism (TP), and pipeline parallelism (PP)
leverage deepspeed’s zero-1 redundancy optimizer to shard states for memory saving
Sequence parallelism (SP) with selective checkpointing activation was leveraged to reduce memory usage
leverage 1F1B PP for Qwen2-VL 72B training

Software:

PyTorch
CUDA
flash-attention: for efficient training in both the vision encoder and the LLM
fused operators: LayerNorm, RMSNorm and Adam
we leverage the overlap of communication and computation during matrix multiplication in our training process.

Data Recipe

ablation study

conclusion