VLMs - Category - Naifan Li's Blog

Vision Language Adapter

Naifan Li — Wed, 24 Dec 2025 15:04:44 +0800

Motivation

cross-modal alignment between visual space and text space.
visual feature compression

cross attention

A single-layer cross-attention module initialized randomly with trainable positon embeddings.

Qwen-VL

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84


# https://huggingface.co/Qwen/Qwen-VL/blob/main/visual.py
def get_abs_pos(abs_pos, tgt_size):
    # abs_pos: L, C
    # tgt_size: M
    # return: M, C
    src_size = int(math.sqrt(abs_pos.size(0)))
    tgt_size = int(math.sqrt(tgt_size))
    dtype = abs_pos.dtype

    if src_size != tgt_size:
        return F.interpolate(
            abs_pos.float().reshape(1, src_size, src_size, -1).permute(0, 3, 1, 2),
            size=(tgt_size, tgt_size),
            mode="bicubic",
            align_corners=False,
        ).permute(0, 2, 3, 1).flatten(0, 2).to(dtype=dtype)
    else:
        return abs_pos

class Resampler(nn.Module):
    """
    A 2D perceiver-resampler network with one cross attention layers by
        (grid_size**2) learnable queries and 2d sincos pos_emb
    Outputs:
        A tensor with the shape of (grid_size**2, embed_dim)
    """
    def __init__(
            self,
            grid_size,
            embed_dim,
            num_heads,
            kv_dim=None,
            norm_layer=nn.LayerNorm
    ):
        super().__init__()
        self.num_queries = grid_size ** 2
        self.embed_dim = embed_dim
        self.num_heads = num_heads

        self.pos_embed = nn.Parameter(
            torch.from_numpy(get_2d_sincos_pos_embed(embed_dim, grid_size)).float()
        ).requires_grad_(False)

        self.query = nn.Parameter(torch.zeros(self.num_queries, embed_dim))
        trunc_normal_(self.query, std=.02)

        if kv_dim is not None and kv_dim != embed_dim:
            self.kv_proj = nn.Linear(kv_dim, embed_dim, bias=False)
        else:
            self.kv_proj = nn.Identity()

        self.attn = nn.MultiheadAttention(embed_dim, num_heads)
        self.ln_q = norm_layer(embed_dim)
        self.ln_kv = norm_layer(embed_dim)

        self.apply(self._init_weights)

    def _init_weights(self, m):
        if isinstance(m, nn.Linear):
            trunc_normal_(m.weight, std=.02)
            if isinstance(m, nn.Linear) and m.bias is not None:
                nn.init.constant_(m.bias, 0)
        elif isinstance(m, nn.LayerNorm):
            nn.init.constant_(m.bias, 0)
            nn.init.constant_(m.weight, 1.0)

    def forward(self, x, attn_mask=None):

        pos_embed = get_abs_pos(self.pos_embed, x.size(1))

        x = self.kv_proj(x)
        x = self.ln_kv(x).permute(1, 0, 2)

        N = x.shape[1]
        q = self.ln_q(self.query)
        out = self.attn(
            self._repeat(q, N) + self.pos_embed.unsqueeze(1),
            x + pos_embed.unsqueeze(1),
            x,
            attn_mask=attn_mask)[0]
        return out.permute(1, 0, 2)

    def _repeat(self, query, N: int):
        return query.unsqueeze(1).repeat(1, N, 1)

torch.view + MLP(Linear + GELU + Linear)

A single MLP layer to compress adjacent 2x2 tokens into a single token.

Qwen3-VL Technical Report

Naifan Li — Thu, 27 Nov 2025 19:13:03 +0800

Qwen Team, Alibaba Group arXiv 2511.21631 QwenLM/Qwen3-VL Qwen/qwen3-vl

Motivation

Contribution

Method

Architecture

Large Language Model: Qwen3-VL model is initialized with pre-trained weights from Qwen3.

three dense variants (Qwen3-VL-2B/4B/8B/32B) and two MoE variants (Qwen3-VL-30B-A3B, Qwen3-VL-235B-A22B)

Vision Encoder: SigLIP-2

Vision-Language Adapter: a two-layer MLP to compress 2x2 visual features from the vision encoder into a single visual token.

Summary: VLMs

Naifan Li — Wed, 26 Nov 2025 16:51:18 +0800

VLM Tasks

Image Captioning: generate a description for a given image
General Visual Question Answering: answer questions based on the visual content of a given image.
Text-oriented Visual Question Answering: Text-VQA is a specialized sub-task of VQA where answering questions critically depends on reading and comprehending text in a given image.
- Multilingual Text Recognition and Understanding
Refer Expression Comprehension
Visual Grounding
Mathematical Reasoning
Video Understanding
Visual Agent
- Function Calling
- UI Operations/Games/Robotics/Navigation

VLMs Summary

Model	Year	Model Architecture			Training Recipe	Data Recipe
Model	Year	Vision Encoder	Adapter	LLM	Training Recipe	Data Recipe
BLIP	2022.01	-	-	-	-	-
BLIP-2	2023.01	-	Q-Former	-	-	-
LLaVA	2023.04	CLIP ViT-L/14	Linear	Vicuna	Pre-training + Fine-tuning	Image-text pairs
Qwen-VL	2023.08	ViT-bigG	Cross-attention	Qwen	Pre-training + SFT	Image-text pairs
Qwen2-VL	2024.09	ViT	MLP	Qwen2	Pre-training + Post-training	1.2T tokens
Qwen2.5-VL	2025.02	ViT	MLP	Qwen2.5	Pre-training + Post-training	4T tokens
Qwen3-VL	2025.02	SigLIP-2	MLP	Qwen3	Pre-training + Post-training	-
HunyuanOCR	2025.11	SigLIP-v2	Conv2d + MLP	Hunyuan	Multi-stage + RL	200M image-text pairs

HunyuanOCR Technical Report

Naifan Li — Sun, 16 Nov 2025 14:27:22 +0800

Tencent Hunyuan Vision Team arXiv 2511.19575 Tencent-Hunyuan/HunyuanOCR tencent/HunyuanOCR

Motivation

Traditional OCR systems rely on the modularized pipeline architecture, primarily including, but not limited to: text detection, text recognition, document layout analysis, named entity recognition, and optional text translation modules, which inevitably result in cumulative error propagation, elevate deployment and maintenance overhead. -> End-to-End
While leading general VLMs (e.g., Gemini, Qwen-VL) deliver superior OCR performance, they often entail excessive computational overhead and high latency due to the massive parameter scales. -> OCR-specific, lightweight(1B)
unified multi-task modeling, including text spotting, document parsing, information extraction, visual question answering, and text image translation.

Method

Model Architecture

Qwen2.5-VL Technical Report

Naifan Li — Tue, 25 Feb 2025 20:38:06 +0800

Qwen Team, Alibaba Group arXiv 2502.13923 QwenLM/Qwen2.5-VL Qwen/qwen25-vl blog/qwen2.5-vl

Motivation

Contribution

Method

Architecture

Large Language Model (3B/7B/72B): Qwen2.5-VL model is initialized with pre-trained weights from Qwen2.5.

To better meet the demands of multimodal understanding, we have modified the 1D RoPE (Rotary Position Embedding) to our Multimodal Rotary Position Embedding Aligned to Absolute Time.

Vision Encoder: Qwen2.5-VL model employs a redesigned Vision Transformer (ViT) as visual encoder.

Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution

Naifan Li — Wed, 25 Sep 2024 20:53:12 +0800

Qwen Team, Alibaba Group arXiv 2409.12191 QwenLM/Qwen2-VL Qwen/Qwen2-VL

Motivation

Contribution

Method

Architecture

Large Language Model (1.5B/7.6B/72B): Qwen2-VL model is initialized with pre-trained weights from Qwen2 series.

Vision Encoder (675M): Qwen2-VL model employs a constant 675M parameter Vision Transformer (ViT) as visual encoder accross various-sized LLMs.

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Naifan Li — Fri, 25 Aug 2023 20:53:14 +0800

Qwen Team, Alibaba Group arXiv 2308.12966 QwenLM/Qwen-VL Qwen/Qwen-VL

Motivation

Despite their powerful capabilities in text generation and following user’s intentions via instruction tuning, native LLMs lack the ability to handle multiple modalities (e.g., images, speech, and videos) -> LVLM
Current open-source LVLMs lag far behind the proprietary models, primarily due to inadequate training and optimization. -> open-source
The majority of open-source LVLMs are limited to coarse-grained perception, lacking the ability for fine-grained visual understanding such as obect grounding, OCR and text-oriented question answering. -> fine-grained perception

Contribution

Method

Architecture

Large Language Model (7.7B): Qwen-VL model is initialized with pre-trained weights from Qwen-7B.

LLaVA: Visual Instruction Tuning

Naifan Li — Wed, 26 Apr 2023 20:22:53 +0800

NeurIPS 2023 (Oral) Microsoft Research arXiv 2304.08485 haotian-liu/LLaVA llava-vl.github.io

Motivation

Contribution

Method

Architecture

Large Language Model: Vicuna

Vision Encoder: the pre-trained CLIP visual encoder ViT-L/14

Adapter: While a simple linear layer is employed here, more sophisticated alternatives, such as gated cross-attention in Flamingo and Q-former in BLIP-2, could be optionally substitued.