<rss xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title>VLMs - Category - Naifan Li's Blog</title><link>https://blog.omagiclee.com/categories/vlms/</link><description>VLMs - Category - Naifan Li's Blog</description><generator>Hugo -- gohugo.io</generator><language>en-us</language><lastBuildDate>Wed, 24 Dec 2025 15:04:44 +0800</lastBuildDate><atom:link href="https://blog.omagiclee.com/categories/vlms/" rel="self" type="application/rss+xml"/><item><title>Vision Language Adapter</title><link>https://blog.omagiclee.com/posts/vlms/vision-language-adapter/</link><pubDate>Wed, 24 Dec 2025 15:04:44 +0800</pubDate><author>Naifan Li</author><guid>https://blog.omagiclee.com/posts/vlms/vision-language-adapter/</guid><description><![CDATA[<h2 id="motivation">Motivation</h2>
<ul>
<li>cross-modal alignment between visual space and text space.</li>
<li>visual feature compression</li>
</ul>
<h2 id="heading"></h2>
<h3 id="cross-attention">cross attention</h3>
<p>A single-layer cross-attention module initialized randomly with trainable positon embeddings.</p>
<ul>
<li>Qwen-VL
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span><span class="lnt">19
</span><span class="lnt">20
</span><span class="lnt">21
</span><span class="lnt">22
</span><span class="lnt">23
</span><span class="lnt">24
</span><span class="lnt">25
</span><span class="lnt">26
</span><span class="lnt">27
</span><span class="lnt">28
</span><span class="lnt">29
</span><span class="lnt">30
</span><span class="lnt">31
</span><span class="lnt">32
</span><span class="lnt">33
</span><span class="lnt">34
</span><span class="lnt">35
</span><span class="lnt">36
</span><span class="lnt">37
</span><span class="lnt">38
</span><span class="lnt">39
</span><span class="lnt">40
</span><span class="lnt">41
</span><span class="lnt">42
</span><span class="lnt">43
</span><span class="lnt">44
</span><span class="lnt">45
</span><span class="lnt">46
</span><span class="lnt">47
</span><span class="lnt">48
</span><span class="lnt">49
</span><span class="lnt">50
</span><span class="lnt">51
</span><span class="lnt">52
</span><span class="lnt">53
</span><span class="lnt">54
</span><span class="lnt">55
</span><span class="lnt">56
</span><span class="lnt">57
</span><span class="lnt">58
</span><span class="lnt">59
</span><span class="lnt">60
</span><span class="lnt">61
</span><span class="lnt">62
</span><span class="lnt">63
</span><span class="lnt">64
</span><span class="lnt">65
</span><span class="lnt">66
</span><span class="lnt">67
</span><span class="lnt">68
</span><span class="lnt">69
</span><span class="lnt">70
</span><span class="lnt">71
</span><span class="lnt">72
</span><span class="lnt">73
</span><span class="lnt">74
</span><span class="lnt">75
</span><span class="lnt">76
</span><span class="lnt">77
</span><span class="lnt">78
</span><span class="lnt">79
</span><span class="lnt">80
</span><span class="lnt">81
</span><span class="lnt">82
</span><span class="lnt">83
</span><span class="lnt">84
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># https://huggingface.co/Qwen/Qwen-VL/blob/main/visual.py</span>
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">get_abs_pos</span><span class="p">(</span><span class="n">abs_pos</span><span class="p">,</span> <span class="n">tgt_size</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># abs_pos: L, C</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># tgt_size: M</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># return: M, C</span>
</span></span><span class="line"><span class="cl">    <span class="n">src_size</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">math</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">abs_pos</span><span class="o">.</span><span class="n">size</span><span class="p">(</span><span class="mi">0</span><span class="p">)))</span>
</span></span><span class="line"><span class="cl">    <span class="n">tgt_size</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">math</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">tgt_size</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">    <span class="n">dtype</span> <span class="o">=</span> <span class="n">abs_pos</span><span class="o">.</span><span class="n">dtype</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">src_size</span> <span class="o">!=</span> <span class="n">tgt_size</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="n">F</span><span class="o">.</span><span class="n">interpolate</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">            <span class="n">abs_pos</span><span class="o">.</span><span class="n">float</span><span class="p">()</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">src_size</span><span class="p">,</span> <span class="n">src_size</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">permute</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">            <span class="n">size</span><span class="o">=</span><span class="p">(</span><span class="n">tgt_size</span><span class="p">,</span> <span class="n">tgt_size</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">            <span class="n">mode</span><span class="o">=</span><span class="s2">&#34;bicubic&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">            <span class="n">align_corners</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="p">)</span><span class="o">.</span><span class="n">permute</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">flatten</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span><span class="o">.</span><span class="n">to</span><span class="p">(</span><span class="n">dtype</span><span class="o">=</span><span class="n">dtype</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="k">else</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="n">abs_pos</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">Resampler</span><span class="p">(</span><span class="n">nn</span><span class="o">.</span><span class="n">Module</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;
</span></span></span><span class="line"><span class="cl"><span class="s2">    A 2D perceiver-resampler network with one cross attention layers by
</span></span></span><span class="line"><span class="cl"><span class="s2">        (grid_size**2) learnable queries and 2d sincos pos_emb
</span></span></span><span class="line"><span class="cl"><span class="s2">    Outputs:
</span></span></span><span class="line"><span class="cl"><span class="s2">        A tensor with the shape of (grid_size**2, embed_dim)
</span></span></span><span class="line"><span class="cl"><span class="s2">    &#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">            <span class="bp">self</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">            <span class="n">grid_size</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">            <span class="n">embed_dim</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">            <span class="n">num_heads</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">            <span class="n">kv_dim</span><span class="o">=</span><span class="kc">None</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">            <span class="n">norm_layer</span><span class="o">=</span><span class="n">nn</span><span class="o">.</span><span class="n">LayerNorm</span>
</span></span><span class="line"><span class="cl">    <span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="nb">super</span><span class="p">()</span><span class="o">.</span><span class="fm">__init__</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">num_queries</span> <span class="o">=</span> <span class="n">grid_size</span> <span class="o">**</span> <span class="mi">2</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">embed_dim</span> <span class="o">=</span> <span class="n">embed_dim</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">num_heads</span> <span class="o">=</span> <span class="n">num_heads</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">pos_embed</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">Parameter</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">            <span class="n">torch</span><span class="o">.</span><span class="n">from_numpy</span><span class="p">(</span><span class="n">get_2d_sincos_pos_embed</span><span class="p">(</span><span class="n">embed_dim</span><span class="p">,</span> <span class="n">grid_size</span><span class="p">))</span><span class="o">.</span><span class="n">float</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">        <span class="p">)</span><span class="o">.</span><span class="n">requires_grad_</span><span class="p">(</span><span class="kc">False</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">query</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">Parameter</span><span class="p">(</span><span class="n">torch</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">num_queries</span><span class="p">,</span> <span class="n">embed_dim</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">        <span class="n">trunc_normal_</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">query</span><span class="p">,</span> <span class="n">std</span><span class="o">=</span><span class="mf">.02</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">kv_dim</span> <span class="ow">is</span> <span class="ow">not</span> <span class="kc">None</span> <span class="ow">and</span> <span class="n">kv_dim</span> <span class="o">!=</span> <span class="n">embed_dim</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="bp">self</span><span class="o">.</span><span class="n">kv_proj</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">kv_dim</span><span class="p">,</span> <span class="n">embed_dim</span><span class="p">,</span> <span class="n">bias</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="k">else</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="bp">self</span><span class="o">.</span><span class="n">kv_proj</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">Identity</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">attn</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">MultiheadAttention</span><span class="p">(</span><span class="n">embed_dim</span><span class="p">,</span> <span class="n">num_heads</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">ln_q</span> <span class="o">=</span> <span class="n">norm_layer</span><span class="p">(</span><span class="n">embed_dim</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">ln_kv</span> <span class="o">=</span> <span class="n">norm_layer</span><span class="p">(</span><span class="n">embed_dim</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">_init_weights</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">_init_weights</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">m</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">m</span><span class="p">,</span> <span class="n">nn</span><span class="o">.</span><span class="n">Linear</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">            <span class="n">trunc_normal_</span><span class="p">(</span><span class="n">m</span><span class="o">.</span><span class="n">weight</span><span class="p">,</span> <span class="n">std</span><span class="o">=</span><span class="mf">.02</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">            <span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">m</span><span class="p">,</span> <span class="n">nn</span><span class="o">.</span><span class="n">Linear</span><span class="p">)</span> <span class="ow">and</span> <span class="n">m</span><span class="o">.</span><span class="n">bias</span> <span class="ow">is</span> <span class="ow">not</span> <span class="kc">None</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">                <span class="n">nn</span><span class="o">.</span><span class="n">init</span><span class="o">.</span><span class="n">constant_</span><span class="p">(</span><span class="n">m</span><span class="o">.</span><span class="n">bias</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="k">elif</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">m</span><span class="p">,</span> <span class="n">nn</span><span class="o">.</span><span class="n">LayerNorm</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">            <span class="n">nn</span><span class="o">.</span><span class="n">init</span><span class="o">.</span><span class="n">constant_</span><span class="p">(</span><span class="n">m</span><span class="o">.</span><span class="n">bias</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">            <span class="n">nn</span><span class="o">.</span><span class="n">init</span><span class="o">.</span><span class="n">constant_</span><span class="p">(</span><span class="n">m</span><span class="o">.</span><span class="n">weight</span><span class="p">,</span> <span class="mf">1.0</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">attn_mask</span><span class="o">=</span><span class="kc">None</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="n">pos_embed</span> <span class="o">=</span> <span class="n">get_abs_pos</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">pos_embed</span><span class="p">,</span> <span class="n">x</span><span class="o">.</span><span class="n">size</span><span class="p">(</span><span class="mi">1</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">kv_proj</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">ln_kv</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="o">.</span><span class="n">permute</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="n">N</span> <span class="o">=</span> <span class="n">x</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">        <span class="n">q</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">ln_q</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">query</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">out</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">attn</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">            <span class="bp">self</span><span class="o">.</span><span class="n">_repeat</span><span class="p">(</span><span class="n">q</span><span class="p">,</span> <span class="n">N</span><span class="p">)</span> <span class="o">+</span> <span class="bp">self</span><span class="o">.</span><span class="n">pos_embed</span><span class="o">.</span><span class="n">unsqueeze</span><span class="p">(</span><span class="mi">1</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">            <span class="n">x</span> <span class="o">+</span> <span class="n">pos_embed</span><span class="o">.</span><span class="n">unsqueeze</span><span class="p">(</span><span class="mi">1</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">            <span class="n">x</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">            <span class="n">attn_mask</span><span class="o">=</span><span class="n">attn_mask</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="n">out</span><span class="o">.</span><span class="n">permute</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">_repeat</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">query</span><span class="p">,</span> <span class="n">N</span><span class="p">:</span> <span class="nb">int</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="n">query</span><span class="o">.</span><span class="n">unsqueeze</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">repeat</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">N</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
</span></span></code></pre></td></tr></table>
</div>
</div></li>
</ul>
<h3 id="torchview--mlplinear--gelu--linear">torch.view + MLP(Linear + GELU + Linear)</h3>
<p>A single MLP layer to compress adjacent 2x2 tokens into a single token.</p>]]></description></item><item><title>Qwen3-VL Technical Report</title><link>https://blog.omagiclee.com/posts/vlms/qwen3-vl/</link><pubDate>Thu, 27 Nov 2025 19:13:03 +0800</pubDate><author>Naifan Li</author><guid>https://blog.omagiclee.com/posts/vlms/qwen3-vl/</guid><description><![CDATA[<p><i class="fas fa-building fa-fw" aria-hidden="true"></i><span style="color:gray">Qwen Team, Alibaba Group</span>
<i class="fas fa-file-pdf fa-fw" aria-hidden="true"></i><a href="https://arxiv.org/abs/2511.21631" target="_blank" rel="noopener noreffer ">arXiv 2511.21631</a>
<i class="fab fa-github fa-fw" aria-hidden="true"></i><a href="https://github.com/QwenLM/Qwen3-VL" target="_blank" rel="noopener noreffer ">QwenLM/Qwen3-VL</a>
<a href="https://huggingface.co/collections/Qwen/qwen3-vl" target="_blank" rel="noopener noreffer ">Qwen/qwen3-vl</a></p>
<h2 id="motivation">Motivation</h2>
<h2 id="contribution">Contribution</h2>
<h2 id="method">Method</h2>
<h2 id="architecture">Architecture</h2>
<p></p>
<p><strong>Large Language Model</strong>: Qwen3-VL model is initialized with pre-trained weights from Qwen3.</p>
<ul>
<li>three dense variants (Qwen3-VL-2B/4B/8B/32B) and two MoE variants (Qwen3-VL-30B-A3B, Qwen3-VL-235B-A22B)</li>
</ul>
<p><strong>Vision Encoder</strong>: SigLIP-2</p>
<p><strong>Vision-Language Adapter</strong>: a two-layer MLP to compress 2x2 visual features from the vision encoder into a single visual token.</p>]]></description></item><item><title>Summary: VLMs</title><link>https://blog.omagiclee.com/posts/vlms/summary/</link><pubDate>Wed, 26 Nov 2025 16:51:18 +0800</pubDate><author>Naifan Li</author><guid>https://blog.omagiclee.com/posts/vlms/summary/</guid><description><![CDATA[<h2 id="vlm-tasks">VLM Tasks</h2>
<ul>
<li><strong>Image Captioning</strong>: generate a description for a given image</li>
<li><strong>General Visual Question Answering</strong>: answer questions based on the visual content of a given image.</li>
<li><strong>Text-oriented Visual Question Answering</strong>: Text-VQA is a specialized sub-task of VQA where answering questions critically depends on reading and comprehending text in a given image.
<ul>
<li>Multilingual Text Recognition and Understanding</li>
</ul>
</li>
<li><strong>Refer Expression Comprehension</strong></li>
<li><strong>Visual Grounding</strong></li>
<li>Mathematical Reasoning</li>
<li>Video Understanding</li>
<li>Visual Agent
<ul>
<li>Function Calling</li>
<li>UI Operations/Games/Robotics/Navigation</li>
</ul>
</li>
</ul>
<h2 id="vlms-summary">VLMs Summary</h2>
<style>
table.vlm-comparison {
  width: 100%;
  border-collapse: collapse;
  font-size: 0.9em;
  margin: 20px 0;
  border-top: 1px solid #ccc !important;
}
table.vlm-comparison th,
table.vlm-comparison td {
  padding: 6px 8px;
  vertical-align: middle;
  border: none !important;
  line-height: 1.3;
  white-space: nowrap;
}
table.vlm-comparison thead th {
  border: none !important;
  border-bottom: 1px solid #ccc !important;
  font-weight: bold;
  text-align: center;
  padding: 8px;
}
table.vlm-comparison thead th:first-child {
  width: 12%;
  border-right: 1px solid #ccc !important;
}
table.vlm-comparison thead th:nth-child(2) {
  width: 8%;
}
table.vlm-comparison thead th:nth-child(3),
table.vlm-comparison thead th:nth-child(4),
table.vlm-comparison thead th:nth-child(5) {
  width: 12%;
}
table.vlm-comparison thead tr:nth-child(2) th:nth-child(1) {
  border-right: none !important;
}
table.vlm-comparison thead th:nth-child(6) {
  width: 18%;
}
table.vlm-comparison thead th:nth-child(7) {
  width: 18%;
}
table.vlm-comparison tbody td {
  border: none !important;
  border-top: none !important;
  border-bottom: none !important;
  border-left: none !important;
  border-right: none !important;
  text-align: left;
  white-space: nowrap;
}
table.vlm-comparison tbody td:first-child {
  border-right: 1px solid #ccc !important;
  font-weight: 500;
}
table.vlm-comparison tbody tr:last-child td {
  border-bottom: 1px solid #ccc !important;
}
</style>
<table class="vlm-comparison">
  <thead>
    <tr>
      <th rowspan="2">Model</th>
      <th rowspan="2">Year</th>
      <th colspan="3">Model Architecture</th>
      <th rowspan="2">Training Recipe</th>
      <th rowspan="2">Data Recipe</th>
    </tr>
    <tr>
      <th>Vision Encoder</th>
      <th>Adapter</th>
      <th>LLM</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>BLIP</strong></td>
      <td>2022.01</td>
      <td>-</td>
      <td>-</td>
      <td>-</td>
      <td>-</td>
      <td>-</td>
    </tr>
    <tr>
      <td><strong>BLIP-2</strong></td>
      <td>2023.01</td>
      <td>-</td>
      <td>Q-Former</td>
      <td>-</td>
      <td>-</td>
      <td>-</td>
    </tr>
    <tr>
      <td><strong>LLaVA</strong></td>
      <td>2023.04</td>
      <td>CLIP ViT-L/14</td>
      <td>Linear</td>
      <td>Vicuna</td>
      <td>Pre-training + Fine-tuning</td>
      <td>Image-text pairs</td>
    </tr>
    <tr>
      <td><strong>Qwen-VL</strong></td>
      <td>2023.08</td>
      <td>ViT-bigG</td>
      <td>Cross-attention</td>
      <td>Qwen</td>
      <td>Pre-training + SFT</td>
      <td>Image-text pairs</td>
    </tr>
    <tr>
      <td><strong>Qwen2-VL</strong></td>
      <td>2024.09</td>
      <td>ViT</td>
      <td>MLP</td>
      <td>Qwen2</td>
      <td>Pre-training + Post-training</td>
      <td>1.2T tokens</td>
    </tr>
    <tr>
      <td><strong>Qwen2.5-VL</strong></td>
      <td>2025.02</td>
      <td>ViT</td>
      <td>MLP</td>
      <td>Qwen2.5</td>
      <td>Pre-training + Post-training</td>
      <td>4T tokens</td>
    </tr>
    <tr>
      <td><strong>Qwen3-VL</strong></td>
      <td>2025.02</td>
      <td>SigLIP-2</td>
      <td>MLP</td>
      <td>Qwen3</td>
      <td>Pre-training + Post-training</td>
      <td>-</td>
    </tr>
    <tr>
      <td><strong>HunyuanOCR</strong></td>
      <td>2025.11</td>
      <td>SigLIP-v2</td>
      <td>Conv2d + MLP</td>
      <td>Hunyuan</td>
      <td>Multi-stage + RL</td>
      <td>200M image-text pairs</td>
    </tr>
  </tbody>
</table>]]></description></item><item><title>HunyuanOCR Technical Report</title><link>https://blog.omagiclee.com/posts/vlms/hunyuan-ocr/</link><pubDate>Sun, 16 Nov 2025 14:27:22 +0800</pubDate><author>Naifan Li</author><guid>https://blog.omagiclee.com/posts/vlms/hunyuan-ocr/</guid><description><![CDATA[<p><i class="fas fa-building fa-fw" aria-hidden="true"></i><span style="color:gray">Tencent Hunyuan Vision Team</span>
<i class="fas fa-file-pdf fa-fw" aria-hidden="true"></i><a href="https://arxiv.org/abs/2511.19575" target="_blank" rel="noopener noreffer ">arXiv 2511.19575</a>
<i class="fab fa-github fa-fw" aria-hidden="true"></i><a href="https://github.com/Tencent-Hunyuan/HunyuanOCR" target="_blank" rel="noopener noreffer ">Tencent-Hunyuan/HunyuanOCR</a>
<a href="https://huggingface.co/tencent/HunyuanOCR" target="_blank" rel="noopener noreffer ">tencent/HunyuanOCR</a></p>
<h2 id="motivation">Motivation</h2>
<ul>
<li>Traditional OCR systems rely on the modularized pipeline architecture, primarily including, but not limited to: text detection, text recognition, document layout analysis, named entity recognition, and optional text translation modules, which inevitably result in cumulative error propagation, elevate deployment and maintenance overhead. -&gt; <strong>End-to-End</strong></li>
<li>While leading general VLMs (e.g., Gemini, Qwen-VL) deliver superior OCR performance, they often entail excessive computational overhead and high latency due to the massive parameter scales. -&gt; <strong>OCR-specific, lightweight(1B)</strong></li>
<li><strong>unified multi-task modeling, including text spotting, document parsing, information extraction, visual question answering, and text image translation.</strong></li>
</ul>
<h2 id="method">Method</h2>
<h3 id="model-architecture">Model Architecture</h3>
<p></p>]]></description></item><item><title>Qwen2.5-VL Technical Report</title><link>https://blog.omagiclee.com/posts/vlms/qwen2.5-vl/</link><pubDate>Tue, 25 Feb 2025 20:38:06 +0800</pubDate><author>Naifan Li</author><guid>https://blog.omagiclee.com/posts/vlms/qwen2.5-vl/</guid><description><![CDATA[<p><i class="fas fa-building fa-fw" aria-hidden="true"></i><span style="color:gray">Qwen Team, Alibaba Group</span>
<i class="fas fa-file-pdf fa-fw" aria-hidden="true"></i><a href="https://arxiv.org/abs/2502.13923" target="_blank" rel="noopener noreffer ">arXiv 2502.13923</a>
<i class="fab fa-github fa-fw" aria-hidden="true"></i><a href="https://github.com/QwenLM/Qwen2.5-VL" target="_blank" rel="noopener noreffer ">QwenLM/Qwen2.5-VL</a>
<a href="https://huggingface.co/collections/Qwen/qwen25-vl" target="_blank" rel="noopener noreffer ">Qwen/qwen25-vl</a>
<i class="fab fa-blog fa-fw" aria-hidden="true"></i><a href="https://qwenlm.github.io/blog/qwen2.5-vl/" target="_blank" rel="noopener noreffer ">blog/qwen2.5-vl</a></p>
<h2 id="motivation">Motivation</h2>
<h2 id="contribution">Contribution</h2>
<h2 id="method">Method</h2>
<h2 id="architecture">Architecture</h2>
<p></p>
<p><strong>Large Language Model (3B/7B/72B)</strong>: Qwen2.5-VL model is initialized with pre-trained weights from Qwen2.5.</p>
<ul>
<li>To better meet the demands of multimodal understanding, we have modified the 1D RoPE (Rotary Position Embedding) to our Multimodal Rotary Position Embedding Aligned to Absolute Time.</li>
</ul>
<p><strong>Vision Encoder</strong>: Qwen2.5-VL model employs a redesigned Vision Transformer (ViT) as visual encoder.</p>]]></description></item><item><title>Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution</title><link>https://blog.omagiclee.com/posts/vlms/qwen2-vl/</link><pubDate>Wed, 25 Sep 2024 20:53:12 +0800</pubDate><author>Naifan Li</author><guid>https://blog.omagiclee.com/posts/vlms/qwen2-vl/</guid><description><![CDATA[<p><i class="fas fa-building fa-fw" aria-hidden="true"></i><span style="color:gray">Qwen Team, Alibaba Group</span>
<i class="fas fa-file-pdf fa-fw" aria-hidden="true"></i><a href="https://arxiv.org/abs/2409.12191" target="_blank" rel="noopener noreffer ">arXiv 2409.12191</a>
<i class="fab fa-github fa-fw" aria-hidden="true"></i><a href="https://github.com/QwenLM/Qwen2-VL" target="_blank" rel="noopener noreffer ">QwenLM/Qwen2-VL</a>
<a href="https://huggingface.co/collections/Qwen/qwen2-vl" target="_blank" rel="noopener noreffer ">Qwen/Qwen2-VL</a></p>
<h2 id="motivation">Motivation</h2>
<h2 id="contribution">Contribution</h2>
<h2 id="method">Method</h2>
<h3 id="architecture">Architecture</h3>
<p></p>
<p><strong>Large Language Model (1.5B/7.6B/72B)</strong>: Qwen2-VL model is initialized with pre-trained weights from Qwen2 series.</p>
<p><strong>Vision Encoder (675M)</strong>: Qwen2-VL model employs a constant 675M parameter <strong>Vision Transformer (ViT)</strong> as visual encoder accross various-sized LLMs.</p>]]></description></item><item><title>Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond</title><link>https://blog.omagiclee.com/posts/vlms/qwen-vl/</link><pubDate>Fri, 25 Aug 2023 20:53:14 +0800</pubDate><author>Naifan Li</author><guid>https://blog.omagiclee.com/posts/vlms/qwen-vl/</guid><description><![CDATA[<p><i class="fas fa-building fa-fw" aria-hidden="true"></i><span style="color:gray">Qwen Team, Alibaba Group</span>
<i class="fas fa-file-pdf fa-fw" aria-hidden="true"></i><a href="https://arxiv.org/abs/2308.12966" target="_blank" rel="noopener noreffer ">arXiv 2308.12966</a>
<i class="fab fa-github fa-fw" aria-hidden="true"></i><a href="https://github.com/QwenLM/Qwen-VL" target="_blank" rel="noopener noreffer ">QwenLM/Qwen-VL</a>
<a href="https://huggingface.co/Qwen/Qwen-VL" target="_blank" rel="noopener noreffer ">Qwen/Qwen-VL</a></p>
<h2 id="motivation">Motivation</h2>
<ul>
<li>Despite their powerful capabilities in text generation and following user&rsquo;s intentions via instruction tuning, native LLMs lack the ability to handle multiple modalities (e.g., images, speech, and videos) -&gt; <strong>LVLM</strong></li>
<li>Current open-source LVLMs lag far behind the proprietary models, primarily due to inadequate training and optimization. -&gt; <strong>open-source</strong></li>
<li>The majority of open-source LVLMs are limited to coarse-grained perception, lacking the ability for fine-grained visual understanding such as obect grounding, OCR and text-oriented question answering. -&gt; <strong>fine-grained perception</strong></li>
</ul>
<h2 id="contribution">Contribution</h2>
<h2 id="method">Method</h2>
<h3 id="architecture">Architecture</h3>
<p><strong>Large Language Model (7.7B)</strong>: Qwen-VL model is initialized with pre-trained weights from Qwen-7B.</p>]]></description></item><item><title>LLaVA: Visual Instruction Tuning</title><link>https://blog.omagiclee.com/posts/vlms/llava/</link><pubDate>Wed, 26 Apr 2023 20:22:53 +0800</pubDate><author>Naifan Li</author><guid>https://blog.omagiclee.com/posts/vlms/llava/</guid><description><![CDATA[<p><i class="fas fa-award fa-fw" aria-hidden="true"></i><span style="color:gray">NeurIPS 2023 (Oral)</span>
<i class="fas fa-building fa-fw" aria-hidden="true"></i><span style="color:gray">Microsoft Research</span>
<i class="fas fa-file-pdf fa-fw" aria-hidden="true"></i><a href="https://arxiv.org/abs/2304.08485" target="_blank" rel="noopener noreffer ">arXiv 2304.08485</a>
<i class="fab fa-github fa-fw" aria-hidden="true"></i><a href="https://github.com/haotian-liu/LLaVA" target="_blank" rel="noopener noreffer ">haotian-liu/LLaVA</a>
<i class="fas fa-globe fa-fw" aria-hidden="true"></i><a href="https://llava-vl.github.io/" target="_blank" rel="noopener noreffer ">llava-vl.github.io</a></p>
<h2 id="motivation">Motivation</h2>
<h2 id="contribution">Contribution</h2>
<h2 id="method">Method</h2>
<h3 id="architecture">Architecture</h3>
<p></p>
<p><strong>Large Language Model</strong>: Vicuna</p>
<p><strong>Vision Encoder</strong>: the pre-trained CLIP visual encoder ViT-L/14</p>
<p><strong>Adapter</strong>: While a simple linear layer is employed here, more sophisticated alternatives, such as gated cross-attention in Flamingo and Q-former in BLIP-2, could be optionally substitued.</p>]]></description></item></channel></rss>