<rss xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title>LLMs - Category - Naifan Li's Blog</title><link>https://blog.omagiclee.com/categories/llms/</link><description>LLMs - Category - Naifan Li's Blog</description><generator>Hugo -- gohugo.io</generator><language>en-us</language><lastBuildDate>Wed, 14 May 2025 11:26:22 +0800</lastBuildDate><atom:link href="https://blog.omagiclee.com/categories/llms/" rel="self" type="application/rss+xml"/><item><title>Qwen Series: Technical Summary</title><link>https://blog.omagiclee.com/posts/llms/qwens/qwen-llm-summary/</link><pubDate>Wed, 14 May 2025 11:26:22 +0800</pubDate><author>Naifan Li</author><guid>https://blog.omagiclee.com/posts/llms/qwens/qwen-llm-summary/</guid><description><![CDATA[<h2 id="model-architecture">Model Architecture</h2>
<table class="comparison-table">
  <thead>
    <tr>
      <th>Model</th>
      <th>Attention Mechanism</th>
      <th>Positional Embedding</th>
      <th>Activation</th>
      <th>Normalization</th>
      <th>Context Length</th>
      <th>Embedding Strategy</th>
      <th>Key Changes</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Qwen</strong></td>
      <td>Multi-head Attention (MHA)<br>Flash Attention</td>
      <td>RoPE (FP32 precision)</td>
      <td>SwiGLU<br>FFN: 8/3 × hidden size</td>
      <td>Pre-Norm & RMSNorm</td>
      <td>2K</td>
      <td>Untied Embedding</td>
      <td>
        <ul>
          <li>QKV bias</li>
          <li>Flash Attention</li>
          <li>Untied Embedding</li>
        </ul>
      </td>
    </tr>
    <tr>
      <td><strong>Qwen2</strong></td>
      <td>GQA (Grouped Query Attention)<br>DCA with YARN</td>
      <td>RoPE<br>YARN extension</td>
      <td>SwiGLU</td>
      <td>Pre-Norm & RMSNorm</td>
      <td>32K-128K<br>(with YARN)</td>
      <td>Untied Embedding</td>
      <td>
        <ul>
          <li>GQA for efficient KV cache</li>
          <li>Dual Chunk Attention (DCA)</li>
          <li>YARN for long context extension</li>
          <li>QKV bias retained</li>
        </ul>
      </td>
    </tr>
    <tr>
      <td><strong>Qwen2.5</strong></td>
      <td>GQA</td>
      <td>RoPE<br>YARN extension</td>
      <td>SwiGLU</td>
      <td>Pre-Norm & RMSNorm</td>
      <td>32K-128K<br>(with YARN)</td>
      <td>Untied Embedding</td>
      <td>Same as Qwen2</td>
    </tr>
    <tr>
      <td><strong>Qwen3</strong></td>
      <td>GQA<br>QK-Norm</td>
      <td>RoPE<br>ABF + YARN</td>
      <td>SwiGLU</td>
      <td>Pre-Norm & RMSNorm</td>
      <td>32K-128K<br>(ABF + YARN)</td>
      <td>Untied Embedding<br>(varies by size)</td>
      <td>
        <ul>
          <li>Remove QKV-bias</li>
          <li>Introduce QK-Norm</li>
          <li>ABF for context extension</li>
          <li>MoE: 128 experts, 8 active</li>
        </ul>
      </td>
    </tr>
  </tbody>
</table>
<h3 id="moe-architecture-qwen3">MoE Architecture (Qwen3)</h3>
<p>Qwen3 introduces MoE (Mixture-of-Experts) variants with significant architectural improvements:</p>]]></description></item><item><title>Qwen3 Technical Report</title><link>https://blog.omagiclee.com/posts/llms/qwens/qwen-3/</link><pubDate>Wed, 14 May 2025 10:28:47 +0800</pubDate><author>Naifan Li</author><guid>https://blog.omagiclee.com/posts/llms/qwens/qwen-3/</guid><description><![CDATA[<p><i class="fas fa-building fa-fw" aria-hidden="true"></i><span style="color:gray">Qwen Team, Alibaba Group</span>
<i class="fas fa-file-pdf fa-fw" aria-hidden="true"></i><a href="https://arxiv.org/abs/2505.09388" target="_blank" rel="noopener noreffer ">arXiv 2505.09388</a>
<i class="fab fa-github fa-fw" aria-hidden="true"></i><a href="https://github.com/QwenLM/Qwen3" target="_blank" rel="noopener noreffer ">QwenLM/Qwen3</a>
<a href="https://huggingface.co/collections/Qwen/qwen3" target="_blank" rel="noopener noreffer ">Qwen/qwen3</a>
<i class="fas fa-blog fa-fw" aria-hidden="true"></i><a href="https://qwen.readthedocs.io/en/latest/" target="_blank" rel="noopener noreffer ">Blog</a></p>
<h1 id="introduction">Introduction</h1>
<p></p>
<p>post-trained models, such as Qwen3-30B-A3B, along with their pre-trained counterparts (e.g.-30B-A3B-Base), are now available on platforms like Hugging Face, ModelScope, and Kaggle.</p>
<h1 id="key-features">Key Features</h1>
<ul>
<li><strong>Hybrid Thinking Modes</strong>
<ul>
<li>Thinking Mode: In this mode, the model takes time to reason step by step before delivering the final answer. This is ideal for complex problems that require deeper thought.</li>
<li>Non-Thinking Mode: Here, the model provides quick, near-instant responses, suitable for simpler questions where speed is more important than depth.
</li>
</ul>
</li>
<li><strong>Multilingual Support</strong>
<ul>
<li>support 119 languages and dialects</li>
</ul>
</li>
<li><strong>Improved Agentic Capabilities</strong>
<ul>
<li>We have optimized the Qwen3 models for coding and agentic capabilities, and also we have strengthened the support of <strong>MCP</strong> as well.</li>
</ul>
</li>
</ul>
<h2 id="approach">Approach</h2>
<h3 id="tokenization">Tokenization</h3>
<p>Qwen3 utilizes the Qwen2.5 BBPE tokenizer (vocab size: 151,646; 151,624 regular and 22 control)</p>]]></description></item><item><title>大模型 API 开发指南</title><link>https://blog.omagiclee.com/posts/toolkits/manual-of-develop-large-model-api/</link><pubDate>Tue, 11 Feb 2025 10:17:16 +0800</pubDate><author>Naifan Li</author><guid>https://blog.omagiclee.com/posts/toolkits/manual-of-develop-large-model-api/</guid><description><![CDATA[<h2 id="阿里云百炼平台">阿里云百炼平台</h2>
<p><a href="https://bailian.console.aliyun.com/cn-beijing/?spm=a2c4g.11186623.0.0.3bb0394e3JQHXT&amp;tab=api#/api/?type=model&amp;url=2712195" target="_blank" rel="noopener noreffer ">https://bailian.console.aliyun.com/cn-beijing/?spm=a2c4g.11186623.0.0.3bb0394e3JQHXT&tab=api#/api/?type=model&url=2712195</a></p>
<ol>
<li>获取 API Key</li>
</ol>
<p><a href="https://bailian.console.aliyun.com/cn-beijing/?tab=model#/api-key" target="_blank" rel="noopener noreffer ">https://bailian.console.aliyun.com/cn-beijing/?tab=model#/api-key</a></p>
<ol start="2">
<li>调用 API Key</li>
</ol>
<ul>
<li>API Key</li>
<li>Base URL: <a href="https://dashscope.aliyuncs.com/compatible-mode/v1" target="_blank" rel="noopener noreffer ">https://dashscope.aliyuncs.com/compatible-mode/v1</a></li>
<li>Model Name: 如 qwen3-max</li>
</ul>
<p>sample:</p>
<ul>
<li><a href="https://help.aliyun.com/zh/model-studio/claude-code" target="_blank" rel="noopener noreffer ">https://help.aliyun.com/zh/model-studio/claude-code</a></li>
<li><a href="https://help.aliyun.com/zh/model-studio/openclaw?spm=a2c4g.11186623.help-menu-2400256.d_0_10_5.79ec69c3Dekn1K&amp;scm=20140722.H_3020785._.OR_help-T_cn~zh-V_1" target="_blank" rel="noopener noreffer ">https://help.aliyun.com/zh/model-studio/openclaw?spm=a2c4g.11186623.help-menu-2400256.d_0_10_5.79ec69c3Dekn1K&scm=20140722.H_3020785._.OR_help-T_cn~zh-V_1</a></li>
</ul>
<ol start="3">
<li>配置 API Key 环境变量</li>
</ol>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-zsh" data-lang="zsh"><span class="line"><span class="cl"><span class="nb">echo</span> <span class="s2">&#34;export DASHSCOPE_API_KEY=&#39;YOUR_DASHSCOPE_API_KEY&#39;&#34;</span> &gt;&gt; ~/.zshrc
</span></span><span class="line"><span class="cl"><span class="nb">source</span> ~/.zshrc
</span></span></code></pre></td></tr></table>
</div>
</div><ol start="4">
<li>安装 OpenAI-Python SDK</li>
</ol>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">pip3</span> <span class="n">install</span> <span class="o">-</span><span class="n">U</span> <span class="n">openai</span>
</span></span></code></pre></td></tr></table>
</div>
</div>]]></description></item><item><title>LLaMA 4: Next-Generation Open Language Models</title><link>https://blog.omagiclee.com/posts/llms/llamas/llama-4/</link><pubDate>Wed, 25 Dec 2024 17:45:22 +0800</pubDate><author>Naifan Li</author><guid>https://blog.omagiclee.com/posts/llms/llamas/llama-4/</guid><description><![CDATA[<p><i class="fas fa-building fa-fw" aria-hidden="true"></i><span style="color:gray">Meta AI</span>
<i class="fas fa-file-pdf fa-fw" aria-hidden="true"></i><a href="" rel="">arXiv TBD</a>
<i class="fab fa-github fa-fw" aria-hidden="true"></i><a href="https://github.com/meta-llama/llama4" target="_blank" rel="noopener noreffer ">meta-llama/llama4</a>
<a href="" rel="">meta-llama/Meta-Llama-4</a>
<i class="fas fa-blog fa-fw" aria-hidden="true"></i><a href="https://ai.meta.com/blog/meta-llama-4/" target="_blank" rel="noopener noreffer ">LLaMA 4</a></p>
<h2 id="tldr">TL;DR</h2>
<p>LLaMA 4 represents the latest generation of Meta&rsquo;s open language models, featuring significant improvements in reasoning, context handling, and multimodal capabilities. The models continue Meta&rsquo;s commitment to open-source AI research.</p>
<h2 id="motivation">Motivation</h2>
<p>LLaMA 4 builds upon the success of previous generations by:</p>
<ul>
<li>Advancing reasoning and problem-solving capabilities</li>
<li>Extending context length for better long-context understanding</li>
<li>Improving efficiency and scalability</li>
<li>Enhancing safety and alignment</li>
</ul>
<h2 id="key-innovations">Key Innovations</h2>
<ul>
<li><strong>Advanced Reasoning</strong>: Improved reasoning capabilities through enhanced training</li>
<li><strong>Extended Context</strong>: Support for longer context windows</li>
<li><strong>Efficiency Improvements</strong>: Better parameter efficiency and inference speed</li>
<li><strong>Safety Enhancements</strong>: Continued focus on safety and alignment</li>
</ul>
<h2 id="approach">Approach</h2>
<h3 id="model-architecture">Model Architecture</h3>
<p>LLaMA 4 features an evolved Transformer architecture:</p>]]></description></item><item><title>Qwen2.5 Technical Report</title><link>https://blog.omagiclee.com/posts/llms/qwens/qwen-2.5/</link><pubDate>Thu, 19 Dec 2024 10:28:47 +0800</pubDate><author>Naifan Li</author><guid>https://blog.omagiclee.com/posts/llms/qwens/qwen-2.5/</guid><description><![CDATA[<p><i class="fas fa-building fa-fw" aria-hidden="true"></i><span style="color:gray">Qwen Team, Alibaba Group</span>
<i class="fas fa-file-pdf fa-fw" aria-hidden="true"></i><a href="https://arxiv.org/abs/2412.15115" target="_blank" rel="noopener noreffer ">arXiv 2412.15115</a>
<i class="fab fa-github fa-fw" aria-hidden="true"></i><a href="" rel=""></a>
<a href="" rel=""></a>
<i class="fas fa-globe fa-fw" aria-hidden="true"></i><a href="" rel=""></a>
<i class="fas fa-blog fa-fw" aria-hidden="true"></i><a href="" rel=""></a></p>
<h2 id="tldr">TL;DR</h2>
<h2 id="motivation">Motivation</h2>
<h2 id="key-innovations">Key Innovations</h2>
<h2 id="approach">Approach</h2>
<h3 id="tokenization">Tokenization</h3>
<p>Based on the Qwen BBPE tokenzier (151,646 tokens: 151,624 regular and 22 control), we expand the control tokens from 3 to 22, including two tool-related tokens and 20 for other model capabilities.</p>
<h3 id="model-architecture">Model Architecture</h3>
<h4 id="dense-model">Dense Model</h4>
<ul>
<li>GQA for efficient KV cache utilization</li>
<li>SwiGLU for activation</li>
<li>RoPE for positional embedding</li>
<li>QKV bias for attention</li>
<li>RMSNorm and pre-normalization for training stability</li>
</ul>
<h4 id="mixture-of-experts-moe-model">Mixture-of-Experts (MoE) Model</h4>
<h3 id="pre-training">Pre-training</h3>
<h4 id="pre-training-data">Pre-training Data</h4>
<ul>
<li><strong>Better data filtering</strong>: with Qwen2-Instruct Model</li>
<li><strong>Better math and code data</strong>: incorporate high-quality domain-specific datasets (math, code) during pretraining.</li>
<li><strong>Better synthetic data</strong>:
<ul>
<li>leverage both Qwen2-72B-Instruct and Qwen2-Math-72B-Instruct to generate high-quality synthetic data, particularly in mathematics, code, and knowledge domain.</li>
<li>further enhance the quality of synthesized data through rigorous filtering using our propretary general reward model and the specialized Qwen2-Math-RM-72B model.</li>
</ul>
</li>
<li><strong>Better data mixture</strong>:
<ul>
<li>Domains like e-commerce, social media, and entertainment are significantly overrepresented in web-scale data, often containing repetitive, template-based, or machine-generated content.</li>
<li>Domains such as technology, science, and academic research, while containing higher-quality information, are traditionally underrepresented.</li>
<li>down-sample overrepresented domains and up-sample high-value domains.</li>
</ul>
</li>
</ul>
<h2 id="experiments">Experiments</h2>
<h2 id="references">References</h2>]]></description></item><item><title>Qwen2 Technical Report</title><link>https://blog.omagiclee.com/posts/llms/qwens/qwen-2/</link><pubDate>Mon, 15 Jul 2024 10:28:43 +0800</pubDate><author>Naifan Li</author><guid>https://blog.omagiclee.com/posts/llms/qwens/qwen-2/</guid><description><![CDATA[<p><i class="fas fa-building fa-fw" aria-hidden="true"></i><span style="color:gray">Qwen Team, Alibaba Group</span>
<i class="fas fa-file-pdf fa-fw" aria-hidden="true"></i><a href="https://arxiv.org/abs/2407.10671" target="_blank" rel="noopener noreffer ">arXiv 2407.10671</a>
<i class="fab fa-github fa-fw" aria-hidden="true"></i><a href="https://github.com/QwenLM/Qwen2" target="_blank" rel="noopener noreffer ">QwenLM/Qwen2</a>
<a href="https://huggingface.co/collections/Qwen/qwen2" target="_blank" rel="noopener noreffer ">Qwen/qwen2</a></p>
<h2 id="tldr">TL;DR</h2>
<h2 id="motivation">Motivation</h2>
<h2 id="key-innovations">Key Innovations</h2>
<h2 id="approach">Approach</h2>
<h3 id="tokenization">Tokenization</h3>
<p>Identical to the Qwen, the tokenizer utilizes byte-level byte-pair encoding (BBPE) with a total vocabulary size of 151,646, consisting of 151,643 regular tokens and 3 control tokens.</p>
<h3 id="model-architecture">Model Architecture</h3>
<style>
.grouped-table.architecture {
  table-layout: auto;
  width: auto;
  margin-left: 0;
  margin-right: auto;
}
.grouped-table.architecture th,
.grouped-table.architecture td {
  padding: 3px 5px;
  white-space: nowrap;
}
.grouped-table.architecture td:first-child {
  white-space: normal;
  padding-right: 10px;
}
.grouped-table.architecture th:not(:first-child),
.grouped-table.architecture td:not(:first-child) {
  text-align: center;
  padding-left: 5px;
  padding-right: 5px;
}
</style>
<table class="grouped-table architecture">
  <thead>
    <tr>
      <th>Configuration</th>
      <th>0.5B</th>
      <th>1.5B</th>
      <th>7B</th>
      <th>72B</th>
      <th>57B-A14B</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Hidden Size</strong></td>
      <td>896</td>
      <td>1,536</td>
      <td>3,584</td>
      <td>8,192</td>
      <td>3,584</td>
    </tr>
    <tr>
      <td><strong># Layers</strong></td>
      <td>24</td>
      <td>28</td>
      <td>28</td>
      <td>80</td>
      <td>28</td>
    </tr>
    <tr>
      <td><strong># Query Heads</strong></td>
      <td>14</td>
      <td>12</td>
      <td>28</td>
      <td>64</td>
      <td>28</td>
    </tr>
    <tr>
      <td><strong># KV Heads</strong></td>
      <td>2</td>
      <td>2</td>
      <td>4</td>
      <td>8</td>
      <td>4</td>
    </tr>
    <tr>
      <td><strong>Head Size</strong></td>
      <td>64</td>
      <td>128</td>
      <td>128</td>
      <td>128</td>
      <td>128</td>
    </tr>
    <tr>
      <td><strong>Intermediate Size</strong></td>
      <td>4,864</td>
      <td>8,960</td>
      <td>18,944</td>
      <td>29,568</td>
      <td>2,560</td>
    </tr>
    <tr>
      <td><strong># Routed Experts</strong></td>
      <td>-</td>
      <td>-</td>
      <td>-</td>
      <td>-</td>
      <td>64</td>
    </tr>
    <tr>
      <td><strong># Activated Experts</strong></td>
      <td>-</td>
      <td>-</td>
      <td>-</td>
      <td>-</td>
      <td>8</td>
    </tr>
    <tr>
      <td><strong># Shared Experts</strong></td>
      <td>-</td>
      <td>-</td>
      <td>-</td>
      <td>-</td>
      <td>8</td>
    </tr>
    <tr>
      <td><strong>Embedding Tying</strong></td>
      <td>True</td>
      <td>True</td>
      <td>False</td>
      <td>False</td>
      <td>False</td>
    </tr>
    <tr>
      <td><strong>Vocabulary Size</strong></td>
      <td>151,646</td>
      <td>151,646</td>
      <td>151,646</td>
      <td>151,646</td>
      <td>151,646</td>
    </tr>
    <tr>
      <td><strong># Trained Tokens</strong></td>
      <td>12T</td>
      <td>7T</td>
      <td>7T</td>
      <td>7T</td>
      <td>4.5T</td>
    </tr>
  </tbody>
</table>
<h4 id="dense-model">Dense Model</h4>
<ul>
<li><strong>Grouped Query Attention (GQA)</strong>: GQA instead of conventional multi-head attention (MHA). GQA optimizes KV cache usage during inference, significantly enhancing throughput.</li>
<li><strong>Dual Chunk Attention (DCA) with YARN</strong>:</li>
<li>Moreover, we follow Qwen with the usage of SwiGLU (Dauphin et al., 2017) for activation, Rotary Positional Embeddings (RoPE, Su et al., 2024) for positional embedding, QKV bias (Su, 2023) for attention, RMSNorm (Jiang et al., 2023b) and pre-normalization for training stability.</li>
</ul>
<h4 id="mixture-of-experts-moe-model">Mixture-of-Experts (MoE) Model</h4>
<style>
.grouped-table {
  width: 100%;
  font-size: 0.8em;
  border-collapse: collapse;
  margin: 1.5rem 0;
}
.grouped-table th,
.grouped-table td {
  padding: 8px 12px;
  border: 1px solid #ddd;
  text-align: center;
  vertical-align: middle;
  color: #000;
}
.grouped-table .section-header td {
  background: #f0f0f0;
  color: #000;
  font-weight: bold;
  text-align: center;
  padding: 12px;
  font-size: 1.05em;
}
.grouped-table td:first-child {
  font-weight: 600;
  background: #f8f9fa;
  color: #000;
}
.grouped-table ul {
  margin: 0;
  padding-left: 15px;
  text-align: left;
  font-size: 0.95em;
  color: #000;
}
[theme=dark] .grouped-table th,
[theme=dark] .grouped-table td {
  border-color: #444;
  color: #fff;
}
[theme=dark] .grouped-table .section-header td {
  background: #2a2a2a;
  color: #fff;
}
[theme=dark] .grouped-table td:first-child {
  background: #2a2a2a;
  color: #fff;
}
[theme=dark] .grouped-table ul {
  color: #fff;
}
</style>
<table class="grouped-table">
  <thead>
    <tr>
      <th rowspan="2">Stages</th>
      <th rowspan="2">Pre-training</th>
      <th rowspan="2">SFT</th>
      <th rowspan="2">Reinforcement Learning</th>
    </tr>
  </thead>
  <!-- 第一组：超参数 -->
  <tbody>
    <tr class="section-header">
      <td colspan="4">Hyperparameters</td>
    </tr>
    <tr>
      <td><strong>Purpose</strong></td>
      <td style="white-space: nowrap;">Language Foundations & World Knowledge</td>
      <td>Chat-style Alignment & Instruction Following</td>
      <td>Human Preference Alignment</td>
    </tr>
    <tr>
      <td><strong>Training Objective</strong></td>
      <td colspan="2">Next-token prediction</td>
      <td>Reward Maximization (PPO)</td>
    </tr>
    <tr>
      <td><strong>Vocabulary Size</strong></td>
      <td>151,643 regular tokens and 3 control tokens</td>
      <td></td>
      <td>-</td>
    </tr>
    <tr>
      <td><strong>Optimizer</strong></td>
      <td colspan="2">AdamW (β₁=0.9, β₂=0.95, ε=10⁻⁸)</td>
      <td>-</td>
    </tr>
    <tr>
      <td><strong>Learning Rate</strong></td>
      <td>Cosine schedule (peak → 10% peak)</td>
      <td>7×10⁻⁶ → 7×10⁻⁷ (linear decay)</td>
      <td>-</td>
    </tr>
    <tr>
      <td><strong>Precision</strong></td>
      <td colspan="2">BFloat16 mixed precision</td>
      <td>-</td>
    </tr>
    <tr>
      <td><strong>Batch Size</strong></td>
      <td></td>
      <td>128</td>
      <td>-</td>
    </tr>
    <tr>
      <td><strong>Training Epochs</strong></td>
      <td></td>
      <td>2</td>
      <td>-</td>
    </tr>
    <tr>
      <td><strong>Weight Decay</strong></td>
      <td></td>
      <td>0.1</td>
      <td>-</td>
    </tr>
    <tr>
      <td><strong>Gradient Clipping</strong></td>
      <td></td>
      <td>1.0</td>
      <td>-</td>
    </tr>
    <tr>
      <td><strong>Context Length</strong></td>
      <td>2048</td>
      <td>32,768</td>
      <td></td>
    </tr>
  </tbody>
  <!-- 第三组：数据 -->
  <tbody>
    <tr class="section-header">
      <td colspan="4">Data</td>
    </tr>
    <tr>
      <td><strong>Training Corpus</strong></td>
      <td>7T tokens</td>
      <td>500,000+ instruction examples<br>(instruction following, coding, mathematics,<br>logical reasoning, role-playing, multilingualism, safety)</td>
      <td>-</td>
    </tr>
  </tbody>
</table>
<h3 id="pre-training">Pre-training</h3>
<h4 id="pre-training-data">Pre-training Data</h4>
<ul>
<li>7T tokens</li>
<li>An attempt to further relax the quality threshold resulted in a 12 trillion token dataset.</li>
<li>All Qwen2 dense models, excluding Qwen2-0.5B, were pre-trained on this large-scale dataset of over 7 trillion tokens.</li>
</ul>
<p>All Qwen2 dense models, excluding Qwen2-0.5B, were pre-trained on this large-scale dataset of over 7 trillion tokens. Qwen2-0.5B were pre-trained using the 12 trillion token dataset. The MoE model received an additional 4.5 trillion tokens of pre-training, in line with the principle of upcycling. Similar to previous Qwen models, high-quality multi-task instruction data is integrated into the Qwen2 pre-training process to enhance in-context learning and instruction-following abilities.</p>]]></description></item><item><title>LLaMA 3: The Most Capable Openly Available LLM to Date</title><link>https://blog.omagiclee.com/posts/llms/llamas/llama-3/</link><pubDate>Thu, 18 Apr 2024 17:45:20 +0800</pubDate><author>Naifan Li</author><guid>https://blog.omagiclee.com/posts/llms/llamas/llama-3/</guid><description><![CDATA[<p><i class="fas fa-building fa-fw" aria-hidden="true"></i><span style="color:gray">Meta AI</span>
<i class="fas fa-file-pdf fa-fw" aria-hidden="true"></i><a href="https://arxiv.org/abs/2404.14219" target="_blank" rel="noopener noreffer ">arXiv 2404.14219</a>
<i class="fab fa-github fa-fw" aria-hidden="true"></i><a href="https://github.com/meta-llama/llama3" target="_blank" rel="noopener noreffer ">meta-llama/llama3</a>
<a href="https://huggingface.co/meta-llama/Meta-Llama-3-8B" target="_blank" rel="noopener noreffer ">meta-llama/Meta-Llama-3-8B</a>
<i class="fas fa-blog fa-fw" aria-hidden="true"></i><a href="https://ai.meta.com/blog/meta-llama-3/" target="_blank" rel="noopener noreffer ">LLaMA 3</a></p>
<h2 id="tldr">TL;DR</h2>
<p>LLaMA 3 represents a significant advancement in open-source language models, featuring improved reasoning capabilities, extended context length (8K tokens), and a new tokenizer with 128K vocabulary. The initial release includes 8B and 70B parameter models, with larger models planned.</p>
<h2 id="motivation">Motivation</h2>
<p>LLaMA 3 aims to push the boundaries of open-source language models by:</p>]]></description></item><item><title>Qwen Technical Report</title><link>https://blog.omagiclee.com/posts/llms/qwens/qwen/</link><pubDate>Thu, 28 Sep 2023 10:28:38 +0800</pubDate><author>Naifan Li</author><guid>https://blog.omagiclee.com/posts/llms/qwens/qwen/</guid><description><![CDATA[<p><i class="fas fa-building fa-fw" aria-hidden="true"></i><span style="color:gray">Qwen Team, Alibaba Group</span>
<i class="fas fa-file-pdf fa-fw" aria-hidden="true"></i><a href="https://arxiv.org/abs/2309.16609" target="_blank" rel="noopener noreffer ">arXiv 2309.16609</a>
<i class="fab fa-github fa-fw" aria-hidden="true"></i><a href="https://github.com/QwenLM/Qwen" target="_blank" rel="noopener noreffer ">QwenLM/Qwen</a>
<a href="https://huggingface.co/collections/Qwen/qwen" target="_blank" rel="noopener noreffer ">Qwen/qwen</a></p>
<h2 id="tldr">TL;DR</h2>
<h2 id="motivation">Motivation</h2>
<h2 id="key-innovations">Key Innovations</h2>
<p>Qwen:  the base pretrained language models
Qwen-Chat (RLHF): the chat models finetuned with human alignment techniques.
Code-Qwen: coding-specialized models
Code-Qwen-Chat: coding-specialized model
Math-Qwen-Chat: mathematics-focused:</p>
<p></p>
<h2 id="approach">Approach</h2>
<h3 id="tokenization">Tokenization</h3>
<ul>
<li><strong>Tokenizer</strong>: tiktoken (BBPE)</li>
<li><strong>Base Vocabulary</strong>: cl100k_base</li>
<li><strong>Augmentation</strong>: Multilingual (Primary Chinese) Augmentation</li>
<li><strong>Special Handling</strong>: Single digit Split</li>
<li><strong>Vocabulary Size</strong>: approximately 152k</li>
</ul>
<p><strong>Encoding Compression Rate</strong>: Qwen achieves higher compression efficiency than its competitors in most languages.</p>]]></description></item><item><title>LLaMA 2: Open Foundation and Fine-Tuned Chat Models</title><link>https://blog.omagiclee.com/posts/llms/llamas/llama-2/</link><pubDate>Tue, 18 Jul 2023 17:45:18 +0800</pubDate><author>Naifan Li</author><guid>https://blog.omagiclee.com/posts/llms/llamas/llama-2/</guid><description><![CDATA[<p><i class="fas fa-building fa-fw" aria-hidden="true"></i><span style="color:gray">Meta AI</span>
<i class="fas fa-file-pdf fa-fw" aria-hidden="true"></i><a href="https://arxiv.org/abs/2307.09288" target="_blank" rel="noopener noreffer ">arXiv 2307.09288</a>
<i class="fab fa-github fa-fw" aria-hidden="true"></i><a href="https://github.com/facebookresearch/llama" target="_blank" rel="noopener noreffer ">facebookresearch/llama</a>
<a href="https://huggingface.co/meta-llama/Llama-2-7b-hf" target="_blank" rel="noopener noreffer ">meta-llama/Llama-2-7b-hf</a>
<i class="fas fa-blog fa-fw" aria-hidden="true"></i><a href="https://ai.meta.com/llama/" target="_blank" rel="noopener noreffer ">LLaMA 2</a></p>
<h2 id="tldr">TL;DR</h2>
<p>LLaMA 2 is the next generation of LLaMA models, featuring improved performance, longer context length (4K tokens), and fine-tuned chat models trained with Reinforcement Learning from Human Feedback (RLHF). The models are available in 7B, 13B, and 70B parameter sizes.</p>]]></description></item><item><title>LIMA: Less Is More for Alignment</title><link>https://blog.omagiclee.com/posts/llms/instruction-tuning/lima/</link><pubDate>Thu, 18 May 2023 20:17:46 +0800</pubDate><author>Naifan Li</author><guid>https://blog.omagiclee.com/posts/llms/instruction-tuning/lima/</guid><description><![CDATA[<p><i class="fas fa-award fa-fw" aria-hidden="true"></i><span style="color:gray"></span>
<i class="fas fa-building fa-fw" aria-hidden="true"></i><span style="color:gray"></span>
<i class="fas fa-file-pdf fa-fw" aria-hidden="true"></i><a href="https://arxiv.org/abs/2305.11201" target="_blank" rel="noopener noreffer ">arXiv 2305.11201</a>
<i class="fab fa-github fa-fw" aria-hidden="true"></i><a href="" rel=""></a>
<a href="" rel=""></a>
<i class="fas fa-globe fa-fw" aria-hidden="true"></i><a href="" rel=""></a>
<i class="fas fa-blog fa-fw" aria-hidden="true"></i><a href="" rel=""></a></p>
<h2 id="tldr">TL;DR</h2>
<p><span style="color:red;"><strong>Superficial Alignment Hypothesis</strong>: A model&rsquo;s knowledge and capabilities are learned almost entirely during pretraining, while alignment teaches it the style or format when interacting with users. -&gt; a rather small set of examples is sufficient to achieve alignment.</span></p>
<h2 id="motivations--innovations">Motivations &amp; Innovations</h2>
<p>Existing alignment methods require significant amounts of instruction data. -&gt; simply fine-tuning on 1,000 carefully curated training examples.</p>]]></description></item></channel></rss>