LLMs - Category - Naifan Li's Blog

Qwen Series: Technical Summary

Naifan Li — Wed, 14 May 2025 11:26:22 +0800

Model Architecture

Model	Attention Mechanism	Positional Embedding	Activation	Normalization	Context Length	Embedding Strategy	Key Changes
Qwen	Multi-head Attention (MHA) Flash Attention	RoPE (FP32 precision)	SwiGLU FFN: 8/3 × hidden size	Pre-Norm & RMSNorm	2K	Untied Embedding	QKV bias Flash Attention Untied Embedding
Qwen2	GQA (Grouped Query Attention) DCA with YARN	RoPE YARN extension	SwiGLU	Pre-Norm & RMSNorm	32K-128K (with YARN)	Untied Embedding	GQA for efficient KV cache Dual Chunk Attention (DCA) YARN for long context extension QKV bias retained
Qwen2.5	GQA	RoPE YARN extension	SwiGLU	Pre-Norm & RMSNorm	32K-128K (with YARN)	Untied Embedding	Same as Qwen2
Qwen3	GQA QK-Norm	RoPE ABF + YARN	SwiGLU	Pre-Norm & RMSNorm	32K-128K (ABF + YARN)	Untied Embedding (varies by size)	Remove QKV-bias Introduce QK-Norm ABF for context extension MoE: 128 experts, 8 active

MoE Architecture (Qwen3)

Qwen3 introduces MoE (Mixture-of-Experts) variants with significant architectural improvements:

Qwen3 Technical Report

Naifan Li — Wed, 14 May 2025 10:28:47 +0800

Qwen Team, Alibaba Group arXiv 2505.09388 QwenLM/Qwen3 Qwen/qwen3 Blog

Introduction

post-trained models, such as Qwen3-30B-A3B, along with their pre-trained counterparts (e.g.-30B-A3B-Base), are now available on platforms like Hugging Face, ModelScope, and Kaggle.

Key Features

Hybrid Thinking Modes
- Thinking Mode: In this mode, the model takes time to reason step by step before delivering the final answer. This is ideal for complex problems that require deeper thought.
- Non-Thinking Mode: Here, the model provides quick, near-instant responses, suitable for simpler questions where speed is more important than depth.
Multilingual Support
- support 119 languages and dialects
Improved Agentic Capabilities
- We have optimized the Qwen3 models for coding and agentic capabilities, and also we have strengthened the support of MCP as well.

Approach

Tokenization

Qwen3 utilizes the Qwen2.5 BBPE tokenizer (vocab size: 151,646; 151,624 regular and 22 control)

大模型 API 开发指南

Naifan Li — Tue, 11 Feb 2025 10:17:16 +0800

阿里云百炼平台

https://bailian.console.aliyun.com/cn-beijing/?spm=a2c4g.11186623.0.0.3bb0394e3JQHXT&tab=api#/api/?type=model&url=2712195

获取 API Key

https://bailian.console.aliyun.com/cn-beijing/?tab=model#/api-key

调用 API Key

API Key
Base URL: https://dashscope.aliyuncs.com/compatible-mode/v1
Model Name: 如 qwen3-max

sample:

配置 API Key 环境变量

1
2


echo "export DASHSCOPE_API_KEY='YOUR_DASHSCOPE_API_KEY'" >> ~/.zshrc
source ~/.zshrc

安装 OpenAI-Python SDK

1

pip3 install -U openai

LLaMA 4: Next-Generation Open Language Models

Naifan Li — Wed, 25 Dec 2024 17:45:22 +0800

Meta AI arXiv TBD meta-llama/llama4 meta-llama/Meta-Llama-4 LLaMA 4

TL;DR

LLaMA 4 represents the latest generation of Meta’s open language models, featuring significant improvements in reasoning, context handling, and multimodal capabilities. The models continue Meta’s commitment to open-source AI research.

Motivation

LLaMA 4 builds upon the success of previous generations by:

Advancing reasoning and problem-solving capabilities
Extending context length for better long-context understanding
Improving efficiency and scalability
Enhancing safety and alignment

Key Innovations

Advanced Reasoning: Improved reasoning capabilities through enhanced training
Extended Context: Support for longer context windows
Efficiency Improvements: Better parameter efficiency and inference speed
Safety Enhancements: Continued focus on safety and alignment

Approach

Model Architecture

LLaMA 4 features an evolved Transformer architecture:

Qwen2.5 Technical Report

Naifan Li — Thu, 19 Dec 2024 10:28:47 +0800

Qwen Team, Alibaba Group arXiv 2412.15115

TL;DR

Motivation

Key Innovations

Approach

Tokenization

Based on the Qwen BBPE tokenzier (151,646 tokens: 151,624 regular and 22 control), we expand the control tokens from 3 to 22, including two tool-related tokens and 20 for other model capabilities.

Model Architecture

Dense Model

GQA for efficient KV cache utilization
SwiGLU for activation
RoPE for positional embedding
QKV bias for attention
RMSNorm and pre-normalization for training stability

Mixture-of-Experts (MoE) Model

Pre-training

Pre-training Data

Better data filtering: with Qwen2-Instruct Model
Better math and code data: incorporate high-quality domain-specific datasets (math, code) during pretraining.
Better synthetic data:
- leverage both Qwen2-72B-Instruct and Qwen2-Math-72B-Instruct to generate high-quality synthetic data, particularly in mathematics, code, and knowledge domain.
- further enhance the quality of synthesized data through rigorous filtering using our propretary general reward model and the specialized Qwen2-Math-RM-72B model.
Better data mixture:
- Domains like e-commerce, social media, and entertainment are significantly overrepresented in web-scale data, often containing repetitive, template-based, or machine-generated content.
- Domains such as technology, science, and academic research, while containing higher-quality information, are traditionally underrepresented.
- down-sample overrepresented domains and up-sample high-value domains.

Experiments

References

Qwen2 Technical Report

Naifan Li — Mon, 15 Jul 2024 10:28:43 +0800

Qwen Team, Alibaba Group arXiv 2407.10671 QwenLM/Qwen2 Qwen/qwen2

TL;DR

Motivation

Key Innovations

Approach

Tokenization

Identical to the Qwen, the tokenizer utilizes byte-level byte-pair encoding (BBPE) with a total vocabulary size of 151,646, consisting of 151,643 regular tokens and 3 control tokens.

Model Architecture

Configuration	0.5B	1.5B	7B	72B	57B-A14B
Hidden Size	896	1,536	3,584	8,192	3,584
# Layers	24	28	28	80	28
# Query Heads	14	12	28	64	28
# KV Heads	2	2	4	8	4
Head Size	64	128	128	128	128
Intermediate Size	4,864	8,960	18,944	29,568	2,560
# Routed Experts	-	-	-	-	64
# Activated Experts	-	-	-	-	8
# Shared Experts	-	-	-	-	8
Embedding Tying	True	True	False	False	False
Vocabulary Size	151,646	151,646	151,646	151,646	151,646
# Trained Tokens	12T	7T	7T	7T	4.5T

Dense Model

Grouped Query Attention (GQA): GQA instead of conventional multi-head attention (MHA). GQA optimizes KV cache usage during inference, significantly enhancing throughput.
Dual Chunk Attention (DCA) with YARN:
Moreover, we follow Qwen with the usage of SwiGLU (Dauphin et al., 2017) for activation, Rotary Positional Embeddings (RoPE, Su et al., 2024) for positional embedding, QKV bias (Su, 2023) for attention, RMSNorm (Jiang et al., 2023b) and pre-normalization for training stability.

Mixture-of-Experts (MoE) Model

Stages	Pre-training	SFT	Reinforcement Learning
Stages	Pre-training	SFT	Reinforcement Learning	Hyperparameters
Purpose	Language Foundations & World Knowledge	Chat-style Alignment & Instruction Following	Human Preference Alignment
Training Objective	Next-token prediction		Reward Maximization (PPO)
Vocabulary Size	151,643 regular tokens and 3 control tokens		-
Optimizer	AdamW (β₁=0.9, β₂=0.95, ε=10⁻⁸)		-
Learning Rate	Cosine schedule (peak → 10% peak)	7×10⁻⁶ → 7×10⁻⁷ (linear decay)	-
Precision	BFloat16 mixed precision		-
Batch Size		128	-
Training Epochs		2	-
Weight Decay		0.1	-
Gradient Clipping		1.0	-
Context Length	2048	32,768
Data
Training Corpus	7T tokens	500,000+ instruction examples (instruction following, coding, mathematics, logical reasoning, role-playing, multilingualism, safety)	-

Pre-training

Pre-training Data

7T tokens
An attempt to further relax the quality threshold resulted in a 12 trillion token dataset.
All Qwen2 dense models, excluding Qwen2-0.5B, were pre-trained on this large-scale dataset of over 7 trillion tokens.

All Qwen2 dense models, excluding Qwen2-0.5B, were pre-trained on this large-scale dataset of over 7 trillion tokens. Qwen2-0.5B were pre-trained using the 12 trillion token dataset. The MoE model received an additional 4.5 trillion tokens of pre-training, in line with the principle of upcycling. Similar to previous Qwen models, high-quality multi-task instruction data is integrated into the Qwen2 pre-training process to enhance in-context learning and instruction-following abilities.

LLaMA 3: The Most Capable Openly Available LLM to Date

Naifan Li — Thu, 18 Apr 2024 17:45:20 +0800

Meta AI arXiv 2404.14219 meta-llama/llama3 meta-llama/Meta-Llama-3-8B LLaMA 3

TL;DR

LLaMA 3 represents a significant advancement in open-source language models, featuring improved reasoning capabilities, extended context length (8K tokens), and a new tokenizer with 128K vocabulary. The initial release includes 8B and 70B parameter models, with larger models planned.

Motivation

LLaMA 3 aims to push the boundaries of open-source language models by:

Qwen Technical Report

Naifan Li — Thu, 28 Sep 2023 10:28:38 +0800

Qwen Team, Alibaba Group arXiv 2309.16609 QwenLM/Qwen Qwen/qwen

TL;DR

Motivation

Key Innovations

Qwen: the base pretrained language models Qwen-Chat (RLHF): the chat models finetuned with human alignment techniques. Code-Qwen: coding-specialized models Code-Qwen-Chat: coding-specialized model Math-Qwen-Chat: mathematics-focused:

Approach

Tokenization

Tokenizer: tiktoken (BBPE)
Base Vocabulary: cl100k_base
Augmentation: Multilingual (Primary Chinese) Augmentation
Special Handling: Single digit Split
Vocabulary Size: approximately 152k

Encoding Compression Rate: Qwen achieves higher compression efficiency than its competitors in most languages.

LLaMA 2: Open Foundation and Fine-Tuned Chat Models

Naifan Li — Tue, 18 Jul 2023 17:45:18 +0800

Meta AI arXiv 2307.09288 facebookresearch/llama meta-llama/Llama-2-7b-hf LLaMA 2

TL;DR

LLaMA 2 is the next generation of LLaMA models, featuring improved performance, longer context length (4K tokens), and fine-tuned chat models trained with Reinforcement Learning from Human Feedback (RLHF). The models are available in 7B, 13B, and 70B parameter sizes.

LIMA: Less Is More for Alignment

Naifan Li — Thu, 18 May 2023 20:17:46 +0800

arXiv 2305.11201

TL;DR

Superficial Alignment Hypothesis: A model’s knowledge and capabilities are learned almost entirely during pretraining, while alignment teaches it the style or format when interacting with users. -> a rather small set of examples is sufficient to achieve alignment.

Motivations & Innovations

Existing alignment methods require significant amounts of instruction data. -> simply fine-tuning on 1,000 carefully curated training examples.