Contents

Qwen Series: Technical Summary

Published on May 14, 2025 Updated on Jan 15, 2026 LLMs LLMs, Qwens 5 minutes

Contents

Model Architecture

Model	Attention Mechanism	Positional Embedding	Activation	Normalization	Context Length	Embedding Strategy	Key Changes
Qwen	Multi-head Attention (MHA) Flash Attention	RoPE (FP32 precision)	SwiGLU FFN: 8/3 × hidden size	Pre-Norm & RMSNorm	2K	Untied Embedding	QKV bias Flash Attention Untied Embedding
Qwen2	GQA (Grouped Query Attention) DCA with YARN	RoPE YARN extension	SwiGLU	Pre-Norm & RMSNorm	32K-128K (with YARN)	Untied Embedding	GQA for efficient KV cache Dual Chunk Attention (DCA) YARN for long context extension QKV bias retained
Qwen2.5	GQA	RoPE YARN extension	SwiGLU	Pre-Norm & RMSNorm	32K-128K (with YARN)	Untied Embedding	Same as Qwen2
Qwen3	GQA QK-Norm	RoPE ABF + YARN	SwiGLU	Pre-Norm & RMSNorm	32K-128K (ABF + YARN)	Untied Embedding (varies by size)	Remove QKV-bias Introduce QK-Norm ABF for context extension MoE: 128 experts, 8 active

MoE Architecture (Qwen3)

Qwen3 introduces MoE (Mixture-of-Experts) variants with significant architectural improvements:

Expert Configuration: 128 total experts with 8 experts activated per token
Fine-grained Expert Partitioning: Enhanced expressiveness through specialized experts
Global Batch Load-balancing Loss: Promotes expert specialization and balanced utilization
Key Difference from Qwen2.5-MoE: Removed shared experts design for better efficiency

Context Length Extension Techniques

Qwen: Standard RoPE with 2K context
Qwen2/Qwen2.5: YARN (Yet Another RoPE extensioN) enables 32K-128K context
Qwen3: ABF (Adaptive Base Frequency) + YARN for optimized long-context handling up to 128K tokens

Tokenization

Model	Tokenizer	Base Vocabulary	Vocabulary Size	Special Features
Qwen	tiktoken (BBPE)	cl100k_base	~152k	Multilingual (Primary Chinese) Augmentation Single digit Split
Qwen2	BBPE	Qwen	151,646 (151,643 regular + 3 control)	Same as Qwen
Qwen2.5	BBPE	Qwen2	151,646 (151,624 regular + 22 control)	Expanded control tokens from 3 to 22 2 tool-related tokens 20 for other model capabilities
Qwen3	BBPE	Qwen 2.5	151,646 (151,624 regular + 22 control)	Same as Qwen2.5

Pre-training Data

Model	Training Corpus	Data Sources	Data Processing	Language Support
Qwen	Up to 3T tokens	Public web documents Encyclopedia Books Codes	Deduplication Hybrid Quality Scrubbing Selective Up-sampling Instruction-Augmented Pretraining Data Decontamination	Primarily English and Chinese
Qwen2	7T tokens	Web-scale data High-quality code Mathematics data Multilingual data	Quality Enhancement: heuristic and model-based methods Qwen models for synthesis Distribution Improvement: optimize data mixing	~30 languages
Qwen2.5	18T tokens	Web-scale data High-quality domain-specific datasets (math, code) Synthetic data (Qwen2-72B-Instruct, Qwen2-Math-72B-Instruct)	Better data filtering: Qwen2-Instruct Model Better synthetic data: filtered with reward models Better data mixture: down-sample overrepresented, up-sample high-value domains	-
Qwen3	36T tokens (30T + 5T + 1T)	Web PDF-like documents (extracted with Qwen2.5-VL, improved with Qwen2.5) Synthetic data (Qwen2.5-Math, Qwen2.5-Coder)	Stage 1: 30T tokens, 4K context Stage 2: 5T tokens, knowledge-intensive data Stage 3: extend to 32K context	119 languages and dialects

Pre-training

Model	Training Stages	Training Objective	Context Length	Optimizer	Learning Rate	Precision	Key Hyperparameters
Qwen	Single stage	Next-token prediction	2K	AdamW (β₁=0.9, β₂=0.95, ε=10⁻⁸)	Cosine schedule (peak → 10% peak)	BFloat16 mixed precision	Batch size: varies by model size Weight decay: 0.1 Gradient clipping: 1.0
Qwen2	Single stage	Next-token prediction	2K	AdamW (β₁=0.9, β₂=0.95, ε=10⁻⁸)	Cosine schedule (peak → 10% peak)	BFloat16 mixed precision	Batch size: varies by model size Weight decay: 0.1 Gradient clipping: 1.0
Qwen2.5	Single stage	Next-token prediction	2K	AdamW (β₁=0.9, β₂=0.95, ε=10⁻⁸)	Cosine schedule (peak → 10% peak)	BFloat16 mixed precision	Same as Qwen2
Qwen3	Stage 1: 30T tokens, 4K context Stage 2: 5T tokens, knowledge-intensive Stage 3: extend to 32K context	Next-token prediction	4K → 32K (multi-stage)	AdamW (β₁=0.9, β₂=0.95, ε=10⁻⁸)	Cosine schedule (peak → 10% peak)	BFloat16 mixed precision	Multi-stage training Context length extension ABF + YARN techniques

Supervised Fine-Tuning (SFT)

Model	Training Objective	Data Format	Optimizer	Learning Rate	Batch Size	Training Steps	Key Hyperparameters
Qwen	Next-token prediction	Chat-style data	AdamW (β₁=0.9, β₂=0.95, ε=10⁻⁸)	Warmup to 2×10⁻⁶ (1430 steps)	128	4000	Weight decay: 0.1 Dropout: 0.1 Gradient clipping: 1.0 Context length: 2048
Qwen2	Next-token prediction	Chat-style data	AdamW (β₁=0.9, β₂=0.95, ε=10⁻⁸)	Warmup to 2×10⁻⁶ (1430 steps)	128	4000	Weight decay: 0.1 Dropout: 0.1 Gradient clipping: 1.0 Context length: 2048
Qwen2.5	Next-token prediction	Chat-style data	AdamW (β₁=0.9, β₂=0.95, ε=10⁻⁸)	Warmup to 2×10⁻⁶ (1430 steps)	128	4000	Same as Qwen2
Qwen3	Next-token prediction	Long CoT data + Instruction-tuning data	AdamW (β₁=0.9, β₂=0.95, ε=10⁻⁸)	Warmup schedule	128	Varies	Long CoT data focus Thinking mode fusion Extended context support

Reinforcement Learning from Human Feedback (RLHF)

Model	RL Method	Reward Model	Training Data	Key Features
Qwen	PPO	Human preference-based	Human feedback data	Standard RLHF pipeline Reward model training PPO optimization
Qwen2	PPO	Human preference-based	Human feedback data	Enhanced reward modeling Improved safety training Better alignment
Qwen2.5	PPO	Human preference-based	Human feedback data	Same as Qwen2
Qwen3	Multi-stage RL	Rule-based + General reward model	Stage 2: Reasoning-based RL Stage 4: General RL (20+ tasks)	Reasoning-based RL (Stage 2) Rule-based rewards General RL across 20+ tasks Enhanced exploration

Model Variants & Sizes

Model	Dense Models	MoE Models	Specialized Variants	Context Length Support
Qwen	Base models Chat models (RLHF)	-	Code-Qwen: coding-specialized Code-Qwen-Chat: coding-specialized chat Math-Qwen-Chat: mathematics-focused	2K
Qwen2	Base models Chat models (RLHF)	MoE variants	-	32K-128K (with YARN)
Qwen2.5	Base models Chat models (RLHF)	MoE variants (with shared experts)	-	32K-128K (with YARN)
Qwen3	0.6B to 32B parameters Base and post-trained models	Qwen3-30B-A3B (30B total, 3B active) Qwen3-235B-A22B (235B total, 22B active)	-	32K-128K (ABF + YARN)

Key Innovations by Version

Category	Qwen	Qwen2	Qwen2.5	Qwen3
Architecture	Untied Embedding RoPE (FP32 precision) QKV Bias Flash Attention	GQA (Grouped Query Attention) Dual Chunk Attention (DCA) YARN extension MoE Architecture	Same as Qwen2	QK-Norm (replaces QKV bias) Advanced MoE: 128 experts, 8 active, no shared experts ABF + YARN
Training	Standard pre-training	Enhanced data quality	Better data filtering Better synthetic data Optimized data mixture	Multi-stage pre-training (3 stages) Multi-stage RLHF (4 stages)
Tokenization	tiktoken (BBPE) ~152k vocabulary	BBPE 151,646 tokens (3 control)	Expanded control tokens (3→22) Tool-related tokens	Same as Qwen2.5
Context Length	2K	32K-128K (YARN)	32K-128K (YARN)	32K-128K (ABF + YARN)
Special Features	Code-Qwen variants Math-Qwen variants	Model-based data synthesis	Reward model filtering	Hybrid Thinking Modes Enhanced Agent Capabilities MCP support