Contents

Qwen Series: Technical Summary

Model Architecture

ModelAttention MechanismPositional EmbeddingActivationNormalizationContext LengthEmbedding StrategyKey Changes
QwenMulti-head Attention (MHA)
Flash Attention
RoPE (FP32 precision)SwiGLU
FFN: 8/3 × hidden size
Pre-Norm & RMSNorm2KUntied Embedding
  • QKV bias
  • Flash Attention
  • Untied Embedding
Qwen2GQA (Grouped Query Attention)
DCA with YARN
RoPE
YARN extension
SwiGLUPre-Norm & RMSNorm32K-128K
(with YARN)
Untied Embedding
  • GQA for efficient KV cache
  • Dual Chunk Attention (DCA)
  • YARN for long context extension
  • QKV bias retained
Qwen2.5GQARoPE
YARN extension
SwiGLUPre-Norm & RMSNorm32K-128K
(with YARN)
Untied EmbeddingSame as Qwen2
Qwen3GQA
QK-Norm
RoPE
ABF + YARN
SwiGLUPre-Norm & RMSNorm32K-128K
(ABF + YARN)
Untied Embedding
(varies by size)
  • Remove QKV-bias
  • Introduce QK-Norm
  • ABF for context extension
  • MoE: 128 experts, 8 active

MoE Architecture (Qwen3)

Qwen3 introduces MoE (Mixture-of-Experts) variants with significant architectural improvements:

  • Expert Configuration: 128 total experts with 8 experts activated per token
  • Fine-grained Expert Partitioning: Enhanced expressiveness through specialized experts
  • Global Batch Load-balancing Loss: Promotes expert specialization and balanced utilization
  • Key Difference from Qwen2.5-MoE: Removed shared experts design for better efficiency

Context Length Extension Techniques

  • Qwen: Standard RoPE with 2K context
  • Qwen2/Qwen2.5: YARN (Yet Another RoPE extensioN) enables 32K-128K context
  • Qwen3: ABF (Adaptive Base Frequency) + YARN for optimized long-context handling up to 128K tokens

Tokenization

ModelTokenizerBase VocabularyVocabulary SizeSpecial Features
Qwentiktoken (BBPE)cl100k_base~152k
  • Multilingual (Primary Chinese) Augmentation
  • Single digit Split
Qwen2BBPEQwen151,646
(151,643 regular + 3 control)
Same as Qwen
Qwen2.5BBPEQwen2151,646
(151,624 regular + 22 control)
  • Expanded control tokens from 3 to 22
  • 2 tool-related tokens
  • 20 for other model capabilities
Qwen3BBPEQwen 2.5151,646
(151,624 regular + 22 control)
Same as Qwen2.5

Pre-training Data

ModelTraining CorpusData SourcesData ProcessingLanguage Support
QwenUp to 3T tokens
  • Public web documents
  • Encyclopedia
  • Books
  • Codes
  • Deduplication
  • Hybrid Quality Scrubbing
  • Selective Up-sampling
  • Instruction-Augmented Pretraining
  • Data Decontamination
Primarily English and Chinese
Qwen27T tokens
  • Web-scale data
  • High-quality code
  • Mathematics data
  • Multilingual data
  • Quality Enhancement: heuristic and model-based methods
  • Qwen models for synthesis
  • Distribution Improvement: optimize data mixing
~30 languages
Qwen2.518T tokens
  • Web-scale data
  • High-quality domain-specific datasets (math, code)
  • Synthetic data (Qwen2-72B-Instruct, Qwen2-Math-72B-Instruct)
  • Better data filtering: Qwen2-Instruct Model
  • Better synthetic data: filtered with reward models
  • Better data mixture: down-sample overrepresented, up-sample high-value domains
-
Qwen336T tokens
(30T + 5T + 1T)
  • Web
  • PDF-like documents (extracted with Qwen2.5-VL, improved with Qwen2.5)
  • Synthetic data (Qwen2.5-Math, Qwen2.5-Coder)
  • Stage 1: 30T tokens, 4K context
  • Stage 2: 5T tokens, knowledge-intensive data
  • Stage 3: extend to 32K context
119 languages and dialects

Pre-training

ModelTraining StagesTraining ObjectiveContext LengthOptimizerLearning RatePrecisionKey Hyperparameters
QwenSingle stageNext-token prediction2KAdamW
(β₁=0.9, β₂=0.95, ε=10⁻⁸)
Cosine schedule
(peak → 10% peak)
BFloat16 mixed precision
  • Batch size: varies by model size
  • Weight decay: 0.1
  • Gradient clipping: 1.0
Qwen2Single stageNext-token prediction2KAdamW
(β₁=0.9, β₂=0.95, ε=10⁻⁸)
Cosine schedule
(peak → 10% peak)
BFloat16 mixed precision
  • Batch size: varies by model size
  • Weight decay: 0.1
  • Gradient clipping: 1.0
Qwen2.5Single stageNext-token prediction2KAdamW
(β₁=0.9, β₂=0.95, ε=10⁻⁸)
Cosine schedule
(peak → 10% peak)
BFloat16 mixed precisionSame as Qwen2
Qwen3
  • Stage 1: 30T tokens, 4K context
  • Stage 2: 5T tokens, knowledge-intensive
  • Stage 3: extend to 32K context
Next-token prediction4K → 32K
(multi-stage)
AdamW
(β₁=0.9, β₂=0.95, ε=10⁻⁸)
Cosine schedule
(peak → 10% peak)
BFloat16 mixed precision
  • Multi-stage training
  • Context length extension
  • ABF + YARN techniques

Supervised Fine-Tuning (SFT)

ModelTraining ObjectiveData FormatOptimizerLearning RateBatch SizeTraining StepsKey Hyperparameters
QwenNext-token predictionChat-style dataAdamW
(β₁=0.9, β₂=0.95, ε=10⁻⁸)
Warmup to 2×10⁻⁶
(1430 steps)
1284000
  • Weight decay: 0.1
  • Dropout: 0.1
  • Gradient clipping: 1.0
  • Context length: 2048
Qwen2Next-token predictionChat-style dataAdamW
(β₁=0.9, β₂=0.95, ε=10⁻⁸)
Warmup to 2×10⁻⁶
(1430 steps)
1284000
  • Weight decay: 0.1
  • Dropout: 0.1
  • Gradient clipping: 1.0
  • Context length: 2048
Qwen2.5Next-token predictionChat-style dataAdamW
(β₁=0.9, β₂=0.95, ε=10⁻⁸)
Warmup to 2×10⁻⁶
(1430 steps)
1284000Same as Qwen2
Qwen3Next-token predictionLong CoT data +
Instruction-tuning data
AdamW
(β₁=0.9, β₂=0.95, ε=10⁻⁸)
Warmup schedule128Varies
  • Long CoT data focus
  • Thinking mode fusion
  • Extended context support

Reinforcement Learning from Human Feedback (RLHF)

ModelRL MethodReward ModelTraining DataKey Features
QwenPPOHuman preference-basedHuman feedback data
  • Standard RLHF pipeline
  • Reward model training
  • PPO optimization
Qwen2PPOHuman preference-basedHuman feedback data
  • Enhanced reward modeling
  • Improved safety training
  • Better alignment
Qwen2.5PPOHuman preference-basedHuman feedback dataSame as Qwen2
Qwen3Multi-stage RLRule-based +
General reward model
  • Stage 2: Reasoning-based RL
  • Stage 4: General RL (20+ tasks)
  • Reasoning-based RL (Stage 2)
  • Rule-based rewards
  • General RL across 20+ tasks
  • Enhanced exploration

Model Variants & Sizes

ModelDense ModelsMoE ModelsSpecialized VariantsContext Length Support
QwenBase models
Chat models (RLHF)
-
  • Code-Qwen: coding-specialized
  • Code-Qwen-Chat: coding-specialized chat
  • Math-Qwen-Chat: mathematics-focused
2K
Qwen2Base models
Chat models (RLHF)
MoE variants-32K-128K
(with YARN)
Qwen2.5Base models
Chat models (RLHF)
MoE variants
(with shared experts)
-32K-128K
(with YARN)
Qwen30.6B to 32B parameters
Base and post-trained models
  • Qwen3-30B-A3B (30B total, 3B active)
  • Qwen3-235B-A22B (235B total, 22B active)
-32K-128K
(ABF + YARN)

Key Innovations by Version

CategoryQwenQwen2Qwen2.5Qwen3
Architecture
  • Untied Embedding
  • RoPE (FP32 precision)
  • QKV Bias
  • Flash Attention
  • GQA (Grouped Query Attention)
  • Dual Chunk Attention (DCA)
  • YARN extension
  • MoE Architecture
  • Same as Qwen2
  • QK-Norm (replaces QKV bias)
  • Advanced MoE: 128 experts, 8 active, no shared experts
  • ABF + YARN
Training
  • Standard pre-training
  • Enhanced data quality
  • Better data filtering
  • Better synthetic data
  • Optimized data mixture
  • Multi-stage pre-training (3 stages)
  • Multi-stage RLHF (4 stages)
Tokenization
  • tiktoken (BBPE)
  • ~152k vocabulary
  • BBPE
  • 151,646 tokens (3 control)
  • Expanded control tokens (3→22)
  • Tool-related tokens
  • Same as Qwen2.5
Context Length2K32K-128K (YARN)32K-128K (YARN)32K-128K (ABF + YARN)
Special Features
  • Code-Qwen variants
  • Math-Qwen variants
  • Model-based data synthesis
  • Reward model filtering
  • Hybrid Thinking Modes
  • Enhanced Agent Capabilities
  • MCP support