Qwen Series: Technical Summary
Contents
Model Architecture
| Model | Attention Mechanism | Positional Embedding | Activation | Normalization | Context Length | Embedding Strategy | Key Changes |
|---|---|---|---|---|---|---|---|
| Qwen | Multi-head Attention (MHA) Flash Attention | RoPE (FP32 precision) | SwiGLU FFN: 8/3 × hidden size | Pre-Norm & RMSNorm | 2K | Untied Embedding |
|
| Qwen2 | GQA (Grouped Query Attention) DCA with YARN | RoPE YARN extension | SwiGLU | Pre-Norm & RMSNorm | 32K-128K (with YARN) | Untied Embedding |
|
| Qwen2.5 | GQA | RoPE YARN extension | SwiGLU | Pre-Norm & RMSNorm | 32K-128K (with YARN) | Untied Embedding | Same as Qwen2 |
| Qwen3 | GQA QK-Norm | RoPE ABF + YARN | SwiGLU | Pre-Norm & RMSNorm | 32K-128K (ABF + YARN) | Untied Embedding (varies by size) |
|
MoE Architecture (Qwen3)
Qwen3 introduces MoE (Mixture-of-Experts) variants with significant architectural improvements:
- Expert Configuration: 128 total experts with 8 experts activated per token
- Fine-grained Expert Partitioning: Enhanced expressiveness through specialized experts
- Global Batch Load-balancing Loss: Promotes expert specialization and balanced utilization
- Key Difference from Qwen2.5-MoE: Removed shared experts design for better efficiency
Context Length Extension Techniques
- Qwen: Standard RoPE with 2K context
- Qwen2/Qwen2.5: YARN (Yet Another RoPE extensioN) enables 32K-128K context
- Qwen3: ABF (Adaptive Base Frequency) + YARN for optimized long-context handling up to 128K tokens
Tokenization
| Model | Tokenizer | Base Vocabulary | Vocabulary Size | Special Features |
|---|---|---|---|---|
| Qwen | tiktoken (BBPE) | cl100k_base | ~152k |
|
| Qwen2 | BBPE | Qwen | 151,646 (151,643 regular + 3 control) | Same as Qwen |
| Qwen2.5 | BBPE | Qwen2 | 151,646 (151,624 regular + 22 control) |
|
| Qwen3 | BBPE | Qwen 2.5 | 151,646 (151,624 regular + 22 control) | Same as Qwen2.5 |
Pre-training Data
| Model | Training Corpus | Data Sources | Data Processing | Language Support |
|---|---|---|---|---|
| Qwen | Up to 3T tokens |
|
| Primarily English and Chinese |
| Qwen2 | 7T tokens |
|
| ~30 languages |
| Qwen2.5 | 18T tokens |
|
| - |
| Qwen3 | 36T tokens (30T + 5T + 1T) |
|
| 119 languages and dialects |
Pre-training
| Model | Training Stages | Training Objective | Context Length | Optimizer | Learning Rate | Precision | Key Hyperparameters |
|---|---|---|---|---|---|---|---|
| Qwen | Single stage | Next-token prediction | 2K | AdamW (β₁=0.9, β₂=0.95, ε=10⁻⁸) | Cosine schedule (peak → 10% peak) | BFloat16 mixed precision |
|
| Qwen2 | Single stage | Next-token prediction | 2K | AdamW (β₁=0.9, β₂=0.95, ε=10⁻⁸) | Cosine schedule (peak → 10% peak) | BFloat16 mixed precision |
|
| Qwen2.5 | Single stage | Next-token prediction | 2K | AdamW (β₁=0.9, β₂=0.95, ε=10⁻⁸) | Cosine schedule (peak → 10% peak) | BFloat16 mixed precision | Same as Qwen2 |
| Qwen3 |
| Next-token prediction | 4K → 32K (multi-stage) | AdamW (β₁=0.9, β₂=0.95, ε=10⁻⁸) | Cosine schedule (peak → 10% peak) | BFloat16 mixed precision |
|
Supervised Fine-Tuning (SFT)
| Model | Training Objective | Data Format | Optimizer | Learning Rate | Batch Size | Training Steps | Key Hyperparameters |
|---|---|---|---|---|---|---|---|
| Qwen | Next-token prediction | Chat-style data | AdamW (β₁=0.9, β₂=0.95, ε=10⁻⁸) | Warmup to 2×10⁻⁶ (1430 steps) | 128 | 4000 |
|
| Qwen2 | Next-token prediction | Chat-style data | AdamW (β₁=0.9, β₂=0.95, ε=10⁻⁸) | Warmup to 2×10⁻⁶ (1430 steps) | 128 | 4000 |
|
| Qwen2.5 | Next-token prediction | Chat-style data | AdamW (β₁=0.9, β₂=0.95, ε=10⁻⁸) | Warmup to 2×10⁻⁶ (1430 steps) | 128 | 4000 | Same as Qwen2 |
| Qwen3 | Next-token prediction | Long CoT data + Instruction-tuning data | AdamW (β₁=0.9, β₂=0.95, ε=10⁻⁸) | Warmup schedule | 128 | Varies |
|
Reinforcement Learning from Human Feedback (RLHF)
| Model | RL Method | Reward Model | Training Data | Key Features |
|---|---|---|---|---|
| Qwen | PPO | Human preference-based | Human feedback data |
|
| Qwen2 | PPO | Human preference-based | Human feedback data |
|
| Qwen2.5 | PPO | Human preference-based | Human feedback data | Same as Qwen2 |
| Qwen3 | Multi-stage RL | Rule-based + General reward model |
|
|
Model Variants & Sizes
| Model | Dense Models | MoE Models | Specialized Variants | Context Length Support |
|---|---|---|---|---|
| Qwen | Base models Chat models (RLHF) | - |
| 2K |
| Qwen2 | Base models Chat models (RLHF) | MoE variants | - | 32K-128K (with YARN) |
| Qwen2.5 | Base models Chat models (RLHF) | MoE variants (with shared experts) | - | 32K-128K (with YARN) |
| Qwen3 | 0.6B to 32B parameters Base and post-trained models |
| - | 32K-128K (ABF + YARN) |
Key Innovations by Version
| Category | Qwen | Qwen2 | Qwen2.5 | Qwen3 |
|---|---|---|---|---|
| Architecture |
|
|
|
|
| Training |
|
|
|
|
| Tokenization |
|
|
|
|
| Context Length | 2K | 32K-128K (YARN) | 32K-128K (YARN) | 32K-128K (ABF + YARN) |
| Special Features |
|
|
|
|