Qwen2 Technical Report
Contents
Qwen Team, Alibaba Group
arXiv 2407.10671
QwenLM/Qwen2
Qwen/qwen2
TL;DR
Motivation
Key Innovations
Approach
Tokenization
Identical to the Qwen, the tokenizer utilizes byte-level byte-pair encoding (BBPE) with a total vocabulary size of 151,646, consisting of 151,643 regular tokens and 3 control tokens.
Model Architecture
| Configuration | 0.5B | 1.5B | 7B | 72B | 57B-A14B |
|---|---|---|---|---|---|
| Hidden Size | 896 | 1,536 | 3,584 | 8,192 | 3,584 |
| # Layers | 24 | 28 | 28 | 80 | 28 |
| # Query Heads | 14 | 12 | 28 | 64 | 28 |
| # KV Heads | 2 | 2 | 4 | 8 | 4 |
| Head Size | 64 | 128 | 128 | 128 | 128 |
| Intermediate Size | 4,864 | 8,960 | 18,944 | 29,568 | 2,560 |
| # Routed Experts | - | - | - | - | 64 |
| # Activated Experts | - | - | - | - | 8 |
| # Shared Experts | - | - | - | - | 8 |
| Embedding Tying | True | True | False | False | False |
| Vocabulary Size | 151,646 | 151,646 | 151,646 | 151,646 | 151,646 |
| # Trained Tokens | 12T | 7T | 7T | 7T | 4.5T |
Dense Model
- Grouped Query Attention (GQA): GQA instead of conventional multi-head attention (MHA). GQA optimizes KV cache usage during inference, significantly enhancing throughput.
- Dual Chunk Attention (DCA) with YARN:
- Moreover, we follow Qwen with the usage of SwiGLU (Dauphin et al., 2017) for activation, Rotary Positional Embeddings (RoPE, Su et al., 2024) for positional embedding, QKV bias (Su, 2023) for attention, RMSNorm (Jiang et al., 2023b) and pre-normalization for training stability.
Mixture-of-Experts (MoE) Model
| Stages | Pre-training | SFT | Reinforcement Learning |
|---|---|---|---|
| Hyperparameters | |||
| Purpose | Language Foundations & World Knowledge | Chat-style Alignment & Instruction Following | Human Preference Alignment |
| Training Objective | Next-token prediction | Reward Maximization (PPO) | |
| Vocabulary Size | 151,643 regular tokens and 3 control tokens | - | |
| Optimizer | AdamW (β₁=0.9, β₂=0.95, ε=10⁻⁸) | - | |
| Learning Rate | Cosine schedule (peak → 10% peak) | 7×10⁻⁶ → 7×10⁻⁷ (linear decay) | - |
| Precision | BFloat16 mixed precision | - | |
| Batch Size | 128 | - | |
| Training Epochs | 2 | - | |
| Weight Decay | 0.1 | - | |
| Gradient Clipping | 1.0 | - | |
| Context Length | 2048 | 32,768 | |
| Data | |||
| Training Corpus | 7T tokens | 500,000+ instruction examples (instruction following, coding, mathematics, logical reasoning, role-playing, multilingualism, safety) | - |
Pre-training
Pre-training Data
- 7T tokens
- An attempt to further relax the quality threshold resulted in a 12 trillion token dataset.
- All Qwen2 dense models, excluding Qwen2-0.5B, were pre-trained on this large-scale dataset of over 7 trillion tokens.
All Qwen2 dense models, excluding Qwen2-0.5B, were pre-trained on this large-scale dataset of over 7 trillion tokens. Qwen2-0.5B were pre-trained using the 12 trillion token dataset. The MoE model received an additional 4.5 trillion tokens of pre-training, in line with the principle of upcycling. Similar to previous Qwen models, high-quality multi-task instruction data is integrated into the Qwen2 pre-training process to enhance in-context learning and instruction-following abilities.
- Quality Enhancement
- refined with additional heuristic and model-based methods.
- Qwen models are utilized to synthesize high-quality pre-training data.
- Data Expansion
- larger volume of high-quality code, mathematics, and multilingual data.
- supports approximately 30 languages.
- Distribution Improvement
- we conduct experiments on scale-down models to optimize the mixing of data from various sources and domains.