Qwen2 Technical Report

Published on Jul 15, 2024 Updated on Jan 15, 2026 LLMs LLMs, Qwens 3 minutes

Contents

Qwen Team, Alibaba Group arXiv 2407.10671 QwenLM/Qwen2 Hugging Face Qwen/qwen2

TL;DR

Motivation

Key Innovations

Approach

Tokenization

Identical to the Qwen, the tokenizer utilizes byte-level byte-pair encoding (BBPE) with a total vocabulary size of 151,646, consisting of 151,643 regular tokens and 3 control tokens.

Model Architecture

Configuration	0.5B	1.5B	7B	72B	57B-A14B
Hidden Size	896	1,536	3,584	8,192	3,584
# Layers	24	28	28	80	28
# Query Heads	14	12	28	64	28
# KV Heads	2	2	4	8	4
Head Size	64	128	128	128	128
Intermediate Size	4,864	8,960	18,944	29,568	2,560
# Routed Experts	-	-	-	-	64
# Activated Experts	-	-	-	-	8
# Shared Experts	-	-	-	-	8
Embedding Tying	True	True	False	False	False
Vocabulary Size	151,646	151,646	151,646	151,646	151,646
# Trained Tokens	12T	7T	7T	7T	4.5T

Dense Model

Grouped Query Attention (GQA): GQA instead of conventional multi-head attention (MHA). GQA optimizes KV cache usage during inference, significantly enhancing throughput.
Dual Chunk Attention (DCA) with YARN:
Moreover, we follow Qwen with the usage of SwiGLU (Dauphin et al., 2017) for activation, Rotary Positional Embeddings (RoPE, Su et al., 2024) for positional embedding, QKV bias (Su, 2023) for attention, RMSNorm (Jiang et al., 2023b) and pre-normalization for training stability.

Mixture-of-Experts (MoE) Model

Stages	Pre-training	SFT	Reinforcement Learning
Stages	Pre-training	SFT	Reinforcement Learning	Hyperparameters
Purpose	Language Foundations & World Knowledge	Chat-style Alignment & Instruction Following	Human Preference Alignment
Training Objective	Next-token prediction		Reward Maximization (PPO)
Vocabulary Size	151,643 regular tokens and 3 control tokens		-
Optimizer	AdamW (β₁=0.9, β₂=0.95, ε=10⁻⁸)		-
Learning Rate	Cosine schedule (peak → 10% peak)	7×10⁻⁶ → 7×10⁻⁷ (linear decay)	-
Precision	BFloat16 mixed precision		-
Batch Size		128	-
Training Epochs		2	-
Weight Decay		0.1	-
Gradient Clipping		1.0	-
Context Length	2048	32,768
Data
Training Corpus	7T tokens	500,000+ instruction examples (instruction following, coding, mathematics, logical reasoning, role-playing, multilingualism, safety)	-

Pre-training

Pre-training Data

7T tokens
An attempt to further relax the quality threshold resulted in a 12 trillion token dataset.
All Qwen2 dense models, excluding Qwen2-0.5B, were pre-trained on this large-scale dataset of over 7 trillion tokens.

All Qwen2 dense models, excluding Qwen2-0.5B, were pre-trained on this large-scale dataset of over 7 trillion tokens. Qwen2-0.5B were pre-trained using the 12 trillion token dataset. The MoE model received an additional 4.5 trillion tokens of pre-training, in line with the principle of upcycling. Similar to previous Qwen models, high-quality multi-task instruction data is integrated into the Qwen2 pre-training process to enhance in-context learning and instruction-following abilities.

Quality Enhancement
- refined with additional heuristic and model-based methods.
- Qwen models are utilized to synthesize high-quality pre-training data.
Data Expansion
- larger volume of high-quality code, mathematics, and multilingual data.
- supports approximately 30 languages.
Distribution Improvement
- we conduct experiments on scale-down models to optimize the mixing of data from various sources and domains.

Contents

Qwen2 Technical Report

TL;DR

Motivation

Key Innovations

Approach

Tokenization

Model Architecture

Dense Model

Mixture-of-Experts (MoE) Model

Pre-training

Pre-training Data

Post-training

Experiments

References

Contents

Qwen2 Technical Report

TL;DR

Motivation

Key Innovations

Approach

Tokenization

Model Architecture

Dense Model

Mixture-of-Experts (MoE) Model

Pre-training

Pre-training Data

Post-training

Experiments

References

Related Posts