Qwen2.5 Technical Report
Contents
Qwen Team, Alibaba Group
arXiv 2412.15115
TL;DR
Motivation
Key Innovations
Approach
Tokenization
Based on the Qwen BBPE tokenzier (151,646 tokens: 151,624 regular and 22 control), we expand the control tokens from 3 to 22, including two tool-related tokens and 20 for other model capabilities.
Model Architecture
Dense Model
- GQA for efficient KV cache utilization
- SwiGLU for activation
- RoPE for positional embedding
- QKV bias for attention
- RMSNorm and pre-normalization for training stability
Mixture-of-Experts (MoE) Model
Pre-training
Pre-training Data
- Better data filtering: with Qwen2-Instruct Model
- Better math and code data: incorporate high-quality domain-specific datasets (math, code) during pretraining.
- Better synthetic data:
- leverage both Qwen2-72B-Instruct and Qwen2-Math-72B-Instruct to generate high-quality synthetic data, particularly in mathematics, code, and knowledge domain.
- further enhance the quality of synthesized data through rigorous filtering using our propretary general reward model and the specialized Qwen2-Math-RM-72B model.
- Better data mixture:
- Domains like e-commerce, social media, and entertainment are significantly overrepresented in web-scale data, often containing repetitive, template-based, or machine-generated content.
- Domains such as technology, science, and academic research, while containing higher-quality information, are traditionally underrepresented.
- down-sample overrepresented domains and up-sample high-value domains.