Contents

Qwen2 Technical Report

Qwen Team, Alibaba Group arXiv 2407.10671 QwenLM/Qwen2 Hugging FaceQwen/qwen2

TL;DR

Motivation

Key Innovations

Approach

Tokenization

Identical to the Qwen, the tokenizer utilizes byte-level byte-pair encoding (BBPE) with a total vocabulary size of 151,646, consisting of 151,643 regular tokens and 3 control tokens.

Model Architecture

Configuration0.5B1.5B7B72B57B-A14B
Hidden Size8961,5363,5848,1923,584
# Layers2428288028
# Query Heads1412286428
# KV Heads22484
Head Size64128128128128
Intermediate Size4,8648,96018,94429,5682,560
# Routed Experts----64
# Activated Experts----8
# Shared Experts----8
Embedding TyingTrueTrueFalseFalseFalse
Vocabulary Size151,646151,646151,646151,646151,646
# Trained Tokens12T7T7T7T4.5T

Dense Model

  • Grouped Query Attention (GQA): GQA instead of conventional multi-head attention (MHA). GQA optimizes KV cache usage during inference, significantly enhancing throughput.
  • Dual Chunk Attention (DCA) with YARN:
  • Moreover, we follow Qwen with the usage of SwiGLU (Dauphin et al., 2017) for activation, Rotary Positional Embeddings (RoPE, Su et al., 2024) for positional embedding, QKV bias (Su, 2023) for attention, RMSNorm (Jiang et al., 2023b) and pre-normalization for training stability.

Mixture-of-Experts (MoE) Model

StagesPre-trainingSFTReinforcement Learning
Hyperparameters
PurposeLanguage Foundations & World KnowledgeChat-style Alignment & Instruction FollowingHuman Preference Alignment
Training ObjectiveNext-token predictionReward Maximization (PPO)
Vocabulary Size151,643 regular tokens and 3 control tokens-
OptimizerAdamW (β₁=0.9, β₂=0.95, ε=10⁻⁸)-
Learning RateCosine schedule (peak → 10% peak)7×10⁻⁶ → 7×10⁻⁷ (linear decay)-
PrecisionBFloat16 mixed precision-
Batch Size128-
Training Epochs2-
Weight Decay0.1-
Gradient Clipping1.0-
Context Length204832,768
Data
Training Corpus7T tokens500,000+ instruction examples
(instruction following, coding, mathematics,
logical reasoning, role-playing, multilingualism, safety)
-

Pre-training

Pre-training Data

  • 7T tokens
  • An attempt to further relax the quality threshold resulted in a 12 trillion token dataset.
  • All Qwen2 dense models, excluding Qwen2-0.5B, were pre-trained on this large-scale dataset of over 7 trillion tokens.

All Qwen2 dense models, excluding Qwen2-0.5B, were pre-trained on this large-scale dataset of over 7 trillion tokens. Qwen2-0.5B were pre-trained using the 12 trillion token dataset. The MoE model received an additional 4.5 trillion tokens of pre-training, in line with the principle of upcycling. Similar to previous Qwen models, high-quality multi-task instruction data is integrated into the Qwen2 pre-training process to enhance in-context learning and instruction-following abilities.

  • Quality Enhancement
    • refined with additional heuristic and model-based methods.
    • Qwen models are utilized to synthesize high-quality pre-training data.
  • Data Expansion
    • larger volume of high-quality code, mathematics, and multilingual data.
    • supports approximately 30 languages.
  • Distribution Improvement
    • we conduct experiments on scale-down models to optimize the mixing of data from various sources and domains.

Post-training

Experiments

References