Contents

Qwen Technical Report

Qwen Team, Alibaba Group arXiv 2309.16609 QwenLM/Qwen Hugging FaceQwen/qwen

TL;DR

Motivation

Key Innovations

Qwen: the base pretrained language models Qwen-Chat (RLHF): the chat models finetuned with human alignment techniques. Code-Qwen: coding-specialized models Code-Qwen-Chat: coding-specialized model Math-Qwen-Chat: mathematics-focused:

./images/index-20260121145208.webp

Approach

Tokenization

  • Tokenizer: tiktoken (BBPE)
  • Base Vocabulary: cl100k_base
  • Augmentation: Multilingual (Primary Chinese) Augmentation
  • Special Handling: Single digit Split
  • Vocabulary Size: approximately 152k

Encoding Compression Rate: Qwen achieves higher compression efficiency than its competitors in most languages.

./images/index-20260122174424.webp

Despite the increase in vocabulary size, Qwen matains its performance levels in downstream evaluation.

Model Architecture

./images/index-20260121145952.webp

Qwen is derived from the open-source LLaMA, incorporating the following modifications:

  • Untied Embedding between Input Embedding and Output Projection: higher memory cost.
  • RoPE Positional Embedding: use FP32 precision for the inverse frequency matrix, rather than BF16 or FP16, in order to prioritize model performance and achieve higher accuracy.
  • Bias: For most layers, we remove biases following Chowdhery et al. (2022), but we add biases in the QKV layer of attention to enhance the extrapolation ability of the model (Su, 2023b).
  • Pre-Norm & RMSNorm:
    • improve training stability compared to post-norm.
    • RMSNorm maintains equivalent performance while improving efficiency.
  • Activation function:
    • use SwiGLU, a combination of Swish and Gated Linear Unit.
    • As is common practice in previous research, we have reduced the dimension of the feed-forward network (FFN) from 4 times the hidden size to 8 3 of the hidden size.
  • Flash Attention: employ Flash Attention in the attention modules to improve computational efficiency and reduce memory usage.

Model Parameters

Qwen provides multiple model scales with the following architecture parameters:

  • Parameters: 0.5B, 1.8B, 7B, 14B, 72B
  • Hidden Size:
    • 0.5B: 1024
    • 1.8B: 2048
    • 7B: 4096
    • 14B: 5120
    • 72B: 8192
  • Attention Heads:
    • 0.5B: 16
    • 1.8B: 16
    • 7B: 32
    • 14B: 40
    • 72B: 64
  • Layers:
    • 0.5B: 24
    • 1.8B: 24
    • 7B: 32
    • 14B: 40
    • 72B: 80
  • FFN Dimension: 8/3 × hidden size
    • 0.5B: 8/3 × 1024 ≈ 2731
    • 1.8B: 8/3 × 2048 ≈ 5461
    • 7B: 8/3 × 4096 ≈ 10923
    • 14B: 8/3 × 5120 ≈ 13653
    • 72B: 8/3 × 8192 ≈ 21845

Context Length Extension

todo

Pre-training

Pre-training Data

  • Multi-source Heterogeneous Data, including public web documents, encyclopedia, books, codes, etc.
  • Multilingual Corpus: primiarily English and Chinese
  • Data Preprocessing:
    • Deduplication
    • Hybrid Quality Scrubbing: rule-based and machine-learning-based method
    • Selective Up-sampling
    • Instruction-Augmented Pretraining
    • Data Decontamination

Alignment

Supervised Fintuning (SFT)

  • training objective: next-token prediction
  • apply the loss masks for the system and user inputs.

finetunes a pretrained LLM on chat-style data.

Reinforcement Learning from Human Feedback (RLHF)

Reward Model
Reinforcement Learning
StagesPre-trainingSFTReinforcement Learning
Hyperparameters
PurposeLanguage Foundations & World KnowledgeChat-style Alignment & Instruction FollowingHuman Preference Alignment
Training ObjectiveNext-token predictionReward Maximization (PPO)
Vocabulary Size152k-
OptimizerAdamW (β₁=0.9, β₂=0.95, ε=10⁻⁸)-
Learning RateCosine schedule (peak → 10% peak)Warmup to 2×10⁻⁶ (1430 steps)-
PrecisionBFloat16 mixed precision-
Batch Size128-
Training Steps4000-
Weight Decay0.1-
Dropout0.1-
Gradient Clipping1.0-
Context Length2048
Data
Training CorpusUp to 3 trillion tokens-

Code-Qwen: Specialized Model for Coding

  • Code-Qwen: continual pre-training
  • Code-Qwen-Chat: supervised finetuned model

Math-Qwen: Specialized Model for Mathematics Reasoning

Experiments

References