Qwen Technical Report

Published on Sep 28, 2023 Updated on Jan 15, 2026 LLMs LLMs, Qwens 3 minutes

Contents

Qwen Team, Alibaba Group arXiv 2309.16609 QwenLM/Qwen Hugging Face Qwen/qwen

TL;DR

Motivation

Key Innovations

Qwen: the base pretrained language models Qwen-Chat (RLHF): the chat models finetuned with human alignment techniques. Code-Qwen: coding-specialized models Code-Qwen-Chat: coding-specialized model Math-Qwen-Chat: mathematics-focused:

Approach

Tokenization

Tokenizer: tiktoken (BBPE)
Base Vocabulary: cl100k_base
Augmentation: Multilingual (Primary Chinese) Augmentation
Special Handling: Single digit Split
Vocabulary Size: approximately 152k

Encoding Compression Rate: Qwen achieves higher compression efficiency than its competitors in most languages.

Despite the increase in vocabulary size, Qwen matains its performance levels in downstream evaluation.

Model Architecture

Qwen is derived from the open-source LLaMA, incorporating the following modifications:

Untied Embedding between Input Embedding and Output Projection: higher memory cost.
RoPE Positional Embedding: use FP32 precision for the inverse frequency matrix, rather than BF16 or FP16, in order to prioritize model performance and achieve higher accuracy.
Bias: For most layers, we remove biases following Chowdhery et al. (2022), but we add biases in the QKV layer of attention to enhance the extrapolation ability of the model (Su, 2023b).
Pre-Norm & RMSNorm:
- improve training stability compared to post-norm.
- RMSNorm maintains equivalent performance while improving efficiency.
Activation function:
- use SwiGLU, a combination of Swish and Gated Linear Unit.
- As is common practice in previous research, we have reduced the dimension of the feed-forward network (FFN) from 4 times the hidden size to 8 3 of the hidden size.
Flash Attention: employ Flash Attention in the attention modules to improve computational efficiency and reduce memory usage.

Model Parameters

Qwen provides multiple model scales with the following architecture parameters:

Parameters: 0.5B, 1.8B, 7B, 14B, 72B
Hidden Size:
- 0.5B: 1024
- 1.8B: 2048
- 7B: 4096
- 14B: 5120
- 72B: 8192
Attention Heads:
- 0.5B: 16
- 1.8B: 16
- 7B: 32
- 14B: 40
- 72B: 64
Layers:
- 0.5B: 24
- 1.8B: 24
- 7B: 32
- 14B: 40
- 72B: 80
FFN Dimension: 8/3 × hidden size
- 0.5B: 8/3 × 1024 ≈ 2731
- 1.8B: 8/3 × 2048 ≈ 5461
- 7B: 8/3 × 4096 ≈ 10923
- 14B: 8/3 × 5120 ≈ 13653
- 72B: 8/3 × 8192 ≈ 21845

Context Length Extension

todo

Pre-training

Pre-training Data

Multi-source Heterogeneous Data, including public web documents, encyclopedia, books, codes, etc.
Multilingual Corpus: primiarily English and Chinese
Data Preprocessing:
- Deduplication
- Hybrid Quality Scrubbing: rule-based and machine-learning-based method
- Selective Up-sampling
- Instruction-Augmented Pretraining
- Data Decontamination

Alignment

Supervised Fintuning (SFT)

training objective: next-token prediction
apply the loss masks for the system and user inputs.

finetunes a pretrained LLM on chat-style data.

Reinforcement Learning from Human Feedback (RLHF)

Reward Model

Reinforcement Learning

Stages	Pre-training	SFT	Reinforcement Learning
Stages	Pre-training	SFT	Reinforcement Learning	Hyperparameters
Purpose	Language Foundations & World Knowledge	Chat-style Alignment & Instruction Following	Human Preference Alignment
Training Objective	Next-token prediction		Reward Maximization (PPO)
Vocabulary Size	152k		-
Optimizer	AdamW (β₁=0.9, β₂=0.95, ε=10⁻⁸)		-
Learning Rate	Cosine schedule (peak → 10% peak)	Warmup to 2×10⁻⁶ (1430 steps)	-
Precision	BFloat16 mixed precision		-
Batch Size		128	-
Training Steps		4000	-
Weight Decay		0.1	-
Dropout		0.1	-
Gradient Clipping		1.0	-
Context Length	2048
Data
Training Corpus	Up to 3 trillion tokens		-

Contents

Qwen Technical Report

TL;DR

Motivation

Key Innovations

Approach

Tokenization

Model Architecture

Model Parameters

Context Length Extension

Pre-training

Pre-training Data

Alignment

Supervised Fintuning (SFT)

Reinforcement Learning from Human Feedback (RLHF)

Reward Model

Reinforcement Learning

Code-Qwen: Specialized Model for Coding

Math-Qwen: Specialized Model for Mathematics Reasoning

Experiments

References

Contents

Qwen Technical Report

TL;DR

Motivation

Key Innovations

Approach

Tokenization

Model Architecture

Model Parameters

Context Length Extension

Pre-training

Pre-training Data

Alignment

Supervised Fintuning (SFT)

Reinforcement Learning from Human Feedback (RLHF)

Reward Model

Reinforcement Learning

Code-Qwen: Specialized Model for Coding

Math-Qwen: Specialized Model for Mathematics Reasoning

Experiments

References

Related Posts