LLaMA 2: Open Foundation and Fine-Tuned Chat Models

Published on Jul 18, 2023 Updated on Jan 15, 2026 LLMs LLMs, LLaMAs 2 minutes

Contents

Meta AI arXiv 2307.09288 facebookresearch/llama Hugging Face meta-llama/Llama-2-7b-hf LLaMA 2

TL;DR

LLaMA 2 is the next generation of LLaMA models, featuring improved performance, longer context length (4K tokens), and fine-tuned chat models trained with Reinforcement Learning from Human Feedback (RLHF). The models are available in 7B, 13B, and 70B parameter sizes.

Motivation

LLaMA 2 aims to build upon the success of LLaMA by introducing:

Improved pre-training with more data and longer context
Fine-tuned chat models optimized for dialogue
Enhanced safety through RLHF and safety training
Open access for research and commercial use

Key Innovations

RLHF Training: First open-source model trained with Reinforcement Learning from Human Feedback
Extended Context: 4,096 tokens (doubled from LLaMA’s 2,048)
Safety Improvements: Comprehensive safety training and red-teaming
Chat Models: Specialized models fine-tuned for conversational AI
Commercial License: Available for commercial use with certain restrictions

Approach

Model Architecture

LLaMA 2 maintains the same architecture as LLaMA with improvements:

Pre-normalization: RMSNorm for training stability
SwiGLU Activation: Swish-Gated Linear Unit
Rotary Position Embeddings (RoPE): Rotary embeddings for positional encoding
Grouped Query Attention (GQA): Introduced in 70B model for efficiency
Architecture Variants:
- 7B: 32 layers, 32 attention heads, 4096 hidden dimension
- 13B: 40 layers, 40 attention heads, 5120 hidden dimension
- 70B: 80 layers, 64 attention heads, 8192 hidden dimension (with GQA)

Tokenization

Tokenizer: SentencePiece with BPE algorithm
Vocabulary Size: 32,000 tokens
Encoding: Byte-level BPE for multilingual support

Pre-training

Pre-training Data

Total Training Data: Approximately 2 trillion tokens (40% more than LLaMA)
Data Sources:
- Publicly available online data
- Enhanced data quality filtering
- Improved data diversity
Context Length: 4,096 tokens (doubled from LLaMA)

Training Details

Optimizer: AdamW with β₁=0.9, β₂=0.95
Learning Rate: Cosine schedule with warmup
Batch Size: 4 million tokens per batch
Training Duration: Varies by model size

Post-training

Supervised Fine-Tuning (SFT)

Data: Over 100,000 human-annotated examples
Format: Multi-turn conversations
Quality: High-quality, helpful, and safe responses

Reinforcement Learning from Human Feedback (RLHF)

Reward Model: Trained on human preference data
Methods:
- Rejection Sampling
- Proximal Policy Optimization (PPO)
Safety: Integrated safety considerations into reward model

Safety Training

Red Teaming: Extensive adversarial testing
Safety Tuning: Additional fine-tuning for safety
Contextual Distillation: Knowledge distillation for safety

Experiments

LLaMA 2 models show significant improvements over LLaMA:

MMLU: Improved performance on multi-task understanding
GSM8K: Better mathematical reasoning
HumanEval: Enhanced code generation
Safety Benchmarks: Strong performance on safety evaluations
Helpfulness: Improved helpfulness in conversational tasks

The 70B model is competitive with closed-source models like GPT-3.5 and PaLM-2.

References

Touvron, H., et al. (2023). LLaMA 2: Open Foundation and Fine-Tuned Chat Models. arXiv preprint arXiv:2307.09288.