LLaMA 2: Open Foundation and Fine-Tuned Chat Models
Contents
Meta AI
arXiv 2307.09288
facebookresearch/llama
meta-llama/Llama-2-7b-hf
LLaMA 2
TL;DR
LLaMA 2 is the next generation of LLaMA models, featuring improved performance, longer context length (4K tokens), and fine-tuned chat models trained with Reinforcement Learning from Human Feedback (RLHF). The models are available in 7B, 13B, and 70B parameter sizes.
Motivation
LLaMA 2 aims to build upon the success of LLaMA by introducing:
- Improved pre-training with more data and longer context
- Fine-tuned chat models optimized for dialogue
- Enhanced safety through RLHF and safety training
- Open access for research and commercial use
Key Innovations
- RLHF Training: First open-source model trained with Reinforcement Learning from Human Feedback
- Extended Context: 4,096 tokens (doubled from LLaMA’s 2,048)
- Safety Improvements: Comprehensive safety training and red-teaming
- Chat Models: Specialized models fine-tuned for conversational AI
- Commercial License: Available for commercial use with certain restrictions
Approach
Model Architecture
LLaMA 2 maintains the same architecture as LLaMA with improvements:
- Pre-normalization: RMSNorm for training stability
- SwiGLU Activation: Swish-Gated Linear Unit
- Rotary Position Embeddings (RoPE): Rotary embeddings for positional encoding
- Grouped Query Attention (GQA): Introduced in 70B model for efficiency
- Architecture Variants:
- 7B: 32 layers, 32 attention heads, 4096 hidden dimension
- 13B: 40 layers, 40 attention heads, 5120 hidden dimension
- 70B: 80 layers, 64 attention heads, 8192 hidden dimension (with GQA)
Tokenization
- Tokenizer: SentencePiece with BPE algorithm
- Vocabulary Size: 32,000 tokens
- Encoding: Byte-level BPE for multilingual support
Pre-training
Pre-training Data
- Total Training Data: Approximately 2 trillion tokens (40% more than LLaMA)
- Data Sources:
- Publicly available online data
- Enhanced data quality filtering
- Improved data diversity
- Context Length: 4,096 tokens (doubled from LLaMA)
Training Details
- Optimizer: AdamW with β₁=0.9, β₂=0.95
- Learning Rate: Cosine schedule with warmup
- Batch Size: 4 million tokens per batch
- Training Duration: Varies by model size
Post-training
Supervised Fine-Tuning (SFT)
- Data: Over 100,000 human-annotated examples
- Format: Multi-turn conversations
- Quality: High-quality, helpful, and safe responses
Reinforcement Learning from Human Feedback (RLHF)
- Reward Model: Trained on human preference data
- Methods:
- Rejection Sampling
- Proximal Policy Optimization (PPO)
- Safety: Integrated safety considerations into reward model
Safety Training
- Red Teaming: Extensive adversarial testing
- Safety Tuning: Additional fine-tuning for safety
- Contextual Distillation: Knowledge distillation for safety
Experiments
LLaMA 2 models show significant improvements over LLaMA:
- MMLU: Improved performance on multi-task understanding
- GSM8K: Better mathematical reasoning
- HumanEval: Enhanced code generation
- Safety Benchmarks: Strong performance on safety evaluations
- Helpfulness: Improved helpfulness in conversational tasks
The 70B model is competitive with closed-source models like GPT-3.5 and PaLM-2.
References
- Touvron, H., et al. (2023). LLaMA 2: Open Foundation and Fine-Tuned Chat Models. arXiv preprint arXiv:2307.09288.