LLaMA: Open and Efficient Foundation Language Models
Meta AI
arXiv 2302.13971
facebookresearch/llama
meta-llama/Llama-2-7b-hf
TL;DR
LLaMA (Large Language Model Meta AI) is a collection of foundation language models ranging from 7B to 65B parameters. The models demonstrate that state-of-the-art performance can be achieved using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets.
Motivation
The goal of LLaMA is to show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets. By training on more tokens, smaller models can achieve better performance than larger models trained on fewer tokens.
Key Innovations
- Efficient Training: Smaller models trained on more tokens outperform larger models trained on fewer tokens
- Open Foundation: Trained exclusively on publicly available datasets
- Scalable Architecture: Transformer-based architecture with optimizations for efficiency
- Multiple Sizes: Four model sizes (7B, 13B, 33B, 65B) for different use cases
Approach
Model Architecture
LLaMA is based on the Transformer architecture with the following modifications:
- Pre-normalization: Using RMSNorm for improved training stability
- SwiGLU Activation: Swish-Gated Linear Unit activation function
- Rotary Position Embeddings (RoPE): Replacing absolute positional embeddings with rotary embeddings
- Architecture Variants:
- 7B: 32 layers, 32 attention heads, 4096 hidden dimension
- 13B: 40 layers, 40 attention heads, 5120 hidden dimension
- 33B: 60 layers, 52 attention heads, 6656 hidden dimension
- 65B: 80 layers, 64 attention heads, 8192 hidden dimension
Tokenization
- Tokenizer: SentencePiece with BPE algorithm
- Vocabulary Size: 32,000 tokens
- Encoding: Byte-level BPE for better handling of multilingual text
Pre-training
Pre-training Data
The training data consists of a mixture of several large-scale datasets:
- CommonCrawl: Web crawl data (67% of training data)
- C4: Cleaned CommonCrawl (15% of training data)
- Wikipedia: English Wikipedia (4.5% of training data)
- Gutenberg and Books3: Books dataset (4.5% of training data)
- ArXiv: Scientific papers (2.5% of training data)
- Stack Exchange: Q&A data (2% of training data)
Total Training Data: Approximately 1.4 trillion tokens
Training Details
- Context Length: 2,048 tokens
- Optimizer: AdamW with β₁=0.9, β₂=0.95
- Learning Rate: Cosine schedule with warmup
- Batch Size: 4 million tokens per batch
- Training Duration: Varies by model size (7B: ~82K steps, 65B: ~15K steps)
Experiments
LLaMA models achieve competitive performance across various benchmarks:
- MMLU: Strong performance on multi-task language understanding
- GSM8K: Competitive results on mathematical reasoning
- HumanEval: Good performance on code generation
- TriviaQA: Strong performance on question answering
The 13B parameter model outperforms GPT-3 (175B) on most benchmarks, while the 65B model is competitive with PaLM (540B).
References
- Touvron, H., et al. (2023). LLaMA: Open and Efficient Foundation Language Models. arXiv preprint arXiv:2302.13971.