LLaMA 3: The Most Capable Openly Available LLM to Date

Published on Apr 18, 2024 Updated on Jan 15, 2026 LLMs LLMs, LLaMAs 2 minutes

Contents

Meta AI arXiv 2404.14219 meta-llama/llama3 Hugging Face meta-llama/Meta-Llama-3-8B LLaMA 3

TL;DR

LLaMA 3 represents a significant advancement in open-source language models, featuring improved reasoning capabilities, extended context length (8K tokens), and a new tokenizer with 128K vocabulary. The initial release includes 8B and 70B parameter models, with larger models planned.

Motivation

LLaMA 3 aims to push the boundaries of open-source language models by:

Improving reasoning and instruction-following capabilities
Enhancing multilingual performance
Extending context length for better long-context understanding
Providing a more efficient tokenizer for better compression

Key Innovations

Improved Tokenizer: New tokenizer with 128K vocabulary (4× larger than LLaMA 2)
Extended Context: 8,192 tokens (doubled from LLaMA 2)
Better Reasoning: Enhanced reasoning capabilities through improved training
Multilingual Support: Improved performance across multiple languages
Instruction Tuning: Better instruction-following and task completion

Approach

Model Architecture

LLaMA 3 maintains the efficient Transformer architecture with enhancements:

Pre-normalization: RMSNorm for stability
SwiGLU Activation: Swish-Gated Linear Unit
Rotary Position Embeddings (RoPE): Rotary embeddings
Grouped Query Attention (GQA): Used in 70B model for efficiency
Architecture Variants:
- 8B: Optimized architecture with improved efficiency
- 70B: 80 layers with GQA for efficient inference

Tokenization

Tokenizer: New tokenizer with improved efficiency
Vocabulary Size: 128,000 tokens (4× increase from LLaMA 2)
Benefits:
- Better compression ratio
- Improved handling of code and technical content
- Better multilingual support

Pre-training

Pre-training Data

Total Training Data: Over 15 trillion tokens (significantly more than LLaMA 2)
Data Quality:
- Enhanced data filtering and quality control
- Improved data diversity
- Better code and technical content
Context Length: 8,192 tokens during training

Training Details

Optimizer: AdamW with optimized hyperparameters
Learning Rate: Improved learning rate schedule
Training Efficiency: Optimized training pipeline for better efficiency
Data Mixing: Improved data mixing strategies

Post-training

Supervised Fine-Tuning (SFT)

Instruction Tuning: Large-scale instruction tuning dataset
Quality: High-quality instruction-response pairs
Diversity: Diverse tasks and domains

Reinforcement Learning from Human Feedback (RLHF)

Improved RLHF: Enhanced RLHF pipeline
Reward Modeling: Better reward model training
Safety: Continued focus on safety and helpfulness

Experiments

LLaMA 3 models demonstrate state-of-the-art performance:

MMLU: Strong performance on multi-task understanding
GSM8K: Excellent mathematical reasoning
HumanEval: Competitive code generation
AGI Eval: Strong performance on AGI benchmarks
Multilingual: Improved performance across multiple languages

The 70B model is competitive with GPT-4 and other leading closed-source models on many benchmarks.

References

Meta AI. (2024). LLaMA 3 Model Card. arXiv preprint arXiv:2404.14219.