Contents

InstructGPT

NeurlPS 2022 OpenAI arXiv 2203.02155

TL;DR

InstructGPT aligns language models with human intent using supervised fine-tuning (SFT) and reinforcement learning from Human Feedback (RLHF) to improve following instructions and minimize harmful outputs.

Motivations & Innovations

Approach

Model

GPT-3 pretrained language models.

Training Recipe

assets/images/2026-01-15-15-02-29.webp

Supervised Fine-tuning (SFT)

Reward Modeling (RM)

Reinforcement Learning (RL)

Data Recipe

Step1: Collect demonstration data, and train a supervised policy.

Step 2: Collect comparison data, and train a reward model.

Step3: Optimize a policy against the reward model using PPO.

Experiments