InstructGPT
Contents
NeurlPS 2022 OpenAI arXiv 2203.02155
TL;DR
InstructGPT aligns language models with human intent using supervised fine-tuning (SFT) and reinforcement learning from Human Feedback (RLHF) to improve following instructions and minimize harmful outputs.
Motivations & Innovations
Approach
Model
GPT-3 pretrained language models.
Training Recipe

Supervised Fine-tuning (SFT)
Reward Modeling (RM)
Reinforcement Learning (RL)
Data Recipe
Step1: Collect demonstration data, and train a supervised policy.
Step 2: Collect comparison data, and train a reward model.
Step3: Optimize a policy against the reward model using PPO.