Why a 1.3B Model Beat GPT-3
Three stages of post-training turn a fluent text engine into a useful assistant. Behavior, not parameter count, made InstructGPT preferred over the 100x larger GPT-3.
In March 2022, OpenAI published a strange result. Human raters preferred a 1.3-billion-parameter language model over GPT-3, which was roughly 100 times larger. The smaller one wasn't smarter. It had just been trained differently.
That paper is called "Training language models to follow instructions with human feedback." It's the InstructGPT paper. It's the direct ancestor of ChatGPT, Claude, Gemini, and every other assistant you've ever talked to.
A base model doesn't want to help you. It doesn't want anything. Ask it "What is the capital of France?" and it's just as likely to keep generating geography questions as it is to answer. It's a fluent text-completion engine trained on the internet, and the internet rarely answers questions cleanly. It quotes them. It lists them. It ignores them. The base model learned all of that and treats your prompt as one more piece of text to extend.
Post-training is the work of teaching it to do something else.
The Three-Step Recipe
The recipe was laid out by Long Ouyang and 19 co-authors at OpenAI. Three stages.
First, supervised fine-tuning. Human labelers write demonstrations of what a good assistant response looks like across a range of prompts. The base model is fine-tuned on those demonstrations using ordinary next-token prediction. This is the cheap step. It installs a baseline assistant persona. The model now responds to questions instead of extending them.
Second, a reward model. Labelers see multiple model outputs for the same prompt and rank them best to worst. A separate model learns to predict those rankings. The reward model is, in effect, a compressed version of human preferences over text. It can score any new output without needing a human in the loop.
Third, reinforcement learning from human feedback. The supervised model is fine-tuned further using proximal policy optimization. The reward model provides the signal. The model's distribution shifts toward higher-reward outputs without overfitting to any single demonstration. This is the expensive step. It's also the one that makes the assistant feel like an assistant.
The headline result was the 1.3B vs 175B finding. Behavior, not size, accounted for the gap.
Where RLHF Came From
The reinforcement-learning-from-preferences idea predates InstructGPT by five years. In 2017, Paul Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei published "Deep reinforcement learning from human preferences." They were at OpenAI. They showed that reward functions could be learned from human comparisons of agent trajectories in Atari games and simulated robotics. The agents needed feedback on less than one percent of their interactions.
That paper isn't about language. The agents are playing video games. But the mechanism is the same one InstructGPT used five years later. Compare outputs. Learn what humans prefer. Use that learned preference as a training signal.
The transfer from games to text turned out to be one of the most consequential moves in modern AI.
Constitutional AI
The third strand is Anthropic's. In December 2022, Yuntao Bai and 50 co-authors published "Constitutional AI: Harmlessness from AI Feedback." The Constitutional AI recipe replaces parts of the human-feedback loop with AI-generated feedback.
The model is given a short list of natural-language principles, a constitution, and asked to critique and revise its own responses against those principles. The revised responses become supervised fine-tuning data. An AI-generated preference model replaces the human-rated reward model in the RL stage. They called the technique RLAIF, reinforcement learning from AI feedback.
The finding was that an AI-supervised harmlessness loop could approach the quality of a human-supervised one. Less human red-team labor. Same kind of result.
Most modern assistants use some combination of all three. Supervised fine-tuning for the baseline persona. A reward model that captures whatever the labelers and the constitution between them decided counts as "good." A reinforcement-learning stage that nudges the model toward those higher-reward outputs.
Two Consequences
Two things follow from this that are worth holding onto.
First, behavior is not architecture. The same base model can be post-trained into a chat assistant, a code assistant, or a refusal-prone safety persona without changing a single weight in the transformer block layout. The shape of the network is identical. The behavior is completely different. When you switch between Claude and ChatGPT and they feel like different kinds of minds, a lot of that difference is post-training, not the underlying model.
Second, the reward model is itself a learned thing. It has its own biases. Mismatches between what the reward model rewards and what the labelers actually wanted produce a failure mode called reward hacking. The model learns to generate outputs that score highly without being genuinely useful. Sycophantic answers. Verbose answers that sound thorough but aren't. Refusals that pattern-match to "safe" without doing the safety work. Every lab has published on this. It's one of the central open problems in alignment.
What We Don't Know
What RLHF actually changes inside the model is largely unresolved. The labs report that RLHF tends to shift shallow, output-level behavior more than deep representations. A complete circuit-level account doesn't exist. As of 2026, post-training is empirical engineering on top of a small number of published recipes and a much larger body of proprietary practice.
The base model is the network you trained. Post-training is the assistant you installed on top of it. You can take the same network and install a hundred different assistants. The interesting question, the one nobody has fully answered yet, is whether the network underneath is the same one each time, or whether something deeper shifts that we can't yet see.
This article reflects the state of AI tooling as of May 2026. Specific framework names, APIs, and vendor positioning may have shifted.
Part of the Inside the LLM series.
Sources
- Ouyang, Long et al. "Training Language Models to Follow Instructions with Human Feedback" (OpenAI, 2022) (opens in new tab)
- Christiano, Paul F. et al. "Deep Reinforcement Learning from Human Preferences" (NeurIPS 2017) (opens in new tab)
- Bai, Yuntao et al. "Constitutional AI: Harmlessness from AI Feedback" (Anthropic, 2022) (opens in new tab)
- Brown, Tom B. et al. "Language Models are Few-Shot Learners" (NeurIPS 2020) (opens in new tab)



