Pretraining Builds Fluency, Not Knowledge

Pretraining is one thing, repeated trillions of times.

Show the model a sequence of text. Ask it to guess the next token. Compare the guess to the actual next token. Nudge the weights. Repeat.

Nothing else is happening.

The model is not being told what is true. It is not being told what is helpful. It is being shown sequences and asked, at every position, to assign a probability distribution over the next token. The training signal is the difference between that distribution and the actual next token in the data. That is the entire job.

The scale is what makes it work.

GPT-3 went through this loop on about 300 billion tokens. Tom Brown and 30 co-authors at OpenAI described the setup in their 2020 paper (arXiv:2005.14165 (opens in new tab)). 175 billion parameters. A filtered slice of Common Crawl, plus WebText2, Books1, Books2, and English Wikipedia. The Common Crawl portion was run through a quality classifier to favor reference content, and near-duplicates were removed. The full mixture is in Table 2.2 of the paper.

Most open models cite C4, the Colossal Clean Crawled Corpus. Colin Raffel and colleagues at Google released it with the T5 paper in 2020 (arXiv:1910.10683 (opens in new tab)). It is a scrubbed Common Crawl snapshot of about 750 GB. The Pile, RefinedWeb, Wikipedia, large GitHub crawls for code, and scientific papers for math fill out the rest of the modern corpus diet.

How Much Data, How Much Compute

Two papers govern how to spend the budget.

Jared Kaplan and colleagues at OpenAI published "Scaling Laws for Neural Language Models" in 2020 (arXiv:2001.08361 (opens in new tab)). Test loss falls as a smooth power law across more than seven orders of magnitude in three quantities: parameter count, dataset size, and compute. Their conclusion at the time was that, given a fixed compute budget, the right move was a very large model trained on a relatively modest amount of data.

Two years later, DeepMind disagreed. Jordan Hoffmann and colleagues trained over 400 models from 70 million to 16 billion parameters on 5 to 500 billion tokens (arXiv:2203.15556 (opens in new tab)). They concluded the opposite. For compute-optimal training, parameters and training tokens should grow at roughly the same rate. Their 70-billion-parameter Chinchilla, trained on 1.4 trillion tokens, beat the 280-billion-parameter Gopher that had been trained on only 300 billion. The lesson reshaped how every lab afterward spent its money.

Meta's LLaMA paper in February 2023 (arXiv:2302.13971 (opens in new tab)) applied this directly. Models from 7 to 65 billion parameters, trained on roughly a trillion tokens of public data, hitting GPT-3 quality at a fraction of the parameter count.

Fluency Comes Before Knowledge

The mechanics are simple.

Next-token prediction over diverse text rewards the model for internalizing whatever is dense in the data. Grammar. Idiom. Register. The shape of question and answer pairs. The structure of code blocks. The conventions of citations. The gradient signal for these patterns is strong because they show up everywhere.

Facts are different.

"Paris is the capital of France" appears thousands of times in slightly different sentences. The publication year of an obscure paper might appear once. Or it might contradict itself across copies. The model learns the first reliably. The second, not so much.

That gap is the whole reason hallucinations exist. The model is not lying. It is doing exactly what it was trained to do: continue the sequence with whatever string is most plausible given the context. When the real fact is dense in the data, the most plausible continuation also happens to be correct. When it isn't, the most plausible continuation is just fluent.

This is why a base model can write convincingly on almost any topic and still invent the specifics inside that confident prose. The fluency is real. The accuracy is a side effect of how often the truth showed up.

The Emergence Argument

There is an ongoing fight about what else pretraining produces.

Jason Wei and colleagues at Google claimed in 2022 (arXiv:2206.07682 (opens in new tab)) that certain abilities, like multistep arithmetic and instruction following, appear sharply at specific scales rather than gradually. The metaphor that stuck was "emergent."

Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo at Stanford pushed back at NeurIPS 2023 (arXiv:2304.15004 (opens in new tab)). They showed the apparent sharpness often disappears when you switch to a continuous evaluation metric instead of a strict pass-fail one. The capability was scaling smoothly the whole time. The bar for credit was just discontinuous.

The debate is not settled. What is settled is that pretraining alone produces something specific. It is called a base model.

What Pretraining Hands You

A base model is not the helpful assistant most users have met.

It is a fluent next-token predictor with no particular orientation toward your intent. Ask a base model "What is the capital of France?" and it might tell you Paris. It might also continue with four more questions in the same format, because that is what a list of capital-city quiz questions looks like in the data it learned from.

It will not refuse anything. It will not "try to help." It does not know what helpfulness is. It has only learned what comes next.

Turning that artifact into something that follows instructions, declines requests, and behaves like an assistant is a separate phase. That is the next article in this series.

This article reflects the state of AI tooling as of May 2026. Specific framework names, APIs, and vendor positioning may have shifted.

Sources

Part of the Inside the LLM series.

Pretraining is one thing, repeated trillions of times.

Show the model a sequence of text. Ask it to guess the next token. Compare the guess to the actual next token. Nudge the weights. Repeat.

Nothing else is happening.

The scale is what makes it work.

How Much Data, How Much Compute

Two papers govern how to spend the budget.

Fluency Comes Before Knowledge

The mechanics are simple.

Facts are different.

The Emergence Argument

There is an ongoing fight about what else pretraining produces.

The debate is not settled. What is settled is that pretraining alone produces something specific. It is called a base model.

What Pretraining Hands You

A base model is not the helpful assistant most users have met.

It will not refuse anything. It will not "try to help." It does not know what helpfulness is. It has only learned what comes next.

Turning that artifact into something that follows instructions, declines requests, and behaves like an assistant is a separate phase. That is the next article in this series.

This article reflects the state of AI tooling as of May 2026. Specific framework names, APIs, and vendor positioning may have shifted.

Sources

Part of the Inside the LLM series.

Pretraining Builds Fluency, Not Knowledge

How Much Data, How Much Compute

Fluency Comes Before Knowledge

The Emergence Argument

What Pretraining Hands You

Sources

Keep reading

An LLM Is One Giant Function

Plausible, Not True

The Tool Loop and Who Drives It

Pretraining Builds Fluency, Not Knowledge

How Much Data, How Much Compute

Fluency Comes Before Knowledge

The Emergence Argument

What Pretraining Hands You

Sources

Keep reading

An LLM Is One Giant Function

Plausible, Not True

The Tool Loop and Who Drives It