Pretraining Builds Fluency, Not Knowledge
Pretraining does one thing trillions of times: predict the next token. The result is a fluent predictor with patchy facts, and the gap between fluency and truth is where hallucinations live.
Pretraining is one thing, repeated trillions of times.
Show the model a sequence of text. Ask it to guess the next token. Compare the guess to the actual next token. Nudge the weights. Repeat.
Nothing else is happening.
The model is not being told what is true. It is not being told what is helpful. It is being shown sequences and asked, at every position, to assign a probability distribution over the next token. The training signal is the difference between that distribution and the actual next token in the data. That is the entire job.
The scale is what makes it work.
GPT-3 went through this loop on about 300 billion tokens. Tom Brown and 30 co-authors at OpenAI described the setup in their 2020 paper (arXiv:2005.14165 (opens in new tab)). 175 billion parameters. A filtered slice of Common Crawl, plus WebText2, Books1, Books2, and English Wikipedia. The Common Crawl portion was run through a quality classifier to favor reference content, and near-duplicates were removed. The full mixture is in Table 2.2 of the paper.
Most open models cite C4, the Colossal Clean Crawled Corpus. Colin Raffel and colleagues at Google released it with the T5 paper in 2020 (arXiv:1910.10683 (opens in new tab)). It is a scrubbed Common Crawl snapshot of about 750 GB. The Pile, RefinedWeb, Wikipedia, large GitHub crawls for code, and scientific papers for math fill out the rest of the modern corpus diet.
How Much Data, How Much Compute
Two papers govern how to spend the budget.
Jared Kaplan and colleagues at OpenAI published "Scaling Laws for Neural Language Models" in 2020 (arXiv:2001.08361 (opens in new tab)). Test loss falls as a smooth power law across more than seven orders of magnitude in three quantities: parameter count, dataset size, and compute. Their conclusion at the time was that, given a fixed compute budget, the right move was a very large model trained on a relatively modest amount of data.
Two years later, DeepMind disagreed. Jordan Hoffmann and colleagues trained over 400 models from 70 million to 16 billion parameters on 5 to 500 billion tokens (arXiv:2203.15556 (opens in new tab)). They concluded the opposite. For compute-optimal training, parameters and training tokens should grow at roughly the same rate. Their 70-billion-parameter Chinchilla, trained on 1.4 trillion tokens, beat the 280-billion-parameter Gopher that had been trained on only 300 billion. The lesson reshaped how every lab afterward spent its money.
Meta's LLaMA paper in February 2023 (arXiv:2302.13971 (opens in new tab)) applied this directly. Models from 7 to 65 billion parameters, trained on roughly a trillion tokens of public data, hitting GPT-3 quality at a fraction of the parameter count.
Fluency Comes Before Knowledge
The mechanics are simple.
Next-token prediction over diverse text rewards the model for internalizing whatever is dense in the data. Grammar. Idiom. Register. The shape of question and answer pairs. The structure of code blocks. The conventions of citations. The gradient signal for these patterns is strong because they show up everywhere.
Facts are different.
"Paris is the capital of France" appears thousands of times in slightly different sentences. The publication year of an obscure paper might appear once. Or it might contradict itself across copies. The model learns the first reliably. The second, not so much.
That gap is the whole reason hallucinations exist. The model is not lying. It is doing exactly what it was trained to do: continue the sequence with whatever string is most plausible given the context. When the real fact is dense in the data, the most plausible continuation also happens to be correct. When it isn't, the most plausible continuation is just fluent.
This is why a base model can write convincingly on almost any topic and still invent the specifics inside that confident prose. The fluency is real. The accuracy is a side effect of how often the truth showed up.
The Emergence Argument
There is an ongoing fight about what else pretraining produces.
Jason Wei and colleagues at Google claimed in 2022 (arXiv:2206.07682 (opens in new tab)) that certain abilities, like multistep arithmetic and instruction following, appear sharply at specific scales rather than gradually. The metaphor that stuck was "emergent."
Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo at Stanford pushed back at NeurIPS 2023 (arXiv:2304.15004 (opens in new tab)). They showed the apparent sharpness often disappears when you switch to a continuous evaluation metric instead of a strict pass-fail one. The capability was scaling smoothly the whole time. The bar for credit was just discontinuous.
The debate is not settled. What is settled is that pretraining alone produces something specific. It is called a base model.
What Pretraining Hands You
A base model is not the helpful assistant most users have met.
It is a fluent next-token predictor with no particular orientation toward your intent. Ask a base model "What is the capital of France?" and it might tell you Paris. It might also continue with four more questions in the same format, because that is what a list of capital-city quiz questions looks like in the data it learned from.
It will not refuse anything. It will not "try to help." It does not know what helpfulness is. It has only learned what comes next.
Turning that artifact into something that follows instructions, declines requests, and behaves like an assistant is a separate phase. That is the next article in this series.
This article reflects the state of AI tooling as of May 2026. Specific framework names, APIs, and vendor positioning may have shifted.
Sources
- Brown, Tom B. et al. "Language Models are Few-Shot Learners." NeurIPS 2020. (opens in new tab)
- Raffel, Colin et al. "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer." JMLR 21 (2020). (opens in new tab)
- Kaplan, Jared et al. "Scaling Laws for Neural Language Models." OpenAI, 2020. (opens in new tab)
- Hoffmann, Jordan et al. "Training Compute-Optimal Large Language Models." DeepMind, 2022. (opens in new tab)
- Touvron, Hugo et al. "LLaMA: Open and Efficient Foundation Language Models." Meta AI, 2023. (opens in new tab)
- Wei, Jason et al. "Emergent Abilities of Large Language Models." TMLR, 2022. (opens in new tab)
- Schaeffer, Rylan, Brando Miranda, and Sanmi Koyejo. "Are Emergent Abilities of Large Language Models a Mirage?" NeurIPS 2023. (opens in new tab)
Part of the Inside the LLM series.



