An LLM Is One Giant Function

A large language model, mechanically, is one thing. A function that takes a sequence of tokens and outputs a probability distribution over the next token.

That's it.

You feed it text. It guesses what comes next. You sample from the guess. You append. Feed it back. Repeat.

The training objective is the same operation. Show the model text. Have it predict the next token. Score the prediction against the actual next token. Update the parameters. Do this across trillions of tokens.

Reasoning, code, dialogue, agents. All of it sits on top of that one move.

What A Token Is

A token isn't a word. It's a sub-word unit produced by a tokenizer, usually using byte-pair encoding (BPE). For English text, a token averages roughly four characters. Common words are often a single token. Rare words get split into multiple ones.

OpenAI's GPT-4 tokenizer turns "Hello, world" into six tokens. My name, "Justin," is one or two depending on the model.

This matters because a model's context window, the maximum input it can attend to at once, is measured in tokens, not characters or words. A 100,000-token context window in English is roughly 75,000 words. About the length of a long novel.

You don't pay for words. You pay for tokens.

Why "Next Token Prediction" Undersells It

Saying an LLM "predicts the next token" is true. It also makes the thing sound smaller than it is.

To predict the next token well, across a massive and varied corpus, the model has to encode a staggering amount of statistical structure. Syntax. Facts. Code patterns. Translation patterns. Conversational structure. The shape of arguments. The shape of jokes.

GPT-3, in 2020, had 175 billion parameters trained on roughly 300 billion tokens of text. That single objective, run at sufficient scale, produced a system that could pass standardized tests, write working code, translate between languages, and carry on multi-turn conversations.

Jurafsky and Martin, in the current draft of Speech and Language Processing (opens in new tab) (third edition, manuscript released January 2025), cover modern LLMs in chapters 10 through 12. Their treatment emphasizes that next-token prediction is the training objective, not a sufficient description of what the trained model actually represents.

The model is a compressed, lossy encoding of patterns across an enormous training corpus. That encoding turns out to be useful for tasks far beyond raw text continuation.

The Emergence Debate

Whether LLMs have "emergent abilities" that appear suddenly at scale is still contested.

The original claim came from Jason Wei and 15 co-authors at Google (opens in new tab) in a June 2022 paper called "Emergent Abilities of Large Language Models." They defined an emergent ability as one that isn't present in smaller models but appears in larger ones. They reported sharp performance jumps on tasks like arithmetic, multi-step reasoning, and word unscrambling once models passed certain parameter thresholds.

In April 2023, Rylan Schaeffer, Brando Miranda, and Oluwasanmi Koyejo at Stanford (opens in new tab) published "Are Emergent Abilities of Large Language Models a Mirage?" which got accepted to NeurIPS 2023. They argued many of those jumps disappear if you use continuous evaluation metrics instead of discontinuous ones. The "emergence," in their reanalysis, was an artifact of how performance was being measured.

The debate isn't settled. As of 2026, the consensus view is roughly: some capabilities improve smoothly with scale, some do show sharper transitions on certain benchmarks, and the right way to evaluate scale-dependent behavior is still being worked out.

Both papers are worth reading. The honest position is that the truth lives somewhere in between, and the people claiming clean answers in either direction are selling you something.

The Scale Has Changed Everything

The Stanford HAI 2025 AI Index Report (opens in new tab) is the best single document for tracking what's happened. A few numbers from it.

In 2024, nearly 90 percent of notable AI models came from industry rather than academia. Training compute for frontier models has been roughly doubling every five months. The inference cost for systems performing at the GPT-3.5 level dropped more than 280-fold between November 2022 and October 2024.

What was a research curiosity in 2020 is now a deployed industrial technology in 2026. The price of using one of these models has collapsed. The price of training a frontier one has exploded.

Model vs System

An LLM, at the core, predicts the next token. Very well. Using a transformer-shaped function with billions of parameters, trained on a corpus larger than any human could read in a lifetime.

Everything else, reasoning, tool use, memory, agentic behavior, is either an emergent property of that training (depending on what you believe about emergence) or, more often, external scaffolding built around the model by the surrounding software.

That distinction between the model and the system around it is the bridge to the next two parts of this series. Part 2 opens the model up. Part 3 steps back and looks at the harness layer, the code and tools that wrap the model and turn it into the thing people actually use.

The Hugging Face LLM Course (opens in new tab) covers all of this at a practical level, including tokenization, transformer internals, fine-tuning, and evaluation. It's the standard hands-on resource for engineers learning to work with LLMs. If you want to go deeper than this article goes, start there.

Part of the AI Foundations series.

Sources

A large language model, mechanically, is one thing. A function that takes a sequence of tokens and outputs a probability distribution over the next token.

That's it.

You feed it text. It guesses what comes next. You sample from the guess. You append. Feed it back. Repeat.

Reasoning, code, dialogue, agents. All of it sits on top of that one move.

What A Token Is

OpenAI's GPT-4 tokenizer turns "Hello, world" into six tokens. My name, "Justin," is one or two depending on the model.

You don't pay for words. You pay for tokens.

Why "Next Token Prediction" Undersells It

Saying an LLM "predicts the next token" is true. It also makes the thing sound smaller than it is.

The model is a compressed, lossy encoding of patterns across an enormous training corpus. That encoding turns out to be useful for tasks far beyond raw text continuation.

The Emergence Debate

Whether LLMs have "emergent abilities" that appear suddenly at scale is still contested.

Both papers are worth reading. The honest position is that the truth lives somewhere in between, and the people claiming clean answers in either direction are selling you something.

The Scale Has Changed Everything

The Stanford HAI 2025 AI Index Report (opens in new tab) is the best single document for tracking what's happened. A few numbers from it.

What was a research curiosity in 2020 is now a deployed industrial technology in 2026. The price of using one of these models has collapsed. The price of training a frontier one has exploded.

Model vs System

An LLM, at the core, predicts the next token. Very well. Using a transformer-shaped function with billions of parameters, trained on a corpus larger than any human could read in a lifetime.

Part of the AI Foundations series.

An LLM Is One Giant Function

What A Token Is

Why "Next Token Prediction" Undersells It

The Emergence Debate

The Scale Has Changed Everything

Model vs System

Sources

Keep reading

Why LLMs Can't Count the R's in Strawberry

The 2017 Paper Underneath Every Modern AI

Pretraining Builds Fluency, Not Knowledge

An LLM Is One Giant Function

What A Token Is

Why "Next Token Prediction" Undersells It

The Emergence Debate

The Scale Has Changed Everything

Model vs System

Sources

Keep reading

Why LLMs Can't Count the R's in Strawberry

The 2017 Paper Underneath Every Modern AI

Pretraining Builds Fluency, Not Knowledge