Why LLMs Can't Count the R's in Strawberry

Ask GPT-4 how many R's are in "strawberry." For years it would tell you two. The model has never seen the letters. It sees one token, then another, then maybe a third. By the time anything resembling reasoning kicks in, the word has already been chopped into pieces the model treats as atoms.

That gap between what you typed and what the model actually processes is called tokenization. It runs before the model and after the model. Its own training data. Its own algorithm. Its own bugs. Andrej Karpathy spent a whole 2024 tutorial on it called "Let's build the GPT Tokenizer," and his main point was that a surprising amount of weird LLM behavior is the tokenizer's fault, not the model's.

You type a string. A separate program called the tokenizer chops it into pieces. Each piece gets looked up in a fixed vocabulary and turned into an integer. The model only ever sees integers. The transformer math, the attention heads, the matrix multiplications, all of it runs on a sequence of numbers like [785, 4128, 6981].

No letters. No words. The model's alphabet is a list of integer IDs.

The dominant chopping algorithm is byte pair encoding, or BPE. It came from a 2016 paper by Rico Sennrich, Barry Haddow, and Alexandra Birch at the University of Edinburgh called "Neural Machine Translation of Rare Words with Subword Units."

The trick is mechanical. Start with raw characters. Repeatedly merge the most common adjacent pair into a new symbol. Run it long enough and you get a vocabulary of subword pieces. Common chunks like " the" and " and" stay whole. Rare words get split into fragments.

Their paper reported BLEU score improvements of 1.1 and 1.3 over the previous best on English to German and English to Russian translation. That made BPE the default for neural translation, and later for GPT.

The other widely used algorithm is SentencePiece, published by Taku Kudo and John Richardson at Google in 2018. SentencePiece operates on raw Unicode without splitting words first. It treats spaces as ordinary characters using a special underscore symbol. T5, LLaMA, and most multilingual systems use SentencePiece because it doesn't assume anything about how a language puts whitespace between words.

OpenAI publishes the exact tokenizers they use in a library called tiktoken. You can run it yourself. GPT-3.5 and GPT-4 use a vocabulary called cl100k_base with 100,277 tokens. GPT-4o uses a bigger one called o200k_base with roughly 200,000 tokens. Bigger vocabulary, better compression on non-English text.

The vocabulary is weird when you look at it. "tokenization" splits into two tokens: "token" and "ization." "Antidisestablishmentarianism" turns into five separate tokens. The name " Justin" with a leading space is a single token because it's common in training data. A rarer name gets split into two or three. None of this is about meaning. It's about what byte sequences happened to be frequent when the tokenizer was trained.

This is why models miscount letters. The word "strawberry" is one or two tokens. The model has no access to the characters inside a token, any more than you have access to the individual atoms in your fingertip when you tap a screen.

To count R's, the model would have to learn, through training examples, which tokens contain which letters. It mostly hasn't.

This is also why arithmetic on long numbers is hard. Numbers get tokenized inconsistently. The number 1234 might be one token, but 12345 might be split into "1234" and "5," or "123" and "45." Then the model has to do math on token IDs that have no numerical structure. The fact that it works at all is a small miracle.

Tokenization also has a fairness problem. In 2023, Aleksandar Petrov and colleagues at the University of Oxford published a paper called "Language Model Tokenizers Introduce Unfairness Between Languages." They showed that some languages need more than ten times as many tokens as English to express the same content with GPT's tokenizer. A non-English user pays more per query, waits longer for the response, and runs out of context window faster. It's not malicious. It's an artifact of training the tokenizer mostly on English text. The bill is still real.

So when someone says a model "understood the question," remember what actually happened. The string got chopped by a frozen program written in a different decade. The integers got fed into a neural network that has never read an English letter in its life. The output integers got mapped back to text by the same frozen program. Everything that looks like understanding happens between those two translations.

The model's relationship with words is closer to your relationship with street addresses. You see "1600 Pennsylvania Avenue" and know it points to a famous building. You don't experience the digits as individual numerals. You don't experience "Pennsylvania" as a sequence of letters. You experience the whole thing as a unit. That's the model, on every word, all the time.

The next layer up is where the real math starts. The tokens become vectors. The vectors flow through the transformer. A probability distribution comes out the other end.

We'll get to that.

But none of it makes sense until you accept that the model never reads what you wrote.

This article reflects the state of AI tooling as of May 2026. Specific framework names, APIs, and vendor positioning may have shifted.

Part 1 of the Inside the LLM series.

Sources

No letters. No words. The model's alphabet is a list of integer IDs.

To count R's, the model would have to learn, through training examples, which tokens contain which letters. It mostly hasn't.

The next layer up is where the real math starts. The tokens become vectors. The vectors flow through the transformer. A probability distribution comes out the other end.

We'll get to that.

But none of it makes sense until you accept that the model never reads what you wrote.

This article reflects the state of AI tooling as of May 2026. Specific framework names, APIs, and vendor positioning may have shifted.

Part 1 of the Inside the LLM series.

Why LLMs Can't Count the R's in Strawberry

Sources

Keep reading

Plausible, Not True

An LLM Is One Giant Function

Random Labels Work Almost As Well

Why LLMs Can't Count the R's in Strawberry

Sources

Keep reading

Plausible, Not True

An LLM Is One Giant Function

Random Labels Work Almost As Well