Plausible, Not True
Hallucinations are not a bug. They are a structural consequence of training a model to predict plausible tokens.
In December 2023, Ziwei Ji and a team at the Hong Kong University of Science and Technology published a survey of hallucination in language models in ACM Computing Surveys. They split the phenomenon into two categories. Intrinsic hallucinations contradict the input. Extrinsic hallucinations add information the input does not support and the model has no way to verify.
That distinction matters more than the word "hallucination" suggests. It is not a single bug. It is several different failure modes that share a vocabulary because the surface looks the same. Confident text, wrong facts.
The root cause is the training objective itself. A model trained to maximize the likelihood of plausible next tokens has no internal signal that distinguishes "this matches the world" from "this matches the distribution of training text about the world." Pretraining is full of confident assertions. So the model learns to produce confident assertions. The prompt invites a specific answer. So the model produces one. Whether or not it actually has evidence for it.
That is the bleak read. Calibration research complicates it.
In 2022, Saurav Kadavath and a team at Anthropic published a paper called "Language Models (Mostly) Know What They Know." On multiple-choice questions in the right format, they found larger models are reasonably well calibrated. The probability the model assigns to its own answer correlates with whether that answer is correct.
They went further, testing whether models could predict, ahead of time, the probability that they would be able to answer a question correctly. They called this P(IK). The models did meaningfully better than chance and partially generalized across tasks. But calibration on new task distributions was worse.
The picture is nuanced. Models are not totally blind to their own uncertainty. They also do not have a clean internal "I don't know" signal that can be read off without specific elicitation. The button isn't there yet.
Lost in the Middle
There is a second failure mode that gets less attention. Nelson Liu and a team at Stanford published "Lost in the Middle: How Language Models Use Long Contexts" in 2023, later in the Transactions of the Association for Computational Linguistics. They tested a range of models, including ones marketed explicitly as long-context.
The result was the same shape across the board. Performance on multi-document question answering and key-value retrieval was much better when the relevant information sat at the beginning or the end of the context. Much worse when it sat in the middle.
That changes how you should think about long context windows. Stuffing more text into a prompt is not a clean way to install knowledge in the model. Attention, in current architectures, is uneven across position. The model can technically see the middle of the prompt. It does not weight it the same.
People dropping a 200-page PDF into a chatbot and trusting the answer should know this. The middle of that PDF is functionally fuzzier than the ends.
Catching It Without an Answer Key
Detection does not always need a labeled answer key. In 2023, Potsawee Manakul and Mark Gales at Cambridge published "SelfCheckGPT" at EMNLP. The core idea is simple. Ask the model the same question multiple times. On facts it has reliable representations of, the samples tend to be consistent. On facts it is confabulating, the samples diverge.
It is not perfect. Prompts deliberately steered toward a single answer will collapse the variance whether or not the answer is correct. But the consistency signal is usable. It sits underneath a lot of modern hallucination detection pipelines.
What Follows
Three things follow from this body of work.
First, hallucinations are not a bug that gets patched out in the next release. They are a structural consequence of training to predict plausible tokens rather than to know things. A model that always returned "I don't know" on uncertain questions would be more honest. It would also be less useful for the cases it does have evidence for. The objective rewards fluency. Fluency is what it gives you.
Second, models have some internal information about their own uncertainty, but extracting it reliably is an open research problem. The calibration is partial. The generalization is shaky. There is no clean read of "the model is making this up right now" that works across tasks.
Third, claims about "the model remembered correctly" need to be tested. The model that confidently cites a paper may have generated the author list from common name patterns. That is a documented failure mode. Any system that depends on factual accuracy needs an external verification layer. Not optional. The model is not going to refuse on its own.
People say LLMs lie. They don't lie. Lying requires a model of the truth and an intent to deviate from it. These systems do not have that. They produce the most likely continuation of the prompt according to the training distribution. Sometimes that continuation matches reality. Sometimes it doesn't. The model does not know the difference unless you build the machinery to tell it.
This article reflects the state of AI tooling as of May 2026. Specific framework names, APIs, and vendor positioning may have shifted.
Part of the Inside the LLM series.
Sources
- Ji, Ziwei et al. "Survey of Hallucination in Natural Language Generation." ACM Computing Surveys 55(12), 2023. (opens in new tab)
- Kadavath, Saurav et al. "Language Models (Mostly) Know What They Know." Anthropic, 2022. (opens in new tab)
- Liu, Nelson F. et al. "Lost in the Middle: How Language Models Use Long Contexts." TACL, 2023. (opens in new tab)
- Manakul, Potsawee, Adian Liusie, and Mark J. F. Gales. "SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models." EMNLP 2023. (opens in new tab)
- Ouyang, Long et al. "Training language models to follow instructions with human feedback." OpenAI, 2022. (opens in new tab)
- Brown, Tom B. et al. "Language Models are Few-Shot Learners." NeurIPS 2020. (opens in new tab)



