Your Chatbot Doesn't Remember You
When ChatGPT or Claude appears to recall what you said three weeks ago, the language model itself is doing none of the remembering. Three different systems get confused for one.
You open ChatGPT. It references something you mentioned three weeks ago. Your dog's name. A project. The fact that you're vegetarian.
It feels like the model remembers you.
It doesn't.
Three different systems produce that experience, and only one of them is the language model itself. The other two are software wrapped around the model. Untangling them is the most useful conceptual move you can make if you want to understand what these systems actually are.
What Counts as "Remembering"
When a chatbot appears to recall something, one of three things is happening.
One: the information was in the model's training data, and it's pattern-matching from weights baked in months or years ago. The model "knows" Paris is in France because that fact appeared in billions of tokens it trained on. Generic world knowledge.
Two: the information is in the current conversation, sitting inside the context window. The model can read it because it's right there in the input. Session ends. Gone.
Three: the information was in a previous session, and some piece of software outside the model stored it and reinjected it into the prompt at the start of this one.
Training, in-context, retrieval. Three different mechanisms. People collapse them into "the AI remembers" and lose the ability to predict what these systems will and won't do.
How Retrieval Actually Works
Patrick Lewis and colleagues at Facebook AI Research and University College London introduced retrieval-augmented generation in 2020 in "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." The recipe has two parts. A retrieval component scans a corpus of documents and finds a handful that look relevant to your question. A generation component, the language model, takes your question and those passages as input and writes an answer.
The retriever runs first. It pulls. The model reads what was pulled along with the question. It generates.
That retrieval step usually depends on embeddings. Vladimir Karpukhin, Sewon Min, Patrick Lewis, and others published Dense Passage Retrieval in 2020. They trained two encoders, one for questions and one for passages, that map text into a shared vector space. Cosine similarity in that space approximates "are these related." The dual-encoder beat the older BM25 lexical baseline by 9 to 19 percentage points on top-20 retrieval accuracy across open-domain question-answering datasets. Nils Reimers and Iryna Gurevych at TU Darmstadt had published the same architectural idea a year earlier in Sentence-BERT. That paper became the basis of most open-source embedding libraries. Modern commercial embeddings, OpenAI's text-embedding-3-small and text-embedding-3-large, expose the same interface. Text in, fixed-length vector out, similarity for ranking.
None of that changes the model.
The retrieved passages get pasted into the prompt at inference time. The model reads them along with your question. It has not learned anything new in the gradient-descent sense. Its weights are identical before and after. Pull the retrieval system out and the model behaves as if it never saw the documents.
Memory Features Are Retrieval
OpenAI shipped ChatGPT memory in 2024. Anthropic has memory features in Claude described in their product documentation. Different vendors, same pattern.
The product stores user-specific facts in a separate database. Next time you start a conversation, the application reads from that store and prepends the relevant entries to the prompt. The model sees them as in-context information, generates a response, and the conversation ends. The model is stateless across sessions. The application is stateful.
The labs are explicit about this in their own docs. When OpenAI or Anthropic describe what their memory features do, they describe retrieval and reinjection. They do not describe the model learning between sessions, because it does not.
The Engineering Tradeoff
If you want a system that "knows your company's documents," you have two basic options.
Fine-tune a base model on those documents. Expensive. Slow to update. Risks degrading other capabilities. Hard to audit, because the knowledge is now distributed across billions of weight values and you can't point to where any specific fact lives.
Build a retrieval pipeline that pulls relevant chunks at query time. Cheap to update. Change a document and the next query sees the change. Easy to audit, you can log which passages got pulled into any given response. The downside is that the retrieval system itself becomes a new failure mode. Bad embeddings, bad chunking, bad ranking, and the model is suddenly answering with garbage it was confidently fed.
Most production systems combine both. Three different things, what the model learned in pretraining, what it's doing in-context with the current prompt, and what the surrounding software is inserting into that prompt, look the same from the outside and are completely different inside.
What's Still Open
One frontier research strand is worth flagging before closing.
Anthropic's interpretability team published "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet" in May 2024. They used sparse autoencoders to decompose model activations into millions of interpretable features. The team identified features for specific concepts. The Golden Gate Bridge. Code bugs. Sycophancy. Deception. Artificially activating those features changed model behavior in predictable ways.
That is real evidence the model encodes something like concept-aligned internal structure.
The same paper is explicit that the work is partial. The identified features are a sample. The techniques have known limitations. A complete mechanistic account of what a production language model is doing remains an open scientific problem.
The model is not magic. It is also not fully understood by the people who built it. We know a lot about the three phases that produced it. We know a lot about the software wrapped around it. We are still learning what is happening inside it.
The model is one component in a stack of software. It is stateless. It does not learn from you. Whatever appears to be memory or personalization is the application doing work the model cannot do.
This article reflects the state of AI tooling as of May 2026. Specific framework names, APIs, and vendor positioning may have shifted.
Part of the Inside the LLM series.
Sources
- Lewis, Patrick et al. "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (NeurIPS 2020) (opens in new tab)
- Karpukhin, Vladimir et al. "Dense Passage Retrieval for Open-Domain Question Answering" (EMNLP 2020) (opens in new tab)
- Reimers, Nils and Iryna Gurevych. "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks" (EMNLP 2019) (opens in new tab)
- Anthropic Interpretability Team. "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet" (Transformer Circuits Thread, May 2024) (opens in new tab)
- Brown, Tom B. et al. "Language Models are Few-Shot Learners" (NeurIPS 2020) (opens in new tab)



