Take a Look, It's in a Book

In we-keep-calling-it-memory, I called RAG “a library card catalog wearing a memory costume.” I still think that’s right. RAG isn’t memory in the cognitive-science sense. It doesn’t decay, doesn’t reorganize itself with use, doesn’t consolidate.

Calling RAG a library card catalog isn’t a put-down: a great catalog is real engineering, and the library is load-bearing on its own. The highest-scoring memory systems on standard benchmarks tend to be the ones that recognized this. At least one team has shown that verbatim text in a vector store with no LLM extraction reaches ~96% retrieval-recall on the standard long-context-memory benchmark, beating systems that spend an LLM call per memory write.

The interesting question, then, isn’t whether the library matters. It’s how to build a great one: what gets indexed, at what granularity, with what metadata, retrieved how. That’s what this post is about.

What the library has to do

Endel Tulving’s encoding specificity principle¹ says, roughly, that a memory is retrievable to the extent that the cue you have at retrieval time matches the form in which the memory was originally encoded. If you stored “Paris is the capital of France” and someone asks “what’s the capital of France,” the cue and the encoding overlap. Easy retrieval. If you stored a smell and someone asks for a fact, the cue and the encoding don’t share a representation, and even if the smell encodes the fact incidentally, the retrieval fails.

That principle is the whole job description for the library half of memory.

The translation to engineering is unflattering. The cue an agent will get at retrieval time is whatever question the user asks, in whatever words they choose, at whatever level of abstraction they happen to be at. The agent doesn’t get to negotiate. So the library has to make the underlying signal findable from whichever cue arrives, by indexing it under enough different representations that one of them lines up.

The CS frame for this is index choice. You can’t query what you didn’t index for. A database without an index on a column doesn’t fail to query that column; it just degrades to a full scan. A retrieval system without an index that matches the question’s shape doesn’t fail to retrieve; it just degrades to noise. Most agent-memory failures I’ve seen up close are index failures, not embedding-model failures. The team picked one representation (per-message, or per-summary, or per-extracted-fact) and indexed only that. When a question’s cue didn’t fit, no embedding model was going to save them.

The library tier’s job, then, is findability. And findability is a much larger surface than people think. Most of this post is about why.

What gets lost at the door

A common pattern in agent memory looks like this: an LLM reads the conversation as it happens, decides what’s worth remembering, extracts a small set of facts (“user prefers PostgreSQL,” “user lives in Berlin,” “user is allergic to peanuts”), stores those, and discards the rest. Variants get more elaborate (agentic search loops, multi-pass observation, cluster-and-summarize), but the shape is the same: an LLM stands at the door of memory and decides what comes in.

The risk is information-theoretic. Any extraction is lossy compression of a stream you can’t re-acquire. JPEG of a conversation. What’s left is the LLM’s interpretation of what mattered, judged by whatever heuristic the prompt happened to encode at write time.

When the question’s cue at retrieval time matches the LLM’s earlier interpretation, the system finds the memory. When it doesn’t, because the LLM extracted “preference for PostgreSQL” but the question is “what database was the user wrestling with last quarter,” the answer is no longer in the corpus. It can’t be retrieved at any granularity, by any embedding model, because it was thrown out at write time.

The risk shows up empirically. At least one team has shown that verbatim text in a vector store, with no LLM in the write path, reaches 96.6% R@5 on LongMemEval-S² (the standard long-context-memory benchmark), and 99.4% with an optional reranker.³ On a harder benchmark (ConvoMem), verbatim text scores 92.9% while Mem0, a widely-used LLM-fact-extraction system, scores 30-45% across categories.³ Sophisticated extraction pipelines can recover most of the loss with multi-pass or agentic strategies, but they’re paying at write time for something verbatim storage solves for free at read time.

The library half is enough for a lot of questions, if you keep the data lossless on the way in.

What a great library looks like

If we accept that the library half is real, three claims about how to build a good one follow. Each of them deserves its own post. What follows here is the gist.

The unit of retrieval has to match the question’s shape

Most retrieval systems pick one granularity (per-message, per-chunk, per-document) and tune the rest of the pipeline around it. That’s the wrong axis. Different question shapes have different right granularities, and the system that knows them all and fuses their answers wins.

The CS analogue is well-trodden. Image pyramids store the same image at multiple zoom levels because different downstream tasks need different detail; LSM trees store data at multiple compaction levels because different access patterns hit different sizes. Multi-resolution is the standard answer in storage and in vision. It belongs in retrieval too.

Concretely: when an agent is asked “did I ever talk about X,” the question is session-shaped. It’s distributed across a whole conversation, not localized to any one turn. When it’s asked “what exactly did the user say about Y,” the question is turn-shaped. It’s localized, exact tokens matter. A library that indexes only turns leaves session-shaped questions for dead. A library that indexes only sessions leaves turn-shaped questions for dead. The right answer is to index both, and to fuse the results at retrieval time so the system can take whichever ranking the question’s shape favors.

This lesson cost me two weeks: a per-session embedding had been sitting in the database the whole time, indexed and ready, populated by a sidecar I’d built for clustering. And the retrieval path wasn’t reading it. Adding it as a fan-out tier moved R@5 by 25 points overnight. The granularity I needed already existed. I was treating retrieval like a single-resolution problem when the data shape was already multi-resolution.

Metadata is retrieval, not decoration

The card-catalog metaphor breaks if you imagine a catalog as a flat list of titles. Real catalogs let you narrow before you search: by subject, by era, by format, by location. The narrowing is the work; the search inside the narrowed slice is the easy part.

For agent memory the relevant metadata is structural: who was talking, when in the timeline, in what workspace, about what topic. Embeddings answer “is this similar”; metadata answers “is this even the right kind of thing to consider.” The two operations belong at different stages of the pipeline.

The CS frame here is partition pruning, or predicate pushdown: the database trick where the query planner pushes a WHERE clause down to the partition layer so a petabyte query becomes a gigabyte query before any similarity computation runs. The same mechanic applies to retrieval: structural filters collapse the search space before similarity dominates the cost.

Workspace scoping alone, partitioning a memory store by which conversational context it came from, was a fifteen-point lift on LongMemEval for a system I work on. That’s not a model improvement. That’s a predicate improvement. In my experience, current embedding models have become close to interchangeable for this kind of work: swap one for another and scores move within run-to-run noise. What moves scores is the structure around the model.

Lexical search is a real tier, not a fallback

Vector search is good at semantic similarity. It’s bad at exact tokens: names, identifiers, quoted phrases, technical jargon, anything where the answer hinges on the literal string rather than its conceptual neighborhood. Lexical retrieval is the inverse: good at exactly those tokens, indifferent to semantics.

They’re complementary, not redundant. Treating BM25 or full-text search as a backup for when vectors fail is leaving signal on the table.

The CS frame is hybrid search, which has been IR consensus for decades. Reciprocal Rank Fusion exists because sparse and dense retrievers disagree on different question shapes, and that disagreement is itself information. When both rankers agree on a candidate, that’s a strong signal. When they disagree, the disagreement tells you something about the question, something neither ranker alone can know.

Concretely: switching one Postgres FTS function call from plainto_tsquery to websearch_to_tsquery was a 2.2-point lift on two of my benchmarks, from a single-line change. plainto_tsquery ANDs every stemmed term in a long natural-language query, which collapses to zero matches the moment any single token doesn’t appear in a candidate. websearch_to_tsquery is more forgiving and better-shaped for long queries. The lexical tier wasn’t broken; it was being asked the wrong question.

All three of these are findability engineering, not model engineering. The question that matters isn’t “is my embedding model the best one.” It’s “what shape do I have to give my data so that the question’s cue lands on the right thing.”

What the numbers say

The system I work on is called ghola. Here are the numbers it scores on the benchmarks I run, all recorded 2026-05-06, because systems improve over time.

Benchmark	Top-k	R@1	R@5	R@10	N
LongMemEval-S²	10	94.0%	99.4%	99.6%	500
LoCoMo⁴	10	76.4%	93.0%	96.5%	1,982
MemBench⁵ (subset)	5	69.1%	93.8%	—	1,100

Notes per benchmark:

LongMemEval-S: full 500 questions, MRR 0.962, mean retrieval latency 1.17s/q.
LoCoMo: all 10 conversations, 1,986 multi-hop QA pairs, session-granularity. Per-category R@10: single-hop 97.5%, multi-hop 97.2%, temporal 95.3%, open-domain 84.8%, adversarial 97.5%. Mean retrieval latency 0.88s/q.
MemBench: 100 records sampled per category × 11 categories with shuffle_seed=42 (upstream has 26,637 items total). Per-category R@5 ranges from 100.0% (lowlevel_rec, RecMultiSession) to 83.0% (post_processing). Per-category R@1 ranges from 26.0% (highlevel, the inference-heavy category) to 97.0% (RecMultiSession). Robustness against distractors (the noisy category): R@5 87.0%, R@1 74.0%.

Two notes on the LoCoMo number above. First, top-k is the integrity decision. LoCoMo conversations have between 19 and 32 sessions, so top-k=50 trivializes retrieval: k exceeds session count, the answer is always in the candidate pool, the score becomes a measure of the reranker’s reading comprehension rather than the library’s retrieval. The honest number is top-10, which is what’s reported. Second, the score sits above the published hybrid-only LoCoMo top-10 baseline I’m aware of: 88.9% R@10 (snapshot March 2026), against ghola’s 96.5% R@10 on the same configuration. The lift is plausible from the multi-tier pipeline (vector + lexical + per-session embedding + cross-encoder rerank), but it’s a single observation.

For comparison, here are public numbers I’ve recorded from other teams’ systems. Each is a snapshot. These systems improve, and the snapshots may already be stale by the time you read this.

MemPal’s verbatim-text baseline reaches 96.6% R@5 on LongMemEval-S, and their hybrid-plus-reranker variant hits 99.4% (snapshot March 2026).³ Evidence that the library half alone hits the LME ceiling. Mastra reports 94.87% on the same benchmark, but that figure is end-to-end QA accuracy under GPT-5-mini doing the answering, not retrieval recall (snapshot March 2026).⁶ Different metric, methodology mismatch. Mem0 doesn’t publish an LME number; on ConvoMem it scores 30-45% across categories, well below MemPal’s 92.9% on the same benchmark (per MemPal’s citation, snapshot March 2026).³ Evidence that LLM-fact-extraction loses information that verbatim storage keeps.

Two further notes on what these numbers do and don’t say. First, retrieval recall (R@k) is not the same as end-to-end QA accuracy. A system can land the right session at rank 1 and still produce a wrong answer when an LLM reads it, and vice versa. Several published memory-system numbers are QA accuracy, not retrieval; comparing them to R@k is a methodological mismatch that I’ve tried to flag inline above. Second, ghola’s R@5 on LongMemEval was not produced by tuning a ranker on specific failure questions, but it also wasn’t produced on a held-out split. The changes were motivated by category-level failure patterns, which is honest but not bulletproof. A held-out split is on the to-do list.

Where the library ends

For all the work that goes into making a library findable, there’s a class of question the library cannot answer no matter how good it gets. Two failure modes show up consistently in my data.

The recall miss

I keep an internal benchmark on the vercel/next.js issue corpus: thirty-two hand-picked structural-bridge cases where each one is an issue paired with the pull request that resolves it. The task is to retrieve the resolving PR given the issue. Five of thirty-two cases land the right PR in top-5. The other twenty-seven don’t, regardless of which embedding model, FTS query shape, or cross-encoder reranker is in the pipeline.

The CS frame for this is the recall miss in the information-retrieval sense: the candidate set returned by retrieval doesn’t contain the right document at all. Re-ranking is downstream of recall; it cannot recover what was never retrieved. This is the architectural ceiling of “more library.”

The shape of the failure is straightforward. The issue body and the resolving PR’s diff text are far apart in cosine. They don’t share enough vocabulary, the embedding model can’t bridge them. The lexical tier doesn’t save it either; the technical jargon overlaps too weakly. No amount of better library makes the right document appear in the candidate set.

What does work, at least in principle, is association: knowing that the issue and the PR were activated together in some prior session, even when they don’t share surface features. Hebbian co-activation, spreading activation, retrieval expansion via association graphs. That’s not findability. That’s a different operation entirely, and it doesn’t live in the library tier.

The confidence gap

The other failure mode is the inverse. The library does find the right candidate, but the reranker has to choose between it and a topically-similar wrong candidate, and sometimes chooses wrong. On LongMemEval’s single-session-preference category (questions like “what does the user prefer for X”), ghola’s R@5 is 96.7% (the right session is almost always in the top-5) but R@1 is 73.3%. We pick wrong about 27% of the time when forced to commit to one.

The shape of this failure: the ground-truth session contains a specific named entity (a brand, a person’s name, a quoted phrase) that uniquely identifies the preference. The wrong-top-1 has the same conversational shape (preference language, recommendation framing, advice-giving register) but lacks the entity. Surface preference-cues dominate the reranker’s score; the entity-as-discriminator is weighted too lightly.

The CS frame is the reranking ceiling: both candidates are in the set, and the discriminator’s features don’t separate them. You can push the ceiling up some (entity-uniqueness as a multiplicative score factor, specificity scoring, better candidate-text shape), and probably I will. But the deeper unlock is dynamics. A memory that has seen the user state this preference before, across multiple sessions would resolve the tie by accumulated confidence, not by reading harder. The reranker keeps reading the same text; the system that’s seen the text three times before doesn’t have to.

What the second half does

The cases that survive a great library are exactly the cases where memory’s other half has to do the work.

Recall miss is solved by association: co-activated memories retrieving together, the way a human reminds themselves of a thing by remembering an adjacent thing they hadn’t been asked about. Confidence gap is solved by reinforcement: memories that have been reaffirmed across sessions accumulate confidence, break ties, dominate surface noise. Crowding is solved by decay: old memories that no one returns to should weaken, so they don’t drown out the freshly relevant. Generalization is solved by consolidation: repeated experience distilled to pattern, the way semantic memory in humans is the slow residue of episodic memory.

These aren’t optional polish. In biological memory, they’re the operations that make the rest of the system work. In agent memory, they’re the operations that turn a great library into something that deserves the word memory.

Closing

The answer is in the book. That’s the part the library does. The work of the library is making the right book reachable from whatever question someone happens to ask, and that work is harder than it looks. Multiple granularities, structural metadata, lexical and dense retrieval running in parallel, query shapes matched to data shapes, no LLM standing at the door dropping signal on the way in.

The work of memory is something more. Once you have the right book on the right shelf, retrievable from the right cues, then the question of what reorganizes itself with use, what fades, what gets stronger when it’s revisited becomes tractable.

The catalog isn’t memory. But it’s half. Take a look. You’ll know which book to open.

Tulving, E., & Thomson, D. M. (1973). Encoding specificity and retrieval processes in episodic memory. Psychological Review, 80(5), 352–373. ↩
LongMemEval dataset and benchmark: https://github.com/xiaowu0162/LongMemEval ↩ ↩²
MemPal benchmark results (snapshot March 2026): https://github.com/MemPalace/mempalace/blob/develop/benchmarks/BENCHMARKS.md ↩ ↩² ↩³ ↩⁴
LoCoMo dataset: https://github.com/snap-research/locomo ↩
MemBench (ACL 2025 Findings): https://aclanthology.org/2025.findings-acl.989/ ↩
Mastra observational memory results (snapshot March 2026): https://mastra.ai/research/observational-memory ↩