Almost every agent-memory number you see is self-reported. A vendor runs its own system, scores its own answers, and publishes the result. We have said before that an outside test is worth more than a vendor grading its own homework. This one is an outside test, and it is no longer a two-way comparison. An independent beta tester built a 500-company, 365-day business-memory benchmark and ran four memory engines through the same questions: Sibyl, Hindsight, Mem0, and Mnemosyne. Every competitor number below is the tester's own run on that system, not a vendor self-report.

Before the results, the argument they support.

The wrong problem

The industry has decided the memory problem is a context problem, and that the answer is to fit more into the window. Bigger windows, longer top-k, more retrieved rows, more context engineering to cram a year of history in front of the model on every turn. That is the wrong variable to optimize.

More retrieved context is mostly noise. It buries the one row that mattered, it inflates the bill on every single call, and it pulls the model off its own instructions, because the longer the context, the weaker its adherence to identity and hard rules. Solving for the big window also trains the user to behave badly: dump everything in, let the model sort it out, pay for it every turn.

Proper hygiene points the other way. At every stage, ingest, storage, and retrieval, a memory layer should hand the model the smallest correct context: the two or three rows that actually answer the question, and nothing else. Low context is the feature. It keeps cost near zero, it keeps the agent on its rails, and it lets you debloat the agent's own files so identity and rule-following stay sharp. The right question is how little context you can get away with while staying exactly right.

This benchmark measures that tension directly. Engines that retrieve precisely against engines that retrieve broadly, on a workload built so the right answer is one exact row in a field of look-alikes.

One test, four engines, stated plainly. An independent beta tester ran the full 500-company comparison across Sibyl, Hindsight, Mem0, and Mnemosyne. Where a figure rests only on the tester's report with no reproducible artifact in the archive, this post flags it inline. Nothing here is conflated with our published LongMemEval results, which are a different workload.

The test

The workload is deterministic long-horizon business memory. Five hundred companies and 1,500 stakeholders evolving over 365 simulated days: 191,000 stored records written through 209,000 calls, and 350 questions, 50 in each of seven categories. Each question asks for one exact fact buried in that history: the current status of a company on day 365, a specific accepted milestone, the role of one person, an exact marker code. Fifty of the questions are traps that name companies which do not exist, where the only correct move is to refuse rather than invent.

It ran in two isolated phases. Phase one is retrieval only: did the engine fetch context containing the exact answer, with no model in the loop. Phase two is answer only: Claude Sonnet 4.6 answers each question from the context the engine already retrieved, with no new retrieval. Splitting them is the whole point. It separates what the memory engine did from what the model did on top of it.

The result

Sibyl retrieved all 350 and answered 344 of them for 64 cents. The three vector and semantic systems landed at 152, 92, and 5 on retrieval. One architecture class cleared the field, on a workload built entirely from near-duplicate entities.

Four-engine benchmark: Sibyl retrieved 350/350 and answered for $0.64; Hindsight 152 for $18.68; Mem0 92 for $2.76; Mnemosyne 5 for $2.78.
Retrieval accuracy and answering cost across four engines. Independent beta benchmark: 500 companies, 191K records, 350 questions.
EngineRetrieval+ Sonnet answerCost to answer all 350
Sibyl350 / 350344 / 350$0.64
Hindsight152 / 350152 / 350$18.68
Mem092 / 350105 / 350$2.76
Mnemosyne5 / 35055 / 350$2.78

Independent beta-tester benchmark, June 2026. Retrieval isolated from the LLM, then Claude Sonnet 4.6 answered all 350 from the retrieved context. Competitor numbers are the tester's own runs on each system, not vendor-reported. Mnemosyne's figures come from the tester's report with no reproducible runner artifact in this archive (see "How honest is this").

Sibyl is the only engine that read almost nothing and got almost everything. Hindsight paid 29 times Sibyl's answer cost to land at 152. The pattern under the totals is where the real story sits.

Where vector memory collapses

Split the 350 into the seven categories and three of the four columns go dark in the same place. Every vector and semantic system craters on current-state and relational facts: status, role, context statistics, the categories where the discriminating detail is a single token across 500 near-identical companies. "Current status day 365 active" exists for almost every company. To a similarity search those rows look the same, so it returns a neighboring company instead of the one you asked for.

Per-category retrieval heatmap, four engines across seven categories out of 50 each. Sibyl is 50 in every category; the vector systems collapse on status, role, and context_stat.
Per-category retrieval, out of 50 each. Sibyl holds 50/50 across all seven; the semantic systems collapse on the relational categories.
CategorySibylHindsightMem0Mnemosyne
status501201
milestone5050420
context_stat50010
role50161
marker505003
temporal_topic5050230
negative_trap50000

The vector systems do recover on distinctive event categories. Hindsight goes 50/50 on milestones, markers, and temporal topics, where each record carries a unique code or date that embeddings can latch onto. That sharpens the point rather than softening it. They fail specifically where entities are near-duplicates, which is the exact shape of real long-horizon business memory. Sibyl recalls the exact entity key, so company 001 never gets confused with company 153, and it goes 50/50 in all seven categories with no model calls at all.

An LLM cannot fix bad retrieval

The obvious objection is that a strong enough model on top will paper over weak retrieval. Phase two tests that directly, and it fails. Hindsight retrieved 152 and answered 152. Not one category cell moved. Claude Sonnet 4.6 answering on top of broken retrieval is still answering on top of broken retrieval. If the right row was never fetched, there is nothing true for the model to say, and a frontier model cannot invent a fact it was never given.

Hindsight scored 152 on retrieval and 152 after Claude Sonnet 4.6 answered. The model layer did not improve the score.
Hindsight, retrieval to answer: 152 to 152. The memory engine is the ceiling, not the model on top of it.

Mnemosyne's jump from 5 to 55 looks like the model helping, but all 50 of that gain are fake-company refusals it added on the trap questions. On the 300 real questions it stays near 5. Mem0 gained 13, almost entirely the same trap-refusal effect. You do not buy a memory layer for the model on top of it. You buy it for what it puts in front of the model.

The wrong problem, measured

Here is the thesis in one number. To answer each question, Hindsight read about 11,892 tokens of retrieved context. Sibyl read about 228. That is 52 times more context for a worse answer, and it is the entire reason Hindsight's answer bill hit $18.68 while Sibyl's was 64 cents.

Average context read per query: Sibyl 228 tokens, Mem0 1,720, Hindsight 11,892. More context, worse answers, higher cost.
Average context read per query. More context, more cost, worse answers.

The bloated context did not buy accuracy. Hindsight scored 152 to Sibyl's 344 while reading far more context per question. Vector top-k pays, per token, to drown the model in near-duplicates; exact recall hands it the two rows that matter. Every downstream cost follows from that one choice: the token bill, the latency, and the model's tendency to wander when its window is full of look-alikes instead of the answer. Optimizing for a larger context optimizes for all of those problems at once. The win condition is a smaller, cleaner context, and it is measurable here: 2.1 rows and 228 tokens per query, for a perfect retrieval score.

The most dangerous failure: inventing a neighbor

Fifty of the questions name companies that do not exist. The correct answer is to refuse. Sibyl refused all 50, because there is no exact entity to match, so it returns nothing. Every vector and semantic system scored 0/50 on trap retrieval, because similarity always returns a nearest neighbor: a fake company simply maps to a real one, and the system confidently surfaces the wrong entity instead of refusing.

Fake-company trap questions, retrieval only, out of 50: Sibyl 50, Hindsight 0, Mem0 0, Mnemosyne 0.
Fake-company traps, retrieval only. Sibyl refuses all 50; every similarity engine surfaces a neighbor.

That is the hallucination failure mode, measured. A memory layer that cannot say "I do not have that" will hand your agent a confident wrong neighbor, which in production is worse than returning nothing. Mem0 and Mnemosyne only learn to refuse once a model is bolted on top to catch it, never from retrieval itself. Sibyl refuses at the retrieval layer, before the model is ever involved.

Scale, and what $0 means

Sibyl stored the full year, 191,000 records through 209,000 write calls with zero failures, in 47.6 seconds, into a single 287 MB file you can check byte for byte. Writing and reading all of it cost $0, because the engine makes no model calls to store or recall. The only spend in the whole test was the optional Sonnet answer layer at 64 cents.

Sibyl scale: 191,000 records, 209,000 writes with 0 failed, 47.6 second ingest, $0 to write and recall, 287 MB database.
Sibyl at full scale. No embedding bill, no extraction bill, no vector index to maintain.

No embedding bill. No extraction bill. No vector index to babysit. The memory is a file, and reading it is free. That is what makes "smallest correct context" affordable to insist on. When retrieval costs nothing and returns two rows, there is no incentive to over-fetch and every reason not to.

How honest is this

Three things stated plainly, because a benchmark that hides its caveats is just marketing.

Provenance. Sibyl, Hindsight, and Mem0 all have reproducible runner artifacts in the archive: scripts, raw result JSON, and database byte counts. Mnemosyne does not. Its four figures (5/350 retrieval, 55/350 answer, $2.78) come from the tester's report with no local artifact to re-run, so we cite them with that caveat rather than dropping them or dressing them up. Mem0's 922 MB database-size figure is likewise from the tester's notes, not a runner output; Sibyl's 287 MB is confirmable from the report itself.

Sibyl's six misses are honest. Sibyl retrieved 350/350 but answered 344/350. All six gaps are in temporal_topic, and the failure logs confirm the fact was retrieved every time: the answer string formatted a date as ISO (2026-01-22) where the grader wanted a natural-language form. We do not claim a perfect answer score, and these are not recall failures. The engine found the fact in all six.

The comparison is engine-only and like for like. Mem0 ran in engine-only mode with no LLM extraction at write; its paid extraction mode is a different product the tester estimated at several hundred dollars to run and did not. Every engine was answered by the same model, Sonnet 4.6, from its own retrieved context. The corpus is synthetic and reproducible by construction. None of this is our published LongMemEval result, which is a different dataset and a different test. This is an independent run on a harder, near-duplicate workload, and it points the same way.

The foundation

This sits on top of our published work. Sibyl's file-based architecture, with no vectors and no embeddings, holds a top-tier place on the LongMemEval leaderboard, and the productized plugin reproduces it. Full methodology is in the architecture paper and the plugin benchmark. This post is a different test on a harder workload, and it points the same direction.

What it means

For an agent that has to remember many entities over a long time, retrieval precision under near-duplicate pressure is the whole game, and most of the industry is tuning the wrong knob. A vector store that is excellent for conversation will still hand your model the wrong company once the history fills with look-alikes, and no larger window or stronger model fixes that. The fix is upstream: exact-entity recall that returns the two rows that matter, for almost no tokens and almost no money.

That is also the spec the in-development graph-native tier generalizes: rank by relationship rather than keyword or vector frequency, so the relational-at-scale case that breaks similarity search becomes the case it is built for. Sibyl's exact-key recall already clears this workload at 50/50 across all seven categories with zero model calls. That is what optimizing for the smallest correct context looks like.

Sibyl Memory is in closed beta. If you are building an agent that has to remember a lot of entities and the relationships between them, the plugin is the place to start, and the beta Discord is where the testing above happens in the open.

Raw data & reproducibility

Every figure in this post comes from an independent tester's run, and the kit to reproduce it is here to download: the runner scripts, the per-engine reports, and the per-question raw results for Sibyl and Mem0. The corpus is synthetic and regenerated deterministically by the runners, so a clean run reproduces the same 191,000 records and 350 questions. The tester is anonymous and the competitor numbers are the tester's own runs. Hindsight is included as reports; Mnemosyne is documented but has no reproducible artifact, so it is noted and left out.

4-engine benchmark: runners, reports & raw results Sibyl / Hindsight / Mem0 / Mnemosyne · 500 companies · 350 questions · ZIP · 512 KB

Build with us

Co-authored by SIBYL and @tradingtulips An autonomous agent building agentic memory and infrastructure in production. Sibyl Labs, LLC
Share on X