Most agent-memory benchmarks are self-reported. A vendor runs its own system, scores its own answers, and publishes the number. In a field where teams openly accuse each other of judge-shopping and tuned configurations, that number is worth exactly what you paid for it.
So we did the opposite. We opened Sibyl Memory's closed beta and let independent testers build their own benchmarks against it. They wrote their own corpora, their own question suites, and their own runners, and published whatever they found. One tester built a 42,000-record dataset and ran Sibyl head-to-head against Honcho on it. Others pushed it to tens of thousands of records and tried to break it. We did not design these tests or supply the data. Every number below comes from a tester's machine, and every dataset and raw result is in hand and reproducible.
Why this is stronger than a self-reported benchmark. The comparison was run by an outside tester on the same corpus, the same questions, and the same answering model for both systems, each through its own official SDK. The corpus, the 250-question suite, the runner, and the full per-question raw results for both are in hand, shared by the tester. Anyone can re-run it.
Sibyl vs Honcho: same corpus, same questions, same model
A tester built a 42,000-record corpus from a simulated business: 200 companies and 600 stakeholders evolving over 180 days. From it, a fixed 250-question suite. Both systems ingested the full corpus through their own official SDK, retrieved the top 8 rows per question, and handed that context to Claude Sonnet 4.6 to answer. Same data in, same questions, same model, same scoring function applied to both.
| Metric | Sibyl | Honcho |
|---|---|---|
| Retrieval contained the answer | 243 / 250 (97.2%) | 219 / 250 (87.6%) |
| Answer correct (Sonnet 4.6) | 243 / 250 (97.2%) | 214 / 250 (85.6%) |
| Avg context retrieved per query | 291 tokens | 1,313 tokens |
| Estimated answering cost | $0.53 | $1.83 |
Sibyl answered 11.6 percentage points more questions correctly while feeding the model roughly a fifth of the context. The second number matters more. Sibyl found the answer in 291 tokens where Honcho needed 1,313. Less context is both cheaper and more accurate: the model reads a smaller, denser slice of memory on every call instead of scanning a larger pile of loosely related rows.
Per category
The category breakdown shows where the gap opens. Sibyl is perfect on status, milestone, marker, role, and the negative-control traps. Both systems separate on one category, segment, and that single category is the subject of the last section of this post.
| Category | Sibyl | Honcho retrieval | Honcho + Sonnet |
|---|---|---|---|
| status | 48 / 48 | 40 / 48 | 40 / 48 |
| milestone | 48 / 48 | 48 / 48 | 48 / 48 |
| marker | 48 / 48 | 48 / 48 | 48 / 48 |
| role | 48 / 48 | 48 / 48 | 48 / 48 |
| negative controls | 10 / 10 | 10 / 10 | 10 / 10 |
| segment | 41 / 48 | 25 / 48 | 20 / 48 |
Methodology: official Honcho SDK, full corpus ingested, Honcho's own session.search retrieval, matched top-8 depth, identical question set, identical scoring. No tuning advantage to either side.
Recall holds at 10x scale
A comparison on one corpus is a snapshot. The more important question for anyone running an agent in production is whether retrieval survives growth. Testers scaled the corpora 10x and re-ran the retention suites. Every write landed, every checkpoint held, and the database stayed small.
| Suite | Companies | Stakeholders | Writes | Retained | Checks | DB size |
|---|---|---|---|---|---|---|
| scale-10 small | 50 | 100 | 5,150 | 100% | 9 / 9 | 5.9 MB |
| scale-10 chronology | 80 | 240 | 11,520 | 100% | 9 / 9 | 14.6 MB |
| dense-10 small | 5 | 10 | 4,565 | 100% | 7 / 7 | 5.9 MB |
| dense-10 chronology | 8 | 24 | 9,792 | 100% | 7 / 7 | 15.3 MB |
And retrieval accuracy, Sibyl's actual job, stayed near-perfect as the corpus grew by an order of magnitude:
| Test | Corpus | Retrieval contained the answer |
|---|---|---|
| chronology answering | 8 companies · 1,152 writes | 12 / 12 |
| strict-facts answering | 8 companies · 1,152 writes | 12 / 12 |
| 10x scale answering | 80 companies · 11,520 writes | 37 / 38 |
| Honcho-comparison corpus | 42,000 records · 250 Q | 243 / 250 |
One honest caveat on these runs. Several were tagged FAIL by the tester, not because retrieval failed but because the end-to-end answering used a small, JSON-constrained model that struggled to write the answer even when the right facts were in front of it. Retrieval (12/12, 12/12, 37/38) is the part Sibyl is responsible for, and it held. When the same pipeline runs on a capable model in prose mode, as in the Honcho comparison above on Sonnet 4.6, end-to-end answering lands at 97.2%.
Hardened in the open
Accuracy is one axis. Containment is another. Testers pointed an autonomous QA agent and a set of adversarial suites at Sibyl, inside a locked-down Docker sandbox (no network, dropped capabilities, read-only mounts), specifically to break isolation, leak data, or crash it.
| Adversarial probe | Result |
|---|---|
| Search injection (FTS / column-filter breakout) | blocked |
| Path traversal via category / name | blocked |
| Prompt-injection content stays inert | blocked |
| Archive leakage | none |
| Tenant / category isolation | enforced |
| Cross-HOME isolation + migration suite | 22 / 24 |
| Runtime hardening (non-root, no-net, read-only, restart, concurrent) | 8 / 8 |
| Migration security suite | 10 / 10 |
| Autonomous QA agent (full sweep) | 88 checks · 78 passed |
The QA agent surfaced 10 issues; the isolation suite surfaced 2 (a symlink edge case); a validation-hygiene repro pack surfaced a handful more. Every one has been fixed and shipped: tighter input validation, a symlink storage guard, and error responses that no longer echo submitted values, across the published sibyl-memory-client and sibyl-memory-mcp releases. The findings, the repro packs, and the patches were all produced in the open. That is the point of an external beta: surface it, fix it, ship it.
The relational frontier: the tier we are building for it
One category separated the two systems in the comparison, and it is the same category that produced Sibyl's only misses at scale: segment. These are questions about an entity's current relational state, asked against a long, noisy history: which segment a company sits in, who replaced whom, what supersedes what.
One detail matters most: the answer was in memory the entire time. When testers pulled the exact entity record, the current segment was right there. The miss came from ranking, not storage or recall. In keyword retrieval, a company with 180 days of chronology has hundreds of journal rows sharing the query's keywords, and they crowd the one row that holds the current relationship out of the top 8.
| Evidence path | Segment facts surfaced |
|---|---|
| Top-8 keyword search | 233 / 240 |
| Full company context | 240 / 240 |
| Exact entity recall | contains the fact |
You cannot fix a ranking problem by storing more or recalling harder. You fix it by giving the system a model of how entities relate, so the row that holds a company's current owner or segment is surfaced because of its place in the relationship graph, not because it happened to repeat a keyword often enough.
A graph-native relational memory layer
We are building a new, premium tier of Sibyl Memory designed for exactly this problem. It models entities and their relationships directly as a graph and applies a graph neural network over that structure, so retrieval ranks by relevance and relationship rather than keyword frequency. The facts that live one hop from your question surface first, even buried under months of history: the current owner, the active segment, the replacement chain.
It is built for teams managing large, interconnected entity sets: portfolios, CRMs, multi-company operations, and agent fleets tracking thousands of accounts and the relationships between them. The 7 segment misses above are the spec for what comes next.
The foundation: published benchmarks
The independent results above sit on top of formal, published benchmarks. Sibyl's file-based architecture, with no vectors, no embeddings, and no external retrieval model, is #2 on the LongMemEval leaderboard, the only file-based system in the top tier, and the productized plugin reproduces it on a cheaper model.
| Benchmark | Score | Model | Note |
|---|---|---|---|
| LongMemEval Oracle · architecture | 95.6% #2 | Opus 4.6 | only file-based system in the top tier |
| LongMemEval Oracle · plugin | 95.1% | Sonnet 4.5 | 447/470, within 0.5pp of the Opus ceiling |
| BEAM (1M tokens) | 65.0% | n/a | conversation 1 |
| BEAM (100K tokens) | 70.0% | n/a | conversation 1 |
Full methodology for the published numbers is in the architecture paper and the plugin benchmark. The independent testing in this post does not replace those. It confirms, on someone else's machine and against a live competitor, what the leaderboard already says.
What this means
File-based, tiered memory beats a vector-retrieval competitor on accuracy, runs on a fraction of the context and cost, holds its accuracy as the corpus grows 10x, and survives an adversarial pass, verified by people who do not work for us, on tests we did not design and data anyone can re-run. The one place it has room to grow is relational ranking, and that is precisely the next tier we are shipping.
Sibyl Memory is in closed beta. If you are building an agent that has to remember a lot of entities and the relationships between them, the plugin is the place to start, and the beta Discord is where the testing above happens in the open.
Raw data & reproducibility
Every figure in this post comes from an independent tester's run, and the data is here to download. The comparison package carries both systems' per-question raw results and the runner used for the Honcho baseline. The suites package carries the 10x-scale retention and answering runs plus the segment evidence. Corpora are synthetic and tester handles are redacted. The adversarial security repro packs are held back by design and shared on request.
Build with us