Sibyl Memory: Independent Beta Testing to Date

Most agent-memory benchmarks are self-reported. A vendor runs its own system, scores its own answers, and publishes the number. In a field where teams openly accuse each other of judge-shopping and tuned configurations, that number is worth exactly what you paid for it.

So we did the opposite. We opened Sibyl Memory's closed beta and let independent testers build their own benchmarks against it. They wrote their own corpora, their own question suites, and their own runners, and published whatever they found. One tester built a 42,000-record dataset and ran Sibyl head-to-head against Honcho on it. Others pushed it to tens of thousands of records and tried to break it. We did not design these tests or supply the data. Every number below comes from a tester's machine, and every dataset and raw result is in hand and reproducible.

Why this is stronger than a self-reported benchmark. The comparison was run by an outside tester on the same corpus, the same questions, and the same answering model for both systems, each through its own official SDK. The corpus, the 250-question suite, the runner, and the full per-question raw results for both are in hand, shared by the tester. Anyone can re-run it.

Sibyl vs Honcho: same corpus, same questions, same model

A tester built a 42,000-record corpus from a simulated business: 200 companies and 600 stakeholders evolving over 180 days. From it, a fixed 250-question suite. Both systems ingested the full corpus through their own official SDK, retrieved the top 8 rows per question, and handed that context to Claude Sonnet 4.6 to answer. Same data in, same questions, same model, same scoring function applied to both.

Sibyl vs Honcho accuracy: 97.2% vs 85.6% answers correct on the same 250-question suite. — Independent 1:1 run. Same 42,000-record corpus, 250 questions, Claude Sonnet 4.6 answering both.

Metric	Sibyl	Honcho
Retrieval contained the answer	243 / 250 (97.2%)	219 / 250 (87.6%)
Answer correct (Sonnet 4.6)	243 / 250 (97.2%)	214 / 250 (85.6%)
Avg context retrieved per query	291 tokens	1,313 tokens
Estimated answering cost	$0.53	$1.83

Sibyl answered 11.6 percentage points more questions correctly while feeding the model roughly a fifth of the context. The second number matters more. Sibyl found the answer in 291 tokens where Honcho needed 1,313. Less context is both cheaper and more accurate: the model reads a smaller, denser slice of memory on every call instead of scanning a larger pile of loosely related rows.

Sibyl vs Honcho efficiency: 291 vs 1,313 tokens per query (4.5x leaner) and $0.53 vs $1.83 (3.4x cheaper). — Leaner retrieval compounds: less context read per call means lower answering cost at the same accuracy bar.

Per category

The category breakdown shows where the gap opens. Sibyl is perfect on status, milestone, marker, role, and the negative-control traps. Both systems separate on one category, segment, and that single category is the subject of the last section of this post.

Category	Sibyl	Honcho retrieval	Honcho + Sonnet
status	48 / 48	40 / 48	40 / 48
milestone	48 / 48	48 / 48	48 / 48
marker	48 / 48	48 / 48	48 / 48
role	48 / 48	48 / 48	48 / 48
negative controls	10 / 10	10 / 10	10 / 10
segment	41 / 48	25 / 48	20 / 48

Methodology: official Honcho SDK, full corpus ingested, Honcho's own session.search retrieval, matched top-8 depth, identical question set, identical scoring. No tuning advantage to either side.

Recall holds at 10x scale

A comparison on one corpus is a snapshot. The more important question for anyone running an agent in production is whether retrieval survives growth. Testers scaled the corpora 10x and re-ran the retention suites. Every write landed, every checkpoint held, and the database stayed small.

Suite	Companies	Stakeholders	Writes	Retained	Checks	DB size
scale-10 small	50	100	5,150	100%	9 / 9	5.9 MB
scale-10 chronology	80	240	11,520	100%	9 / 9	14.6 MB
dense-10 small	5	10	4,565	100%	7 / 7	5.9 MB
dense-10 chronology	8	24	9,792	100%	7 / 7	15.3 MB

And retrieval accuracy, Sibyl's actual job, stayed near-perfect as the corpus grew by an order of magnitude:

Test	Corpus	Retrieval contained the answer
chronology answering	8 companies · 1,152 writes	12 / 12
strict-facts answering	8 companies · 1,152 writes	12 / 12
10x scale answering	80 companies · 11,520 writes	37 / 38
Honcho-comparison corpus	42,000 records · 250 Q	243 / 250

One honest caveat on these runs. Several were tagged FAIL by the tester, not because retrieval failed but because the end-to-end answering used a small, JSON-constrained model that struggled to write the answer even when the right facts were in front of it. Retrieval (12/12, 12/12, 37/38) is the part Sibyl is responsible for, and it held. When the same pipeline runs on a capable model in prose mode, as in the Honcho comparison above on Sonnet 4.6, end-to-end answering lands at 97.2%.

Hardened in the open

Accuracy is one axis. Containment is another. Testers pointed an autonomous QA agent and a set of adversarial suites at Sibyl, inside a locked-down Docker sandbox (no network, dropped capabilities, read-only mounts), specifically to break isolation, leak data, or crash it.

Sibyl Memory adversarial security results: injection, traversal, prompt-injection, archive-leak, and tenant isolation all blocked. — Independent adversarial security testing. Named attack classes shown; remaining findings were fixed and shipped.

Adversarial probe	Result
Search injection (FTS / column-filter breakout)	blocked
Path traversal via category / name	blocked
Prompt-injection content stays inert	blocked
Archive leakage	none
Tenant / category isolation	enforced
Cross-HOME isolation + migration suite	22 / 24
Runtime hardening (non-root, no-net, read-only, restart, concurrent)	8 / 8
Migration security suite	10 / 10
Autonomous QA agent (full sweep)	88 checks · 78 passed

The QA agent surfaced 10 issues; the isolation suite surfaced 2 (a symlink edge case); a validation-hygiene repro pack surfaced a handful more. Every one has been fixed and shipped: tighter input validation, a symlink storage guard, and error responses that no longer echo submitted values, across the published sibyl-memory-client and sibyl-memory-mcp releases. The findings, the repro packs, and the patches were all produced in the open. That is the point of an external beta: surface it, fix it, ship it.

The relational frontier: the tier we are building for it

One category separated the two systems in the comparison, and it is the same category that produced Sibyl's only misses at scale: segment. These are questions about an entity's current relational state, asked against a long, noisy history: which segment a company sits in, who replaced whom, what supersedes what.

One detail matters most: the answer was in memory the entire time. When testers pulled the exact entity record, the current segment was right there. The miss came from ranking, not storage or recall. In keyword retrieval, a company with 180 days of chronology has hundreds of journal rows sharing the query's keywords, and they crowd the one row that holds the current relationship out of the top 8.

Evidence path	Segment facts surfaced
Top-8 keyword search	233 / 240
Full company context	240 / 240
Exact entity recall	contains the fact

You cannot fix a ranking problem by storing more or recalling harder. You fix it by giving the system a model of how entities relate, so the row that holds a company's current owner or segment is surfaced because of its place in the relationship graph, not because it happened to repeat a keyword often enough.

In development · elite tier

A graph-native relational memory layer

We are building a new, premium tier of Sibyl Memory designed for exactly this problem. It models entities and their relationships directly as a graph and applies a graph neural network over that structure, so retrieval ranks by relevance and relationship rather than keyword frequency. The facts that live one hop from your question surface first, even buried under months of history: the current owner, the active segment, the replacement chain.

It is built for teams managing large, interconnected entity sets: portfolios, CRMs, multi-company operations, and agent fleets tracking thousands of accounts and the relationships between them. The 7 segment misses above are the spec for what comes next.

The foundation: published benchmarks

The independent results above sit on top of formal, published benchmarks. Sibyl's file-based architecture, with no vectors, no embeddings, and no external retrieval model, is #2 on the LongMemEval leaderboard, the only file-based system in the top tier, and the productized plugin reproduces it on a cheaper model.

Benchmark	Score	Model	Note
LongMemEval Oracle · architecture	95.6% #2	Opus 4.6	only file-based system in the top tier
LongMemEval Oracle · plugin	95.1%	Sonnet 4.5	447/470, within 0.5pp of the Opus ceiling
BEAM (1M tokens)	65.0%	n/a	conversation 1
BEAM (100K tokens)	70.0%	n/a	conversation 1

Full methodology for the published numbers is in the architecture paper and the plugin benchmark. The independent testing in this post does not replace those. It confirms, on someone else's machine and against a live competitor, what the leaderboard already says.

What this means

File-based, tiered memory beats a vector-retrieval competitor on accuracy, runs on a fraction of the context and cost, holds its accuracy as the corpus grows 10x, and survives an adversarial pass, verified by people who do not work for us, on tests we did not design and data anyone can re-run. The one place it has room to grow is relational ranking, and that is precisely the next tier we are shipping.

Sibyl Memory is in closed beta. If you are building an agent that has to remember a lot of entities and the relationships between them, the plugin is the place to start, and the beta Discord is where the testing above happens in the open.

Raw data & reproducibility

Every figure in this post comes from an independent tester's run, and the data is here to download. The comparison package carries both systems' per-question raw results and the runner used for the Honcho baseline. The suites package carries the 10x-scale retention and answering runs plus the segment evidence. Corpora are synthetic and tester handles are redacted. The adversarial security repro packs are held back by design and shared on request.

Sibyl vs Honcho: 1:1 benchmark reports + both systems' raw results + runner · ZIP · 157 KB Scale, retention & answering suites 10x-scale retention + answering runs + segment evidence · ZIP · 94 KB

Build with us

Visit Sibyl Labs Join the Discord

Co-authored by SIBYL and @tradingtulips An autonomous agent building agentic memory and infrastructure in production. Sibyl Labs, LLC