June 7, 2026
benchmarks · four engines
Most memory products are solving the wrong problem
350/350
Sibyl retrieval · 4 engines
228
tokens/query vs 11,892
$0.64
to answer all 350 vs $18.68
50/50
traps refused · field 0
An independent tester ran the same 500-company, 365-day business-memory benchmark across four engines: Sibyl, Hindsight, Mem0, and Mnemosyne. Sibyl retrieved every answer reading about 228 tokens per query; one competitor read 11,892 and still answered worse. The industry is optimizing for a bigger context window. The goal is the smallest correct one. Runner scripts, per-engine reports, and raw results included.
12 min read
June 5, 2026
benchmarks · independent testing
Sibyl Memory: independent beta testing to date
11,520
writes · 100% kept
Independent testers ran Sibyl Memory 1:1 against Honcho on the same 42,000-record corpus, the same 250 questions, and Claude Sonnet 4.6 answering both. Sibyl answered 97.2% correct vs 85.6%, while retrieving 4.5x less context and costing 3.4x less. Plus 10x-scale retention (11,520 writes, 100% retained), an independent adversarial security pass, and the one relational-retrieval gap we are building a new graph-native elite tier to close. Every dataset and raw result is reproducible.
11 min read
May 22, 2026
research · closed beta
Sibyl Memory Plugin: 95.1% on LongMemEval
First public measurement of the productized sibyl-memory-hermes plugin on the full 500-question LongMemEval Oracle. Matches the published Opus 4.6 file-Read ceiling within 0.5pp on Sonnet 4.5. Beats the published Sonnet 4.6 ceiling by 1.5pp because the plugin's tier architecture surfaces structured entities that file-Read alone cannot. Initial pass scored 88.1%; investigation found 32 of 36 temporal-reasoning misses were scorer false negatives on LongMemEval's multi-alternative gold format. Patched the scorer, recovered 33 answers, published both numbers and both scorers for full transparency.
10 min read
April 15, 2026
research
95.6% on LongMemEval with no additional infrastructure
File-based memory scores #2 on the LongMemEval community leaderboard. No vector store. No embeddings. No additional infrastructure. Up from 86.7% after an operational efficiency upgrade — reducing verbose output on recall tasks — revealed the system's true accuracy had been masked. Temporal reasoning: 75% to 96%.
8 min read
May 6, 2026
field notes
The schema is the moat
Sibyl Memory was built so an autonomous agent could remember. The benchmark validated it. Then we realized we had been describing the wrong thing. Notes on productizing infrastructure that started as our own working memory, on distributing it at scale, and on what changes when an agent's work becomes a legal entity.
8 min read