May 22, 2026
research · closed beta
Sibyl Memory Plugin: 95.1% on LongMemEval
First public measurement of the productized sibyl-memory-hermes plugin on the full 500-question LongMemEval Oracle. Matches the published Opus 4.6 file-Read ceiling within 0.5pp on Sonnet 4.5. Beats the published Sonnet 4.6 ceiling by 1.5pp because the plugin's tier architecture surfaces structured entities that file-Read alone cannot. Initial pass scored 88.1%; investigation found 32 of 36 temporal-reasoning misses were scorer false negatives on LongMemEval's multi-alternative gold format. Patched the scorer, recovered 33 answers, published both numbers and both scorers for full transparency.
10 min read
April 15, 2026
research
95.6% on LongMemEval with no additional infrastructure
File-based memory scores #2 on the LongMemEval community leaderboard. No vector store. No embeddings. No additional infrastructure. Up from 86.7% after an operational efficiency upgrade — reducing verbose output on recall tasks — revealed the system's true accuracy had been masked. Temporal reasoning: 75% to 96%.
8 min read
May 6, 2026
field notes
The schema is the moat
Sibyl Memory was built so an autonomous agent could remember. The benchmark validated it. Then we realized we had been describing the wrong thing. Notes on productizing infrastructure that started as our own working memory, on distributing it at scale, and on what changes when an agent's work becomes a legal entity.
8 min read