SIBYL Blog — Agent Memory Research, Benchmarks, Field Notes

July 6, 2026 release · hardening

Aegis: Agent Memory, Hardened for Production

5

packages, one release

35

findings closed

1,005

tests · 0 failures

0.1.0

LangGraph adapter

A coordinated hardening release across every Sibyl Memory package: client, CLI, Hermes, MCP, plus the first public LangGraph adapter. It closes a 35-finding hardening pass, a 10-lens adversarial audit spanning reliability, multi-tenant correctness, privacy, and security. Most of it is defense-in-depth: no API changes, no config changes, every fix free on upgrade. SibylStore brings the same local SQLite and FTS5 engine to LangGraph natively.

6 min read

June 7, 2026 benchmarks · four engines

Most memory products are solving the wrong problem

350/350

Sibyl retrieval · 4 engines

228

tokens/query vs 11,892

$0.64

to answer all 350 vs $18.68

50/50

traps refused · field 0

An independent tester ran the same 500-company, 365-day business-memory benchmark across four engines: Sibyl, Hindsight, Mem0, and Mnemosyne. Sibyl retrieved every answer reading about 228 tokens per query; one competitor read 11,892 and still answered worse. The industry is optimizing for a bigger context window. The goal is the smallest correct one. Runner scripts, per-engine reports, and raw results included.

12 min read

June 5, 2026 benchmarks · independent testing

Sibyl Memory: independent beta testing to date

97.2%

vs Honcho 85.6%

4.5×

less context

3.4×

cheaper

11,520

writes · 100% kept

Independent testers ran Sibyl Memory 1:1 against Honcho on the same 42,000-record corpus, the same 250 questions, and Claude Sonnet 4.6 answering both. Sibyl answered 97.2% correct vs 85.6%, while retrieving 4.5x less context and costing 3.4x less. Plus 10x-scale retention (11,520 writes, 100% retained), an independent adversarial security pass, and the one relational-retrieval gap we are building a new graph-native elite tier to close. Every dataset and raw result is reproducible.

11 min read

May 22, 2026 research · closed beta

Sibyl Memory Plugin: 95.1% on LongMemEval

95.1%

Sonnet 4.5

−0.5pp

vs Opus ceiling

500

Q · 0 errors

$43.78

total cost

First public measurement of the productized sibyl-memory-hermes plugin on the full 500-question LongMemEval Oracle. Matches the published Opus 4.6 file-Read ceiling within 0.5pp on Sonnet 4.5. Beats the published Sonnet 4.6 ceiling by 1.5pp because the plugin's tier architecture surfaces structured entities that file-Read alone cannot. Initial pass scored 88.1%; investigation found 32 of 36 temporal-reasoning misses were scorer false negatives on LongMemEval's multi-alternative gold format. Patched the scorer, recovered 33 answers, published both numbers and both scorers for full transparency.

10 min read

Benchmarks, architecture,and operational findings

Benchmarks, architecture,
and operational findings