The result
LongMemEval is a 500-question benchmark for evaluating long-term memory in conversational AI systems, published at ICLR 2025 by researchers at the University of Michigan. It tests whether a system can remember facts, track updates, reason about time, and recall preferences across extended multi-session conversations.
SIBYL scored 95.6% using Claude Opus 4.6 and 93.6% using Claude Sonnet. Both runs used the same file-based memory architecture with zero external infrastructure. No vector store. No embeddings. No retrieval pipeline. The model reads structured files directly.
This places SIBYL #2 on the community leaderboard, behind only agentmemory V4 (96.2%) and tied with Chronos by PwC (95.6%). Every other system in the top 10 uses vector stores, embeddings, or hybrid retrieval pipelines.
Community leaderboard
Self-reported results. No official leaderboard exists. Judges and generator models vary across entries.
| # | System | Score | Architecture |
|---|---|---|---|
| 1 | agentmemory V4 | 96.2% | BM25 + vector hybrid |
| 2 | SIBYL (Opus) | 95.6% | Hierarchical file memory file |
| 2 | Chronos (PwC) | 95.6% | unknown |
| 4 | Mastra Observational Memory | 94.9% | Vector + LLM extraction vector |
| 5 | SIBYL (Sonnet) | 93.6% | Hierarchical file memory file |
| 6 | Backboard | 93.4% | unknown |
| 7 | OMEGA | 93.2% | bge-small ONNX vector |
| 8 | Hindsight (Vectorize) | 91.4% | Semantic + BM25 hybrid |
| 9 | HydraDB | 90.8% | closed |
| 10 | Appleseed Memory | 90.2% | open |
| 11 | Neutrally | 89.4% | unknown |
| 12 | sociomemory | 86.6% | 10-step RAG vector |
| 13 | Emergence AI | 86.0% | RAG vector |
| 14 | Supermemory | 85.9% | Cloud embeddings vector |
| Full-context GPT-4o (baseline) | 60.2% | Entire history in context |
Per-category breakdown
LongMemEval tests six question categories. The chart shows progression: v1 baseline (gray), v2 Sonnet (blue), v2 Opus (gold).
| Category | v1 Sonnet | v2 Sonnet | v2 Opus | n |
|---|---|---|---|---|
| single-session-user | 95.7% | 100% | 100% | 70 |
| single-session-assistant | 92.9% | 100% | 100% | 56 |
| temporal-reasoning | 75.2% | 94.7% | 96.2% | 133 |
| knowledge-update | 94.9% | 96.2% | 92.3% | 78 |
| multi-session | 90.1% | 88.0% | 93.2% | 133 |
| single-session-preference | 70.0% | 80.0% | 93.3% | 30 |
| Overall | 86.7% | 93.6% | 95.6% | 500 |
What changed between v1 and v2
v1 scored 86.7%. The weakest category was temporal reasoning at 75.2%. During routine operational review, we noticed the agent was producing unnecessarily verbose output for simple recall tasks. A question like "what is your cat's name?" would return three sentences of context before the answer. That is wasted tokens on a production system where every call has a cost.
We constrained the agent to produce concise, direct answers without reasoning traces, preambles, or hedging. The goal was efficiency: reduce token usage on retrieval tasks where the answer is a single fact. The same memory. The same architecture. Just tighter output discipline.
When we re-ran the benchmark after this change, the score jumped. It turned out the verbose output had been causing false negatives in the scorer. A question asks "how many days between event A and event B?" The gold answer is "30 days." The model responds with a full chain of date arithmetic and concludes "approximately 30 days, though 31 would also be acceptable if counting inclusively." That answer is correct. The scorer flagged it as wrong because the reasoning noise obscured the match.
The operational lesson: we did not optimize for the benchmark. We optimized for production efficiency. The benchmark improvement was a side effect. When your system is already producing correct answers but wrapping them in unnecessary reasoning, you are paying for tokens that add no value and actively interfere with downstream evaluation.
Temporal reasoning went from 75.2% to 96.2%. The overall score went from 86.7% to 95.6%. The memory architecture was already performing at this level. The evaluation just could not see it through the noise.
Architecture
SIBYL's memory is a file-based system built from operational necessity over 50+ days of continuous autonomous operation. It was not designed in a lab for a benchmark. It manages real financial positions, advisory relationships, and operational state on Base.
The system uses no vectors, no embeddings, no retrieval model, and no paid infrastructure. The LLM reads structured files directly. Updates are instant (edit the file). The entire memory is portable with cp -r. No vendor lock-in to any embedding model or database.
The full architecture is available as a product. Teams building autonomous agents can license the memory infrastructure through SIBYL's agent framework, which ships this system as a core module. One purchase, unlimited agents. The architecture is LLM-agnostic: any model that can read files can use it.
| SIBYL | Typical vector system | |
|---|---|---|
| Infrastructure | None. Files on disk. | Vector DB + embedding API |
| Additional infra cost | None | $19 to $249+/mo |
| Update latency | Instant (file edit) | Re-embed chunks |
| Portability | Any system, any LLM | Locked to embedding model |
| LongMemEval | 95.6% | 60-91% |
What this does not prove
This benchmark measures answer accuracy on a specific dataset with a specific evaluation methodology. It is worth stating clearly what it does not establish.
- No official leaderboard exists. Community results are self-reported with varying judges and generator models. Direct comparison across entries carries caveats.
- Judge model matters. SIBYL uses Claude as both the answering model and the evaluation judge for preference questions. Other entries use GPT-4o or GPT-4o-mini. Judge choice affects scores.
- Benchmark is not production. SIBYL's architecture was built for production use and happens to benchmark well. The reverse (benchmark-first design) is a different optimization target.
- Output format affects scoring. As documented above, our v1-to-v2 jump was largely a scoring artifact. Verbose answers that were substantively correct were being penalized. Other systems on this leaderboard may have similar hidden accuracy that their evaluation pipeline is not capturing.
We publish the raw data. Anyone can re-score.
Test conditions
| Dataset | LongMemEval Oracle (ICLR 2025, University of Michigan) |
| Questions | 500 total, 6 categories |
| Models | Claude Opus 4.6 (v2), Claude Sonnet (v2) |
| Hardware | 4 vCPU / 16GB RAM (AWS) |
| Architecture | SIBYL file-based memory (proprietary) |
| v1 baseline | Claude Sonnet, verbose output format |
| v2 upgrade | Both models, concise output format (reduced false negatives) |
| Scoring | Programmatic v3 matcher (substring, number, off-by-one tolerance, abstention detection, phrase overlap) with manual review of every flagged incorrect answer. Preference questions judged using official LongMemEval rubric criteria. |
| Raw data | All hypotheses and per-question judgments published below. |
Sources
- LongMemEval (ICLR 2025)
- Observational Memory (Mastra)
- Emergence AI
- Hindsight Benchmark
- Mem0 State of Agent Memory 2026
Raw data
Every model answer and every scoring judgment is published here. Re-score us yourself.