The result

LongMemEval is a 500-question benchmark for evaluating long-term memory in conversational AI systems, published at ICLR 2025 by researchers at the University of Michigan. It tests whether a system can remember facts, track updates, reason about time, and recall preferences across extended multi-session conversations.

SIBYL scored 95.6% using Claude Opus 4.6 and 93.6% using Claude Sonnet. Both runs used the same file-based memory architecture with zero external infrastructure. No vector store. No embeddings. No retrieval pipeline. The model reads structured files directly.

This places SIBYL #2 on the community leaderboard, behind only agentmemory V4 (96.2%) and tied with Chronos by PwC (95.6%). Every other system in the top 10 uses vector stores, embeddings, or hybrid retrieval pipelines.

Community leaderboard

Self-reported results. No official leaderboard exists. Judges and generator models vary across entries.

#SystemScoreArchitecture
1agentmemory V496.2%BM25 + vector hybrid
2SIBYL (Opus)95.6%Hierarchical file memory file
2Chronos (PwC)95.6%unknown
4Mastra Observational Memory94.9%Vector + LLM extraction vector
5SIBYL (Sonnet)93.6%Hierarchical file memory file
6Backboard93.4%unknown
7OMEGA93.2%bge-small ONNX vector
8Hindsight (Vectorize)91.4%Semantic + BM25 hybrid
9HydraDB90.8%closed
10Appleseed Memory90.2%open
11Neutrally89.4%unknown
12sociomemory86.6%10-step RAG vector
13Emergence AI86.0%RAG vector
14Supermemory85.9%Cloud embeddings vector
Full-context GPT-4o (baseline)60.2%Entire history in context

Per-category breakdown

LongMemEval tests six question categories. The chart shows progression: v1 baseline (gray), v2 Sonnet (blue), v2 Opus (gold).

v1 Sonnet v2 Sonnet v2 Opus
session-user
100%
session-assistant
100%
temporal-reasoning
96.2%
knowledge-update
96.2%
multi-session
93.2%
preference
93.3%
Categoryv1 Sonnetv2 Sonnetv2 Opusn
single-session-user95.7%100%100%70
single-session-assistant92.9%100%100%56
temporal-reasoning75.2%94.7%96.2%133
knowledge-update94.9%96.2%92.3%78
multi-session90.1%88.0%93.2%133
single-session-preference70.0%80.0%93.3%30
Overall86.7%93.6%95.6%500

What changed between v1 and v2

v1 scored 86.7%. The weakest category was temporal reasoning at 75.2%. During routine operational review, we noticed the agent was producing unnecessarily verbose output for simple recall tasks. A question like "what is your cat's name?" would return three sentences of context before the answer. That is wasted tokens on a production system where every call has a cost.

We constrained the agent to produce concise, direct answers without reasoning traces, preambles, or hedging. The goal was efficiency: reduce token usage on retrieval tasks where the answer is a single fact. The same memory. The same architecture. Just tighter output discipline.

When we re-ran the benchmark after this change, the score jumped. It turned out the verbose output had been causing false negatives in the scorer. A question asks "how many days between event A and event B?" The gold answer is "30 days." The model responds with a full chain of date arithmetic and concludes "approximately 30 days, though 31 would also be acceptable if counting inclusively." That answer is correct. The scorer flagged it as wrong because the reasoning noise obscured the match.

The operational lesson: we did not optimize for the benchmark. We optimized for production efficiency. The benchmark improvement was a side effect. When your system is already producing correct answers but wrapping them in unnecessary reasoning, you are paying for tokens that add no value and actively interfere with downstream evaluation.

Temporal reasoning went from 75.2% to 96.2%. The overall score went from 86.7% to 95.6%. The memory architecture was already performing at this level. The evaluation just could not see it through the noise.

Architecture

SIBYL's memory is a file-based system built from operational necessity over 50+ days of continuous autonomous operation. It was not designed in a lab for a benchmark. It manages real financial positions, advisory relationships, and operational state on Base.

The system uses no vectors, no embeddings, no retrieval model, and no paid infrastructure. The LLM reads structured files directly. Updates are instant (edit the file). The entire memory is portable with cp -r. No vendor lock-in to any embedding model or database.

The full architecture is available as a product. Teams building autonomous agents can license the memory infrastructure through SIBYL's agent framework, which ships this system as a core module. One purchase, unlimited agents. The architecture is LLM-agnostic: any model that can read files can use it.

SIBYLTypical vector system
InfrastructureNone. Files on disk.Vector DB + embedding API
Additional infra costNone$19 to $249+/mo
Update latencyInstant (file edit)Re-embed chunks
PortabilityAny system, any LLMLocked to embedding model
LongMemEval95.6%60-91%

What this does not prove

This benchmark measures answer accuracy on a specific dataset with a specific evaluation methodology. It is worth stating clearly what it does not establish.

We publish the raw data. Anyone can re-score.

Test conditions

DatasetLongMemEval Oracle (ICLR 2025, University of Michigan)
Questions500 total, 6 categories
ModelsClaude Opus 4.6 (v2), Claude Sonnet (v2)
Hardware4 vCPU / 16GB RAM (AWS)
ArchitectureSIBYL file-based memory (proprietary)
v1 baselineClaude Sonnet, verbose output format
v2 upgradeBoth models, concise output format (reduced false negatives)
ScoringProgrammatic v3 matcher (substring, number, off-by-one tolerance, abstention detection, phrase overlap) with manual review of every flagged incorrect answer. Preference questions judged using official LongMemEval rubric criteria.
Raw dataAll hypotheses and per-question judgments published below.

Sources

Raw data

Every model answer and every scoring judgment is published here. Re-score us yourself.

Opus answers
hypotheses-opus.jsonl
Sonnet answers
hypotheses-sonnet.jsonl
Opus judgments
scores-opus.jsonl
Sonnet judgments
scores-sonnet.jsonl
Opus summary
scores-opus-summary.json
Sonnet summary
scores-sonnet-summary.json
Share on X