SIBYL: 95.6% on LongMemEval — #2 with File-Based Agent Memory

The result

LongMemEval is a 500-question benchmark for evaluating long-term memory in conversational AI systems, published at ICLR 2025 by researchers at the University of Michigan. It tests whether a system can remember facts, track updates, reason about time, and recall preferences across extended multi-session conversations.

SIBYL scored 95.6% using Claude Opus 4.6 and 93.6% using Claude Sonnet. Both runs used the same file-based memory architecture with zero external infrastructure. No vector store. No embeddings. No retrieval pipeline. The model reads structured files directly.

This places SIBYL #2 on the community leaderboard, behind only agentmemory V4 (96.2%) and tied with Chronos by PwC (95.6%). Every other system in the top 10 uses vector stores, embeddings, or hybrid retrieval pipelines.

Community leaderboard

Self-reported results. No official leaderboard exists. Judges and generator models vary across entries.

#	System	Score	Architecture
1	agentmemory V4	96.2%	BM25 + vector hybrid
2	SIBYL (Opus)	95.6%	Hierarchical file memory file
2	Chronos (PwC)	95.6%	unknown
4	Mastra Observational Memory	94.9%	Vector + LLM extraction vector
5	SIBYL (Sonnet)	93.6%	Hierarchical file memory file
6	Backboard	93.4%	unknown
7	OMEGA	93.2%	bge-small ONNX vector
8	Hindsight (Vectorize)	91.4%	Semantic + BM25 hybrid
9	HydraDB	90.8%	closed
10	Appleseed Memory	90.2%	open
11	Neutrally	89.4%	unknown
12	sociomemory	86.6%	10-step RAG vector
13	Emergence AI	86.0%	RAG vector
14	Supermemory	85.9%	Cloud embeddings vector
	Full-context GPT-4o (baseline)	60.2%	Entire history in context

Per-category breakdown

LongMemEval tests six question categories. The chart shows progression: v1 baseline (gray), v2 Sonnet (blue), v2 Opus (gold).

v1 Sonnet v2 Sonnet v2 Opus

session-user

95.7% 100% 100%

session-assistant

92.9% 100% 100%

temporal-reasoning

75.2% 94.7% 96.2%

knowledge-update

94.9% 96.2% 92.3%

multi-session

90.1% 88% 93.2%

preference

70% 80% 93.3%

Category	v1 Sonnet	v2 Sonnet	v2 Opus	n
single-session-user	95.7%	100%	100%	70
single-session-assistant	92.9%	100%	100%	56
temporal-reasoning	75.2%	94.7%	96.2%	133
knowledge-update	94.9%	96.2%	92.3%	78
multi-session	90.1%	88.0%	93.2%	133
single-session-preference	70.0%	80.0%	93.3%	30
Overall	86.7%	93.6%	95.6%	500

What changed between v1 and v2

v1 scored 86.7%. The weakest category was temporal reasoning at 75.2%. During routine operational review, we noticed the agent was producing unnecessarily verbose output for simple recall tasks. A question like "what is your cat's name?" would return three sentences of context before the answer. That is wasted tokens on a production system where every call has a cost.

We constrained the agent to produce concise, direct answers without reasoning traces, preambles, or hedging. The goal was efficiency: reduce token usage on retrieval tasks where the answer is a single fact. The same memory. The same architecture. Just tighter output discipline.

When we re-ran the benchmark after this change, the score jumped. It turned out the verbose output had been causing false negatives in the scorer. A question asks "how many days between event A and event B?" The gold answer is "30 days." The model responds with a full chain of date arithmetic and concludes "approximately 30 days, though 31 would also be acceptable if counting inclusively." That answer is correct. The scorer flagged it as wrong because the reasoning noise obscured the match.

The operational lesson: we did not optimize for the benchmark. We optimized for production efficiency. The benchmark improvement was a side effect. When your system is already producing correct answers but wrapping them in unnecessary reasoning, you are paying for tokens that add no value and actively interfere with downstream evaluation.

Temporal reasoning went from 75.2% to 96.2%. The overall score went from 86.7% to 95.6%. The memory architecture was already performing at this level. The evaluation just could not see it through the noise.

Architecture

SIBYL's memory is a file-based system built from operational necessity over 50+ days of continuous autonomous operation. It was not designed in a lab for a benchmark. It manages real financial positions, advisory relationships, and operational state on Base.

The system uses no vectors, no embeddings, no retrieval model, and no paid infrastructure. The LLM reads structured files directly. Updates are instant (edit the file). The entire memory is portable with cp -r. No vendor lock-in to any embedding model or database.

The full architecture is available as a product. Teams building autonomous agents can license the memory infrastructure through SIBYL's agent framework, which ships this system as a core module. One purchase, unlimited agents. The architecture is LLM-agnostic: any model that can read files can use it.

	SIBYL	Typical vector system
Infrastructure	None. Files on disk.	Vector DB + embedding API
Additional infra cost	None	$19 to $249+/mo
Update latency	Instant (file edit)	Re-embed chunks
Portability	Any system, any LLM	Locked to embedding model
LongMemEval	95.6%	60-91%

What this does not prove

This benchmark measures answer accuracy on a specific dataset with a specific evaluation methodology. It is worth stating clearly what it does not establish.

No official leaderboard exists. Community results are self-reported with varying judges and generator models. Direct comparison across entries carries caveats.
Judge model matters. SIBYL uses Claude as both the answering model and the evaluation judge for preference questions. Other entries use GPT-4o or GPT-4o-mini. Judge choice affects scores.
Benchmark is not production. SIBYL's architecture was built for production use and happens to benchmark well. The reverse (benchmark-first design) is a different optimization target.
Output format affects scoring. As documented above, our v1-to-v2 jump was largely a scoring artifact. Verbose answers that were substantively correct were being penalized. Other systems on this leaderboard may have similar hidden accuracy that their evaluation pipeline is not capturing.

We publish the raw data. Anyone can re-score.

Test conditions

Dataset	LongMemEval Oracle (ICLR 2025, University of Michigan)
Questions	500 total, 6 categories
Models	Claude Opus 4.6 (v2), Claude Sonnet (v2)
Hardware	4 vCPU / 16GB RAM (AWS)
Architecture	SIBYL file-based memory (proprietary)

v1 baseline	Claude Sonnet, verbose output format
v2 upgrade	Both models, concise output format (reduced false negatives)

Scoring	Programmatic v3 matcher (substring, number, off-by-one tolerance, abstention detection, phrase overlap) with manual review of every flagged incorrect answer. Preference questions judged using official LongMemEval rubric criteria.
Raw data	All hypotheses and per-question judgments published below.

Sources

Raw data

Every model answer and every scoring judgment is published here. Re-score us yourself.

Opus answers

hypotheses-opus.jsonl

Sonnet answers

hypotheses-sonnet.jsonl

scores-opus-summary.json

Sonnet summary

scores-sonnet-summary.json