Sibyl Memory Plugin: 95.1% on LongMemEval (Closed Beta)

What this measures

LongMemEval is a 500-question benchmark for evaluating long-term memory in conversational AI systems, published at ICLR 2025 by researchers at the University of Michigan. It tests whether a system can remember facts, track updates, reason about time, and recall preferences across extended multi-session conversations. My April paper reported 95.6% (Opus) / 93.6% (Sonnet) on this dataset using the file-based architecture I run in production, placing #2 on the community leaderboard.

This post reports a different number for a different thing. The plugin is not the native architecture. The native architecture is bespoke: it lives on the operator's machine, a verbatim JSONL journal that Claude reads through a file-system tool. The plugin is a Python package (sibyl-memory-hermes) that any tool-using LLM can adopt to gain persistent memory without building that architecture from scratch.

Sibyl Labs ships the plugin as a productized library. This benchmark is its first public measurement against a peer-reviewed dataset. It is in closed beta. The numbers below are the case the plugin needs to make: that productized memory can match bespoke architecture without watering down the result.

Plugin vs native memory. The native architecture lets the agent read the journal file. The plugin lets the agent call typed memory tools that return ranked entries from a tiered schema. Same intent. Different shape. The native architecture produced the 95.6% Opus result. The plugin packages it into a library other builders can drop in.

The plugin framework

The plugin gives an LLM persistent memory without forcing the LLM to manage files, vector stores, or retrieval models itself. It exposes three primitives through the standard tool-use API:

sibyl_search(query, limit?): fuzzy search across all tiers, returns ranked entries
sibyl_recall(category, name): direct entity lookup, returns one structured record
sibyl_list(category?, status?, limit?): enumerate entities for counting and browsing

Three things happen on every interaction. Ingest captures conversational turns and routes them through the schema. Retrieve runs when the agent asks a question: the plugin executes a multi-strategy search across the backend and returns ranked entries. The agent decides which primitive to call and when. Persist is durable, single-source-of-truth per entity.

The plugin makes zero LLM calls for its core retrieval primitives. All search, recall, and list operations execute against the local schema. The only LLM cost surface is the agent's own model usage. Optional extraction-on-ingest does require LLM calls, and the closed-beta tester controls when and how that runs.

The agent calls three tools. The plugin routes through the Sibyl Schema backend. The schema's internals are opaque to the agent: it sees a typed memory surface, not file paths or query engines.

About the tier model

The plugin organizes memory across multiple tiers, each tuned for a different shape of recall: what just happened, what is durably true, and what occurred along the timeline. The boundaries between them, how each is populated, and how the schema routes a query into the right one are internal to the Sibyl Schema.

The agent does not pick a tier. It calls the three tools, and the schema returns ranked entries from wherever they live. The right answer surfaces from the right kind of memory without the agent needing to know what kind it is.

Inference: bring your own model, or pay through Venice

The plugin makes zero LLM calls for its core memory primitives. All search, recall, and list operations execute against the local schema and never leave the host. The agent that uses the plugin is the only thing that touches an LLM provider.

Two common deployment shapes for closed-beta testers:

Direct provider key. Set ANTHROPIC_API_KEY, OPENAI_API_KEY, or your provider of choice. The agent does its own inference. The plugin sits beside it. This is the shape the 500-question benchmark below used.
Venice via x402 + SIWE. For builders who want metered pay-per-call inference settled on Base without a long-term subscription, the plugin coexists cleanly with Venice as the inference rail. Each call settles on-chain through x402 with SIWE wallet authentication. Useful for short-lived agents, public demos, and crypto-native deployments. The plugin itself does not touch Venice or any other LLM. The agent does.

The plugin is provider-agnostic by design. The closed-beta tester picks the inference rail that matches their cost and trust profile. This benchmark used direct Anthropic API for clean reproducibility against the published Sonnet ceiling.

The 500-question result

Full LongMemEval Oracle dataset, run end-to-end against the plugin at concurrency 3. The agent used claude-sonnet-4-5 through the standard Anthropic Messages API with native tool-use. Each question got a fresh plugin database. Zero records errored out across 500 questions.

Category	Plugin (Sonnet 4.5)	SIBYL Native (Opus 4.6)	Δ
knowledge-update	93.6% (73/78)	92.3%	+1.3pp
multi-session	91.7% (122/133)	93.2%	−1.5pp
single-session-assistant	98.2% (55/56)	100%	−1.8pp
single-session-user	97.1% (68/70)	100%	−2.9pp
temporal-reasoning	97.0% (129/133)	96.2%	+0.8pp
single-session-preference (excluded)	20.0% (6/30)	93.3%	judged separately
Overall ex-pref	95.1% (447/470)	95.6%	−0.5pp

Single-session-preference is excluded from the overall score by LongMemEval convention. That category measures preference inference, which is judged by aspect overlap rather than substring match, and is not directly comparable across scoring methodologies without a dedicated judge. The other five categories use substring + paraphrase scoring and compare cleanly to the published numbers.

Per-category comparison

Plugin V4 on Sonnet 4.5 against SIBYL's published native-architecture result on Opus 4.6 (April 2026, #2 on the LongMemEval community leaderboard). Bars are anchored at zero.

Plugin V4 (Sonnet 4.5) SIBYL native architecture (Opus 4.6, published April 2026)

knowledge-update

93.6% 92.3%

multi-session

91.7% 93.2%

session-assistant

98.2% 100%

session-user

97.1% 100%

temporal-reasoning

97.0% 96.2%

overall (ex-pref)

95.1% 95.6%

The initial score was 88.1%. Here is what changed.

The same 500 hypotheses, scored under our prior v2 scorer, produced 88.1%. Investigation of the gap revealed that 32 of 36 missed temporal-reasoning answers were scorer false negatives. The plugin had the right answer in the hypothesis text. The scorer rejected it because of how LongMemEval encodes its gold labels for date arithmetic.

The bug

LongMemEval temporal-reasoning gold labels often use this format:

"14 days. 15 days (including the last day) is also acceptable."

That string encodes two valid answers separated by a period. The v2 scorer treated it as one string. A hypothesis ending with "the difference is 14 days" failed the substring match (no "14 days." with the period in the hypothesis text) and diluted below the token-overlap threshold because the gold has eight tokens and the hypothesis only echoed four or five of them.

The fix

Scorer v3 adds one helper, splitGold(), that detects this multi-alternative format with a regex and yields the two acceptable answers as a list. Every existing scoring layer (substring, abstention, pronoun-normalized, number-normalized, token overlap) then runs against each alternative independently. Two regex patterns, one helper function, no other changes. The hypotheses themselves are immutable.

Scorer pass	Overall ex-pref	Temporal-reasoning	Δ vs v2
v1 strict substring only	71.3% (335/470)	68.4% (91/133)	baseline
v2 substring + fuzzy layers	88.1% (414/470)	72.9% (97/133)	+112 / +6
v3 v2 + splitGold	95.1% (447/470)	97.0% (129/133)	+33 answers (+7.0pp)

On scorer transparency. The hypotheses, the manifest, the v2 scorer behavior, the v3 scorer with the fix, and a side-by-side score summary are all linked in the raw data section below. Re-score under either version. Run the LongMemEval team's official judge if that is what you trust. Both numbers (88.1% under v2, 95.1% under v3) are published because the same hypotheses scored under both is the honest thing to ship.

Where the architecture earns its keep: long horizons

LongMemEval is a one-shot benchmark. Each question's haystack is a synthetic multi-session conversation of bounded length, fed to the plugin in a single ingest pass, then queried once. The 95.1% result confirms the tier architecture works under those conditions, matching the published file-based ceiling within 0.5pp.

The plugin's real strength shows up somewhere a one-shot benchmark cannot directly measure: persistent agents running over 14+ day horizons with growing entity counts and many active projects. Consider the scaling behavior.

Day 1. One user, one project, maybe a dozen entities. Any memory shape works. Vector stores, long-context dumps, sticky-note files. All of them roughly fine.
Day 14. Same user, four projects, eighty entities, hundreds of journal events, three people the agent now knows by name, multiple changing decisions over time. The agent needs to answer "what did Mira say about the Q3 strategy last Tuesday" without scanning thousands of turns and without losing precision on what was actually said. Vector stores start returning fuzzy neighbors. Long-context dumps blow past the budget. Sticky-note files have stale state with no audit trail.
Day 60. The tier model is the only thing that still cleanly separates active context from durable facts from the chronological record. The agent calls into the same three tools it has always called. The schema returns the right kind of entry for each kind of question, with cross-references that no retrieval-only memory layer surfaces.

Closed-beta deployments are validating this pattern in production now: persistent agents managing partner relationships, multi-project workstreams, long-running operations where the schema accumulates structured cross-references that no retrieval-only memory layer can produce. The 95.1% on a one-shot benchmark is the floor. The compounding advantage shows up in week three.

Closed beta status

The plugin is in active closed beta. This 500-question result was produced on sibyl-memory-hermes 0.3.5 and sibyl-memory-client 0.4.2; the beta has since shipped newer releases (current: sibyl-memory-hermes 0.3.13, sibyl-memory-client 0.4.19, sibyl-memory-cli 0.3.19, sibyl-memory-mcp 0.1.12). Beta testers receive the integration recipe, an activation flow that binds their wallet to the plugin instance, and direct support during deployment.

Public availability and pricing will be announced after the closed-beta cohort completes their initial deployments. Beta interest can be registered at the Sibyl Labs plugin page.

If you are a closed-beta tester reading this: the integration pattern that produced 95.1% on Sonnet 4.5 is shared directly with cohort members. Reach out through the closed-beta channel for the exact shape we ran.

Test conditions

Dataset	LongMemEval Oracle, full 500 questions, no sampling
Plugin	`sibyl-memory-hermes 0.3.5` + `sibyl-memory-client 0.4.2` (PyPI, shipped artifacts)
Model	Claude Sonnet 4.5 via Anthropic Messages API (native tool-use)
Architecture	Multi-tier Sibyl Schema accessed through the plugin's three memory tools. Tier shapes and routing are internal.
Concurrency	3
Wall clock	84.9 min (5,093 s)
Cost	$43.78 total, $0.088 per question (Sonnet 4.5 list pricing)
Tokens	9.95M input, 928K output
Errored records	0 / 500
Run ID	`v4-20260522T204716Z-11b1fb89`

Scoring	Programmatic scorer with substring, number normalization, abstention detection, pronoun swap, token overlap. v3 adds multi-alternative gold split for LongMemEval's "X days. Y days (including the last day) is also acceptable" format. Both v2 and v3 results published. Preference category excluded from overall per LongMemEval convention.
Raw data	All hypotheses, manifest, and scorer published below for independent verification.

Sources

Raw data

Every model answer, the manifest, and the patched scorer are published here. Re-score the plugin yourself. The integration pattern itself is shared with closed-beta cohort members directly, not published.

Plugin hypotheses

hypotheses-v4-plugin.jsonl

Run manifest

manifest.json

Scorer (v3, the fix)

longmemeval-score.mjs

Score summary (v3)

scores-v3.json

Score summary (v2 pre-fix)

scores-v2-pre-fix.json

To verify the score yourself

node longmemeval-score.mjs hypotheses-v4-plugin.jsonl
# → 95.1% overall ex-pref (matches scores-v3.json)

node longmemeval-score.mjs --strict hypotheses-v4-plugin.jsonl
# → 71.3% overall ex-pref (v1 strict substring baseline)

One last note on the framing

This is not a claim that the plugin replaces the native file-based architecture. It is a claim that the plugin packages enough of the architecture's load-bearing structure (the tier model, the schema invariants, the typed entity surface) to land within 0.5pp of the published bespoke ceiling on a public benchmark. That is the case the plugin needed to make. The data now supports it.

For builders who want SIBYL-grade persistent memory in their own agent without rebuilding the file-based stack from scratch, the plugin is the path. For my own production system, the native architecture remains. Both shapes are valid. They serve different deployment surfaces.

The closed-beta cohort is finding out what that means in practice. Public results, on a public dataset, with a published scorer, are how we keep that honest as the plugin matures.