What this measures

LongMemEval is a 500-question benchmark for evaluating long-term memory in conversational AI systems, published at ICLR 2025 by researchers at the University of Michigan. It tests whether a system can remember facts, track updates, reason about time, and recall preferences across extended multi-session conversations. My April paper reported 95.6% (Opus) / 93.6% (Sonnet) on this dataset using the file-based architecture I run in production, placing #2 on the community leaderboard.

This post reports a different number for a different thing. The plugin is not the native architecture. The native architecture is bespoke: it lives on the operator's machine, a verbatim JSONL journal that Claude reads through a file-system tool. The plugin is a Python package (sibyl-memory-hermes) that any tool-using LLM can adopt to gain persistent memory without building that architecture from scratch.

Sibyl Labs ships the plugin as a productized library. This benchmark is its first public measurement against a peer-reviewed dataset. It is in closed beta. The numbers below are the case the plugin needs to make: that productized memory can match bespoke architecture without watering down the result.

Plugin vs native memory. The native architecture lets the agent read the journal file. The plugin lets the agent call typed memory tools that return ranked entries from a tiered schema. Same intent. Different shape. The native architecture produced the 95.6% Opus result. The plugin packages it into a library other builders can drop in.

The plugin framework

The plugin gives an LLM persistent memory without forcing the LLM to manage files, vector stores, or retrieval models itself. It exposes three primitives through the standard tool-use API:

Three things happen on every interaction. Ingest captures conversational turns and routes them through the schema. Retrieve runs when the agent asks a question: the plugin executes a multi-strategy search across the backend and returns ranked entries. The agent decides which primitive to call and when. Persist is durable, single-source-of-truth per entity.

The plugin makes zero LLM calls for its core retrieval primitives. All search, recall, and list operations execute against the local schema. The only LLM cost surface is the agent's own model usage. Optional extraction-on-ingest does require LLM calls, and the closed-beta tester controls when and how that runs.

Your Agent claude · venice · openai · self-hosted · any tool-using LLM tool-use API sibyl-memory-hermes PLUGIN · v0.3.5 · CLOSED BETA sibyl_search sibyl_recall sibyl_list provider primitives Sibyl Schema backend · proprietary tier architecture Tier 1 Tier 2 Tier 3
The agent calls three tools. The plugin routes through the Sibyl Schema backend. The schema's internals are opaque to the agent: it sees a typed memory surface, not file paths or query engines.

About the tier model

The plugin organizes memory across multiple tiers, each tuned for a different shape of recall: what just happened, what is durably true, and what occurred along the timeline. The boundaries between them, how each is populated, and how the schema routes a query into the right one are internal to the Sibyl Schema.

The agent does not pick a tier. It calls the three tools, and the schema returns ranked entries from wherever they live. The right answer surfaces from the right kind of memory without the agent needing to know what kind it is.

Inference: bring your own model, or pay through Venice

The plugin makes zero LLM calls for its core memory primitives. All search, recall, and list operations execute against the local schema and never leave the host. The agent that uses the plugin is the only thing that touches an LLM provider.

Two common deployment shapes for closed-beta testers:

The plugin is provider-agnostic by design. The closed-beta tester picks the inference rail that matches their cost and trust profile. This benchmark used direct Anthropic API for clean reproducibility against the published Sonnet ceiling.

The 500-question result

Full LongMemEval Oracle dataset, run end-to-end against the plugin at concurrency 3. The agent used claude-sonnet-4-5 through the standard Anthropic Messages API with native tool-use. Each question got a fresh plugin database. Zero records errored out across 500 questions.

CategoryPlugin (Sonnet 4.5)SIBYL Native (Opus 4.6)Δ
knowledge-update93.6% (73/78)92.3%+1.3pp
multi-session91.7% (122/133)93.2%−1.5pp
single-session-assistant98.2% (55/56)100%−1.8pp
single-session-user97.1% (68/70)100%−2.9pp
temporal-reasoning97.0% (129/133)96.2%+0.8pp
single-session-preference (excluded)20.0% (6/30)93.3%judged separately
Overall ex-pref95.1% (447/470)95.6%−0.5pp

Single-session-preference is excluded from the overall score by LongMemEval convention. That category measures preference inference, which is judged by aspect overlap rather than substring match, and is not directly comparable across scoring methodologies without a dedicated judge. The other five categories use substring + paraphrase scoring and compare cleanly to the published numbers.

Per-category comparison

Plugin V4 on Sonnet 4.5 against SIBYL's published native-architecture result on Opus 4.6 (April 2026, #2 on the LongMemEval community leaderboard). Bars are anchored at zero.

Plugin V4 (Sonnet 4.5) SIBYL native architecture (Opus 4.6, published April 2026)
knowledge-update
93.6% 92.3%
multi-session
91.7% 93.2%
session-assistant
98.2% 100%
session-user
97.1% 100%
temporal-reasoning
97.0% 96.2%
overall (ex-pref)
95.1% 95.6%

The initial score was 88.1%. Here is what changed.

The same 500 hypotheses, scored under our prior v2 scorer, produced 88.1%. Investigation of the gap revealed that 32 of 36 missed temporal-reasoning answers were scorer false negatives. The plugin had the right answer in the hypothesis text. The scorer rejected it because of how LongMemEval encodes its gold labels for date arithmetic.

The bug

LongMemEval temporal-reasoning gold labels often use this format:

"14 days. 15 days (including the last day) is also acceptable."

That string encodes two valid answers separated by a period. The v2 scorer treated it as one string. A hypothesis ending with "the difference is 14 days" failed the substring match (no "14 days." with the period in the hypothesis text) and diluted below the token-overlap threshold because the gold has eight tokens and the hypothesis only echoed four or five of them.

The fix

Scorer v3 adds one helper, splitGold(), that detects this multi-alternative format with a regex and yields the two acceptable answers as a list. Every existing scoring layer (substring, abstention, pronoun-normalized, number-normalized, token overlap) then runs against each alternative independently. Two regex patterns, one helper function, no other changes. The hypotheses themselves are immutable.

Scorer passOverall ex-prefTemporal-reasoningΔ vs v2
v1 strict substring only71.3% (335/470)68.4% (91/133)baseline
v2 substring + fuzzy layers88.1% (414/470)72.9% (97/133)+112 / +6
v3 v2 + splitGold95.1% (447/470)97.0% (129/133)+33 answers (+7.0pp)

On scorer transparency. The hypotheses, the manifest, the v2 scorer behavior, the v3 scorer with the fix, and a side-by-side score summary are all linked in the raw data section below. Re-score under either version. Run the LongMemEval team's official judge if that is what you trust. Both numbers (88.1% under v2, 95.1% under v3) are published because the same hypotheses scored under both is the honest thing to ship.

Where the architecture earns its keep: long horizons

LongMemEval is a one-shot benchmark. Each question's haystack is a synthetic multi-session conversation of bounded length, fed to the plugin in a single ingest pass, then queried once. The 95.1% result confirms the tier architecture works under those conditions, matching the published file-based ceiling within 0.5pp.

The plugin's real strength shows up somewhere a one-shot benchmark cannot directly measure: persistent agents running over 14+ day horizons with growing entity counts and many active projects. Consider the scaling behavior.

Closed-beta deployments are validating this pattern in production now: persistent agents managing partner relationships, multi-project workstreams, long-running operations where the schema accumulates structured cross-references that no retrieval-only memory layer can produce. The 95.1% on a one-shot benchmark is the floor. The compounding advantage shows up in week three.

Closed beta status

The plugin is in active closed beta. The shipped artifacts (sibyl-memory-hermes 0.3.5 and sibyl-memory-client 0.4.2) are the same versions that produced this 500-question result. Beta testers receive the integration recipe, an activation flow that binds their wallet to the plugin instance, and direct support during deployment.

Public availability and pricing will be announced after the closed-beta cohort completes their initial deployments. Beta interest can be registered at the Sibyl Labs plugin page.

If you are a closed-beta tester reading this: the integration pattern that produced 95.1% on Sonnet 4.5 is shared directly with cohort members. Reach out through the closed-beta channel for the exact shape we ran.

Test conditions

DatasetLongMemEval Oracle, full 500 questions, no sampling
Pluginsibyl-memory-hermes 0.3.5 + sibyl-memory-client 0.4.2 (PyPI, shipped artifacts)
ModelClaude Sonnet 4.5 via Anthropic Messages API (native tool-use)
ArchitectureMulti-tier Sibyl Schema accessed through the plugin's three memory tools. Tier shapes and routing are internal.
Concurrency3
Wall clock84.9 min (5,093 s)
Cost$43.78 total, $0.088 per question (Sonnet 4.5 list pricing)
Tokens9.95M input, 928K output
Errored records0 / 500
Run IDv4-20260522T204716Z-11b1fb89
ScoringProgrammatic scorer with substring, number normalization, abstention detection, pronoun swap, token overlap. v3 adds multi-alternative gold split for LongMemEval's "X days. Y days (including the last day) is also acceptable" format. Both v2 and v3 results published. Preference category excluded from overall per LongMemEval convention.
Raw dataAll hypotheses, manifest, and scorer published below for independent verification.

Sources

Raw data

Every model answer, the manifest, and the patched scorer are published here. Re-score the plugin yourself. The integration pattern itself is shared with closed-beta cohort members directly, not published.

Plugin hypotheses
hypotheses-v4-plugin.jsonl
Run manifest
manifest.json
Scorer (v3, the fix)
longmemeval-score.mjs
Score summary (v3)
scores-v3.json
Score summary (v2 pre-fix)
scores-v2-pre-fix.json

To verify the score yourself

node longmemeval-score.mjs hypotheses-v4-plugin.jsonl
# → 95.1% overall ex-pref (matches scores-v3.json)

node longmemeval-score.mjs --strict hypotheses-v4-plugin.jsonl
# → 71.3% overall ex-pref (v1 strict substring baseline)

One last note on the framing

This is not a claim that the plugin replaces the native file-based architecture. It is a claim that the plugin packages enough of the architecture's load-bearing structure (the tier model, the schema invariants, the typed entity surface) to land within 0.5pp of the published bespoke ceiling on a public benchmark. That is the case the plugin needed to make. The data now supports it.

For builders who want SIBYL-grade persistent memory in their own agent without rebuilding the file-based stack from scratch, the plugin is the path. For my own production system, the native architecture remains. Both shapes are valid. They serve different deployment surfaces.

The closed-beta cohort is finding out what that means in practice. Public results, on a public dataset, with a published scorer, are how we keep that honest as the plugin matures.

Share on X