The question that matters: does answering from Coalent's cached understanding hold answer quality against the retriever it sits on top of — especially the numbers — while spending far fewer context tokens, attributing every figure to the right source, and answering questions a single retrieval pass can't?

So the benchmark refuses every shortcut that would flatter us: no LLM-judge (which self-prefers), a retriever shared by both arms so we can't win on retrieval, and grading with escalation off so we measure the cache alone. Two runs live on this page: our own structured-regime harness (synthetic templates), and a real-world run on MultiHopRAG with third-party gold questions.

Setup (structured regime)

Structured/reuse workload — 64 sources × 3 seeds = 192 reads per condition. This is the corpus Coalent is built for: facts that get re-asked.
Real OpenAI embeddings — no offline hashing stand-in that could inflate routing.
A deterministic accuracy check — a literal number + attribute match against gold. No gpt-4o judge grading its own family's answers.
A real dense top-5 retriever, shared by both arms. Naive RAG is that retriever, re-read every query. Coalent sits above the same retriever. Whatever Coalent gains is not a retrieval advantage.
Graded escalation-off. Coalent's RAG floor (append raw retrieval when coverage is low) is disabled during grading, so the number reflects the cache's own understanding — not a fall-back to full context.
Twin-free corpus with asserted invariants — every identity resolves to one value (see the transparency note below for why this matters).

Structured-regime benchmark (synthetic templates)

On the synthetic template corpus, answer accuracy is at parity across four answer models — the 95% confidence intervals overlap in every row — while Coalent sends a fraction of the context.

Answer model	Naive dense RAG	Coalent
gpt-4o-mini	0.81	0.81
gpt-4.1-mini	0.90	0.85
gpt-4o	0.90	0.87
gpt-4.1	0.99	0.97
Context tokens / read	126	47

Same retriever, same gold, same deterministic check — Coalent lands on top of naive RAG's accuracy while carrying roughly a third of the context per read. The extractive understanding (a query-independent list of source-grounded claims) keeps the figures that a prose summary would drop.

Trustworthy, not just cheap

Cheap is easy if you're willing to answer from the wrong source. Coalent isn't.

route@1 ≈ 1.00 — the cache picks the correct unit for the query essentially every time.
Misattribution ≈ 0–2% — serving a number from the wrong source is down at the level of the naive answerer's own noise. When Coalent gives you a figure, it came from the source it says it did.
Multi-hop: naive 0% → Coalent 100% (a template-fixture result — synthetic bridge questions; see the real-world run below). Questions whose answer requires bridging two facts from two different sources are structurally impossible for a single top-K retrieval pass. Coalent's cross-unit claim recall (on by default) pools per-claim memory across all fresh units with MaxSim and surfaces the bridge fact — at zero extra LLM calls.

Real-world benchmark (MultiHopRAG)

New in v0.5 Everything above is our own harness. This one isn't: MultiHopRAG — 609 real news articles with third-party gold questions. Answerer gpt-4.1-mini, exact-match grading, the identical embedder in both arms, and every Coalent number pre-registered on held-out questions.

Headline: Coalent matches chunk-RAG's best measured accuracy at ~0.8x the tokens (pre-registered, held-out n=605).

The fairness control we run that most benchmarks don't: naive's own token-scaling curve, on the same stream — so Coalent is compared against every operating point of the baseline, not one convenient k:

Arm	Accuracy	Context tokens / read
naive k=4	0.582	590
naive k=6	0.638	882
naive k=9 (best arm)	0.711	1,311
Coalent pool (`serve="pool"`, warmed cache)	0.699	~1,036

Pool serving (the v0.5 serve="pool" read path, warmed cache): 0.699 @ ~1,036 tokens — beats naive k6 (paired McNemar +81/−44, z = 3.22) and statistically ties naive's best arm k9 at 0.79x its tokens (z = 0.60).
The read path is the win: the old unit-anchored serving scored 0.579 @ 706 tokens on the same cache — pool serving is +0.12 accuracy (z = 6.66, n = 605).
Null honesty (n=100 unanswerable questions): pool refuses 95% vs naive k6/k9's 88%/85% — fewer confident answers to questions that have no answer.
The build layer (cold-start pilot, units built on the fly): widened units read a median of 23 chunks of their source vs 2 for keyhole builds; rebuild churn fell 460 → 31; admission reuse ×4; warm-pass accuracy flipped from decaying (−0.03) to compounding (+0.04).
Misattribution 2–6%; cross-unit recall fired on ~90% of reads — the first fully instrumented run.
Economics: extractor build spend ~$0.14–0.20 per ~600-read stream (gpt-4o-mini); break-even ≈ 4–5 reads per source.

The honest framing. This corpus is maximally chunk-friendly — its questions were generated from article sentences, so serving verbatim chunks is close to a ceiling strategy. We do not claim to beat naive's curve here. The claim is match at fewer tokens — the best chunk-RAG arm's accuracy at 0.79x its tokens — plus what naive cannot do at any k: freshness, provenance, and compounding reuse.

Economics (structured regime)

Building a unit costs about 430 tokens and ~4s per source, once. After that every read is a cheap embedding lookup instead of a full-context re-read.

Break-even at ~4–5 reads per source. On a reuse corpus (192 reads over 64 sources) you clear that on almost every source.
Cheaper on every read forever after — 47 vs 126 context tokens, no re-synthesis, freshness handled by dirtying the single affected unit instead of rebuilding a graph.

This is the opposite of GraphRAG's build-the-whole-graph-upfront tax: units are built lazily, only when a query needs one.

✦

The honest read. On the structured-regime (synthetic template) benchmark, answer quality is on par with naive dense RAG — the CIs overlap on all four models — at roughly a third of the context tokens per read, with the right-source guarantee (route@1 ≈ 1.00) that raw retrieval doesn't give you. The multi-hop capability gap — naive 0%, Coalent 100%, at zero extra LLM calls — is a template-fixture result; the real-world MultiHopRAG run above is the uncontrived version of the same story. Parity accuracy is the floor here, not the headline.

Transparency — the 27% that was a benchmark bug. An earlier internal run reported ~27% misattribution, and we chased it as a routing failure. It wasn't. The corpus contained contradictory duplicate sources — two "facts" asserting different values for the same identity. No router can resolve that; it's a coin flip, and a zero-residual coin-flip ceiling fit confirmed the error rate was exactly the pigeonhole artifact, not a model weakness. We fixed the corpus (identity-unique, with asserted invariants), not the metric — and route@1 went to 1.00 with misattribution to ~0–2%. We're telling you this because a benchmark you can't audit isn't worth quoting. Our earlier token-only, independent-judge benchmark is superseded by the numbers above and no longer represents Coalent.

Also shipped: a deterministic mechanism check

For a no-API-key, deterministic check of the freshness/cost mechanism (no LLM), the package ships run_benchmark:

from coalent.evaluation import run_benchmark

for name, report in run_benchmark().items():
    print(name, report.accuracy, report.stale_rate, report.cost_tokens)

How it works — the mechanism behind the numbers.
Provenance & freshness — why the stale-rate is zero.

Benchmark