Coalent ships its own benchmark, built to be honest: an independent correctness oracle (the harness's own ground truth, not the cache's hashes), real token cost, a multi-document corpus under churn, and non-strawman baselines.
from coalent.evaluation import run_benchmark
for name, report in run_benchmark().items():
print(name, report.accuracy, report.stale_rate, report.cost_tokens)
What it compares
- NaiveRAG — re-retrieves on every read. Always fresh, full cost.
- StaleCache — a semantic cache without provenance invalidation (the moat removed). Cheap, but goes stale.
- Coalent — the semantic cache with surgical invalidation.
The measured result
| System | Accuracy | Stale-rate | Cost (tokens) |
|---|---|---|---|
| NaiveRAG | 1.00 | 0.00 | 70 |
| StaleCache | 0.80 | 0.20 | 35 |
| Coalent | 1.00 | 0.00 | 49 |
Coalent matches naive RAG's perfect freshness at ~30% lower cost, while the provenance-less cache is cheapest but 20% stale. That's the thesis — cheap and fresh — measured, not asserted.
Scope, honestly. This is a deterministic mechanism benchmark (no LLM) — it proves the freshness/cost behaviour. It does not measure answer quality against GraphRAG on a real dataset; that's a separate, model-backed benchmark. We don't claim what we haven't measured.
Next
- How it works — the mechanism behind the numbers.
- Provenance & freshness — why the stale-rate is zero.