Benchmark

An honest, runnable mechanism benchmark — the cheap + fresh triangle, measured against naive RAG and a provenance-less cache.

Coalent ships its own benchmark, built to be honest: an independent correctness oracle (the harness's own ground truth, not the cache's hashes), real token cost, a multi-document corpus under churn, and non-strawman baselines.

from coalent.evaluation import run_benchmark

for name, report in run_benchmark().items():
    print(name, report.accuracy, report.stale_rate, report.cost_tokens)

What it compares

  • NaiveRAG — re-retrieves on every read. Always fresh, full cost.
  • StaleCache — a semantic cache without provenance invalidation (the moat removed). Cheap, but goes stale.
  • Coalent — the semantic cache with surgical invalidation.

The measured result

SystemAccuracyStale-rateCost (tokens)
NaiveRAG1.000.0070
StaleCache0.800.2035
Coalent1.000.0049

Coalent matches naive RAG's perfect freshness at ~30% lower cost, while the provenance-less cache is cheapest but 20% stale. That's the thesis — cheap and fresh — measured, not asserted.

!

Scope, honestly. This is a deterministic mechanism benchmark (no LLM) — it proves the freshness/cost behaviour. It does not measure answer quality against GraphRAG on a real dataset; that's a separate, model-backed benchmark. We don't claim what we haven't measured.

Next