Traditional RAG does retrieval — dump the top-k and hope the answer is in the noise. Coalent does context engineering — the minimum sufficient signal for the decision, with the raw on tap.

✦

Key idea. get() returns ctx.context — the decision-relevant slice for this query — while ctx.evidence / ctx.raw_text keep the full source reachable. Minimal payload, nothing lost.

The coverage gate (auto-escalation)

Every read scores how well the cached unit covers the query. A hit that under-covers automatically pulls fresh raw for that query — no LLM call, no manual signal:

ctx = cache.get("how many leave days for a 5-year employee?")
ctx.coverage     # 0.0–1.0: how well the unit covers the query
ctx.escalated    # True if it had to fetch fresh raw to cover the question

So a cached unit that's broadly right but missing a specific number doesn't serve a thin answer — it escalates to fetch that detail. The specifics are always there when the model needs them.

New in v0.5 Builds can also widen instead of escalating later: with widen_chunks=N, a miss-triggered build reads up to N chunks of the dominant source (via a duck-typed retriever.widen(artifact_id, limit=) or a BYO source_fetcher) rather than just the retrieval keyhole — so the unit covers later questions instead of running thin on them. It never fires at ingest; widen_on_admission extends it to thin-coverage admission rebuilds. In the cold-start pilot, widened units read a median 23 chunks of their source vs 2 for keyhole builds, and rebuild churn fell 460 → 31.

Tuning coverage — and entailment-grade precision

New in v0.3 The default coverage check is cosine over the unit's per-claim embeddings — cheap, no extra model call. It's tunable, and for hard cases it's pluggable:

SemanticCache(
    retriever, synth, embedder=OpenAIEmbedder(),
    coverage_floor=0.3,                # below this -> escalate (auto-derived per embedder by default)
    enable_coverage_escalation=True,   # the RAG-floor safety net (on by default)
    coverage_scorer=my_entailment,     # OPTIONAL: a containment check cosine can't do
    coverage_ceiling=0.65,             # two-tier: consult the scorer only on borderline queries
)

Cosine measures similarity, not containment — so a query that's only topically adjacent to a unit (e.g. "sick days" vs an annual-leave unit) can score "covered" yet lack the specific fact. For containment-grade accuracy, plug a coverage_scorer — a (query, understanding) -> float callable backed by a cross-encoder, an NLI model, or a one-token LLM yes/no. coverage_ceiling keeps it cheap by consulting the scorer only in the ambiguous band. The default stays pure cosine; the scorer is entirely opt-in.

cache.stats() reports hit_rate and escalation_rate — your live signal for whether coverage is too loose (silent gaps) or too strict (escalating to raw too often).

Minimum-context projection

ctx.context is a compact, query-shaped payload:

ctx.context["understanding"]   # summary + only the query-relevant claims/facts
ctx.context["raw"]             # raw included only when needed (see strategies)

Irrelevant claims and facts are trimmed for this query — less noise to the LLM, which is both cheaper and higher quality.

Strategies

Choose how much raw rides along (the full raw is always reachable regardless):

from coalent import ContextStrategy

SemanticCache(retriever, synth, strategy=ContextStrategy.CONTEXT_FIRST)  # default

Strategy	`ctx.context["raw"]`
`CONTEXT_FIRST` (default)	raw only when the read escalated
`CONTEXT_RAW`	raw always
`CONTEXT_ONLY`	never (understanding only)

Override per call: cache.get(query, strategy=ContextStrategy.CONTEXT_RAW).

Pool serving — a preview of the v0.6 read path

New in v0.5 serve="pool" (default "unit") is an experimental alternative to unit-anchored serving: instead of projecting one matched unit, the cache serves the token-budgeted, globally-ranked pool of fresh claims across all units.

cache = SemanticCache(
    retriever, synth, embedder=OpenAIEmbedder(),
    serve="pool",                          # experimental v0.6 read-path preview
    serve_budget=600,                      # the cost dial — pool token budget
    pool_header=lambda u: my_title(u),     # per-unit header line: "[title | source | date]"
)

ctx = cache.get("what changed in the Enterprise plan this quarter?")
ctx.context["pool"]                        # the budgeted, ranked claim pool — what renderers read

serve_budget (default 600) is the token budget — the one cost dial.
pool_header supplies the header line for each unit's group of claims.
Freshness holds at claim granularity: a stale unit's claims are masked the moment a source changes.

Measured on real news (MultiHopRAG, pre-registered, held-out n=605), pool serving statistically ties the best chunk-RAG arm's accuracy at 0.79x its tokens — see the benchmark.

get() folds in related cognition units — ones sharing an entity or a source with the match, ranked by relevance to your query:

ctx = cache.get("our leave policy", related=3)
for r in ctx.related:
    r.unit_id, r.relation, r.understanding   # "shared_entity" | "shared_source"

This is light, lazy multi-hop — enough for cross-document reuse, not a graph engine.

Agent affordances

For agent loops, escalation is also explicit:

cache.drill(ctx.unit_id)            # the full raw evidence behind a unit
cache.widen("our leave policy")     # a fresh retrieval for a query

get() & Result — every field on the result.
The Synthesizer — the structured understanding that context is projected from.

Context intelligence