coalent ● Engineering · Context infrastructure

Your AI's context is going stale — and you're paying for it twice.

RAG re-reads your sources on every call, and the moment one changes, every cached answer is silently wrong. Here's the trade-off nobody admits to — and how we removed it.

Nisarg Pujara
AI Architect · Creator of Coalent · VectorLink Labs

Every team building on LLMs eventually hits the same wall. Your agent is only as good as the context you feed it — and feeding it good context is shockingly expensive and shockingly fragile.

Think about what a normal RAG call actually does. A question comes in. You embed it, search your vector store, pull back a handful of chunks, and stuff that raw text into the prompt. The model reads all of it, every single time. Ask a similar question a minute later? You do the whole thing again — re-retrieving, re-reading, re-paying. And the day someone edits that source document, every answer you've cached from it is quietly, confidently wrong.

So you're forced into a bad choice: pay to rebuild context constantly, or serve stale answers. Most teams quietly do both.

The two-column problem

Put the two approaches side by side and the waste becomes obvious. Traditional RAG does full work on every query and has no idea when its context expired. The whole right column is the gap.

Traditional RAG full work, every query Query Retrieve raw chunks LLM reads full raw context Answer Re-reads & re-pays every call. Stale the moment a source changes. Coalent understanding, reused & kept fresh Query Cache hit — reuse understanding Your LLM reads the understanding Answer Skips re-retrieval & re-synthesis · ~66% fewer tokens. The LLM still writes every answer — always fresh.
Coalent caches the understanding it feeds the model — never the answer. Your LLM still writes every response; Coalent just skips the re-retrieval and re-synthesis.

The shift: cache understanding, not chunks

The mistake most caches make is caching the wrong thing. A normal cache stores the raw chunks — so on a hit it replays the same big block of text into the model. No savings, and still no idea when it's stale.

Coalent caches the understanding the model already built from those sources — a compact, decision-ready briefing — keyed by what the query means. Ask again, or from a different agent, and it's a warm hit: no retrieval, no re-synthesis, a fraction of the tokens. The raw evidence is retained underneath, so you never lose the detail — it can fall back to the full source whenever an answer needs it.

One important clarification: Coalent caches the context, not the answer. Your model still composes every response — using the conversation so far, the tools it's calling, and the exact question at hand. Coalent simply hands it a fresher, tighter context and skips the retrieval and re-synthesis it would otherwise repeat. The synthesizer's job is understanding; the answer is always the LLM's.

The part that actually matters: freshness by provenance

Reuse is nice. Reuse that's wrong is dangerous. The reason most teams don't aggressively cache LLM context is the fear of serving yesterday's answer.

So every unit of understanding remembers exactly which sources it was built from — its provenance. When a source changes, Coalent invalidates only the units that actually used it. Surgically. Everything else stays warm.

SOURCES CACHED UNDERSTANDING confluence:hr-policy github:payments.py ● changed deploy:release-42 Leave-policy understanding fresh ✓ Payment-error understanding dirty → rebuilds on next read Release-notes understanding fresh ✓ A change touches only the unit that used it. The rest stay warm.
Zero false-negatives on source-driven staleness, by construction — and zero needless rebuilds.

Does compressing to "understanding" cost you accuracy?

This is the obvious worry: if you answer from a compressed briefing instead of the full text, don't you lose the details — the specific numbers people actually need? We tested it honestly, with real LLMs and an independent grader, on number-dense documents.

Full-context RAG
283 tok
Normal cache (raw)
283 · stale
Coalent
96 tok
Context tokens sent to the model per read (lower is better).

The honest result. Coalent matched full-context RAG's answer accuracy — numbers included — while sending ~66% fewer context tokens (up to 75% on large documents) and staying 0% stale after source changes. A normal raw cache saved nothing on tokens and went stale. That's cost optimization without trading away quality.

It sits on top of what you already run

Coalent isn't another retriever or a rip-and-replace. It's a thin freshness-and-reuse layer above retrieval — bring any vector DB, hybrid search, GraphRAG, tools, or APIs. It ships drop-in helpers for LangGraph and MCP, durable SQLite/Redis stores, and a redis-cli-style CLI to inspect exactly what's cached. Pure Python, Apache-2.0, zero required dependencies.

Give your AI living understanding.
pip install coalent

Open source. If it's useful, a ⭐ on GitHub genuinely helps.