Example — With a vector cache

Compose a retrieval cache inside your retrieve() — serve from a vector/result cache when you have it, fall back to a live vector query otherwise.

Coalent caches understanding. Your retriever can also cache retrieval — the two layers compose cleanly. A common pattern: inside retrieve(), check a fast cache first (an embedding/result cache or a warm vector mirror), and fall back to the real vector query on a miss.

The artifact_id is derived from the source document, so it's identical whether a chunk came from the cache or a live query — invalidation stays consistent across both paths.

A caching retriever

There's caching logic wrapped around the lookup, so we implement the Retriever interface directly — it's just a class with a retrieve method, no base class needed (Retriever is a structural protocol).

import time
from coalent import Chunk

class CachingRetriever:
    """Serve from a result cache when present, else do a live vector query."""

    def __init__(self, vector_client, embed, *, ttl=60):
        self._vector = vector_client
        self._embed = embed
        self._ttl = ttl
        self._cache: dict[str, tuple[float, list[Chunk]]] = {}

    def retrieve(self, query, *, namespace=None):
        key = f"{namespace}:{query}"
        hit = self._cache.get(key)
        if hit and time.monotonic() - hit[0] < self._ttl:
            return hit[1]                              # served from the vector/result cache

        chunks = self._live_query(query)              # fall back to the real query
        self._cache[key] = (time.monotonic(), chunks)
        return chunks

    def _live_query(self, query):
        response = self._vector.query_points(
            collection_name="docs", query=self._embed(query), limit=6, with_payload=True,
        )
        return [
            Chunk(
                artifact_id=hit.payload["artifact_id"],   # dynamic — from the document
                text=hit.payload["text"],
                version=str(hit.payload.get("version", "")),
            )
            for hit in response.points
        ]

Wire it

from coalent import SemanticCache, LLMSynthesizer, OpenAIProvider

cache = SemanticCache(
    CachingRetriever(qdrant, embed, ttl=120),
    LLMSynthesizer(OpenAIProvider()),
)

Two caches, one coherent story

  • Your retrieval cache (inside retrieve) saves embedding + vector-search cost on a miss path.
  • Coalent's cognition cache saves synthesis cost and keeps understanding fresh by provenance.

When a document changes, invalidate both with the same id:

def on_document_changed(doc):
    artifact_id = f"confluence:{doc.id}"
    retriever._cache.clear()                 # or evict just this doc's queries
    cache.source_changed(artifact_id, text=doc.body)
i

Keep the retrieval cache's TTL short and let Coalent own correctness via provenance — the retrieval cache is a speed optimization, the cognition cache is the freshness guarantee.

Next