Semantic cache

The semantic cache goes beyond exact matching: it uses embeddings to find prompts with the same semantic meaning, even when the wording is different. For example, "What's the weather in Paris?" and "Tell me the current weather for Paris" would match semantically.

semantic-cache

How it works

The semantic cache uses a two-stage lookup:

1. Exact match (fast path)

First, a SHA-512 hash of the messages is checked against an in-memory cache, exactly like the simple cache. If an exact match is found, the response is returned immediately.

2. Semantic similarity (vector search)

If no exact match is found, the cache computes an embedding of the user's messages using a local MiniLM model (AllMiniLmL6V2, running via ONNX — no external API call needed). This embedding is compared against all stored embeddings using cosine similarity.

If a stored entry has a similarity score above the configured score threshold, the cached response is returned.

Configuration

The cache is configured on the LLM Provider entity in the cache section:

{
  "cache": {
    "strategy": "semantic",
    "ttl": 300000,
    "score": 0.85
  }
}

Parameter	Type	Default	Description
`strategy`	string	`"none"`	Set to `"semantic"` to enable semantic cache
`ttl`	number (ms)	`86400000` (24h)	Time-to-live for cached entries in milliseconds
`score`	number (0-1)	`0.8`	Minimum cosine similarity score for a semantic match. Higher values require closer matches.
`embedding_ref`	string	—	Optional reference to an Embedding Model entity to use instead of the built-in AllMiniLmL6V2
`redis_url`	string	—	Optional Redis URL to use Redis Stack as cache and vector search backend

Score tuning

Score	Behavior
0.95+	Very strict — only near-identical prompts match
0.85	Good default — catches paraphrases while avoiding false matches
0.70	Loose — broader matching, higher risk of incorrect cache hits
< 0.60	Too loose — likely to return irrelevant cached responses

Embedding model

By default, the semantic cache uses the AllMiniLmL6V2 sentence transformer model (384 dimensions), which runs locally via ONNX runtime. This means:

No external API call is needed for embedding computation
No additional cost per cache lookup
Low latency (typically < 10ms for embedding)
The model is included in the extension — no separate setup required

Custom embedding model

You can use any embedding model registered in the extension instead of the built-in AllMiniLmL6V2. Set embedding_ref to the id of an Embedding Model entity:

{
  "cache": {
    "strategy": "semantic",
    "ttl": 300000,
    "score": 0.85,
    "embedding_ref": "embedding-model_xxxxx"
  }
}

This lets you use a higher-quality model (OpenAI, Mistral, Cohere, etc.) for cache matching. If the referenced model is unavailable or returns an error, the cache falls back to the built-in AllMiniLmL6V2 automatically.

Only messages with role "user" are used for semantic matching.

Redis-backed semantic cache

By default, the semantic cache uses in-memory storage (Caffeine + LangChain4j InMemoryEmbeddingStore). To share the cache across a cluster, set redis_url to a Redis Stack URI:

{
  "cache": {
    "strategy": "semantic",
    "ttl": 300000,
    "score": 0.85,
    "redis_url": "redis://localhost:6379"
  }
}

When Redis is configured:

Embeddings are stored as binary vectors in Redis HASH keys and indexed with RediSearch for KNN similarity search
Cached responses are stored as JSON strings with automatic TTL expiration
Embedding HASH keys also expire via PEXPIRE, automatically removing them from the search index
The RediSearch index is created automatically on first use with the correct vector dimensions
Redis connections are pooled and shared across providers

Requires Redis Stack (the redis-stack or redis-stack-server Docker image) for the RediSearch vector search module.

Cache behavior

In-memory mode: Maximum 5000 entries per provider. When a cached entry expires, its embedding is automatically cleaned up from the vector store.
Redis mode: No entry limit (limited by Redis memory). TTL is managed by Redis natively.
Cached responses return zero token usage (no cost incurred)
Both blocking and streaming responses are cached

Response headers and metadata

Same as the simple cache: X-Cache-Status, X-Cache-Key, X-Cache-Ttl, Age headers and a cache object in response metadata.

When to use semantic cache

Customer support: Users ask the same questions with different phrasing
Search-like applications: Queries with varied wording but same intent
Multi-language contexts: Similar questions in slightly different formulations

For strict exact-match caching, use the simple cache instead. You can also combine both strategies by setting "strategy": "simple,semantic" — the simple cache is checked first (faster), then the semantic cache if no exact match is found.

How it works​

1. Exact match (fast path)​

2. Semantic similarity (vector search)​

Configuration​

Score tuning​

Embedding model​

Custom embedding model​

Redis-backed semantic cache​

Cache behavior​

Response headers and metadata​

When to use semantic cache​