Overview
Otoroshi LLM Extension provides a comprehensive set of cost optimization features to help you monitor, control, and reduce your LLM spending.
Features
- Cost tracking: Monitor the cost of every LLM request in real-time, with per-model pricing based on the LiteLLM price dictionary. Costs can be embedded in API responses and audit events.
- Budgets: Define spending limits (in USD or tokens) per consumer, provider, model, or any scope. Budgets can block requests when exceeded or emit alerts.
- Token quotas: Rate-limit LLM usage by token count per time window, grouped by any attribute (API key, user, route, custom expression).
- Simple cache: In-memory cache based on exact prompt matching. Cached responses return zero token usage, avoiding provider costs entirely.
- Semantic cache: Embedding-based cache that matches semantically similar prompts, even when the wording differs. Uses a local MiniLM model for embeddings.
How caching saves costs
When a cache (simple or semantic) returns a hit, the response is served directly from memory. The request never reaches the LLM provider, so:
- No tokens are consumed (usage is reported as zero)
- No cost is incurred
- Response time is near-instant
You can activate caching on any provider via the cache configuration in the provider entity.