Overview

Otoroshi LLM Extension provides a comprehensive set of cost optimization features to help you monitor, control, and reduce your LLM spending.

Features

Cost tracking: Monitor the cost of every LLM request in real-time, with per-model pricing based on the LiteLLM price dictionary. Costs can be embedded in API responses and audit events.
Budgets: Define spending limits (in USD or tokens) per consumer, provider, model, or any scope. Budgets can block requests when exceeded or emit alerts.
Token quotas: Rate-limit LLM usage by token count per time window, grouped by any attribute (API key, user, route, custom expression).
Simple cache: In-memory cache based on exact prompt matching. Cached responses return zero token usage, avoiding provider costs entirely.
Semantic cache: Embedding-based cache that matches semantically similar prompts, even when the wording differs. Uses a local MiniLM model for embeddings.

When a cache (simple or semantic) returns a hit, the response is served directly from memory. The request never reaches the LLM provider, so:

You can activate caching on any provider via the cache configuration in the provider entity.