Skip to main content

Overview

Otoroshi LLM Extension provides a comprehensive set of cost optimization features to help you monitor, control, and reduce your LLM spending.

Features

  • Cost tracking: Monitor the cost of every LLM request in real-time, with per-model pricing based on the LiteLLM price dictionary. Costs can be embedded in API responses and audit events.
  • Budgets: Define spending limits (in USD or tokens) per consumer, provider, model, or any scope. Budgets can block requests when exceeded or emit alerts.
  • Token quotas: Rate-limit LLM usage by token count per time window, grouped by any attribute (API key, user, route, custom expression).
  • Simple cache: In-memory cache based on exact prompt matching. Cached responses return zero token usage, avoiding provider costs entirely.
  • Semantic cache: Embedding-based cache that matches semantically similar prompts, even when the wording differs. Uses a local MiniLM model for embeddings.

How caching saves costs

When a cache (simple or semantic) returns a hit, the response is served directly from memory. The request never reaches the LLM provider, so:

  • No tokens are consumed (usage is reported as zero)
  • No cost is incurred
  • Response time is near-instant

You can activate caching on any provider via the cache configuration in the provider entity.