Speech-to-Text (STT)
Speech-to-Text transcribes audio files into text. Otoroshi exposes STT through an OpenAI-compatible API endpoint.
Supported providers and models
| Provider | Models |
|---|---|
| OpenAI | whisper-1, gpt-4o-mini-transcribe |
| Azure OpenAI | whisper-1, gpt-4o-mini-transcribe |
| Cloud Temple 🇫🇷 🇪🇺 | whisper-1, gpt-4o-mini-transcribe |
| Groq | whisper-large-v3, whisper-large-v3-turbo, distil-whisper-large-v3-en |
| ElevenLabs | scribe_v1 |
| Mistral 🇫🇷 🇪🇺 | voxtral-mini-latest, voxtral-mini-2507 |
| AlphaEdge 🇫🇷 🇪🇺 | alpha-audio-v1 |
STT configuration
OpenAI / Azure OpenAI / Cloud Temple
{
"stt": {
"enabled": true,
"model": "whisper-1",
"language": "en",
"prompt": "Optional context for better transcription",
"response_format": "json",
"temperature": 0
}
}
| Parameter | Type | Default | Description |
|---|---|---|---|
enabled | boolean | true | Enable or disable STT |
model | string | — | The STT model to use |
language | string | — | Language of the audio (ISO 639-1 code, e.g. en, fr, de) |
prompt | string | — | Optional text to guide the model's transcription style |
response_format | string | — | Response format: json, text, srt, verbose_json, vtt |
temperature | number | — | Sampling temperature between 0 and 1 |
Groq
Same parameters as OpenAI.
{
"stt": {
"enabled": true,
"model": "whisper-large-v3-turbo",
"language": "en"
}
}
ElevenLabs
{
"stt": {
"enabled": true,
"model_id": "scribe_v1",
"language": "en"
}
}
| Parameter | Type | Default | Description |
|---|---|---|---|
enabled | boolean | true | Enable or disable STT |
model_id | string | scribe_v1 | The ElevenLabs STT model |
language | string | — | Language code for transcription |
Mistral (Voxtral)
{
"stt": {
"enabled": true,
"model": "voxtral-mini-latest",
"language": "fr",
"diarize": true,
"temperature": 0
}
}
| Parameter | Type | Default | Description |
|---|---|---|---|
enabled | boolean | true | Enable or disable STT |
model | string | — | The Voxtral model to use |
language | string | — | Language of the audio |
diarize | boolean | — | Enable speaker diarization (identifies different speakers in the audio). Mistral-specific feature. |
temperature | number | — | Sampling temperature |
AlphaEdge
AlphaEdge 🇫🇷 🇪🇺 is a French/EU provider specialized in speech transcription and OCR. The audio model only supports STT (no TTS, no translation). Authentication uses the X-API-Key header (set through config.connection.token).
{
"stt": {
"enabled": true,
"model": "alpha-audio-v1",
"enable_diarization": false,
"enable_postcorrect": false
}
}
| Parameter | Type | Default | Description |
|---|---|---|---|
enabled | boolean | true | Enable or disable STT |
model | string | alpha-audio-v1 | The AlphaEdge transcription model |
enable_diarization | boolean | false | Enable speaker diarization (identifies different speakers in the audio) |
enable_postcorrect | boolean | false | Apply linguistic post-correction (punctuation, capitalization, spelling, stuttering removal) |
enable_diarization and enable_postcorrect can also be passed per-request as form fields, overriding the entity configuration.
API usage
Plugin setup
Add the Cloud APIM - Speech to text backend plugin to your route:
{
"enabled": true,
"plugin": "cp:otoroshi_plugins.com.cloud.apim.otoroshi.extensions.aigateway.plugins.OpenAICompatSpeechToText",
"config": {
"refs": ["audio-gen-model_xxxxxxxxx"],
"max_size_upload": 104857600
}
}
| Parameter | Type | Default | Description |
|---|---|---|---|
refs | array of strings | — | References to audio model entities |
max_size_upload | number | 104857600 (100 MB) | Maximum upload file size in bytes |
Request
curl https://my-audio-endpoint.example.com/v1/audio/transcriptions \
-H "Authorization: Bearer $OTOROSHI_API_KEY" \
-F "file=@recording.mp3" \
-F "model=whisper-1" \
-F "language=en"
The request is sent as multipart form data.
Request parameters
| Parameter | Type | Description |
|---|---|---|
file | file | The audio file to transcribe (required) |
model | string | Model to use. Supports provider_id###model_name routing syntax |
language | string | Language of the audio |
prompt | string | Optional context for better transcription |
response_format | string | Desired response format |
temperature | string | Sampling temperature |
Response
{
"text": "Hello, how are you today?"
}
Full entity example with Mistral (Voxtral)
{
"id": "audio-gen-model_xxxxxxxxx",
"name": "Mistral Voxtral STT",
"description": "Mistral speech-to-text with Voxtral",
"provider": "mistral",
"config": {
"connection": {
"token": "${vault://local/MISTRAL_API_TOKEN}",
"timeout": 30000
},
"options": {
"stt": {
"enabled": true,
"model": "voxtral-mini-latest",
"diarize": true
}
}
},
"kind": "ai-gateway.extensions.cloud-apim.com/AudioModel"
}
Full entity example with ElevenLabs
{
"id": "audio-gen-model_xxxxxxxxx",
"name": "ElevenLabs Audio",
"description": "ElevenLabs TTS and STT",
"provider": "elevenlabs",
"config": {
"connection": {
"token": "${vault://local/ELEVENLABS_API_KEY}",
"timeout": 30000
},
"options": {
"tts": {
"enabled": true,
"model_id": "eleven_multilingual_v2",
"voice_id": "21m00Tcm4TlvDq8ikWAM",
"output_format": "mp3_44100_128"
},
"stt": {
"enabled": true,
"model_id": "scribe_v1"
}
}
},
"kind": "ai-gateway.extensions.cloud-apim.com/AudioModel"
}
Full entity example with AlphaEdge
{
"id": "audio-gen-model_xxxxxxxxx",
"name": "AlphaEdge STT",
"description": "AlphaEdge speech-to-text",
"provider": "alphaedge",
"config": {
"connection": {
"base_url": "https://api-endpoints.alphaedge-ai.com",
"token": "${vault://local/ALPHAEDGE_API_KEY}",
"timeout": 180000
},
"options": {
"stt": {
"enabled": true,
"model": "alpha-audio-v1",
"enable_diarization": true,
"enable_postcorrect": true
}
}
},
"kind": "ai-gateway.extensions.cloud-apim.com/AudioModel"
}