Skip to main content

Speech-to-Text (STT)

Speech-to-Text transcribes audio files into text. Otoroshi exposes STT through an OpenAI-compatible API endpoint.

Supported providers and models

ProviderModels
OpenAIwhisper-1, gpt-4o-mini-transcribe
Azure OpenAIwhisper-1, gpt-4o-mini-transcribe
Cloud Temple 🇫🇷 🇪🇺whisper-1, gpt-4o-mini-transcribe
Groqwhisper-large-v3, whisper-large-v3-turbo, distil-whisper-large-v3-en
ElevenLabsscribe_v1
Mistralvoxtral-mini-latest, voxtral-mini-2507

STT configuration

OpenAI / Azure OpenAI / Cloud Temple

{
"stt": {
"enabled": true,
"model": "whisper-1",
"language": "en",
"prompt": "Optional context for better transcription",
"response_format": "json",
"temperature": 0
}
}
ParameterTypeDefaultDescription
enabledbooleantrueEnable or disable STT
modelstringThe STT model to use
languagestringLanguage of the audio (ISO 639-1 code, e.g. en, fr, de)
promptstringOptional text to guide the model's transcription style
response_formatstringResponse format: json, text, srt, verbose_json, vtt
temperaturenumberSampling temperature between 0 and 1

Groq

Same parameters as OpenAI.

{
"stt": {
"enabled": true,
"model": "whisper-large-v3-turbo",
"language": "en"
}
}

ElevenLabs

{
"stt": {
"enabled": true,
"model_id": "scribe_v1",
"language": "en"
}
}
ParameterTypeDefaultDescription
enabledbooleantrueEnable or disable STT
model_idstringscribe_v1The ElevenLabs STT model
languagestringLanguage code for transcription

Mistral (Voxtral)

{
"stt": {
"enabled": true,
"model": "voxtral-mini-latest",
"language": "fr",
"diarize": true,
"temperature": 0
}
}
ParameterTypeDefaultDescription
enabledbooleantrueEnable or disable STT
modelstringThe Voxtral model to use
languagestringLanguage of the audio
diarizebooleanEnable speaker diarization (identifies different speakers in the audio). Mistral-specific feature.
temperaturenumberSampling temperature

API usage

Plugin setup

Add the Cloud APIM - Speech to text backend plugin to your route:

{
"enabled": true,
"plugin": "cp:otoroshi_plugins.com.cloud.apim.otoroshi.extensions.aigateway.plugins.OpenAICompatSpeechToText",
"config": {
"refs": ["audio-gen-model_xxxxxxxxx"],
"max_size_upload": 104857600
}
}
ParameterTypeDefaultDescription
refsarray of stringsReferences to audio model entities
max_size_uploadnumber104857600 (100 MB)Maximum upload file size in bytes

Request

curl https://my-audio-endpoint.example.com/v1/audio/transcriptions \
-H "Authorization: Bearer $OTOROSHI_API_KEY" \
-F "file=@recording.mp3" \
-F "model=whisper-1" \
-F "language=en"

The request is sent as multipart form data.

Request parameters

ParameterTypeDescription
filefileThe audio file to transcribe (required)
modelstringModel to use. Supports provider_id###model_name routing syntax
languagestringLanguage of the audio
promptstringOptional context for better transcription
response_formatstringDesired response format
temperaturestringSampling temperature

Response

{
"text": "Hello, how are you today?"
}

Full entity example with Mistral (Voxtral)

{
"id": "audio-gen-model_xxxxxxxxx",
"name": "Mistral Voxtral STT",
"description": "Mistral speech-to-text with Voxtral",
"provider": "mistral",
"config": {
"connection": {
"token": "${vault://local/MISTRAL_API_TOKEN}",
"timeout": 30000
},
"options": {
"stt": {
"enabled": true,
"model": "voxtral-mini-latest",
"diarize": true
}
}
},
"kind": "ai-gateway.extensions.cloud-apim.com/AudioModel"
}

Full entity example with ElevenLabs

{
"id": "audio-gen-model_xxxxxxxxx",
"name": "ElevenLabs Audio",
"description": "ElevenLabs TTS and STT",
"provider": "elevenlabs",
"config": {
"connection": {
"token": "${vault://local/ELEVENLABS_API_KEY}",
"timeout": 30000
},
"options": {
"tts": {
"enabled": true,
"model_id": "eleven_multilingual_v2",
"voice_id": "21m00Tcm4TlvDq8ikWAM",
"output_format": "mp3_44100_128"
},
"stt": {
"enabled": true,
"model_id": "scribe_v1"
}
}
},
"kind": "ai-gateway.extensions.cloud-apim.com/AudioModel"
}