Appearance
Voice Tools
MCP tools for audio processing, transcription, and speech synthesis.
Available Tools
| Tool | Category | Description |
|---|---|---|
transcribe_audio | voice | Transcribe audio to text using AI speech recognition |
synthesize_speech | voice | Generate speech from text using AI TTS |
run_voice_pipeline | voice | Full STT → LLM → TTS pipeline |
list_voices | voice | List available TTS voices with preview metadata |
transcribe_audio
Transcribe audio to text using AI speech recognition. Auto-selects provider (Mistral Voxtral default, OpenAI Whisper fallback).
Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
audio_base64 | string | Yes | - | Base64-encoded audio data |
language | string | No | auto-detect | Language code (e.g., 'en', 'es') |
provider | string | No | auto | STT provider: mistral, openai, google |
timestamps | boolean | No | true | Include word-level timestamps |
Supported Formats
- MP3, WAV, M4A, WebM
- Maximum file size: 25MB
Example
python
import base64
# Load audio file
with open("audio.mp3", "rb") as f:
audio_base64 = base64.b64encode(f.read()).decode()
result = await mcp.call_tool("transcribe_audio", {
"audio_base64": audio_base64,
"language": "en",
"timestamps": True
})
print(result["text"])
# "Hello, how can I help you today?"Response
json
{
"text": "Hello, how can I help you today?",
"language": "en",
"duration_seconds": 2.5,
"words": [
{"word": "Hello", "start": 0.0, "end": 0.5},
{"word": "how", "start": 0.6, "end": 0.8}
],
"_gateway": {
"provider": "mistral",
"model": "voxtral-mini-latest",
"latency_ms": 450,
"cost": {"usd": 0.0012}
}
}synthesize_speech
Generate speech from text using AI TTS. Supports OpenAI TTS and ElevenLabs.
Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
text | string | Yes | - | Text to synthesize (max 4096 chars) |
voice | string | No | alloy | Voice name (alloy, echo, fable, onyx, nova, shimmer) |
provider | string | No | auto | TTS provider: openai, elevenlabs |
quality | string | No | standard | Quality: standard, hd, premium |
output_format | string | No | mp3 | Format: mp3, wav, flac |
Example
python
result = await mcp.call_tool("synthesize_speech", {
"text": "Hello! I'm here to help you today.",
"voice": "alloy",
"quality": "hd",
"output_format": "mp3"
})
# result["audio_base64"] is base64-encoded MP3
import base64
audio_bytes = base64.b64decode(result["audio_base64"])
with open("output.mp3", "wb") as f:
f.write(audio_bytes)Response
json
{
"audio_base64": "//uQxAAAAAANIAAAAAExBTUUzLjEwMFVV...",
"format": "mp3",
"duration_seconds": 2.3,
"_gateway": {
"provider": "openai",
"model": "tts-1-hd",
"latency_ms": 890,
"cost": {"usd": 0.0045}
}
}run_voice_pipeline
Execute an end-to-end STT → LLM → TTS pipeline. Use pre-built templates or custom configuration.
Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
audio_input | string | Yes | - | Base64-encoded audio to process |
template | string | Yes | - | Pipeline template (see below) |
output_mode | string | No | both | Output: audio, text, or both |
Available Templates
| Template | Description |
|---|---|
ambient-scribe | Medical transcription with PII handling |
voice-agent-fast | Low-latency voice assistant (< 2s response) |
voice-agent-premium | High-quality voice assistant |
legal-dictation | Legal document transcription |
Example
python
result = await mcp.call_tool("run_voice_pipeline", {
"audio_input": user_audio_base64,
"template": "voice-agent-fast",
"output_mode": "both"
})
print(f"User said: {result['transcription']}")
print(f"Agent replied: {result['response_text']}")
# Play result["response_audio_base64"]Response
json
{
"transcription": "What's the weather like today?",
"response_text": "It's currently 72°F and sunny in your area.",
"response_audio_base64": "//uQxAAAAAANIAAAAAExBTUU...",
"pipeline_latency_ms": 1850,
"_gateway": {
"stt_provider": "mistral",
"llm_provider": "openai",
"tts_provider": "elevenlabs",
"total_cost": {"usd": 0.0089}
}
}list_voices
List available TTS voices with preview metadata.
Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
provider | string | No | all | Filter: openai, elevenlabs, all |
Example
python
result = await mcp.call_tool("list_voices", {
"provider": "all"
})
for voice in result["voices"]:
print(f"{voice['id']}: {voice['description']} ({voice['provider']})")Response
json
{
"voices": [
{
"id": "alloy",
"provider": "openai",
"description": "Neutral, balanced voice",
"gender": "neutral",
"languages": ["en"]
},
{
"id": "rachel",
"provider": "elevenlabs",
"description": "Young, friendly American female",
"gender": "female",
"languages": ["en", "es", "fr"]
}
]
}Permissions
Grant voice tool access:
yaml
permissions:
tools:
- voice/transcribe
- voice/synthesize
- voice/pipeline
- voice/voicesRate Limits
Voice tools have special rate limits:
yaml
permissions:
rate_limits:
audio_minutes_per_day: 60 # Transcription minutes
synthesis_chars_per_day: 50000 # TTS charactersNext Steps
- Document Tools - OCR and document processing
- LLM Tools - Chat and embeddings