Skip to content

Voice Tools

MCP tools for audio processing, transcription, and speech synthesis.

Available Tools

ToolCategoryDescription
transcribe_audiovoiceTranscribe audio to text using AI speech recognition
synthesize_speechvoiceGenerate speech from text using AI TTS
run_voice_pipelinevoiceFull STT → LLM → TTS pipeline
list_voicesvoiceList available TTS voices with preview metadata

transcribe_audio

Transcribe audio to text using AI speech recognition. Auto-selects provider (Mistral Voxtral default, OpenAI Whisper fallback).

Parameters

ParameterTypeRequiredDefaultDescription
audio_base64stringYes-Base64-encoded audio data
languagestringNoauto-detectLanguage code (e.g., 'en', 'es')
providerstringNoautoSTT provider: mistral, openai, google
timestampsbooleanNotrueInclude word-level timestamps

Supported Formats

  • MP3, WAV, M4A, WebM
  • Maximum file size: 25MB

Example

python
import base64

# Load audio file
with open("audio.mp3", "rb") as f:
    audio_base64 = base64.b64encode(f.read()).decode()

result = await mcp.call_tool("transcribe_audio", {
    "audio_base64": audio_base64,
    "language": "en",
    "timestamps": True
})

print(result["text"])
# "Hello, how can I help you today?"

Response

json
{
  "text": "Hello, how can I help you today?",
  "language": "en",
  "duration_seconds": 2.5,
  "words": [
    {"word": "Hello", "start": 0.0, "end": 0.5},
    {"word": "how", "start": 0.6, "end": 0.8}
  ],
  "_gateway": {
    "provider": "mistral",
    "model": "voxtral-mini-latest",
    "latency_ms": 450,
    "cost": {"usd": 0.0012}
  }
}

synthesize_speech

Generate speech from text using AI TTS. Supports OpenAI TTS and ElevenLabs.

Parameters

ParameterTypeRequiredDefaultDescription
textstringYes-Text to synthesize (max 4096 chars)
voicestringNoalloyVoice name (alloy, echo, fable, onyx, nova, shimmer)
providerstringNoautoTTS provider: openai, elevenlabs
qualitystringNostandardQuality: standard, hd, premium
output_formatstringNomp3Format: mp3, wav, flac

Example

python
result = await mcp.call_tool("synthesize_speech", {
    "text": "Hello! I'm here to help you today.",
    "voice": "alloy",
    "quality": "hd",
    "output_format": "mp3"
})

# result["audio_base64"] is base64-encoded MP3
import base64
audio_bytes = base64.b64decode(result["audio_base64"])
with open("output.mp3", "wb") as f:
    f.write(audio_bytes)

Response

json
{
  "audio_base64": "//uQxAAAAAANIAAAAAExBTUUzLjEwMFVV...",
  "format": "mp3",
  "duration_seconds": 2.3,
  "_gateway": {
    "provider": "openai",
    "model": "tts-1-hd",
    "latency_ms": 890,
    "cost": {"usd": 0.0045}
  }
}

run_voice_pipeline

Execute an end-to-end STT → LLM → TTS pipeline. Use pre-built templates or custom configuration.

Parameters

ParameterTypeRequiredDefaultDescription
audio_inputstringYes-Base64-encoded audio to process
templatestringYes-Pipeline template (see below)
output_modestringNobothOutput: audio, text, or both

Available Templates

TemplateDescription
ambient-scribeMedical transcription with PII handling
voice-agent-fastLow-latency voice assistant (< 2s response)
voice-agent-premiumHigh-quality voice assistant
legal-dictationLegal document transcription

Example

python
result = await mcp.call_tool("run_voice_pipeline", {
    "audio_input": user_audio_base64,
    "template": "voice-agent-fast",
    "output_mode": "both"
})

print(f"User said: {result['transcription']}")
print(f"Agent replied: {result['response_text']}")
# Play result["response_audio_base64"]

Response

json
{
  "transcription": "What's the weather like today?",
  "response_text": "It's currently 72°F and sunny in your area.",
  "response_audio_base64": "//uQxAAAAAANIAAAAAExBTUU...",
  "pipeline_latency_ms": 1850,
  "_gateway": {
    "stt_provider": "mistral",
    "llm_provider": "openai",
    "tts_provider": "elevenlabs",
    "total_cost": {"usd": 0.0089}
  }
}

list_voices

List available TTS voices with preview metadata.

Parameters

ParameterTypeRequiredDefaultDescription
providerstringNoallFilter: openai, elevenlabs, all

Example

python
result = await mcp.call_tool("list_voices", {
    "provider": "all"
})

for voice in result["voices"]:
    print(f"{voice['id']}: {voice['description']} ({voice['provider']})")

Response

json
{
  "voices": [
    {
      "id": "alloy",
      "provider": "openai",
      "description": "Neutral, balanced voice",
      "gender": "neutral",
      "languages": ["en"]
    },
    {
      "id": "rachel",
      "provider": "elevenlabs",
      "description": "Young, friendly American female",
      "gender": "female",
      "languages": ["en", "es", "fr"]
    }
  ]
}

Permissions

Grant voice tool access:

yaml
permissions:
  tools:
    - voice/transcribe
    - voice/synthesize
    - voice/pipeline
    - voice/voices

Rate Limits

Voice tools have special rate limits:

yaml
permissions:
  rate_limits:
    audio_minutes_per_day: 60  # Transcription minutes
    synthesis_chars_per_day: 50000  # TTS characters

Next Steps

Built with reliability in mind.