Voice Tools

MCP tools for audio processing, transcription, and speech synthesis.

Available Tools

Tool	Category	Description
`transcribe_audio`	voice	Transcribe audio to text using AI speech recognition
`synthesize_speech`	voice	Generate speech from text using AI TTS
`run_voice_pipeline`	voice	Full STT → LLM → TTS pipeline
`list_voices`	voice	List available TTS voices with preview metadata

transcribe_audio

Transcribe audio to text using AI speech recognition. Auto-selects provider (Mistral Voxtral default, OpenAI Whisper fallback).

Parameters

Parameter	Type	Required	Default	Description
`audio_base64`	string	Yes	-	Base64-encoded audio data
`language`	string	No	auto-detect	Language code (e.g., 'en', 'es')
`provider`	string	No	auto	STT provider: `mistral`, `openai`, `google`
`timestamps`	boolean	No	true	Include word-level timestamps

Supported Formats

MP3, WAV, M4A, WebM
Maximum file size: 25MB

Example

python

import base64

# Load audio file
with open("audio.mp3", "rb") as f:
    audio_base64 = base64.b64encode(f.read()).decode()

result = await mcp.call_tool("transcribe_audio", {
    "audio_base64": audio_base64,
    "language": "en",
    "timestamps": True
})

print(result["text"])
# "Hello, how can I help you today?"

Response

json

{
  "text": "Hello, how can I help you today?",
  "language": "en",
  "duration_seconds": 2.5,
  "words": [
    {"word": "Hello", "start": 0.0, "end": 0.5},
    {"word": "how", "start": 0.6, "end": 0.8}
  ],
  "_gateway": {
    "provider": "mistral",
    "model": "voxtral-mini-latest",
    "latency_ms": 450,
    "cost": {"usd": 0.0012}
  }
}

synthesize_speech

Generate speech from text using AI TTS. Supports OpenAI TTS and ElevenLabs.

Parameters

Parameter	Type	Required	Default	Description
`text`	string	Yes	-	Text to synthesize (max 4096 chars)
`voice`	string	No	`alloy`	Voice name (alloy, echo, fable, onyx, nova, shimmer)
`provider`	string	No	auto	TTS provider: `openai`, `elevenlabs`
`quality`	string	No	`standard`	Quality: `standard`, `hd`, `premium`
`output_format`	string	No	`mp3`	Format: `mp3`, `wav`, `flac`

Example

python

result = await mcp.call_tool("synthesize_speech", {
    "text": "Hello! I'm here to help you today.",
    "voice": "alloy",
    "quality": "hd",
    "output_format": "mp3"
})

# result["audio_base64"] is base64-encoded MP3
import base64
audio_bytes = base64.b64decode(result["audio_base64"])
with open("output.mp3", "wb") as f:
    f.write(audio_bytes)

Response

json

{
  "audio_base64": "//uQxAAAAAANIAAAAAExBTUUzLjEwMFVV...",
  "format": "mp3",
  "duration_seconds": 2.3,
  "_gateway": {
    "provider": "openai",
    "model": "tts-1-hd",
    "latency_ms": 890,
    "cost": {"usd": 0.0045}
  }
}

run_voice_pipeline

Execute an end-to-end STT → LLM → TTS pipeline. Use pre-built templates or custom configuration.

Parameters

Parameter	Type	Required	Default	Description
`audio_input`	string	Yes	-	Base64-encoded audio to process
`template`	string	Yes	-	Pipeline template (see below)
`output_mode`	string	No	`both`	Output: `audio`, `text`, or `both`

Available Templates

Template	Description
`ambient-scribe`	Medical transcription with PII handling
`voice-agent-fast`	Low-latency voice assistant (< 2s response)
`voice-agent-premium`	High-quality voice assistant
`legal-dictation`	Legal document transcription

Example

python

result = await mcp.call_tool("run_voice_pipeline", {
    "audio_input": user_audio_base64,
    "template": "voice-agent-fast",
    "output_mode": "both"
})

print(f"User said: {result['transcription']}")
print(f"Agent replied: {result['response_text']}")
# Play result["response_audio_base64"]

Response

json

{
  "transcription": "What's the weather like today?",
  "response_text": "It's currently 72°F and sunny in your area.",
  "response_audio_base64": "//uQxAAAAAANIAAAAAExBTUU...",
  "pipeline_latency_ms": 1850,
  "_gateway": {
    "stt_provider": "mistral",
    "llm_provider": "openai",
    "tts_provider": "elevenlabs",
    "total_cost": {"usd": 0.0089}
  }
}

list_voices

List available TTS voices with preview metadata.

Parameters

Parameter	Type	Required	Default	Description
`provider`	string	No	`all`	Filter: `openai`, `elevenlabs`, `all`

Example

python

result = await mcp.call_tool("list_voices", {
    "provider": "all"
})

for voice in result["voices"]:
    print(f"{voice['id']}: {voice['description']} ({voice['provider']})")

Response

json

{
  "voices": [
    {
      "id": "alloy",
      "provider": "openai",
      "description": "Neutral, balanced voice",
      "gender": "neutral",
      "languages": ["en"]
    },
    {
      "id": "rachel",
      "provider": "elevenlabs",
      "description": "Young, friendly American female",
      "gender": "female",
      "languages": ["en", "es", "fr"]
    }
  ]
}

Permissions

Grant voice tool access:

yaml

permissions:
  tools:
    - voice/transcribe
    - voice/synthesize
    - voice/pipeline
    - voice/voices

Rate Limits

Voice tools have special rate limits:

yaml

permissions:
  rate_limits:
    audio_minutes_per_day: 60  # Transcription minutes
    synthesis_chars_per_day: 50000  # TTS characters

Next Steps

Document Tools - OCR and document processing
LLM Tools - Chat and embeddings

Voice Tools ​

Available Tools ​

transcribe_audio ​

Parameters ​

Supported Formats ​

Example ​

Response ​

synthesize_speech ​

Parameters ​

Example ​

Response ​

run_voice_pipeline ​

Parameters ​

Available Templates ​

Example ​

Response ​

list_voices ​

Parameters ​

Example ​

Response ​

Permissions ​

Rate Limits ​

Next Steps ​

Voice Tools

Available Tools

transcribe_audio

Parameters

Supported Formats

Example

Response

synthesize_speech

Parameters

Example

Response

run_voice_pipeline

Parameters

Available Templates

Example

Response

list_voices

Parameters

Example

Response

Permissions

Rate Limits

Next Steps