Skip to main content
Coqui, the XTTS maintainer, has shut down. XTTS may not receive future updates or support.

Overview

XTTSTTSService provides multilingual voice synthesis with voice cloning capabilities through a locally hosted streaming server. The service supports real-time streaming and custom voice training using Coqui’s XTTS-v2 model for cross-lingual text-to-speech.

XTTS API Reference

Pipecat’s API methods for XTTS integration

Example Implementation

Complete example with voice cloning

XTTS Repository

Official XTTS streaming server repository

Voice Cloning

Learn about custom voice training

Installation

XTTS requires a running streaming server. Start the server using Docker:
docker run --gpus=all -e COQUI_TOS_AGREED=1 --rm -p 8000:80 \
  ghcr.io/coqui-ai/xtts-streaming-server:latest-cuda121

Prerequisites

XTTS Server Setup

Before using XTTSTTSService, you need:
  1. Docker Environment: Set up Docker with GPU support for optimal performance
  2. XTTS Server: Run the XTTS streaming server container
  3. Voice Models: Configure voice models and cloning samples as needed

Required Configuration

  • Server URL: Configure the XTTS server endpoint (default: http://localhost:8000)
  • Voice Selection: Set up voice models or voice cloning samples
GPU acceleration is recommended for optimal performance. The server requires CUDA support for best results.

Configuration

XTTSService

voice_id
str
required
deprecated
ID of the studio speaker to use for synthesis. Deprecated in v0.0.105. Use settings=XTTSService.Settings(voice=...) instead.
base_url
str
required
Base URL of the XTTS streaming server (e.g. http://localhost:8000).
aiohttp_session
aiohttp.ClientSession
required
An aiohttp session for HTTP requests to the XTTS server.
language
Language
default:"Language.EN"
deprecated
Language for synthesis. Supports Czech, German, English, Spanish, French, Hindi, Hungarian, Italian, Japanese, Korean, Dutch, Polish, Portuguese, Russian, Turkish, and Chinese. Deprecated in v0.0.105. Use settings=XTTSService.Settings(language=...) instead.
settings
XTTSService.Settings
default:"None"
Runtime-configurable settings. See Settings below.
sample_rate
int
default:"None"
Output audio sample rate in Hz. When None, uses the pipeline’s configured sample rate. Audio is automatically resampled from XTTS’s native 24kHz output.

Settings

Runtime-configurable settings passed via the settings constructor argument using XTTSService.Settings(...). These can be updated mid-conversation with TTSUpdateSettingsFrame. See Service Settings for details.
ParameterTypeDefaultDescription
modelstrNoneModel identifier. (Inherited.)
voicestrNoneVoice identifier. (Inherited.)
languageLanguage | strNoneLanguage for synthesis. (Inherited.)

Usage

Basic Setup

import aiohttp
from pipecat.services.xtts import XTTSService

async with aiohttp.ClientSession() as session:
    tts = XTTSService(
        settings=XTTSService.Settings(
            voice="Ana Florence",
        ),
        base_url="http://localhost:8000",
        aiohttp_session=session,
    )

With Language Configuration

import aiohttp
from pipecat.services.xtts import XTTSService
from pipecat.transcriptions.language import Language

async with aiohttp.ClientSession() as session:
    tts = XTTSService(
        settings=XTTSService.Settings(
            voice="Ana Florence",
        ),
        base_url="http://localhost:8000",
        aiohttp_session=session,
        language=Language.ES,
    )
The InputParams / params= pattern is deprecated as of v0.0.105. Use Settings / settings= instead. See the Service Settings guide for migration details.

Notes

  • Local server required: XTTS requires a locally running streaming server (via Docker). The service connects to this server over HTTP.
  • Studio speakers: On startup, the service fetches available “studio speakers” from the server’s /studio_speakers endpoint. The voice_id must match one of these speakers.
  • Audio resampling: XTTS natively outputs audio at 24kHz. The service automatically resamples to match the pipeline’s configured sample rate.
  • GPU recommended: The XTTS server performs best with CUDA-enabled GPU acceleration. CPU inference is significantly slower.
  • No API key required: XTTS runs locally, so no external API credentials are needed.