chatterbox-ui/API_REFERENCE.md

11 KiB

Chatterbox TTS API Reference

Overview

The Chatterbox TTS API is a FastAPI-based backend service that provides text-to-speech capabilities with speaker management and dialog generation features. The API supports creating custom speakers from audio samples and generating complex dialogs with multiple speakers, silences, and fine-tuned TTS parameters.

Base URL: http://127.0.0.1:8000
API Version: 0.1.0
Framework: FastAPI with automatic OpenAPI documentation

Quick Start

  • Interactive API Documentation: http://127.0.0.1:8000/docs (Swagger UI)
  • Alternative Documentation: http://127.0.0.1:8000/redoc (ReDoc)
  • OpenAPI Schema: http://127.0.0.1:8000/openapi.json

Authentication

Currently, the API does not require authentication. CORS is configured to allow requests from localhost:8001 and 127.0.0.1:8001.


Endpoints

🏠 Root Endpoint

GET /

Welcome message and API status check.

Response:

{
  "message": "Welcome to the Chatterbox TTS API!"
}

👥 Speaker Management

GET /api/speakers/

Retrieve all available speakers.

Response Model: List[Speaker]

Example Response:

[
  {
    "id": "speaker_001",
    "name": "John Doe",
    "sample_path": "/path/to/speaker_samples/john_doe.wav"
  },
  {
    "id": "speaker_002", 
    "name": "Jane Smith",
    "sample_path": "/path/to/speaker_samples/jane_smith.wav"
  }
]

Status Codes:

  • 200: Success

POST /api/speakers/

Create a new speaker from an audio sample.

Request Type: multipart/form-data

Parameters:

  • name (form field, required): Speaker name
  • audio_file (file upload, required): Audio sample file (WAV, MP3, etc.)

Response Model: SpeakerResponse

Example Response:

{
  "id": "speaker_003",
  "name": "Alex Johnson",
  "message": "Speaker added successfully."
}

Status Codes:

  • 201: Speaker created successfully
  • 400: Invalid file type or missing file
  • 500: Server error during speaker creation

Example cURL:

curl -X POST "http://127.0.0.1:8000/api/speakers/" \
  -F "name=Alex Johnson" \
  -F "audio_file=@/path/to/sample.wav"

GET /api/speakers/{speaker_id}

Get details for a specific speaker.

Path Parameters:

  • speaker_id (string, required): Unique speaker identifier

Response Model: Speaker

Example Response:

{
  "id": "speaker_001",
  "name": "John Doe", 
  "sample_path": "/path/to/speaker_samples/john_doe.wav"
}

Status Codes:

  • 200: Success
  • 404: Speaker not found

DELETE /api/speakers/{speaker_id}

Delete a speaker by ID.

Path Parameters:

  • speaker_id (string, required): Unique speaker identifier

Example Response:

{
  "message": "Speaker deleted successfully"
}

Status Codes:

  • 200: Speaker deleted successfully
  • 404: Speaker not found

🎭 Dialog Generation

POST /api/dialog/generate_line

Generate audio for a single dialog line (speech or silence).

Request Body: Raw JSON object representing either a SpeechItem or SilenceItem

Speech Item Example:

{
  "type": "speech",
  "speaker_id": "speaker_001",
  "text": "Hello, this is a test message.",
  "exaggeration": 0.7,
  "cfg_weight": 0.6,
  "temperature": 0.8,
  "use_existing_audio": false,
  "audio_url": null
}

Silence Item Example:

{
  "type": "silence",
  "duration": 2.0,
  "use_existing_audio": false,
  "audio_url": null
}

Response:

{
  "audio_url": "/generated_audio/line_abc123def456.wav",
  "type": "speech",
  "text": "Hello, this is a test message."
}

Status Codes:

  • 200: Audio generated successfully
  • 400: Invalid request format or unknown dialog item type
  • 404: Speaker not found
  • 500: Server error during generation

POST /api/dialog/generate

Generate a complete dialog from multiple speech and silence items.

Request Model: DialogRequest

Request Body:

{
  "dialog_items": [
    {
      "type": "speech",
      "speaker_id": "speaker_001", 
      "text": "Welcome to our podcast!",
      "exaggeration": 0.5,
      "cfg_weight": 0.5,
      "temperature": 0.8
    },
    {
      "type": "silence",
      "duration": 1.0
    },
    {
      "type": "speech",
      "speaker_id": "speaker_002",
      "text": "Thank you for having me!",
      "exaggeration": 0.6,
      "cfg_weight": 0.7,
      "temperature": 0.9
    }
  ],
  "output_base_name": "podcast_episode_01"
}

Response Model: DialogResponse

Example Response:

{
  "log": "Processing dialog with 3 items...\nGenerating speech for item 1...\nGenerating silence for item 2...\nGenerating speech for item 3...\nConcatenating audio segments...\nZIP archive created at: /path/to/output.zip",
  "concatenated_audio_url": "/generated_audio/podcast_episode_01_concatenated.wav",
  "zip_archive_url": "/generated_audio/podcast_episode_01_archive.zip", 
  "temp_dir_path": "/path/to/temp/directory",
  "error_message": null
}

Status Codes:

  • 200: Dialog generated successfully
  • 400: Invalid request format or validation errors
  • 404: Speaker or file not found
  • 500: Server error during generation

📁 Static File Serving

GET /generated_audio/{filename}

Serve generated audio files and ZIP archives.

Path Parameters:

  • filename (string, required): Name of the generated file

Response: Binary audio file or ZIP archive

Example URLs:

  • http://127.0.0.1:8000/generated_audio/dialog_concatenated.wav
  • http://127.0.0.1:8000/generated_audio/dialog_archive.zip

📋 Data Models

Speaker Models

Speaker

{
  "id": "string",
  "name": "string", 
  "sample_path": "string|null"
}

SpeakerResponse

{
  "id": "string",
  "name": "string",
  "message": "string|null"
}

Dialog Models

SpeechItem

{
  "type": "speech",
  "speaker_id": "string",
  "text": "string",
  "exaggeration": 0.5,        // 0.0-2.0, controls expressiveness
  "cfg_weight": 0.5,          // 0.0-2.0, alignment with speaker characteristics  
  "temperature": 0.8,         // 0.0-2.0, randomness in generation
  "use_existing_audio": false,
  "audio_url": "string|null"
}

SilenceItem

{
  "type": "silence",
  "duration": 1.0,            // seconds, must be > 0
  "use_existing_audio": false,
  "audio_url": "string|null"
}

DialogRequest

{
  "dialog_items": [
    // Array of SpeechItem and/or SilenceItem objects
  ],
  "output_base_name": "string"  // Base name for output files
}

DialogResponse

{
  "log": "string",                        // Processing log
  "concatenated_audio_url": "string|null", // URL to final audio
  "zip_archive_url": "string|null",       // URL to ZIP archive
  "temp_dir_path": "string|null",         // Server temp directory
  "error_message": "string|null"          // Error details if failed
}

🎛️ TTS Parameters

Exaggeration (exaggeration)

  • Range: 0.0 - 2.0
  • Default: 0.5
  • Description: Controls the expressiveness of speech. Higher values produce more exaggerated, emotional speech.

CFG Weight (cfg_weight)

  • Range: 0.0 - 2.0
  • Default: 0.5
  • Description: Classifier-Free Guidance weight. Higher values make speech more aligned with the prompt text and speaker characteristics.

Temperature (temperature)

  • Range: 0.0 - 2.0
  • Default: 0.8
  • Description: Controls randomness in generation. Lower values produce more deterministic speech, higher values add more variation.

🔧 Configuration

Environment Variables

The API uses the following directory structure (configurable in app/config.py):

  • Speaker Samples: {PROJECT_ROOT}/speaker_data/speaker_samples/
  • Generated Audio: {PROJECT_ROOT}/backend/tts_generated_dialogs/
  • Temporary Files: {PROJECT_ROOT}/tts_temp_outputs/

CORS Settings

  • Allowed Origins: http://localhost:8001, http://127.0.0.1:8001 (plus any FRONTEND_HOST:FRONTEND_PORT when using start_servers.py)
  • Allowed Methods: All
  • Allowed Headers: All
  • Credentials: Enabled

🚀 Usage Examples

Python Client Example

import requests
import json

# Base URL
BASE_URL = "http://127.0.0.1:8000"

# Get all speakers
speakers = requests.get(f"{BASE_URL}/api/speakers/").json()
print("Available speakers:", speakers)

# Generate a simple dialog
dialog_request = {
    "dialog_items": [
        {
            "type": "speech",
            "speaker_id": speakers[0]["id"],
            "text": "Hello world!",
            "exaggeration": 0.7,
            "cfg_weight": 0.6,
            "temperature": 0.9
        },
        {
            "type": "silence", 
            "duration": 1.0
        }
    ],
    "output_base_name": "test_dialog"
}

response = requests.post(
    f"{BASE_URL}/api/dialog/generate",
    json=dialog_request
)

if response.status_code == 200:
    result = response.json()
    print("Dialog generated!")
    print("Audio URL:", result["concatenated_audio_url"])
    print("ZIP URL:", result["zip_archive_url"])
else:
    print("Error:", response.text)

JavaScript/Frontend Example

// Generate dialog
const dialogRequest = {
  dialog_items: [
    {
      type: "speech",
      speaker_id: "speaker_001",
      text: "Welcome to our show!",
      exaggeration: 0.6,
      cfg_weight: 0.5,
      temperature: 0.8
    }
  ],
  output_base_name: "intro"
};

fetch('http://127.0.0.1:8000/api/dialog/generate', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
  },
  body: JSON.stringify(dialogRequest)
})
.then(response => response.json())
.then(data => {
  console.log('Dialog generated:', data);
  // Play the audio
  const audio = new Audio(data.concatenated_audio_url);
  audio.play();
});

⚠️ Error Handling

Common Error Responses

400 Bad Request

{
  "detail": "Invalid value or configuration: Text cannot be empty"
}

404 Not Found

{
  "detail": "Speaker sample for ID 'invalid_speaker' not found."
}

500 Internal Server Error

{
  "detail": "Runtime error during dialog generation: CUDA out of memory"
}

Error Categories

  • Validation Errors: Invalid input format, missing required fields
  • Resource Errors: Speaker not found, file not accessible
  • Processing Errors: TTS model failures, audio processing issues
  • System Errors: Memory issues, disk space, model loading failures

🔍 Development & Debugging

Running the Server

# From project root
uvicorn backend.app.main:app --reload --host 0.0.0.0 --port 8000

API Documentation

  • Swagger UI: http://127.0.0.1:8000/docs
  • ReDoc: http://127.0.0.1:8000/redoc

Logging

The API provides detailed logging in the DialogResponse.log field for dialog generation operations.

File Management

  • Generated files are stored in backend/tts_generated_dialogs/
  • Temporary processing files are kept for inspection (not auto-deleted)
  • ZIP archives contain individual audio segments plus concatenated result

📝 Notes

  • The API automatically loads and unloads TTS models to manage memory usage
  • Speaker audio samples should be clear, single-speaker recordings for best results
  • Large dialogs may take significant time to process depending on hardware
  • Generated files are served statically and persist until manually cleaned up

Generated on: 2025-06-06
API Version: 0.1.0