# Chatterbox TTS API Reference ## Overview The Chatterbox TTS API is a FastAPI-based backend service that provides text-to-speech capabilities with speaker management and dialog generation features. The API supports creating custom speakers from audio samples and generating complex dialogs with multiple speakers, silences, and fine-tuned TTS parameters. **Base URL**: `http://127.0.0.1:8000` **API Version**: 0.1.0 **Framework**: FastAPI with automatic OpenAPI documentation ## Quick Start - **Interactive API Documentation**: `http://127.0.0.1:8000/docs` (Swagger UI) - **Alternative Documentation**: `http://127.0.0.1:8000/redoc` (ReDoc) - **OpenAPI Schema**: `http://127.0.0.1:8000/openapi.json` ## Authentication Currently, the API does not require authentication. CORS is configured to allow requests from `localhost:8001` and `127.0.0.1:8001`. --- ## Endpoints ### 🏠 Root Endpoint #### `GET /` Welcome message and API status check. **Response:** ```json { "message": "Welcome to the Chatterbox TTS API!" } ``` --- ## 👥 Speaker Management ### `GET /api/speakers/` Retrieve all available speakers. **Response Model:** `List[Speaker]` **Example Response:** ```json [ { "id": "speaker_001", "name": "John Doe", "sample_path": "/path/to/speaker_samples/john_doe.wav" }, { "id": "speaker_002", "name": "Jane Smith", "sample_path": "/path/to/speaker_samples/jane_smith.wav" } ] ``` **Status Codes:** - `200`: Success --- ### `POST /api/speakers/` Create a new speaker from an audio sample. **Request Type:** `multipart/form-data` **Parameters:** - `name` (form field, required): Speaker name - `audio_file` (file upload, required): Audio sample file (WAV, MP3, etc.) **Response Model:** `SpeakerResponse` **Example Response:** ```json { "id": "speaker_003", "name": "Alex Johnson", "message": "Speaker added successfully." } ``` **Status Codes:** - `201`: Speaker created successfully - `400`: Invalid file type or missing file - `500`: Server error during speaker creation **Example cURL:** ```bash curl -X POST "http://127.0.0.1:8000/api/speakers/" \ -F "name=Alex Johnson" \ -F "audio_file=@/path/to/sample.wav" ``` --- ### `GET /api/speakers/{speaker_id}` Get details for a specific speaker. **Path Parameters:** - `speaker_id` (string, required): Unique speaker identifier **Response Model:** `Speaker` **Example Response:** ```json { "id": "speaker_001", "name": "John Doe", "sample_path": "/path/to/speaker_samples/john_doe.wav" } ``` **Status Codes:** - `200`: Success - `404`: Speaker not found --- ### `DELETE /api/speakers/{speaker_id}` Delete a speaker by ID. **Path Parameters:** - `speaker_id` (string, required): Unique speaker identifier **Example Response:** ```json { "message": "Speaker deleted successfully" } ``` **Status Codes:** - `200`: Speaker deleted successfully - `404`: Speaker not found --- ## 🎭 Dialog Generation ### `POST /api/dialog/generate_line` Generate audio for a single dialog line (speech or silence). **Request Body:** Raw JSON object representing either a `SpeechItem` or `SilenceItem` #### Speech Item Example: ```json { "type": "speech", "speaker_id": "speaker_001", "text": "Hello, this is a test message.", "exaggeration": 0.7, "cfg_weight": 0.6, "temperature": 0.8, "use_existing_audio": false, "audio_url": null } ``` #### Silence Item Example: ```json { "type": "silence", "duration": 2.0, "use_existing_audio": false, "audio_url": null } ``` **Response:** ```json { "audio_url": "/generated_audio/line_abc123def456.wav", "type": "speech", "text": "Hello, this is a test message." } ``` **Status Codes:** - `200`: Audio generated successfully - `400`: Invalid request format or unknown dialog item type - `404`: Speaker not found - `500`: Server error during generation --- ### `POST /api/dialog/generate` Generate a complete dialog from multiple speech and silence items. **Request Model:** `DialogRequest` **Request Body:** ```json { "dialog_items": [ { "type": "speech", "speaker_id": "speaker_001", "text": "Welcome to our podcast!", "exaggeration": 0.5, "cfg_weight": 0.5, "temperature": 0.8 }, { "type": "silence", "duration": 1.0 }, { "type": "speech", "speaker_id": "speaker_002", "text": "Thank you for having me!", "exaggeration": 0.6, "cfg_weight": 0.7, "temperature": 0.9 } ], "output_base_name": "podcast_episode_01" } ``` **Response Model:** `DialogResponse` **Example Response:** ```json { "log": "Processing dialog with 3 items...\nGenerating speech for item 1...\nGenerating silence for item 2...\nGenerating speech for item 3...\nConcatenating audio segments...\nZIP archive created at: /path/to/output.zip", "concatenated_audio_url": "/generated_audio/podcast_episode_01_concatenated.wav", "zip_archive_url": "/generated_audio/podcast_episode_01_archive.zip", "temp_dir_path": "/path/to/temp/directory", "error_message": null } ``` **Status Codes:** - `200`: Dialog generated successfully - `400`: Invalid request format or validation errors - `404`: Speaker or file not found - `500`: Server error during generation --- ## 📁 Static File Serving ### `GET /generated_audio/{filename}` Serve generated audio files and ZIP archives. **Path Parameters:** - `filename` (string, required): Name of the generated file **Response:** Binary audio file or ZIP archive **Example URLs:** - `http://127.0.0.1:8000/generated_audio/dialog_concatenated.wav` - `http://127.0.0.1:8000/generated_audio/dialog_archive.zip` --- ## 📋 Data Models ### Speaker Models #### `Speaker` ```json { "id": "string", "name": "string", "sample_path": "string|null" } ``` #### `SpeakerResponse` ```json { "id": "string", "name": "string", "message": "string|null" } ``` ### Dialog Models #### `SpeechItem` ```json { "type": "speech", "speaker_id": "string", "text": "string", "exaggeration": 0.5, // 0.0-2.0, controls expressiveness "cfg_weight": 0.5, // 0.0-2.0, alignment with speaker characteristics "temperature": 0.8, // 0.0-2.0, randomness in generation "use_existing_audio": false, "audio_url": "string|null" } ``` #### `SilenceItem` ```json { "type": "silence", "duration": 1.0, // seconds, must be > 0 "use_existing_audio": false, "audio_url": "string|null" } ``` #### `DialogRequest` ```json { "dialog_items": [ // Array of SpeechItem and/or SilenceItem objects ], "output_base_name": "string" // Base name for output files } ``` #### `DialogResponse` ```json { "log": "string", // Processing log "concatenated_audio_url": "string|null", // URL to final audio "zip_archive_url": "string|null", // URL to ZIP archive "temp_dir_path": "string|null", // Server temp directory "error_message": "string|null" // Error details if failed } ``` --- ## 🎛️ TTS Parameters ### Exaggeration (`exaggeration`) - **Range**: 0.0 - 2.0 - **Default**: 0.5 - **Description**: Controls the expressiveness of speech. Higher values produce more exaggerated, emotional speech. ### CFG Weight (`cfg_weight`) - **Range**: 0.0 - 2.0 - **Default**: 0.5 - **Description**: Classifier-Free Guidance weight. Higher values make speech more aligned with the prompt text and speaker characteristics. ### Temperature (`temperature`) - **Range**: 0.0 - 2.0 - **Default**: 0.8 - **Description**: Controls randomness in generation. Lower values produce more deterministic speech, higher values add more variation. --- ## 🔧 Configuration ### Environment Variables The API uses the following directory structure (configurable in `app/config.py`): - **Speaker Samples**: `{PROJECT_ROOT}/speaker_data/speaker_samples/` - **Generated Audio**: `{PROJECT_ROOT}/backend/tts_generated_dialogs/` - **Temporary Files**: `{PROJECT_ROOT}/tts_temp_outputs/` ### CORS Settings - Allowed Origins: `http://localhost:8001`, `http://127.0.0.1:8001` (plus any `FRONTEND_HOST:FRONTEND_PORT` when using `start_servers.py`) - Allowed Methods: All - Allowed Headers: All - Credentials: Enabled --- ## 🚀 Usage Examples ### Python Client Example ```python import requests import json # Base URL BASE_URL = "http://127.0.0.1:8000" # Get all speakers speakers = requests.get(f"{BASE_URL}/api/speakers/").json() print("Available speakers:", speakers) # Generate a simple dialog dialog_request = { "dialog_items": [ { "type": "speech", "speaker_id": speakers[0]["id"], "text": "Hello world!", "exaggeration": 0.7, "cfg_weight": 0.6, "temperature": 0.9 }, { "type": "silence", "duration": 1.0 } ], "output_base_name": "test_dialog" } response = requests.post( f"{BASE_URL}/api/dialog/generate", json=dialog_request ) if response.status_code == 200: result = response.json() print("Dialog generated!") print("Audio URL:", result["concatenated_audio_url"]) print("ZIP URL:", result["zip_archive_url"]) else: print("Error:", response.text) ``` ### JavaScript/Frontend Example ```javascript // Generate dialog const dialogRequest = { dialog_items: [ { type: "speech", speaker_id: "speaker_001", text: "Welcome to our show!", exaggeration: 0.6, cfg_weight: 0.5, temperature: 0.8 } ], output_base_name: "intro" }; fetch('http://127.0.0.1:8000/api/dialog/generate', { method: 'POST', headers: { 'Content-Type': 'application/json', }, body: JSON.stringify(dialogRequest) }) .then(response => response.json()) .then(data => { console.log('Dialog generated:', data); // Play the audio const audio = new Audio(data.concatenated_audio_url); audio.play(); }); ``` --- ## ⚠️ Error Handling ### Common Error Responses #### 400 Bad Request ```json { "detail": "Invalid value or configuration: Text cannot be empty" } ``` #### 404 Not Found ```json { "detail": "Speaker sample for ID 'invalid_speaker' not found." } ``` #### 500 Internal Server Error ```json { "detail": "Runtime error during dialog generation: CUDA out of memory" } ``` ### Error Categories - **Validation Errors**: Invalid input format, missing required fields - **Resource Errors**: Speaker not found, file not accessible - **Processing Errors**: TTS model failures, audio processing issues - **System Errors**: Memory issues, disk space, model loading failures --- ## 🔍 Development & Debugging ### Running the Server ```bash # From project root uvicorn backend.app.main:app --reload --host 0.0.0.0 --port 8000 ``` ### API Documentation - **Swagger UI**: `http://127.0.0.1:8000/docs` - **ReDoc**: `http://127.0.0.1:8000/redoc` ### Logging The API provides detailed logging in the `DialogResponse.log` field for dialog generation operations. ### File Management - Generated files are stored in `backend/tts_generated_dialogs/` - Temporary processing files are kept for inspection (not auto-deleted) - ZIP archives contain individual audio segments plus concatenated result --- ## 📝 Notes - The API automatically loads and unloads TTS models to manage memory usage - Speaker audio samples should be clear, single-speaker recordings for best results - Large dialogs may take significant time to process depending on hardware - Generated files are served statically and persist until manually cleaned up --- *Generated on: 2025-06-06* *API Version: 0.1.0*