diff --git a/API_REFERENCE.md b/API_REFERENCE.md new file mode 100644 index 0000000..ef5d022 --- /dev/null +++ b/API_REFERENCE.md @@ -0,0 +1,518 @@ +# Chatterbox TTS API Reference + +## Overview + +The Chatterbox TTS API is a FastAPI-based backend service that provides text-to-speech capabilities with speaker management and dialog generation features. The API supports creating custom speakers from audio samples and generating complex dialogs with multiple speakers, silences, and fine-tuned TTS parameters. + +**Base URL**: `http://127.0.0.1:8000` +**API Version**: 0.1.0 +**Framework**: FastAPI with automatic OpenAPI documentation + +## Quick Start + +- **Interactive API Documentation**: `http://127.0.0.1:8000/docs` (Swagger UI) +- **Alternative Documentation**: `http://127.0.0.1:8000/redoc` (ReDoc) +- **OpenAPI Schema**: `http://127.0.0.1:8000/openapi.json` + +## Authentication + +Currently, the API does not require authentication. CORS is configured to allow requests from `localhost:8001` and `127.0.0.1:8001`. + +--- + +## Endpoints + +### 🏠 Root Endpoint + +#### `GET /` +Welcome message and API status check. + +**Response:** +```json +{ + "message": "Welcome to the Chatterbox TTS API!" +} +``` + +--- + +## 👥 Speaker Management + +### `GET /api/speakers/` +Retrieve all available speakers. + +**Response Model:** `List[Speaker]` + +**Example Response:** +```json +[ + { + "id": "speaker_001", + "name": "John Doe", + "sample_path": "/path/to/speaker_samples/john_doe.wav" + }, + { + "id": "speaker_002", + "name": "Jane Smith", + "sample_path": "/path/to/speaker_samples/jane_smith.wav" + } +] +``` + +**Status Codes:** +- `200`: Success + +--- + +### `POST /api/speakers/` +Create a new speaker from an audio sample. + +**Request Type:** `multipart/form-data` + +**Parameters:** +- `name` (form field, required): Speaker name +- `audio_file` (file upload, required): Audio sample file (WAV, MP3, etc.) + +**Response Model:** `SpeakerResponse` + +**Example Response:** +```json +{ + "id": "speaker_003", + "name": "Alex Johnson", + "message": "Speaker added successfully." +} +``` + +**Status Codes:** +- `201`: Speaker created successfully +- `400`: Invalid file type or missing file +- `500`: Server error during speaker creation + +**Example cURL:** +```bash +curl -X POST "http://127.0.0.1:8000/api/speakers/" \ + -F "name=Alex Johnson" \ + -F "audio_file=@/path/to/sample.wav" +``` + +--- + +### `GET /api/speakers/{speaker_id}` +Get details for a specific speaker. + +**Path Parameters:** +- `speaker_id` (string, required): Unique speaker identifier + +**Response Model:** `Speaker` + +**Example Response:** +```json +{ + "id": "speaker_001", + "name": "John Doe", + "sample_path": "/path/to/speaker_samples/john_doe.wav" +} +``` + +**Status Codes:** +- `200`: Success +- `404`: Speaker not found + +--- + +### `DELETE /api/speakers/{speaker_id}` +Delete a speaker by ID. + +**Path Parameters:** +- `speaker_id` (string, required): Unique speaker identifier + +**Example Response:** +```json +{ + "message": "Speaker deleted successfully" +} +``` + +**Status Codes:** +- `200`: Speaker deleted successfully +- `404`: Speaker not found + +--- + +## 🎭 Dialog Generation + +### `POST /api/dialog/generate_line` +Generate audio for a single dialog line (speech or silence). + +**Request Body:** Raw JSON object representing either a `SpeechItem` or `SilenceItem` + +#### Speech Item Example: +```json +{ + "type": "speech", + "speaker_id": "speaker_001", + "text": "Hello, this is a test message.", + "exaggeration": 0.7, + "cfg_weight": 0.6, + "temperature": 0.8, + "use_existing_audio": false, + "audio_url": null +} +``` + +#### Silence Item Example: +```json +{ + "type": "silence", + "duration": 2.0, + "use_existing_audio": false, + "audio_url": null +} +``` + +**Response:** +```json +{ + "audio_url": "/generated_audio/line_abc123def456.wav", + "type": "speech", + "text": "Hello, this is a test message." +} +``` + +**Status Codes:** +- `200`: Audio generated successfully +- `400`: Invalid request format or unknown dialog item type +- `404`: Speaker not found +- `500`: Server error during generation + +--- + +### `POST /api/dialog/generate` +Generate a complete dialog from multiple speech and silence items. + +**Request Model:** `DialogRequest` + +**Request Body:** +```json +{ + "dialog_items": [ + { + "type": "speech", + "speaker_id": "speaker_001", + "text": "Welcome to our podcast!", + "exaggeration": 0.5, + "cfg_weight": 0.5, + "temperature": 0.8 + }, + { + "type": "silence", + "duration": 1.0 + }, + { + "type": "speech", + "speaker_id": "speaker_002", + "text": "Thank you for having me!", + "exaggeration": 0.6, + "cfg_weight": 0.7, + "temperature": 0.9 + } + ], + "output_base_name": "podcast_episode_01" +} +``` + +**Response Model:** `DialogResponse` + +**Example Response:** +```json +{ + "log": "Processing dialog with 3 items...\nGenerating speech for item 1...\nGenerating silence for item 2...\nGenerating speech for item 3...\nConcatenating audio segments...\nZIP archive created at: /path/to/output.zip", + "concatenated_audio_url": "/generated_audio/podcast_episode_01_concatenated.wav", + "zip_archive_url": "/generated_audio/podcast_episode_01_archive.zip", + "temp_dir_path": "/path/to/temp/directory", + "error_message": null +} +``` + +**Status Codes:** +- `200`: Dialog generated successfully +- `400`: Invalid request format or validation errors +- `404`: Speaker or file not found +- `500`: Server error during generation + +--- + +## 📁 Static File Serving + +### `GET /generated_audio/{filename}` +Serve generated audio files and ZIP archives. + +**Path Parameters:** +- `filename` (string, required): Name of the generated file + +**Response:** Binary audio file or ZIP archive + +**Example URLs:** +- `http://127.0.0.1:8000/generated_audio/dialog_concatenated.wav` +- `http://127.0.0.1:8000/generated_audio/dialog_archive.zip` + +--- + +## 📋 Data Models + +### Speaker Models + +#### `Speaker` +```json +{ + "id": "string", + "name": "string", + "sample_path": "string|null" +} +``` + +#### `SpeakerResponse` +```json +{ + "id": "string", + "name": "string", + "message": "string|null" +} +``` + +### Dialog Models + +#### `SpeechItem` +```json +{ + "type": "speech", + "speaker_id": "string", + "text": "string", + "exaggeration": 0.5, // 0.0-2.0, controls expressiveness + "cfg_weight": 0.5, // 0.0-2.0, alignment with speaker characteristics + "temperature": 0.8, // 0.0-2.0, randomness in generation + "use_existing_audio": false, + "audio_url": "string|null" +} +``` + +#### `SilenceItem` +```json +{ + "type": "silence", + "duration": 1.0, // seconds, must be > 0 + "use_existing_audio": false, + "audio_url": "string|null" +} +``` + +#### `DialogRequest` +```json +{ + "dialog_items": [ + // Array of SpeechItem and/or SilenceItem objects + ], + "output_base_name": "string" // Base name for output files +} +``` + +#### `DialogResponse` +```json +{ + "log": "string", // Processing log + "concatenated_audio_url": "string|null", // URL to final audio + "zip_archive_url": "string|null", // URL to ZIP archive + "temp_dir_path": "string|null", // Server temp directory + "error_message": "string|null" // Error details if failed +} +``` + +--- + +## 🎛️ TTS Parameters + +### Exaggeration (`exaggeration`) +- **Range**: 0.0 - 2.0 +- **Default**: 0.5 +- **Description**: Controls the expressiveness of speech. Higher values produce more exaggerated, emotional speech. + +### CFG Weight (`cfg_weight`) +- **Range**: 0.0 - 2.0 +- **Default**: 0.5 +- **Description**: Classifier-Free Guidance weight. Higher values make speech more aligned with the prompt text and speaker characteristics. + +### Temperature (`temperature`) +- **Range**: 0.0 - 2.0 +- **Default**: 0.8 +- **Description**: Controls randomness in generation. Lower values produce more deterministic speech, higher values add more variation. + +--- + +## 🔧 Configuration + +### Environment Variables +The API uses the following directory structure (configurable in `app/config.py`): + +- **Speaker Samples**: `{PROJECT_ROOT}/speaker_data/speaker_samples/` +- **Generated Audio**: `{PROJECT_ROOT}/backend/tts_generated_dialogs/` +- **Temporary Files**: `{PROJECT_ROOT}/tts_temp_outputs/` + +### CORS Settings +- Allowed Origins: `http://localhost:8001`, `http://127.0.0.1:8001` +- Allowed Methods: All +- Allowed Headers: All +- Credentials: Enabled + +--- + +## 🚀 Usage Examples + +### Python Client Example + +```python +import requests +import json + +# Base URL +BASE_URL = "http://127.0.0.1:8000" + +# Get all speakers +speakers = requests.get(f"{BASE_URL}/api/speakers/").json() +print("Available speakers:", speakers) + +# Generate a simple dialog +dialog_request = { + "dialog_items": [ + { + "type": "speech", + "speaker_id": speakers[0]["id"], + "text": "Hello world!", + "exaggeration": 0.7, + "cfg_weight": 0.6, + "temperature": 0.9 + }, + { + "type": "silence", + "duration": 1.0 + } + ], + "output_base_name": "test_dialog" +} + +response = requests.post( + f"{BASE_URL}/api/dialog/generate", + json=dialog_request +) + +if response.status_code == 200: + result = response.json() + print("Dialog generated!") + print("Audio URL:", result["concatenated_audio_url"]) + print("ZIP URL:", result["zip_archive_url"]) +else: + print("Error:", response.text) +``` + +### JavaScript/Frontend Example + +```javascript +// Generate dialog +const dialogRequest = { + dialog_items: [ + { + type: "speech", + speaker_id: "speaker_001", + text: "Welcome to our show!", + exaggeration: 0.6, + cfg_weight: 0.5, + temperature: 0.8 + } + ], + output_base_name: "intro" +}; + +fetch('http://127.0.0.1:8000/api/dialog/generate', { + method: 'POST', + headers: { + 'Content-Type': 'application/json', + }, + body: JSON.stringify(dialogRequest) +}) +.then(response => response.json()) +.then(data => { + console.log('Dialog generated:', data); + // Play the audio + const audio = new Audio(data.concatenated_audio_url); + audio.play(); +}); +``` + +--- + +## ⚠️ Error Handling + +### Common Error Responses + +#### 400 Bad Request +```json +{ + "detail": "Invalid value or configuration: Text cannot be empty" +} +``` + +#### 404 Not Found +```json +{ + "detail": "Speaker sample for ID 'invalid_speaker' not found." +} +``` + +#### 500 Internal Server Error +```json +{ + "detail": "Runtime error during dialog generation: CUDA out of memory" +} +``` + +### Error Categories +- **Validation Errors**: Invalid input format, missing required fields +- **Resource Errors**: Speaker not found, file not accessible +- **Processing Errors**: TTS model failures, audio processing issues +- **System Errors**: Memory issues, disk space, model loading failures + +--- + +## 🔍 Development & Debugging + +### Running the Server +```bash +# From project root +uvicorn backend.app.main:app --reload --host 0.0.0.0 --port 8000 +``` + +### API Documentation +- **Swagger UI**: `http://127.0.0.1:8000/docs` +- **ReDoc**: `http://127.0.0.1:8000/redoc` + +### Logging +The API provides detailed logging in the `DialogResponse.log` field for dialog generation operations. + +### File Management +- Generated files are stored in `backend/tts_generated_dialogs/` +- Temporary processing files are kept for inspection (not auto-deleted) +- ZIP archives contain individual audio segments plus concatenated result + +--- + +## 📝 Notes + +- The API automatically loads and unloads TTS models to manage memory usage +- Speaker audio samples should be clear, single-speaker recordings for best results +- Large dialogs may take significant time to process depending on hardware +- Generated files are served statically and persist until manually cleaned up + +--- + +*Generated on: 2025-06-06* +*API Version: 0.1.0* diff --git a/speaker_data/speakers.yaml b/speaker_data/speakers.yaml index f92a7c0..3331196 100644 --- a/speaker_data/speakers.yaml +++ b/speaker_data/speakers.yaml @@ -16,3 +16,6 @@ fb84ce1c-f32d-4df9-9673-2c64e9603133: a6387c23-4ca4-42b5-8aaf-5699dbabbdf0: name: Mike sample_path: speaker_samples/a6387c23-4ca4-42b5-8aaf-5699dbabbdf0.wav +6cf4d171-667d-4bc8-adbb-6d9b7c620cb8: + name: Minnie + sample_path: speaker_samples/6cf4d171-667d-4bc8-adbb-6d9b7c620cb8.wav