2025-08-14 15:44:25 +00:00
2 changed files with 521 additions and 0 deletions
--- a/API_REFERENCE.md
+++ b/API_REFERENCE.md
@ -0,0 +1,518 @@
+# Chatterbox TTS API Reference
+
+## Overview
+
+The Chatterbox TTS API is a FastAPI-based backend service that provides text-to-speech capabilities with speaker management and dialog generation features. The API supports creating custom speakers from audio samples and generating complex dialogs with multiple speakers, silences, and fine-tuned TTS parameters.
+
+**Base URL**: `http://127.0.0.1:8000`  
+**API Version**: 0.1.0  
+**Framework**: FastAPI with automatic OpenAPI documentation
+
+## Quick Start
+
+- **Interactive API Documentation**: `http://127.0.0.1:8000/docs` (Swagger UI)
+- **Alternative Documentation**: `http://127.0.0.1:8000/redoc` (ReDoc)
+- **OpenAPI Schema**: `http://127.0.0.1:8000/openapi.json`
+
+## Authentication
+
+Currently, the API does not require authentication. CORS is configured to allow requests from `localhost:8001` and `127.0.0.1:8001`.
+
+---
+
+## Endpoints
+
+### 🏠 Root Endpoint
+
+#### `GET /`
+Welcome message and API status check.
+
+**Response:**
+```json
+{
+  "message": "Welcome to the Chatterbox TTS API!"
+}
+```
+
+---
+
+## 👥 Speaker Management
+
+### `GET /api/speakers/`
+Retrieve all available speakers.
+
+**Response Model:** `List[Speaker]`
+
+**Example Response:**
+```json
+[
+  {
+    "id": "speaker_001",
+    "name": "John Doe",
+    "sample_path": "/path/to/speaker_samples/john_doe.wav"
+  },
+  {
+    "id": "speaker_002", 
+    "name": "Jane Smith",
+    "sample_path": "/path/to/speaker_samples/jane_smith.wav"
+  }
+]
+```
+
+**Status Codes:**
+- `200`: Success
+
+---
+
+### `POST /api/speakers/`
+Create a new speaker from an audio sample.
+
+**Request Type:** `multipart/form-data`
+
+**Parameters:**
+- `name` (form field, required): Speaker name
+- `audio_file` (file upload, required): Audio sample file (WAV, MP3, etc.)
+
+**Response Model:** `SpeakerResponse`
+
+**Example Response:**
+```json
+{
+  "id": "speaker_003",
+  "name": "Alex Johnson",
+  "message": "Speaker added successfully."
+}
+```
+
+**Status Codes:**
+- `201`: Speaker created successfully
+- `400`: Invalid file type or missing file
+- `500`: Server error during speaker creation
+
+**Example cURL:**
+```bash
+curl -X POST "http://127.0.0.1:8000/api/speakers/" \
+  -F "name=Alex Johnson" \
+  -F "audio_file=@/path/to/sample.wav"
+```
+
+---
+
+### `GET /api/speakers/{speaker_id}`
+Get details for a specific speaker.
+
+**Path Parameters:**
+- `speaker_id` (string, required): Unique speaker identifier
+
+**Response Model:** `Speaker`
+
+**Example Response:**
+```json
+{
+  "id": "speaker_001",
+  "name": "John Doe", 
+  "sample_path": "/path/to/speaker_samples/john_doe.wav"
+}
+```
+
+**Status Codes:**
+- `200`: Success
+- `404`: Speaker not found
+
+---
+
+### `DELETE /api/speakers/{speaker_id}`
+Delete a speaker by ID.
+
+**Path Parameters:**
+- `speaker_id` (string, required): Unique speaker identifier
+
+**Example Response:**
+```json
+{
+  "message": "Speaker deleted successfully"
+}
+```
+
+**Status Codes:**
+- `200`: Speaker deleted successfully
+- `404`: Speaker not found
+
+---
+
+## 🎭 Dialog Generation
+
+### `POST /api/dialog/generate_line`
+Generate audio for a single dialog line (speech or silence).
+
+**Request Body:** Raw JSON object representing either a `SpeechItem` or `SilenceItem`
+
+#### Speech Item Example:
+```json
+{
+  "type": "speech",
+  "speaker_id": "speaker_001",
+  "text": "Hello, this is a test message.",
+  "exaggeration": 0.7,
+  "cfg_weight": 0.6,
+  "temperature": 0.8,
+  "use_existing_audio": false,
+  "audio_url": null
+}
+```
+
+#### Silence Item Example:
+```json
+{
+  "type": "silence",
+  "duration": 2.0,
+  "use_existing_audio": false,
+  "audio_url": null
+}
+```
+
+**Response:**
+```json
+{
+  "audio_url": "/generated_audio/line_abc123def456.wav",
+  "type": "speech",
+  "text": "Hello, this is a test message."
+}
+```
+
+**Status Codes:**
+- `200`: Audio generated successfully
+- `400`: Invalid request format or unknown dialog item type
+- `404`: Speaker not found
+- `500`: Server error during generation
+
+---
+
+### `POST /api/dialog/generate`
+Generate a complete dialog from multiple speech and silence items.
+
+**Request Model:** `DialogRequest`
+
+**Request Body:**
+```json
+{
+  "dialog_items": [
+    {
+      "type": "speech",
+      "speaker_id": "speaker_001", 
+      "text": "Welcome to our podcast!",
+      "exaggeration": 0.5,
+      "cfg_weight": 0.5,
+      "temperature": 0.8
+    },
+    {
+      "type": "silence",
+      "duration": 1.0
+    },
+    {
+      "type": "speech",
+      "speaker_id": "speaker_002",
+      "text": "Thank you for having me!",
+      "exaggeration": 0.6,
+      "cfg_weight": 0.7,
+      "temperature": 0.9
+    }
+  ],
+  "output_base_name": "podcast_episode_01"
+}
+```
+
+**Response Model:** `DialogResponse`
+
+**Example Response:**
+```json
+{
+  "log": "Processing dialog with 3 items...\nGenerating speech for item 1...\nGenerating silence for item 2...\nGenerating speech for item 3...\nConcatenating audio segments...\nZIP archive created at: /path/to/output.zip",
+  "concatenated_audio_url": "/generated_audio/podcast_episode_01_concatenated.wav",
+  "zip_archive_url": "/generated_audio/podcast_episode_01_archive.zip", 
+  "temp_dir_path": "/path/to/temp/directory",
+  "error_message": null
+}
+```
+
+**Status Codes:**
+- `200`: Dialog generated successfully
+- `400`: Invalid request format or validation errors
+- `404`: Speaker or file not found
+- `500`: Server error during generation
+
+---
+
+## 📁 Static File Serving
+
+### `GET /generated_audio/{filename}`
+Serve generated audio files and ZIP archives.
+
+**Path Parameters:**
+- `filename` (string, required): Name of the generated file
+
+**Response:** Binary audio file or ZIP archive
+
+**Example URLs:**
+- `http://127.0.0.1:8000/generated_audio/dialog_concatenated.wav`
+- `http://127.0.0.1:8000/generated_audio/dialog_archive.zip`
+
+---
+
+## 📋 Data Models
+
+### Speaker Models
+
+#### `Speaker`
+```json
+{
+  "id": "string",
+  "name": "string", 
+  "sample_path": "string|null"
+}
+```
+
+#### `SpeakerResponse`
+```json
+{
+  "id": "string",
+  "name": "string",
+  "message": "string|null"
+}
+```
+
+### Dialog Models
+
+#### `SpeechItem`
+```json
+{
+  "type": "speech",
+  "speaker_id": "string",
+  "text": "string",
+  "exaggeration": 0.5,        // 0.0-2.0, controls expressiveness
+  "cfg_weight": 0.5,          // 0.0-2.0, alignment with speaker characteristics  
+  "temperature": 0.8,         // 0.0-2.0, randomness in generation
+  "use_existing_audio": false,
+  "audio_url": "string|null"
+}
+```
+
+#### `SilenceItem`
+```json
+{
+  "type": "silence",
+  "duration": 1.0,            // seconds, must be > 0
+  "use_existing_audio": false,
+  "audio_url": "string|null"
+}
+```
+
+#### `DialogRequest`
+```json
+{
+  "dialog_items": [
+    // Array of SpeechItem and/or SilenceItem objects
+  ],
+  "output_base_name": "string"  // Base name for output files
+}
+```
+
+#### `DialogResponse`
+```json
+{
+  "log": "string",                        // Processing log
+  "concatenated_audio_url": "string|null", // URL to final audio
+  "zip_archive_url": "string|null",       // URL to ZIP archive
+  "temp_dir_path": "string|null",         // Server temp directory
+  "error_message": "string|null"          // Error details if failed
+}
+```
+
+---
+
+## 🎛️ TTS Parameters
+
+### Exaggeration (`exaggeration`)
+- **Range**: 0.0 - 2.0
+- **Default**: 0.5
+- **Description**: Controls the expressiveness of speech. Higher values produce more exaggerated, emotional speech.
+
+### CFG Weight (`cfg_weight`)
+- **Range**: 0.0 - 2.0  
+- **Default**: 0.5
+- **Description**: Classifier-Free Guidance weight. Higher values make speech more aligned with the prompt text and speaker characteristics.
+
+### Temperature (`temperature`)
+- **Range**: 0.0 - 2.0
+- **Default**: 0.8
+- **Description**: Controls randomness in generation. Lower values produce more deterministic speech, higher values add more variation.
+
+---
+
+## 🔧 Configuration
+
+### Environment Variables
+The API uses the following directory structure (configurable in `app/config.py`):
+
+- **Speaker Samples**: `{PROJECT_ROOT}/speaker_data/speaker_samples/`
+- **Generated Audio**: `{PROJECT_ROOT}/backend/tts_generated_dialogs/`
+- **Temporary Files**: `{PROJECT_ROOT}/tts_temp_outputs/`
+
+### CORS Settings
+- Allowed Origins: `http://localhost:8001`, `http://127.0.0.1:8001`
+- Allowed Methods: All
+- Allowed Headers: All
+- Credentials: Enabled
+
+---
+
+## 🚀 Usage Examples
+
+### Python Client Example
+
+```python
+import requests
+import json
+
+# Base URL
+BASE_URL = "http://127.0.0.1:8000"
+
+# Get all speakers
+speakers = requests.get(f"{BASE_URL}/api/speakers/").json()
+print("Available speakers:", speakers)
+
+# Generate a simple dialog
+dialog_request = {
+    "dialog_items": [
+        {
+            "type": "speech",
+            "speaker_id": speakers[0]["id"],
+            "text": "Hello world!",
+            "exaggeration": 0.7,
+            "cfg_weight": 0.6,
+            "temperature": 0.9
+        },
+        {
+            "type": "silence", 
+            "duration": 1.0
+        }
+    ],
+    "output_base_name": "test_dialog"
+}
+
+response = requests.post(
+    f"{BASE_URL}/api/dialog/generate",
+    json=dialog_request
+)
+
+if response.status_code == 200:
+    result = response.json()
+    print("Dialog generated!")
+    print("Audio URL:", result["concatenated_audio_url"])
+    print("ZIP URL:", result["zip_archive_url"])
+else:
+    print("Error:", response.text)
+```
+
+### JavaScript/Frontend Example
+
+```javascript
+// Generate dialog
+const dialogRequest = {
+  dialog_items: [
+    {
+      type: "speech",
+      speaker_id: "speaker_001",
+      text: "Welcome to our show!",
+      exaggeration: 0.6,
+      cfg_weight: 0.5,
+      temperature: 0.8
+    }
+  ],
+  output_base_name: "intro"
+};
+
+fetch('http://127.0.0.1:8000/api/dialog/generate', {
+  method: 'POST',
+  headers: {
+    'Content-Type': 'application/json',
+  },
+  body: JSON.stringify(dialogRequest)
+})
+.then(response => response.json())
+.then(data => {
+  console.log('Dialog generated:', data);
+  // Play the audio
+  const audio = new Audio(data.concatenated_audio_url);
+  audio.play();
+});
+```
+
+---
+
+## ⚠️ Error Handling
+
+### Common Error Responses
+
+#### 400 Bad Request
+```json
+{
+  "detail": "Invalid value or configuration: Text cannot be empty"
+}
+```
+
+#### 404 Not Found
+```json
+{
+  "detail": "Speaker sample for ID 'invalid_speaker' not found."
+}
+```
+
+#### 500 Internal Server Error
+```json
+{
+  "detail": "Runtime error during dialog generation: CUDA out of memory"
+}
+```
+
+### Error Categories
+- **Validation Errors**: Invalid input format, missing required fields
+- **Resource Errors**: Speaker not found, file not accessible
+- **Processing Errors**: TTS model failures, audio processing issues
+- **System Errors**: Memory issues, disk space, model loading failures
+
+---
+
+## 🔍 Development & Debugging
+
+### Running the Server
+```bash
+# From project root
+uvicorn backend.app.main:app --reload --host 0.0.0.0 --port 8000
+```
+
+### API Documentation
+- **Swagger UI**: `http://127.0.0.1:8000/docs`
+- **ReDoc**: `http://127.0.0.1:8000/redoc`
+
+### Logging
+The API provides detailed logging in the `DialogResponse.log` field for dialog generation operations.
+
+### File Management
+- Generated files are stored in `backend/tts_generated_dialogs/`
+- Temporary processing files are kept for inspection (not auto-deleted)
+- ZIP archives contain individual audio segments plus concatenated result
+
+---
+
+## 📝 Notes
+
+- The API automatically loads and unloads TTS models to manage memory usage
+- Speaker audio samples should be clear, single-speaker recordings for best results
+- Large dialogs may take significant time to process depending on hardware
+- Generated files are served statically and persist until manually cleaned up
+
+---
+
+*Generated on: 2025-06-06*  
+*API Version: 0.1.0*
--- a/speaker_data/speakers.yaml
+++ b/speaker_data/speakers.yaml
@ -16,3 +16,6 @@ fb84ce1c-f32d-4df9-9673-2c64e9603133:
 a6387c23-4ca4-42b5-8aaf-5699dbabbdf0:
  name: Mike
  sample_path: speaker_samples/a6387c23-4ca4-42b5-8aaf-5699dbabbdf0.wav
+6cf4d171-667d-4bc8-adbb-6d9b7c620cb8:
+  name: Minnie
+  sample_path: speaker_samples/6cf4d171-667d-4bc8-adbb-6d9b7c620cb8.wav