chatterbox-ui/API_REFERENCE.md

# Chatterbox TTS API Reference

## Overview

The Chatterbox TTS API is a FastAPI-based backend service that provides text-to-speech capabilities with speaker management and dialog generation features. The API supports creating custom speakers from audio samples and generating complex dialogs with multiple speakers, silences, and fine-tuned TTS parameters.

**Base URL**: `http://127.0.0.1:8000`
**API Version**: 0.1.0
**Framework**: FastAPI with automatic OpenAPI documentation

## Quick Start

- **Interactive API Documentation**: `http://127.0.0.1:8000/docs` (Swagger UI)
- **Alternative Documentation**: `http://127.0.0.1:8000/redoc` (ReDoc)
- **OpenAPI Schema**: `http://127.0.0.1:8000/openapi.json`

## Authentication

Currently, the API does not require authentication. CORS is configured to allow requests from `localhost:8001` and `127.0.0.1:8001`.

---

## Endpoints

### 🏠 Root Endpoint

#### `GET /`
Welcome message and API status check.

**Response:**
```json
{
  "message": "Welcome to the Chatterbox TTS API!"
}
```

---

## 👥 Speaker Management

### `GET /api/speakers/`
Retrieve all available speakers.

**Response Model:** `List[Speaker]`

**Example Response:**
```json
[
  {
    "id": "speaker_001",
    "name": "John Doe",
    "sample_path": "/path/to/speaker_samples/john_doe.wav"
  },
  {
    "id": "speaker_002",
    "name": "Jane Smith",
    "sample_path": "/path/to/speaker_samples/jane_smith.wav"
  }
]
```

**Status Codes:**
- `200`: Success

---

### `POST /api/speakers/`
Create a new speaker from an audio sample.

**Request Type:** `multipart/form-data`

**Parameters:**
- `name` (form field, required): Speaker name
- `audio_file` (file upload, required): Audio sample file (WAV, MP3, etc.)

**Response Model:** `SpeakerResponse`

**Example Response:**
```json
{
  "id": "speaker_003",
  "name": "Alex Johnson",
  "message": "Speaker added successfully."
}
```

**Status Codes:**
- `201`: Speaker created successfully
- `400`: Invalid file type or missing file
- `500`: Server error during speaker creation

**Example cURL:**
```bash
curl -X POST "http://127.0.0.1:8000/api/speakers/" \
  -F "name=Alex Johnson" \
  -F "audio_file=@/path/to/sample.wav"
```

---

### `GET /api/speakers/{speaker_id}`
Get details for a specific speaker.

**Path Parameters:**
- `speaker_id` (string, required): Unique speaker identifier

**Response Model:** `Speaker`

**Example Response:**
```json
{
  "id": "speaker_001",
  "name": "John Doe",
  "sample_path": "/path/to/speaker_samples/john_doe.wav"
}
```

**Status Codes:**
- `200`: Success
- `404`: Speaker not found

---

### `DELETE /api/speakers/{speaker_id}`
Delete a speaker by ID.

**Path Parameters:**
- `speaker_id` (string, required): Unique speaker identifier

**Example Response:**
```json
{
  "message": "Speaker deleted successfully"
}
```

**Status Codes:**
- `200`: Speaker deleted successfully
- `404`: Speaker not found

---

## 🎭 Dialog Generation

### `POST /api/dialog/generate_line`
Generate audio for a single dialog line (speech or silence).

**Request Body:** Raw JSON object representing either a `SpeechItem` or `SilenceItem`

#### Speech Item Example:
```json
{
  "type": "speech",
  "speaker_id": "speaker_001",
  "text": "Hello, this is a test message.",
  "exaggeration": 0.7,
  "cfg_weight": 0.6,
  "temperature": 0.8,
  "use_existing_audio": false,
  "audio_url": null
}
```

#### Silence Item Example:
```json
{
  "type": "silence",
  "duration": 2.0,
  "use_existing_audio": false,
  "audio_url": null
}
```

**Response:**
```json
{
  "audio_url": "/generated_audio/line_abc123def456.wav",
  "type": "speech",
  "text": "Hello, this is a test message."
}
```

**Status Codes:**
- `200`: Audio generated successfully
- `400`: Invalid request format or unknown dialog item type
- `404`: Speaker not found
- `500`: Server error during generation

---

### `POST /api/dialog/generate`
Generate a complete dialog from multiple speech and silence items.

**Request Model:** `DialogRequest`

**Request Body:**
```json
{
  "dialog_items": [
    {
      "type": "speech",
      "speaker_id": "speaker_001",
      "text": "Welcome to our podcast!",
      "exaggeration": 0.5,
      "cfg_weight": 0.5,
      "temperature": 0.8
    },
    {
      "type": "silence",
      "duration": 1.0
    },
    {
      "type": "speech",
      "speaker_id": "speaker_002",
      "text": "Thank you for having me!",
      "exaggeration": 0.6,
      "cfg_weight": 0.7,
      "temperature": 0.9
    }
  ],
  "output_base_name": "podcast_episode_01"
}
```

**Response Model:** `DialogResponse`

**Example Response:**
```json
{
  "log": "Processing dialog with 3 items...\nGenerating speech for item 1...\nGenerating silence for item 2...\nGenerating speech for item 3...\nConcatenating audio segments...\nZIP archive created at: /path/to/output.zip",
  "concatenated_audio_url": "/generated_audio/podcast_episode_01_concatenated.wav",
  "zip_archive_url": "/generated_audio/podcast_episode_01_archive.zip",
  "temp_dir_path": "/path/to/temp/directory",
  "error_message": null
}
```

**Status Codes:**
- `200`: Dialog generated successfully
- `400`: Invalid request format or validation errors
- `404`: Speaker or file not found
- `500`: Server error during generation

---

## 📁 Static File Serving

### `GET /generated_audio/{filename}`
Serve generated audio files and ZIP archives.

**Path Parameters:**
- `filename` (string, required): Name of the generated file

**Response:** Binary audio file or ZIP archive

**Example URLs:**
- `http://127.0.0.1:8000/generated_audio/dialog_concatenated.wav`
- `http://127.0.0.1:8000/generated_audio/dialog_archive.zip`

---

## 📋 Data Models

### Speaker Models

#### `Speaker`
```json
{
  "id": "string",
  "name": "string",
  "sample_path": "string|null"
}
```

#### `SpeakerResponse`
```json
{
  "id": "string",
  "name": "string",
  "message": "string|null"
}
```

### Dialog Models

#### `SpeechItem`
```json
{
  "type": "speech",
  "speaker_id": "string",
  "text": "string",
  "exaggeration": 0.5,        // 0.0-2.0, controls expressiveness
  "cfg_weight": 0.5,          // 0.0-2.0, alignment with speaker characteristics
  "temperature": 0.8,         // 0.0-2.0, randomness in generation
  "use_existing_audio": false,
  "audio_url": "string|null"
}
```

#### `SilenceItem`
```json
{
  "type": "silence",
  "duration": 1.0,            // seconds, must be > 0
  "use_existing_audio": false,
  "audio_url": "string|null"
}
```

#### `DialogRequest`
```json
{
  "dialog_items": [
    // Array of SpeechItem and/or SilenceItem objects
  ],
  "output_base_name": "string"  // Base name for output files
}
```

#### `DialogResponse`
```json
{
  "log": "string",                        // Processing log
  "concatenated_audio_url": "string|null", // URL to final audio
  "zip_archive_url": "string|null",       // URL to ZIP archive
  "temp_dir_path": "string|null",         // Server temp directory
  "error_message": "string|null"          // Error details if failed
}
```

---

## 🎛️ TTS Parameters

### Exaggeration (`exaggeration`)
- **Range**: 0.0 - 2.0
- **Default**: 0.5
- **Description**: Controls the expressiveness of speech. Higher values produce more exaggerated, emotional speech.

### CFG Weight (`cfg_weight`)
- **Range**: 0.0 - 2.0
- **Default**: 0.5
- **Description**: Classifier-Free Guidance weight. Higher values make speech more aligned with the prompt text and speaker characteristics.

### Temperature (`temperature`)
- **Range**: 0.0 - 2.0
- **Default**: 0.8
- **Description**: Controls randomness in generation. Lower values produce more deterministic speech, higher values add more variation.

---

## 🔧 Configuration

### Environment Variables
The API uses the following directory structure (configurable in `app/config.py`):

- **Speaker Samples**: `{PROJECT_ROOT}/speaker_data/speaker_samples/`
- **Generated Audio**: `{PROJECT_ROOT}/backend/tts_generated_dialogs/`
- **Temporary Files**: `{PROJECT_ROOT}/tts_temp_outputs/`

### CORS Settings
- Allowed Origins: `http://localhost:8001`, `http://127.0.0.1:8001`
- Allowed Methods: All
- Allowed Headers: All
- Credentials: Enabled

---

## 🚀 Usage Examples

### Python Client Example

```python
import requests
import json

# Base URL
BASE_URL = "http://127.0.0.1:8000"

# Get all speakers
speakers = requests.get(f"{BASE_URL}/api/speakers/").json()
print("Available speakers:", speakers)

# Generate a simple dialog
dialog_request = {
    "dialog_items": [
        {
            "type": "speech",
            "speaker_id": speakers[0]["id"],
            "text": "Hello world!",
            "exaggeration": 0.7,
            "cfg_weight": 0.6,
            "temperature": 0.9
        },
        {
            "type": "silence",
            "duration": 1.0
        }
    ],
    "output_base_name": "test_dialog"
}

response = requests.post(
    f"{BASE_URL}/api/dialog/generate",
    json=dialog_request
)

if response.status_code == 200:
    result = response.json()
    print("Dialog generated!")
    print("Audio URL:", result["concatenated_audio_url"])
    print("ZIP URL:", result["zip_archive_url"])
else:
    print("Error:", response.text)
```

### JavaScript/Frontend Example

```javascript
// Generate dialog
const dialogRequest = {
  dialog_items: [
    {
      type: "speech",
      speaker_id: "speaker_001",
      text: "Welcome to our show!",
      exaggeration: 0.6,
      cfg_weight: 0.5,
      temperature: 0.8
    }
  ],
  output_base_name: "intro"
};

fetch('http://127.0.0.1:8000/api/dialog/generate', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
  },
  body: JSON.stringify(dialogRequest)
})
.then(response => response.json())
.then(data => {
  console.log('Dialog generated:', data);
  // Play the audio
  const audio = new Audio(data.concatenated_audio_url);
  audio.play();
});
```

---

## ⚠️ Error Handling

### Common Error Responses

#### 400 Bad Request
```json
{
  "detail": "Invalid value or configuration: Text cannot be empty"
}
```

#### 404 Not Found
```json
{
  "detail": "Speaker sample for ID 'invalid_speaker' not found."
}
```

#### 500 Internal Server Error
```json
{
  "detail": "Runtime error during dialog generation: CUDA out of memory"
}
```

### Error Categories
- **Validation Errors**: Invalid input format, missing required fields
- **Resource Errors**: Speaker not found, file not accessible
- **Processing Errors**: TTS model failures, audio processing issues
- **System Errors**: Memory issues, disk space, model loading failures

---

## 🔍 Development & Debugging

### Running the Server
```bash
# From project root
uvicorn backend.app.main:app --reload --host 0.0.0.0 --port 8000
```

### API Documentation
- **Swagger UI**: `http://127.0.0.1:8000/docs`
- **ReDoc**: `http://127.0.0.1:8000/redoc`

### Logging
The API provides detailed logging in the `DialogResponse.log` field for dialog generation operations.

### File Management
- Generated files are stored in `backend/tts_generated_dialogs/`
- Temporary processing files are kept for inspection (not auto-deleted)
- ZIP archives contain individual audio segments plus concatenated result

---

## 📝 Notes

- The API automatically loads and unloads TTS models to manage memory usage
- Speaker audio samples should be clear, single-speaker recordings for best results
- Large dialogs may take significant time to process depending on hardware
- Generated files are served statically and persist until manually cleaned up

---

*Generated on: 2025-06-06*
*API Version: 0.1.0*