519 lines
11 KiB
Markdown
519 lines
11 KiB
Markdown
# Chatterbox TTS API Reference
|
|
|
|
## Overview
|
|
|
|
The Chatterbox TTS API is a FastAPI-based backend service that provides text-to-speech capabilities with speaker management and dialog generation features. The API supports creating custom speakers from audio samples and generating complex dialogs with multiple speakers, silences, and fine-tuned TTS parameters.
|
|
|
|
**Base URL**: `http://127.0.0.1:8000`
|
|
**API Version**: 0.1.0
|
|
**Framework**: FastAPI with automatic OpenAPI documentation
|
|
|
|
## Quick Start
|
|
|
|
- **Interactive API Documentation**: `http://127.0.0.1:8000/docs` (Swagger UI)
|
|
- **Alternative Documentation**: `http://127.0.0.1:8000/redoc` (ReDoc)
|
|
- **OpenAPI Schema**: `http://127.0.0.1:8000/openapi.json`
|
|
|
|
## Authentication
|
|
|
|
Currently, the API does not require authentication. CORS is configured to allow requests from `localhost:8001` and `127.0.0.1:8001`.
|
|
|
|
---
|
|
|
|
## Endpoints
|
|
|
|
### 🏠 Root Endpoint
|
|
|
|
#### `GET /`
|
|
Welcome message and API status check.
|
|
|
|
**Response:**
|
|
```json
|
|
{
|
|
"message": "Welcome to the Chatterbox TTS API!"
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## 👥 Speaker Management
|
|
|
|
### `GET /api/speakers/`
|
|
Retrieve all available speakers.
|
|
|
|
**Response Model:** `List[Speaker]`
|
|
|
|
**Example Response:**
|
|
```json
|
|
[
|
|
{
|
|
"id": "speaker_001",
|
|
"name": "John Doe",
|
|
"sample_path": "/path/to/speaker_samples/john_doe.wav"
|
|
},
|
|
{
|
|
"id": "speaker_002",
|
|
"name": "Jane Smith",
|
|
"sample_path": "/path/to/speaker_samples/jane_smith.wav"
|
|
}
|
|
]
|
|
```
|
|
|
|
**Status Codes:**
|
|
- `200`: Success
|
|
|
|
---
|
|
|
|
### `POST /api/speakers/`
|
|
Create a new speaker from an audio sample.
|
|
|
|
**Request Type:** `multipart/form-data`
|
|
|
|
**Parameters:**
|
|
- `name` (form field, required): Speaker name
|
|
- `audio_file` (file upload, required): Audio sample file (WAV, MP3, etc.)
|
|
|
|
**Response Model:** `SpeakerResponse`
|
|
|
|
**Example Response:**
|
|
```json
|
|
{
|
|
"id": "speaker_003",
|
|
"name": "Alex Johnson",
|
|
"message": "Speaker added successfully."
|
|
}
|
|
```
|
|
|
|
**Status Codes:**
|
|
- `201`: Speaker created successfully
|
|
- `400`: Invalid file type or missing file
|
|
- `500`: Server error during speaker creation
|
|
|
|
**Example cURL:**
|
|
```bash
|
|
curl -X POST "http://127.0.0.1:8000/api/speakers/" \
|
|
-F "name=Alex Johnson" \
|
|
-F "audio_file=@/path/to/sample.wav"
|
|
```
|
|
|
|
---
|
|
|
|
### `GET /api/speakers/{speaker_id}`
|
|
Get details for a specific speaker.
|
|
|
|
**Path Parameters:**
|
|
- `speaker_id` (string, required): Unique speaker identifier
|
|
|
|
**Response Model:** `Speaker`
|
|
|
|
**Example Response:**
|
|
```json
|
|
{
|
|
"id": "speaker_001",
|
|
"name": "John Doe",
|
|
"sample_path": "/path/to/speaker_samples/john_doe.wav"
|
|
}
|
|
```
|
|
|
|
**Status Codes:**
|
|
- `200`: Success
|
|
- `404`: Speaker not found
|
|
|
|
---
|
|
|
|
### `DELETE /api/speakers/{speaker_id}`
|
|
Delete a speaker by ID.
|
|
|
|
**Path Parameters:**
|
|
- `speaker_id` (string, required): Unique speaker identifier
|
|
|
|
**Example Response:**
|
|
```json
|
|
{
|
|
"message": "Speaker deleted successfully"
|
|
}
|
|
```
|
|
|
|
**Status Codes:**
|
|
- `200`: Speaker deleted successfully
|
|
- `404`: Speaker not found
|
|
|
|
---
|
|
|
|
## 🎭 Dialog Generation
|
|
|
|
### `POST /api/dialog/generate_line`
|
|
Generate audio for a single dialog line (speech or silence).
|
|
|
|
**Request Body:** Raw JSON object representing either a `SpeechItem` or `SilenceItem`
|
|
|
|
#### Speech Item Example:
|
|
```json
|
|
{
|
|
"type": "speech",
|
|
"speaker_id": "speaker_001",
|
|
"text": "Hello, this is a test message.",
|
|
"exaggeration": 0.7,
|
|
"cfg_weight": 0.6,
|
|
"temperature": 0.8,
|
|
"use_existing_audio": false,
|
|
"audio_url": null
|
|
}
|
|
```
|
|
|
|
#### Silence Item Example:
|
|
```json
|
|
{
|
|
"type": "silence",
|
|
"duration": 2.0,
|
|
"use_existing_audio": false,
|
|
"audio_url": null
|
|
}
|
|
```
|
|
|
|
**Response:**
|
|
```json
|
|
{
|
|
"audio_url": "/generated_audio/line_abc123def456.wav",
|
|
"type": "speech",
|
|
"text": "Hello, this is a test message."
|
|
}
|
|
```
|
|
|
|
**Status Codes:**
|
|
- `200`: Audio generated successfully
|
|
- `400`: Invalid request format or unknown dialog item type
|
|
- `404`: Speaker not found
|
|
- `500`: Server error during generation
|
|
|
|
---
|
|
|
|
### `POST /api/dialog/generate`
|
|
Generate a complete dialog from multiple speech and silence items.
|
|
|
|
**Request Model:** `DialogRequest`
|
|
|
|
**Request Body:**
|
|
```json
|
|
{
|
|
"dialog_items": [
|
|
{
|
|
"type": "speech",
|
|
"speaker_id": "speaker_001",
|
|
"text": "Welcome to our podcast!",
|
|
"exaggeration": 0.5,
|
|
"cfg_weight": 0.5,
|
|
"temperature": 0.8
|
|
},
|
|
{
|
|
"type": "silence",
|
|
"duration": 1.0
|
|
},
|
|
{
|
|
"type": "speech",
|
|
"speaker_id": "speaker_002",
|
|
"text": "Thank you for having me!",
|
|
"exaggeration": 0.6,
|
|
"cfg_weight": 0.7,
|
|
"temperature": 0.9
|
|
}
|
|
],
|
|
"output_base_name": "podcast_episode_01"
|
|
}
|
|
```
|
|
|
|
**Response Model:** `DialogResponse`
|
|
|
|
**Example Response:**
|
|
```json
|
|
{
|
|
"log": "Processing dialog with 3 items...\nGenerating speech for item 1...\nGenerating silence for item 2...\nGenerating speech for item 3...\nConcatenating audio segments...\nZIP archive created at: /path/to/output.zip",
|
|
"concatenated_audio_url": "/generated_audio/podcast_episode_01_concatenated.wav",
|
|
"zip_archive_url": "/generated_audio/podcast_episode_01_archive.zip",
|
|
"temp_dir_path": "/path/to/temp/directory",
|
|
"error_message": null
|
|
}
|
|
```
|
|
|
|
**Status Codes:**
|
|
- `200`: Dialog generated successfully
|
|
- `400`: Invalid request format or validation errors
|
|
- `404`: Speaker or file not found
|
|
- `500`: Server error during generation
|
|
|
|
---
|
|
|
|
## 📁 Static File Serving
|
|
|
|
### `GET /generated_audio/{filename}`
|
|
Serve generated audio files and ZIP archives.
|
|
|
|
**Path Parameters:**
|
|
- `filename` (string, required): Name of the generated file
|
|
|
|
**Response:** Binary audio file or ZIP archive
|
|
|
|
**Example URLs:**
|
|
- `http://127.0.0.1:8000/generated_audio/dialog_concatenated.wav`
|
|
- `http://127.0.0.1:8000/generated_audio/dialog_archive.zip`
|
|
|
|
---
|
|
|
|
## 📋 Data Models
|
|
|
|
### Speaker Models
|
|
|
|
#### `Speaker`
|
|
```json
|
|
{
|
|
"id": "string",
|
|
"name": "string",
|
|
"sample_path": "string|null"
|
|
}
|
|
```
|
|
|
|
#### `SpeakerResponse`
|
|
```json
|
|
{
|
|
"id": "string",
|
|
"name": "string",
|
|
"message": "string|null"
|
|
}
|
|
```
|
|
|
|
### Dialog Models
|
|
|
|
#### `SpeechItem`
|
|
```json
|
|
{
|
|
"type": "speech",
|
|
"speaker_id": "string",
|
|
"text": "string",
|
|
"exaggeration": 0.5, // 0.0-2.0, controls expressiveness
|
|
"cfg_weight": 0.5, // 0.0-2.0, alignment with speaker characteristics
|
|
"temperature": 0.8, // 0.0-2.0, randomness in generation
|
|
"use_existing_audio": false,
|
|
"audio_url": "string|null"
|
|
}
|
|
```
|
|
|
|
#### `SilenceItem`
|
|
```json
|
|
{
|
|
"type": "silence",
|
|
"duration": 1.0, // seconds, must be > 0
|
|
"use_existing_audio": false,
|
|
"audio_url": "string|null"
|
|
}
|
|
```
|
|
|
|
#### `DialogRequest`
|
|
```json
|
|
{
|
|
"dialog_items": [
|
|
// Array of SpeechItem and/or SilenceItem objects
|
|
],
|
|
"output_base_name": "string" // Base name for output files
|
|
}
|
|
```
|
|
|
|
#### `DialogResponse`
|
|
```json
|
|
{
|
|
"log": "string", // Processing log
|
|
"concatenated_audio_url": "string|null", // URL to final audio
|
|
"zip_archive_url": "string|null", // URL to ZIP archive
|
|
"temp_dir_path": "string|null", // Server temp directory
|
|
"error_message": "string|null" // Error details if failed
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## 🎛️ TTS Parameters
|
|
|
|
### Exaggeration (`exaggeration`)
|
|
- **Range**: 0.0 - 2.0
|
|
- **Default**: 0.5
|
|
- **Description**: Controls the expressiveness of speech. Higher values produce more exaggerated, emotional speech.
|
|
|
|
### CFG Weight (`cfg_weight`)
|
|
- **Range**: 0.0 - 2.0
|
|
- **Default**: 0.5
|
|
- **Description**: Classifier-Free Guidance weight. Higher values make speech more aligned with the prompt text and speaker characteristics.
|
|
|
|
### Temperature (`temperature`)
|
|
- **Range**: 0.0 - 2.0
|
|
- **Default**: 0.8
|
|
- **Description**: Controls randomness in generation. Lower values produce more deterministic speech, higher values add more variation.
|
|
|
|
---
|
|
|
|
## 🔧 Configuration
|
|
|
|
### Environment Variables
|
|
The API uses the following directory structure (configurable in `app/config.py`):
|
|
|
|
- **Speaker Samples**: `{PROJECT_ROOT}/speaker_data/speaker_samples/`
|
|
- **Generated Audio**: `{PROJECT_ROOT}/backend/tts_generated_dialogs/`
|
|
- **Temporary Files**: `{PROJECT_ROOT}/tts_temp_outputs/`
|
|
|
|
### CORS Settings
|
|
- Allowed Origins: `http://localhost:8001`, `http://127.0.0.1:8001`
|
|
- Allowed Methods: All
|
|
- Allowed Headers: All
|
|
- Credentials: Enabled
|
|
|
|
---
|
|
|
|
## 🚀 Usage Examples
|
|
|
|
### Python Client Example
|
|
|
|
```python
|
|
import requests
|
|
import json
|
|
|
|
# Base URL
|
|
BASE_URL = "http://127.0.0.1:8000"
|
|
|
|
# Get all speakers
|
|
speakers = requests.get(f"{BASE_URL}/api/speakers/").json()
|
|
print("Available speakers:", speakers)
|
|
|
|
# Generate a simple dialog
|
|
dialog_request = {
|
|
"dialog_items": [
|
|
{
|
|
"type": "speech",
|
|
"speaker_id": speakers[0]["id"],
|
|
"text": "Hello world!",
|
|
"exaggeration": 0.7,
|
|
"cfg_weight": 0.6,
|
|
"temperature": 0.9
|
|
},
|
|
{
|
|
"type": "silence",
|
|
"duration": 1.0
|
|
}
|
|
],
|
|
"output_base_name": "test_dialog"
|
|
}
|
|
|
|
response = requests.post(
|
|
f"{BASE_URL}/api/dialog/generate",
|
|
json=dialog_request
|
|
)
|
|
|
|
if response.status_code == 200:
|
|
result = response.json()
|
|
print("Dialog generated!")
|
|
print("Audio URL:", result["concatenated_audio_url"])
|
|
print("ZIP URL:", result["zip_archive_url"])
|
|
else:
|
|
print("Error:", response.text)
|
|
```
|
|
|
|
### JavaScript/Frontend Example
|
|
|
|
```javascript
|
|
// Generate dialog
|
|
const dialogRequest = {
|
|
dialog_items: [
|
|
{
|
|
type: "speech",
|
|
speaker_id: "speaker_001",
|
|
text: "Welcome to our show!",
|
|
exaggeration: 0.6,
|
|
cfg_weight: 0.5,
|
|
temperature: 0.8
|
|
}
|
|
],
|
|
output_base_name: "intro"
|
|
};
|
|
|
|
fetch('http://127.0.0.1:8000/api/dialog/generate', {
|
|
method: 'POST',
|
|
headers: {
|
|
'Content-Type': 'application/json',
|
|
},
|
|
body: JSON.stringify(dialogRequest)
|
|
})
|
|
.then(response => response.json())
|
|
.then(data => {
|
|
console.log('Dialog generated:', data);
|
|
// Play the audio
|
|
const audio = new Audio(data.concatenated_audio_url);
|
|
audio.play();
|
|
});
|
|
```
|
|
|
|
---
|
|
|
|
## ⚠️ Error Handling
|
|
|
|
### Common Error Responses
|
|
|
|
#### 400 Bad Request
|
|
```json
|
|
{
|
|
"detail": "Invalid value or configuration: Text cannot be empty"
|
|
}
|
|
```
|
|
|
|
#### 404 Not Found
|
|
```json
|
|
{
|
|
"detail": "Speaker sample for ID 'invalid_speaker' not found."
|
|
}
|
|
```
|
|
|
|
#### 500 Internal Server Error
|
|
```json
|
|
{
|
|
"detail": "Runtime error during dialog generation: CUDA out of memory"
|
|
}
|
|
```
|
|
|
|
### Error Categories
|
|
- **Validation Errors**: Invalid input format, missing required fields
|
|
- **Resource Errors**: Speaker not found, file not accessible
|
|
- **Processing Errors**: TTS model failures, audio processing issues
|
|
- **System Errors**: Memory issues, disk space, model loading failures
|
|
|
|
---
|
|
|
|
## 🔍 Development & Debugging
|
|
|
|
### Running the Server
|
|
```bash
|
|
# From project root
|
|
uvicorn backend.app.main:app --reload --host 0.0.0.0 --port 8000
|
|
```
|
|
|
|
### API Documentation
|
|
- **Swagger UI**: `http://127.0.0.1:8000/docs`
|
|
- **ReDoc**: `http://127.0.0.1:8000/redoc`
|
|
|
|
### Logging
|
|
The API provides detailed logging in the `DialogResponse.log` field for dialog generation operations.
|
|
|
|
### File Management
|
|
- Generated files are stored in `backend/tts_generated_dialogs/`
|
|
- Temporary processing files are kept for inspection (not auto-deleted)
|
|
- ZIP archives contain individual audio segments plus concatenated result
|
|
|
|
---
|
|
|
|
## 📝 Notes
|
|
|
|
- The API automatically loads and unloads TTS models to manage memory usage
|
|
- Speaker audio samples should be clear, single-speaker recordings for best results
|
|
- Large dialogs may take significant time to process depending on hardware
|
|
- Generated files are served statically and persist until manually cleaned up
|
|
|
|
---
|
|
|
|
*Generated on: 2025-06-06*
|
|
*API Version: 0.1.0*
|