6.3 KiB

Raw Permalink Blame History

Chatterbox TTS Application

A comprehensive text-to-speech application with multiple interfaces for generating speech from text using the Chatterbox TTS model. Supports single utterance generation, multi-speaker dialogs, and long-form audiobook generation.

Features

Multiple Interfaces: Web UI, FastAPI backend, Gradio interface, and CLI tools
Single Utterance Generation: Generate speech from text using a selected speaker
Dialog Generation: Create multi-speaker conversations with configurable silence gaps
Audiobook Generation: Convert long-form text into narrated audiobooks
Speaker Management: Add/remove speakers with custom audio samples
Paste Script (JSONL) Import: Paste a dialog script as JSONL directly into the editor via a modal
Memory Optimization: Automatic model cleanup after generation
Output Organization: Files saved in organized directories with ZIP packaging

Getting Started

Quick Setup

Clone the repository and install dependencies:

git clone https://github.com/your-username/chatterbox-ui.git
cd chatterbox-ui
pip install -r requirements.txt
npm install

Run automated setup:
```
python setup.py
```
Prepare speaker samples:
- Add audio samples (WAV format) to speaker_data/speaker_samples/
- Configure speakers in speaker_data/speakers.yaml

Windows Quick Start

On Windows, a PowerShell setup script is provided to automate environment setup and startup.

# From the repository root in PowerShell
./setup-windows.ps1

# First time only, if scripts are blocked:
# Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser

What it does:

Creates/uses .venv
Upgrades pip and installs deps from backend/requirements.txt and root requirements.txt
Creates a default .env with sensible ports if missing
Starts both servers via start_servers.py

Running the Application

Full-Stack Web Application:

# Start both backend and frontend servers
python start_servers.py

On Windows, you can also use the one-liner PowerShell script:

./setup-windows.ps1

Individual Components:

# Backend only (FastAPI)
uvicorn backend.app.main:app --reload --host 0.0.0.0 --port 8000

# Frontend only 
cd frontend && python start_dev_server.py

# Gradio interface
python gradio_app.py

Usage

Web Interface

Access the modern web UI at http://localhost:8001 for interactive dialog creation.

Paste Script (JSONL) in Dialog Editor

Quickly load a dialog by pasting JSONL (one JSON object per line):

Click Paste Script in the Dialog Editor.
Paste JSONL content, for example:

{"type":"speech","speaker_id":"dummy_speaker","text":"Hello there!"}
{"type":"silence","duration":0.5}
{"type":"speech","speaker_id":"dummy_speaker","text":"This is the second line."}

Click Load and confirm replacement if prompted.

Notes:

Input is validated per line; errors report line numbers.
The dialog is saved to localStorage, so it persists across refreshes.
Unknown speaker_ids will still load; add speakers later if needed.

CLI Tools

Single utterance generation:

python cbx-generate.py --sample speaker_samples/voice.wav --output output.wav --text "Hello world"

Dialog generation:

python cbx-dialog-generate.py --dialog dialog.md --output dialog_output

Audiobook generation:

python cbx-audiobook.py --input book.txt --output audiobook --speaker speaker_name

Gradio Interface

Single Utterance Tab: Select speaker, enter text, adjust parameters, generate
Dialog Generation Tab: Configure speakers and create multi-speaker conversations

Dialog format:

Speaker1: "Hello, how are you?"
Speaker2: "I'm doing well!"
Silence: 0.5
Speaker1: "What are your plans for today?"

Architecture Overview

Application Structure

Frontend: Modern vanilla JavaScript web UI (frontend/)
Backend: FastAPI REST API (backend/)
CLI Tools: Command-line utilities (cbx-*.py)
Gradio Interface: Alternative web UI (gradio_app.py)

New Files and Features

cbx-audiobook.py: Generate long-form audiobooks from text files
import_helper.py: Utility for managing imports and dependencies
Backend Services: Enhanced dialog processing, speaker management, and TTS services
Web Frontend: Interactive dialog editor with drag-and-drop functionality

File Organization

single_output/ - Single utterance generations
dialog_output/ - Multi-speaker dialog files
tts_outputs/ - Raw TTS generation files
speaker_data/ - Speaker configurations and audio samples
Generated files packaged in ZIP archives for download

API Endpoints

/api/speakers/ - Speaker CRUD operations
/api/dialog/generate/ - Full dialog generation
/api/dialog/generate_line/ - Single line generation
/generated_audio/ - Static audio file serving

Configuration

Environment Setup

Key configuration files:

.env - Global settings
backend/.env - Backend-specific settings
frontend/.env - Frontend-specific settings
speaker_data/speakers.yaml - Speaker configuration

Development Commands

# Run tests
python backend/run_api_test.py
npm test

# Backend development
uvicorn backend.app.main:app --reload --host 0.0.0.0 --port 8000

# Access points
# Web UI: http://localhost:8001
# API: http://localhost:8000
# API Docs: http://localhost:8000/docs

Memory Management

The application automatically:

Cleans up the TTS model after each generation
Manages GPU memory (CUDA/MPS devices)
Optimizes memory usage for long-form content

Troubleshooting

"Skipping unknown speaker": Configure speaker in speaker_data/speakers.yaml
"Sample file not found": Verify audio files exist in speaker_data/speaker_samples/
Memory issues: Use model reinitialization options for long content
CORS errors: Check frontend/backend port configuration (frontend origin is auto-included when using start_servers.py)
Import errors: Run python import_helper.py to check dependencies

Windows-specific

If PowerShell blocks script execution, run once:

Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser

If Windows Firewall prompts the first time you run servers, allow access on your private network.

6.3 KiB Raw Permalink Blame History