6.1 KiB

Raw Blame History

Chatterbox TTS: Gradio to FastAPI & Vanilla JS Migration Plan

This plan outlines the steps to re-implement the dialog generation features of the Chatterbox TTS application, moving from the current Gradio-based implementation to a FastAPI backend and a vanilla JavaScript frontend. It incorporates findings from gradio_app.py and aligns with the existing high-level strategy (MEMORY[c20c2cce-46d4-453f-9bc3-c18e05dbc66f]).

1. Backend (FastAPI) Development

Objective: Create a robust API to handle TTS generation, speaker management, and file delivery.

Key Modules/Components:

API Endpoints:
- POST /api/dialog/generate:
  - Input: Structured list: [{type: "speech", speaker_id: "str", text: "str"}, {type: "silence", duration: float}], output_base_name: str.
  - Output: JSON with log: str, concatenated_audio_url: str, zip_archive_url: str.
- GET /api/speakers: Returns list of available speakers ([{id: "str", name: "str", sample_path: "str"}]).
- POST /api/speakers: Adds a new speaker. Input: name: str, audio_sample_file: UploadFile. Output: {id: "str", name: "str", message: "str"}.
- DELETE /api/speakers/{speaker_id}: Removes a speaker.
Core Logic & Services:
- TTSService:
  - Manages ChatterboxTTS model instance(s) (loading, inference, memory cleanup).
  - Handles ChatterboxTTS.generate() calls, incorporating parameters like exaggeration, cfg_weight, temperature (decision needed on exposure vs. defaults).
  - Implements rigorous memory management (inspired by generate_audio and process_dialog's reinit_each_line concept).
- DialogProcessorService:
  - Orchestrates dialog generation using TTSService.
  - Implements split_text_at_sentence_boundaries logic for long text inputs.
  - Manages generation of individual audio segments.
- AudioManipulationService:
  - Concatenates audio segments using torch and torchaudio, inserting specified silences.
  - Creates ZIP archives of all generated audio files using zipfile.
- SpeakerManagementService:
  - Manages speakers.yaml (or alternative storage) for speaker metadata.
  - Handles storage and retrieval of speaker audio samples (e.g., in speaker_samples/).
File Handling:
- Strategy for storing and serving generated .wav and .zip files (e.g., FastAPI StaticFiles, temporary directories, or cloud storage).

Implementation Steps (Phase 1):

Project Setup: Initialize FastAPI project, define dependencies (fastapi, uvicorn, python-multipart, pyyaml, torch, torchaudio, chatterbox-tts).
Speaker Management: Implement SpeakerManagementService and the /api/speakers endpoints.
TTS Core: Develop TTSService, focusing on model loading, inference, and critical memory management.
Dialog Processing: Implement DialogProcessorService including text splitting.
Audio Utilities: Create AudioManipulationService for concatenation and zipping.
Main Endpoint: Implement POST /api/dialog/generate orchestrating the services.
Configuration: Manage paths (speakers.yaml, sample storage, output directories) and TTS settings.
Testing: Thoroughly test all API endpoints using tools like Postman or curl.

2. Frontend (Vanilla JavaScript) Development

Objective: Create an intuitive UI for dialog construction, speaker management, and interaction with the backend.

Key Modules/Components:

HTML (index.html): Structure for dialog editor, speaker controls, results display.
CSS (style.css): Styling for a clean and usable interface.
**JavaScript (app.js, api.js, ui.js):
- api.js: Functions for all backend API communications (fetch).
- ui.js: DOM manipulation for dynamic dialog lines, speaker lists, and results rendering.
- app.js: Main application logic, event handling, state management (for dialog lines, speaker data).

Implementation Steps (Phase 2):

Basic Layout: Create index.html and style.css.
API Client: Develop api.js to interface with all backend endpoints.
Speaker UI:

Fetch and display speakers using ui.js and api.js.
Implement forms and logic for adding (with file upload) and removing speakers.

Dialog Editor UI:

Dynamically add/remove/reorder dialog lines (speech/silence).
Inputs for speaker selection (populated from API), text, and silence duration.
Input for output_base_name.

Interaction & Results:

"Generate Dialog" button to submit data via api.js.
Display generation log, audio player for concatenated output, and download link for ZIP file.

3. Integration & Testing (Phase 3)

Full System Connection: Ensure seamless frontend-backend communication.
End-to-End Testing: Test various dialog scenarios, speaker configurations, and error conditions.
Performance & Memory: Profile backend memory usage during generation; refine TTSService memory strategies if needed.
UX Refinement: Iterate on UI/UX based on testing feedback.

4. Advanced Features & Deployment (Phase 4)

(As per MEMORY[c20c2cce-46d4-453f-9bc3-c18e05dbc66f])
Real-time Updates: Consider WebSockets for live progress during generation.
Deployment Strategy: Plan for deploying the FastAPI application and serving the static frontend assets.

Key Considerations from `gradio_app.py` Analysis:

Memory Management for TTS Model: This is critical. The reinit_each_line option and explicit cleanup in generate_audio highlight this. The FastAPI backend must handle this robustly.
Text Chunking: The split_text_at_sentence_boundaries (max 300 chars) logic is essential and must be replicated.
Dialog Parsing: The Speaker: "Text" and Silence: duration format should be the basis for the frontend data structure sent to the backend.
TTS Parameters: Decide whether to expose advanced TTS parameters (exaggeration, cfg_weight, temperature) for dialog lines in the new API.
File Output: The backend needs to replicate the generation of individual segment files, a concatenated file, and a ZIP archive.

6.1 KiB Raw Blame History