6.1 KiB
6.1 KiB
Chatterbox TTS: Gradio to FastAPI & Vanilla JS Migration Plan
This plan outlines the steps to re-implement the dialog generation features of the Chatterbox TTS application, moving from the current Gradio-based implementation to a FastAPI backend and a vanilla JavaScript frontend. It incorporates findings from gradio_app.py
and aligns with the existing high-level strategy (MEMORY[c20c2cce-46d4-453f-9bc3-c18e05dbc66f]).
1. Backend (FastAPI) Development
Objective: Create a robust API to handle TTS generation, speaker management, and file delivery.
Key Modules/Components:
- API Endpoints:
POST /api/dialog/generate
:- Input: Structured list:
[{type: "speech", speaker_id: "str", text: "str"}, {type: "silence", duration: float}]
,output_base_name: str
. - Output: JSON with
log: str
,concatenated_audio_url: str
,zip_archive_url: str
.
- Input: Structured list:
GET /api/speakers
: Returns list of available speakers ([{id: "str", name: "str", sample_path: "str"}]
).POST /api/speakers
: Adds a new speaker. Input:name: str
,audio_sample_file: UploadFile
. Output:{id: "str", name: "str", message: "str"}
.DELETE /api/speakers/{speaker_id}
: Removes a speaker.
- Core Logic & Services:
TTSService
:- Manages
ChatterboxTTS
model instance(s) (loading, inference, memory cleanup). - Handles
ChatterboxTTS.generate()
calls, incorporating parameters likeexaggeration
,cfg_weight
,temperature
(decision needed on exposure vs. defaults). - Implements rigorous memory management (inspired by
generate_audio
andprocess_dialog
'sreinit_each_line
concept).
- Manages
DialogProcessorService
:- Orchestrates dialog generation using
TTSService
. - Implements
split_text_at_sentence_boundaries
logic for long text inputs. - Manages generation of individual audio segments.
- Orchestrates dialog generation using
AudioManipulationService
:- Concatenates audio segments using
torch
andtorchaudio
, inserting specified silences. - Creates ZIP archives of all generated audio files using
zipfile
.
- Concatenates audio segments using
SpeakerManagementService
:- Manages
speakers.yaml
(or alternative storage) for speaker metadata. - Handles storage and retrieval of speaker audio samples (e.g., in
speaker_samples/
).
- Manages
- File Handling:
- Strategy for storing and serving generated
.wav
and.zip
files (e.g., FastAPIStaticFiles
, temporary directories, or cloud storage).
- Strategy for storing and serving generated
Implementation Steps (Phase 1):
- Project Setup: Initialize FastAPI project, define dependencies (
fastapi
,uvicorn
,python-multipart
,pyyaml
,torch
,torchaudio
,chatterbox-tts
). - Speaker Management: Implement
SpeakerManagementService
and the/api/speakers
endpoints. - TTS Core: Develop
TTSService
, focusing on model loading, inference, and critical memory management. - Dialog Processing: Implement
DialogProcessorService
including text splitting. - Audio Utilities: Create
AudioManipulationService
for concatenation and zipping. - Main Endpoint: Implement
POST /api/dialog/generate
orchestrating the services. - Configuration: Manage paths (
speakers.yaml
, sample storage, output directories) and TTS settings. - Testing: Thoroughly test all API endpoints using tools like Postman or
curl
.
2. Frontend (Vanilla JavaScript) Development
Objective: Create an intuitive UI for dialog construction, speaker management, and interaction with the backend.
Key Modules/Components:
- HTML (
index.html
): Structure for dialog editor, speaker controls, results display. - CSS (
style.css
): Styling for a clean and usable interface. - **JavaScript (
app.js
,api.js
,ui.js
):api.js
: Functions for all backend API communications (fetch
).ui.js
: DOM manipulation for dynamic dialog lines, speaker lists, and results rendering.app.js
: Main application logic, event handling, state management (for dialog lines, speaker data).
Implementation Steps (Phase 2):
- Basic Layout: Create
index.html
andstyle.css
. - API Client: Develop
api.js
to interface with all backend endpoints. - Speaker UI:
- Fetch and display speakers using
ui.js
andapi.js
. - Implement forms and logic for adding (with file upload) and removing speakers.
- Dialog Editor UI:
- Dynamically add/remove/reorder dialog lines (speech/silence).
- Inputs for speaker selection (populated from API), text, and silence duration.
- Input for
output_base_name
.
- Interaction & Results:
- "Generate Dialog" button to submit data via
api.js
. - Display generation log, audio player for concatenated output, and download link for ZIP file.
3. Integration & Testing (Phase 3)
- Full System Connection: Ensure seamless frontend-backend communication.
- End-to-End Testing: Test various dialog scenarios, speaker configurations, and error conditions.
- Performance & Memory: Profile backend memory usage during generation; refine
TTSService
memory strategies if needed. - UX Refinement: Iterate on UI/UX based on testing feedback.
4. Advanced Features & Deployment (Phase 4)
- (As per MEMORY[c20c2cce-46d4-453f-9bc3-c18e05dbc66f])
- Real-time Updates: Consider WebSockets for live progress during generation.
- Deployment Strategy: Plan for deploying the FastAPI application and serving the static frontend assets.
Key Considerations from gradio_app.py
Analysis:
- Memory Management for TTS Model: This is critical. The
reinit_each_line
option and explicit cleanup ingenerate_audio
highlight this. The FastAPI backend must handle this robustly. - Text Chunking: The
split_text_at_sentence_boundaries
(max 300 chars) logic is essential and must be replicated. - Dialog Parsing: The
Speaker: "Text"
andSilence: duration
format should be the basis for the frontend data structure sent to the backend. - TTS Parameters: Decide whether to expose advanced TTS parameters (
exaggeration
,cfg_weight
,temperature
) for dialog lines in the new API. - File Output: The backend needs to replicate the generation of individual segment files, a concatenated file, and a ZIP archive.