# Chatterbox TTS: Gradio to FastAPI & Vanilla JS Migration Plan This plan outlines the steps to re-implement the dialog generation features of the Chatterbox TTS application, moving from the current Gradio-based implementation to a FastAPI backend and a vanilla JavaScript frontend. It incorporates findings from `gradio_app.py` and aligns with the existing high-level strategy (MEMORY[c20c2cce-46d4-453f-9bc3-c18e05dbc66f]). ### 1. Backend (FastAPI) Development **Objective:** Create a robust API to handle TTS generation, speaker management, and file delivery. **Key Modules/Components:** * **API Endpoints:** * `POST /api/dialog/generate`: * **Input**: Structured list: `[{type: "speech", speaker_id: "str", text: "str"}, {type: "silence", duration: float}]`, `output_base_name: str`. * **Output**: JSON with `log: str`, `concatenated_audio_url: str`, `zip_archive_url: str`. * `GET /api/speakers`: Returns list of available speakers (`[{id: "str", name: "str", sample_path: "str"}]`). * `POST /api/speakers`: Adds a new speaker. Input: `name: str`, `audio_sample_file: UploadFile`. Output: `{id: "str", name: "str", message: "str"}`. * `DELETE /api/speakers/{speaker_id}`: Removes a speaker. * **Core Logic & Services:** * `TTSService`: * Manages `ChatterboxTTS` model instance(s) (loading, inference, memory cleanup). * Handles `ChatterboxTTS.generate()` calls, incorporating parameters like `exaggeration`, `cfg_weight`, `temperature` (decision needed on exposure vs. defaults). * Implements rigorous memory management (inspired by `generate_audio` and `process_dialog`'s `reinit_each_line` concept). * `DialogProcessorService`: * Orchestrates dialog generation using `TTSService`. * Implements `split_text_at_sentence_boundaries` logic for long text inputs. * Manages generation of individual audio segments. * `AudioManipulationService`: * Concatenates audio segments using `torch` and `torchaudio`, inserting specified silences. * Creates ZIP archives of all generated audio files using `zipfile`. * `SpeakerManagementService`: * Manages `speakers.yaml` (or alternative storage) for speaker metadata. * Handles storage and retrieval of speaker audio samples (e.g., in `speaker_samples/`). * **File Handling:** * Strategy for storing and serving generated `.wav` and `.zip` files (e.g., FastAPI `StaticFiles`, temporary directories, or cloud storage). **Implementation Steps (Phase 1):** 1. **Project Setup:** Initialize FastAPI project, define dependencies (`fastapi`, `uvicorn`, `python-multipart`, `pyyaml`, `torch`, `torchaudio`, `chatterbox-tts`). 2. **Speaker Management:** Implement `SpeakerManagementService` and the `/api/speakers` endpoints. 3. **TTS Core:** Develop `TTSService`, focusing on model loading, inference, and critical memory management. 4. **Dialog Processing:** Implement `DialogProcessorService` including text splitting. 5. **Audio Utilities:** Create `AudioManipulationService` for concatenation and zipping. 6. **Main Endpoint:** Implement `POST /api/dialog/generate` orchestrating the services. 7. **Configuration:** Manage paths (`speakers.yaml`, sample storage, output directories) and TTS settings. 8. **Testing:** Thoroughly test all API endpoints using tools like Postman or `curl`. ### 2. Frontend (Vanilla JavaScript) Development **Objective:** Create an intuitive UI for dialog construction, speaker management, and interaction with the backend. **Key Modules/Components:** * **HTML (`index.html`):** Structure for dialog editor, speaker controls, results display. * **CSS (`style.css`):** Styling for a clean and usable interface. * **JavaScript (`app.js`, `api.js`, `ui.js`): * `api.js`: Functions for all backend API communications (`fetch`). * `ui.js`: DOM manipulation for dynamic dialog lines, speaker lists, and results rendering. * `app.js`: Main application logic, event handling, state management (for dialog lines, speaker data). **Implementation Steps (Phase 2):** 1. **Basic Layout:** Create `index.html` and `style.css`. 2. **API Client:** Develop `api.js` to interface with all backend endpoints. 3. **Speaker UI:** * Fetch and display speakers using `ui.js` and `api.js`. * Implement forms and logic for adding (with file upload) and removing speakers. 4. **Dialog Editor UI:** * Dynamically add/remove/reorder dialog lines (speech/silence). * Inputs for speaker selection (populated from API), text, and silence duration. * Input for `output_base_name`. 5. **Interaction & Results:** * "Generate Dialog" button to submit data via `api.js`. * Display generation log, audio player for concatenated output, and download link for ZIP file. ### 3. Integration & Testing (Phase 3) 1. **Full System Connection:** Ensure seamless frontend-backend communication. 2. **End-to-End Testing:** Test various dialog scenarios, speaker configurations, and error conditions. 3. **Performance & Memory:** Profile backend memory usage during generation; refine `TTSService` memory strategies if needed. 4. **UX Refinement:** Iterate on UI/UX based on testing feedback. ### 4. Advanced Features & Deployment (Phase 4) * (As per MEMORY[c20c2cce-46d4-453f-9bc3-c18e05dbc66f]) * **Real-time Updates:** Consider WebSockets for live progress during generation. * **Deployment Strategy:** Plan for deploying the FastAPI application and serving the static frontend assets. ### Key Considerations from `gradio_app.py` Analysis: * **Memory Management for TTS Model:** This is critical. The `reinit_each_line` option and explicit cleanup in `generate_audio` highlight this. The FastAPI backend must handle this robustly. * **Text Chunking:** The `split_text_at_sentence_boundaries` (max 300 chars) logic is essential and must be replicated. * **Dialog Parsing:** The `Speaker: "Text"` and `Silence: duration` format should be the basis for the frontend data structure sent to the backend. * **TTS Parameters:** Decide whether to expose advanced TTS parameters (`exaggeration`, `cfg_weight`, `temperature`) for dialog lines in the new API. * **File Output:** The backend needs to replicate the generation of individual segment files, a concatenated file, and a ZIP archive.