# Chatterbox TTS: Gradio to FastAPI & Vanilla JS Migration Plan

This plan outlines the steps to re-implement the dialog generation features of the Chatterbox TTS application, moving from the current Gradio-based implementation to a FastAPI backend and a vanilla JavaScript frontend. It incorporates findings from `gradio_app.py` and aligns with the existing high-level strategy (MEMORY[c20c2cce-46d4-453f-9bc3-c18e05dbc66f]).

### 1. Backend (FastAPI) Development

**Objective:** Create a robust API to handle TTS generation, speaker management, and file delivery.

**Key Modules/Components:**

* **API Endpoints:**
  * `POST /api/dialog/generate`:
    * **Input**: Structured list: `[{type: "speech", speaker_id: "str", text: "str"}, {type: "silence", duration: float}]`, `output_base_name: str`.
    * **Output**: JSON with `log: str`, `concatenated_audio_url: str`, `zip_archive_url: str`.
  * `GET /api/speakers`: Returns list of available speakers (`[{id: "str", name: "str", sample_path: "str"}]`).
  * `POST /api/speakers`: Adds a new speaker. Input: `name: str`, `audio_sample_file: UploadFile`. Output: `{id: "str", name: "str", message: "str"}`.
  * `DELETE /api/speakers/{speaker_id}`: Removes a speaker.
* **Core Logic & Services:**
  * `TTSService`:
    * Manages `ChatterboxTTS` model instance(s) (loading, inference, memory cleanup).
    * Handles `ChatterboxTTS.generate()` calls, incorporating parameters like `exaggeration`, `cfg_weight`, `temperature` (decision needed on exposure vs. defaults).
    * Implements rigorous memory management (inspired by `generate_audio` and `process_dialog`'s `reinit_each_line` concept).
  * `DialogProcessorService`:
    * Orchestrates dialog generation using `TTSService`.
    * Implements `split_text_at_sentence_boundaries` logic for long text inputs.
    * Manages generation of individual audio segments.
  * `AudioManipulationService`:
    * Concatenates audio segments using `torch` and `torchaudio`, inserting specified silences.
    * Creates ZIP archives of all generated audio files using `zipfile`.
  * `SpeakerManagementService`:
    * Manages `speakers.yaml` (or alternative storage) for speaker metadata.
    * Handles storage and retrieval of speaker audio samples (e.g., in `speaker_samples/`).
* **File Handling:**
  * Strategy for storing and serving generated `.wav` and `.zip` files (e.g., FastAPI `StaticFiles`, temporary directories, or cloud storage).

**Implementation Steps (Phase 1):**

1. **Project Setup:** Initialize FastAPI project, define dependencies (`fastapi`, `uvicorn`, `python-multipart`, `pyyaml`, `torch`, `torchaudio`, `chatterbox-tts`).
2. **Speaker Management:** Implement `SpeakerManagementService` and the `/api/speakers` endpoints.
3. **TTS Core:** Develop `TTSService`, focusing on model loading, inference, and critical memory management.
4. **Dialog Processing:** Implement `DialogProcessorService` including text splitting.
5. **Audio Utilities:** Create `AudioManipulationService` for concatenation and zipping.
6. **Main Endpoint:** Implement `POST /api/dialog/generate` orchestrating the services.
7. **Configuration:** Manage paths (`speakers.yaml`, sample storage, output directories) and TTS settings.
8. **Testing:** Thoroughly test all API endpoints using tools like Postman or `curl`.

### 2. Frontend (Vanilla JavaScript) Development

**Objective:** Create an intuitive UI for dialog construction, speaker management, and interaction with the backend.

**Key Modules/Components:**

* **HTML (`index.html`):** Structure for dialog editor, speaker controls, results display.
* **CSS (`style.css`):** Styling for a clean and usable interface.
* **JavaScript (`app.js`, `api.js`, `ui.js`):
  * `api.js`: Functions for all backend API communications (`fetch`).
  * `ui.js`: DOM manipulation for dynamic dialog lines, speaker lists, and results rendering.
  * `app.js`: Main application logic, event handling, state management (for dialog lines, speaker data).

**Implementation Steps (Phase 2):**

1. **Basic Layout:** Create `index.html` and `style.css`.
2. **API Client:** Develop `api.js` to interface with all backend endpoints.
3. **Speaker UI:**
  * Fetch and display speakers using `ui.js` and `api.js`.
  * Implement forms and logic for adding (with file upload) and removing speakers.
4. **Dialog Editor UI:**
  * Dynamically add/remove/reorder dialog lines (speech/silence).
  * Inputs for speaker selection (populated from API), text, and silence duration.
  * Input for `output_base_name`.
5. **Interaction & Results:**
  * "Generate Dialog" button to submit data via `api.js`.
  * Display generation log, audio player for concatenated output, and download link for ZIP file.

### 3. Integration & Testing (Phase 3)

1. **Full System Connection:** Ensure seamless frontend-backend communication.
2. **End-to-End Testing:** Test various dialog scenarios, speaker configurations, and error conditions.
3. **Performance & Memory:** Profile backend memory usage during generation; refine `TTSService` memory strategies if needed.
4. **UX Refinement:** Iterate on UI/UX based on testing feedback.

### 4. Advanced Features & Deployment (Phase 4)

* (As per MEMORY[c20c2cce-46d4-453f-9bc3-c18e05dbc66f])
* **Real-time Updates:** Consider WebSockets for live progress during generation.
* **Deployment Strategy:** Plan for deploying the FastAPI application and serving the static frontend assets.

### Key Considerations from `gradio_app.py` Analysis:

* **Memory Management for TTS Model:** This is critical. The `reinit_each_line` option and explicit cleanup in `generate_audio` highlight this. The FastAPI backend must handle this robustly.
* **Text Chunking:** The `split_text_at_sentence_boundaries` (max 300 chars) logic is essential and must be replicated.
* **Dialog Parsing:** The `Speaker: "Text"` and `Silence: duration` format should be the basis for the frontend data structure sent to the backend.
* **TTS Parameters:** Decide whether to expose advanced TTS parameters (`exaggeration`, `cfg_weight`, `temperature`) for dialog lines in the new API.
* **File Output:** The backend needs to replicate the generation of individual segment files, a concatenated file, and a ZIP archive.