From 9d1dc330eab311e6d21ca14478fcfc1b0a9c8462 Mon Sep 17 00:00:00 2001 From: Steve White Date: Thu, 5 Jun 2025 09:22:54 -0500 Subject: [PATCH] Update docs in .noew --- .note/current_focus.md | 1 + .note/detailed_migration_plan.md | 134 +++++++++++++++---------------- 2 files changed, 68 insertions(+), 67 deletions(-) diff --git a/.note/current_focus.md b/.note/current_focus.md index 20c3b58..7a8bb46 100644 --- a/.note/current_focus.md +++ b/.note/current_focus.md @@ -15,5 +15,6 @@ - Awaiting your feedback on the detailed migration plan (see `.note/detailed_migration_plan.md`). **Next Steps (pending your approval of plan):** + - Begin Phase 1: Backend API Development (FastAPI). - Task 1.1: Project Setup (FastAPI project structure, `requirements.txt`). diff --git a/.note/detailed_migration_plan.md b/.note/detailed_migration_plan.md index 170639b..d4b4aba 100644 --- a/.note/detailed_migration_plan.md +++ b/.note/detailed_migration_plan.md @@ -2,93 +2,93 @@ This plan outlines the steps to re-implement the dialog generation features of the Chatterbox TTS application, moving from the current Gradio-based implementation to a FastAPI backend and a vanilla JavaScript frontend. It incorporates findings from `gradio_app.py` and aligns with the existing high-level strategy (MEMORY[c20c2cce-46d4-453f-9bc3-c18e05dbc66f]). -### 1. Backend (FastAPI) Development +## 1. Backend (FastAPI) Development **Objective:** Create a robust API to handle TTS generation, speaker management, and file delivery. **Key Modules/Components:** -* **API Endpoints:** - * `POST /api/dialog/generate`: - * **Input**: Structured list: `[{type: "speech", speaker_id: "str", text: "str"}, {type: "silence", duration: float}]`, `output_base_name: str`. - * **Output**: JSON with `log: str`, `concatenated_audio_url: str`, `zip_archive_url: str`. - * `GET /api/speakers`: Returns list of available speakers (`[{id: "str", name: "str", sample_path: "str"}]`). - * `POST /api/speakers`: Adds a new speaker. Input: `name: str`, `audio_sample_file: UploadFile`. Output: `{id: "str", name: "str", message: "str"}`. - * `DELETE /api/speakers/{speaker_id}`: Removes a speaker. -* **Core Logic & Services:** - * `TTSService`: - * Manages `ChatterboxTTS` model instance(s) (loading, inference, memory cleanup). - * Handles `ChatterboxTTS.generate()` calls, incorporating parameters like `exaggeration`, `cfg_weight`, `temperature` (decision needed on exposure vs. defaults). - * Implements rigorous memory management (inspired by `generate_audio` and `process_dialog`'s `reinit_each_line` concept). - * `DialogProcessorService`: - * Orchestrates dialog generation using `TTSService`. - * Implements `split_text_at_sentence_boundaries` logic for long text inputs. - * Manages generation of individual audio segments. - * `AudioManipulationService`: - * Concatenates audio segments using `torch` and `torchaudio`, inserting specified silences. - * Creates ZIP archives of all generated audio files using `zipfile`. - * `SpeakerManagementService`: - * Manages `speakers.yaml` (or alternative storage) for speaker metadata. - * Handles storage and retrieval of speaker audio samples (e.g., in `speaker_samples/`). -* **File Handling:** - * Strategy for storing and serving generated `.wav` and `.zip` files (e.g., FastAPI `StaticFiles`, temporary directories, or cloud storage). +* **API Endpoints:** + * `POST /api/dialog/generate`: + * **Input**: Structured list: `[{type: "speech", speaker_id: "str", text: "str"}, {type: "silence", duration: float}]`, `output_base_name: str`. + * **Output**: JSON with `log: str`, `concatenated_audio_url: str`, `zip_archive_url: str`. + * `GET /api/speakers`: Returns list of available speakers (`[{id: "str", name: "str", sample_path: "str"}]`). + * `POST /api/speakers`: Adds a new speaker. Input: `name: str`, `audio_sample_file: UploadFile`. Output: `{id: "str", name: "str", message: "str"}`. + * `DELETE /api/speakers/{speaker_id}`: Removes a speaker. +* **Core Logic & Services:** + * `TTSService`: + * Manages `ChatterboxTTS` model instance(s) (loading, inference, memory cleanup). + * Handles `ChatterboxTTS.generate()` calls, incorporating parameters like `exaggeration`, `cfg_weight`, `temperature` (decision needed on exposure vs. defaults). + * Implements rigorous memory management (inspired by `generate_audio` and `process_dialog`'s `reinit_each_line` concept). + * `DialogProcessorService`: + * Orchestrates dialog generation using `TTSService`. + * Implements `split_text_at_sentence_boundaries` logic for long text inputs. + * Manages generation of individual audio segments. + * `AudioManipulationService`: + * Concatenates audio segments using `torch` and `torchaudio`, inserting specified silences. + * Creates ZIP archives of all generated audio files using `zipfile`. + * `SpeakerManagementService`: + * Manages `speakers.yaml` (or alternative storage) for speaker metadata. + * Handles storage and retrieval of speaker audio samples (e.g., in `speaker_samples/`). +* **File Handling:** + * Strategy for storing and serving generated `.wav` and `.zip` files (e.g., FastAPI `StaticFiles`, temporary directories, or cloud storage). **Implementation Steps (Phase 1):** -1. **Project Setup:** Initialize FastAPI project, define dependencies (`fastapi`, `uvicorn`, `python-multipart`, `pyyaml`, `torch`, `torchaudio`, `chatterbox-tts`). -2. **Speaker Management:** Implement `SpeakerManagementService` and the `/api/speakers` endpoints. -3. **TTS Core:** Develop `TTSService`, focusing on model loading, inference, and critical memory management. -4. **Dialog Processing:** Implement `DialogProcessorService` including text splitting. -5. **Audio Utilities:** Create `AudioManipulationService` for concatenation and zipping. -6. **Main Endpoint:** Implement `POST /api/dialog/generate` orchestrating the services. -7. **Configuration:** Manage paths (`speakers.yaml`, sample storage, output directories) and TTS settings. -8. **Testing:** Thoroughly test all API endpoints using tools like Postman or `curl`. +1. **Project Setup:** Initialize FastAPI project, define dependencies (`fastapi`, `uvicorn`, `python-multipart`, `pyyaml`, `torch`, `torchaudio`, `chatterbox-tts`). +2. **Speaker Management:** Implement `SpeakerManagementService` and the `/api/speakers` endpoints. +3. **TTS Core:** Develop `TTSService`, focusing on model loading, inference, and critical memory management. +4. **Dialog Processing:** Implement `DialogProcessorService` including text splitting. +5. **Audio Utilities:** Create `AudioManipulationService` for concatenation and zipping. +6. **Main Endpoint:** Implement `POST /api/dialog/generate` orchestrating the services. +7. **Configuration:** Manage paths (`speakers.yaml`, sample storage, output directories) and TTS settings. +8. **Testing:** Thoroughly test all API endpoints using tools like Postman or `curl`. -### 2. Frontend (Vanilla JavaScript) Development +## 2. Frontend (Vanilla JavaScript) Development **Objective:** Create an intuitive UI for dialog construction, speaker management, and interaction with the backend. **Key Modules/Components:** -* **HTML (`index.html`):** Structure for dialog editor, speaker controls, results display. -* **CSS (`style.css`):** Styling for a clean and usable interface. -* **JavaScript (`app.js`, `api.js`, `ui.js`): - * `api.js`: Functions for all backend API communications (`fetch`). - * `ui.js`: DOM manipulation for dynamic dialog lines, speaker lists, and results rendering. - * `app.js`: Main application logic, event handling, state management (for dialog lines, speaker data). +* **HTML (`index.html`):** Structure for dialog editor, speaker controls, results display. +* **CSS (`style.css`):** Styling for a clean and usable interface. +* **JavaScript (`app.js`, `api.js`, `ui.js`):** + * `api.js`: Functions for all backend API communications (`fetch`). + * `ui.js`: DOM manipulation for dynamic dialog lines, speaker lists, and results rendering. + * `app.js`: Main application logic, event handling, state management (for dialog lines, speaker data). **Implementation Steps (Phase 2):** -1. **Basic Layout:** Create `index.html` and `style.css`. -2. **API Client:** Develop `api.js` to interface with all backend endpoints. -3. **Speaker UI:** - * Fetch and display speakers using `ui.js` and `api.js`. - * Implement forms and logic for adding (with file upload) and removing speakers. -4. **Dialog Editor UI:** - * Dynamically add/remove/reorder dialog lines (speech/silence). - * Inputs for speaker selection (populated from API), text, and silence duration. - * Input for `output_base_name`. -5. **Interaction & Results:** - * "Generate Dialog" button to submit data via `api.js`. - * Display generation log, audio player for concatenated output, and download link for ZIP file. +1. **Basic Layout:** Create `index.html` and `style.css`. +2. **API Client:** Develop `api.js` to interface with all backend endpoints. +3. **Speaker UI:** + * Fetch and display speakers using `ui.js` and `api.js`. + * Implement forms and logic for adding (with file upload) and removing speakers. +4. **Dialog Editor UI:** + * Dynamically add/remove/reorder dialog lines (speech/silence). + * Inputs for speaker selection (populated from API), text, and silence duration. + * Input for `output_base_name`. +5. **Interaction & Results:** + * "Generate Dialog" button to submit data via `api.js`. + * Display generation log, audio player for concatenated output, and download link for ZIP file. -### 3. Integration & Testing (Phase 3) +## 3. Integration & Testing (Phase 3) -1. **Full System Connection:** Ensure seamless frontend-backend communication. -2. **End-to-End Testing:** Test various dialog scenarios, speaker configurations, and error conditions. -3. **Performance & Memory:** Profile backend memory usage during generation; refine `TTSService` memory strategies if needed. -4. **UX Refinement:** Iterate on UI/UX based on testing feedback. +1. **Full System Connection:** Ensure seamless frontend-backend communication. +2. **End-to-End Testing:** Test various dialog scenarios, speaker configurations, and error conditions. +3. **Performance & Memory:** Profile backend memory usage during generation; refine `TTSService` memory strategies if needed. +4. **UX Refinement:** Iterate on UI/UX based on testing feedback. -### 4. Advanced Features & Deployment (Phase 4) +## 4. Advanced Features & Deployment (Phase 4) -* (As per MEMORY[c20c2cce-46d4-453f-9bc3-c18e05dbc66f]) -* **Real-time Updates:** Consider WebSockets for live progress during generation. -* **Deployment Strategy:** Plan for deploying the FastAPI application and serving the static frontend assets. +* (As per MEMORY[c20c2cce-46d4-453f-9bc3-c18e05dbc66f]) +* **Real-time Updates:** Consider WebSockets for live progress during generation. +* **Deployment Strategy:** Plan for deploying the FastAPI application and serving the static frontend assets. -### Key Considerations from `gradio_app.py` Analysis: +## Key Considerations from `gradio_app.py` Analysis -* **Memory Management for TTS Model:** This is critical. The `reinit_each_line` option and explicit cleanup in `generate_audio` highlight this. The FastAPI backend must handle this robustly. -* **Text Chunking:** The `split_text_at_sentence_boundaries` (max 300 chars) logic is essential and must be replicated. -* **Dialog Parsing:** The `Speaker: "Text"` and `Silence: duration` format should be the basis for the frontend data structure sent to the backend. -* **TTS Parameters:** Decide whether to expose advanced TTS parameters (`exaggeration`, `cfg_weight`, `temperature`) for dialog lines in the new API. -* **File Output:** The backend needs to replicate the generation of individual segment files, a concatenated file, and a ZIP archive. +* **Memory Management for TTS Model:** This is critical. The `reinit_each_line` option and explicit cleanup in `generate_audio` highlight this. The FastAPI backend must handle this robustly. +* **Text Chunking:** The `split_text_at_sentence_boundaries` (max 300 chars) logic is essential and must be replicated. +* **Dialog Parsing:** The `Speaker: "Text"` and `Silence: duration` format should be the basis for the frontend data structure sent to the backend. +* **TTS Parameters:** Decide whether to expose advanced TTS parameters (`exaggeration`, `cfg_weight`, `temperature`) for dialog lines in the new API. +* **File Output:** The backend needs to replicate the generation of individual segment files, a concatenated file, and a ZIP archive.