Update docs in .noew
This commit is contained in:
parent
b781d8abcf
commit
9d1dc330ea
|
@ -15,5 +15,6 @@
|
|||
- Awaiting your feedback on the detailed migration plan (see `.note/detailed_migration_plan.md`).
|
||||
|
||||
**Next Steps (pending your approval of plan):**
|
||||
|
||||
- Begin Phase 1: Backend API Development (FastAPI).
|
||||
- Task 1.1: Project Setup (FastAPI project structure, `requirements.txt`).
|
||||
|
|
|
@ -2,93 +2,93 @@
|
|||
|
||||
This plan outlines the steps to re-implement the dialog generation features of the Chatterbox TTS application, moving from the current Gradio-based implementation to a FastAPI backend and a vanilla JavaScript frontend. It incorporates findings from `gradio_app.py` and aligns with the existing high-level strategy (MEMORY[c20c2cce-46d4-453f-9bc3-c18e05dbc66f]).
|
||||
|
||||
### 1. Backend (FastAPI) Development
|
||||
## 1. Backend (FastAPI) Development
|
||||
|
||||
**Objective:** Create a robust API to handle TTS generation, speaker management, and file delivery.
|
||||
|
||||
**Key Modules/Components:**
|
||||
|
||||
* **API Endpoints:**
|
||||
* `POST /api/dialog/generate`:
|
||||
* **Input**: Structured list: `[{type: "speech", speaker_id: "str", text: "str"}, {type: "silence", duration: float}]`, `output_base_name: str`.
|
||||
* **Output**: JSON with `log: str`, `concatenated_audio_url: str`, `zip_archive_url: str`.
|
||||
* `GET /api/speakers`: Returns list of available speakers (`[{id: "str", name: "str", sample_path: "str"}]`).
|
||||
* `POST /api/speakers`: Adds a new speaker. Input: `name: str`, `audio_sample_file: UploadFile`. Output: `{id: "str", name: "str", message: "str"}`.
|
||||
* `DELETE /api/speakers/{speaker_id}`: Removes a speaker.
|
||||
* **Core Logic & Services:**
|
||||
* `TTSService`:
|
||||
* Manages `ChatterboxTTS` model instance(s) (loading, inference, memory cleanup).
|
||||
* Handles `ChatterboxTTS.generate()` calls, incorporating parameters like `exaggeration`, `cfg_weight`, `temperature` (decision needed on exposure vs. defaults).
|
||||
* Implements rigorous memory management (inspired by `generate_audio` and `process_dialog`'s `reinit_each_line` concept).
|
||||
* `DialogProcessorService`:
|
||||
* Orchestrates dialog generation using `TTSService`.
|
||||
* Implements `split_text_at_sentence_boundaries` logic for long text inputs.
|
||||
* Manages generation of individual audio segments.
|
||||
* `AudioManipulationService`:
|
||||
* Concatenates audio segments using `torch` and `torchaudio`, inserting specified silences.
|
||||
* Creates ZIP archives of all generated audio files using `zipfile`.
|
||||
* `SpeakerManagementService`:
|
||||
* Manages `speakers.yaml` (or alternative storage) for speaker metadata.
|
||||
* Handles storage and retrieval of speaker audio samples (e.g., in `speaker_samples/`).
|
||||
* **File Handling:**
|
||||
* Strategy for storing and serving generated `.wav` and `.zip` files (e.g., FastAPI `StaticFiles`, temporary directories, or cloud storage).
|
||||
* **API Endpoints:**
|
||||
* `POST /api/dialog/generate`:
|
||||
* **Input**: Structured list: `[{type: "speech", speaker_id: "str", text: "str"}, {type: "silence", duration: float}]`, `output_base_name: str`.
|
||||
* **Output**: JSON with `log: str`, `concatenated_audio_url: str`, `zip_archive_url: str`.
|
||||
* `GET /api/speakers`: Returns list of available speakers (`[{id: "str", name: "str", sample_path: "str"}]`).
|
||||
* `POST /api/speakers`: Adds a new speaker. Input: `name: str`, `audio_sample_file: UploadFile`. Output: `{id: "str", name: "str", message: "str"}`.
|
||||
* `DELETE /api/speakers/{speaker_id}`: Removes a speaker.
|
||||
* **Core Logic & Services:**
|
||||
* `TTSService`:
|
||||
* Manages `ChatterboxTTS` model instance(s) (loading, inference, memory cleanup).
|
||||
* Handles `ChatterboxTTS.generate()` calls, incorporating parameters like `exaggeration`, `cfg_weight`, `temperature` (decision needed on exposure vs. defaults).
|
||||
* Implements rigorous memory management (inspired by `generate_audio` and `process_dialog`'s `reinit_each_line` concept).
|
||||
* `DialogProcessorService`:
|
||||
* Orchestrates dialog generation using `TTSService`.
|
||||
* Implements `split_text_at_sentence_boundaries` logic for long text inputs.
|
||||
* Manages generation of individual audio segments.
|
||||
* `AudioManipulationService`:
|
||||
* Concatenates audio segments using `torch` and `torchaudio`, inserting specified silences.
|
||||
* Creates ZIP archives of all generated audio files using `zipfile`.
|
||||
* `SpeakerManagementService`:
|
||||
* Manages `speakers.yaml` (or alternative storage) for speaker metadata.
|
||||
* Handles storage and retrieval of speaker audio samples (e.g., in `speaker_samples/`).
|
||||
* **File Handling:**
|
||||
* Strategy for storing and serving generated `.wav` and `.zip` files (e.g., FastAPI `StaticFiles`, temporary directories, or cloud storage).
|
||||
|
||||
**Implementation Steps (Phase 1):**
|
||||
|
||||
1. **Project Setup:** Initialize FastAPI project, define dependencies (`fastapi`, `uvicorn`, `python-multipart`, `pyyaml`, `torch`, `torchaudio`, `chatterbox-tts`).
|
||||
2. **Speaker Management:** Implement `SpeakerManagementService` and the `/api/speakers` endpoints.
|
||||
3. **TTS Core:** Develop `TTSService`, focusing on model loading, inference, and critical memory management.
|
||||
4. **Dialog Processing:** Implement `DialogProcessorService` including text splitting.
|
||||
5. **Audio Utilities:** Create `AudioManipulationService` for concatenation and zipping.
|
||||
6. **Main Endpoint:** Implement `POST /api/dialog/generate` orchestrating the services.
|
||||
7. **Configuration:** Manage paths (`speakers.yaml`, sample storage, output directories) and TTS settings.
|
||||
8. **Testing:** Thoroughly test all API endpoints using tools like Postman or `curl`.
|
||||
1. **Project Setup:** Initialize FastAPI project, define dependencies (`fastapi`, `uvicorn`, `python-multipart`, `pyyaml`, `torch`, `torchaudio`, `chatterbox-tts`).
|
||||
2. **Speaker Management:** Implement `SpeakerManagementService` and the `/api/speakers` endpoints.
|
||||
3. **TTS Core:** Develop `TTSService`, focusing on model loading, inference, and critical memory management.
|
||||
4. **Dialog Processing:** Implement `DialogProcessorService` including text splitting.
|
||||
5. **Audio Utilities:** Create `AudioManipulationService` for concatenation and zipping.
|
||||
6. **Main Endpoint:** Implement `POST /api/dialog/generate` orchestrating the services.
|
||||
7. **Configuration:** Manage paths (`speakers.yaml`, sample storage, output directories) and TTS settings.
|
||||
8. **Testing:** Thoroughly test all API endpoints using tools like Postman or `curl`.
|
||||
|
||||
### 2. Frontend (Vanilla JavaScript) Development
|
||||
## 2. Frontend (Vanilla JavaScript) Development
|
||||
|
||||
**Objective:** Create an intuitive UI for dialog construction, speaker management, and interaction with the backend.
|
||||
|
||||
**Key Modules/Components:**
|
||||
|
||||
* **HTML (`index.html`):** Structure for dialog editor, speaker controls, results display.
|
||||
* **CSS (`style.css`):** Styling for a clean and usable interface.
|
||||
* **JavaScript (`app.js`, `api.js`, `ui.js`):
|
||||
* `api.js`: Functions for all backend API communications (`fetch`).
|
||||
* `ui.js`: DOM manipulation for dynamic dialog lines, speaker lists, and results rendering.
|
||||
* `app.js`: Main application logic, event handling, state management (for dialog lines, speaker data).
|
||||
* **HTML (`index.html`):** Structure for dialog editor, speaker controls, results display.
|
||||
* **CSS (`style.css`):** Styling for a clean and usable interface.
|
||||
* **JavaScript (`app.js`, `api.js`, `ui.js`):**
|
||||
* `api.js`: Functions for all backend API communications (`fetch`).
|
||||
* `ui.js`: DOM manipulation for dynamic dialog lines, speaker lists, and results rendering.
|
||||
* `app.js`: Main application logic, event handling, state management (for dialog lines, speaker data).
|
||||
|
||||
**Implementation Steps (Phase 2):**
|
||||
|
||||
1. **Basic Layout:** Create `index.html` and `style.css`.
|
||||
2. **API Client:** Develop `api.js` to interface with all backend endpoints.
|
||||
3. **Speaker UI:**
|
||||
* Fetch and display speakers using `ui.js` and `api.js`.
|
||||
* Implement forms and logic for adding (with file upload) and removing speakers.
|
||||
4. **Dialog Editor UI:**
|
||||
* Dynamically add/remove/reorder dialog lines (speech/silence).
|
||||
* Inputs for speaker selection (populated from API), text, and silence duration.
|
||||
* Input for `output_base_name`.
|
||||
5. **Interaction & Results:**
|
||||
* "Generate Dialog" button to submit data via `api.js`.
|
||||
* Display generation log, audio player for concatenated output, and download link for ZIP file.
|
||||
1. **Basic Layout:** Create `index.html` and `style.css`.
|
||||
2. **API Client:** Develop `api.js` to interface with all backend endpoints.
|
||||
3. **Speaker UI:**
|
||||
* Fetch and display speakers using `ui.js` and `api.js`.
|
||||
* Implement forms and logic for adding (with file upload) and removing speakers.
|
||||
4. **Dialog Editor UI:**
|
||||
* Dynamically add/remove/reorder dialog lines (speech/silence).
|
||||
* Inputs for speaker selection (populated from API), text, and silence duration.
|
||||
* Input for `output_base_name`.
|
||||
5. **Interaction & Results:**
|
||||
* "Generate Dialog" button to submit data via `api.js`.
|
||||
* Display generation log, audio player for concatenated output, and download link for ZIP file.
|
||||
|
||||
### 3. Integration & Testing (Phase 3)
|
||||
## 3. Integration & Testing (Phase 3)
|
||||
|
||||
1. **Full System Connection:** Ensure seamless frontend-backend communication.
|
||||
2. **End-to-End Testing:** Test various dialog scenarios, speaker configurations, and error conditions.
|
||||
3. **Performance & Memory:** Profile backend memory usage during generation; refine `TTSService` memory strategies if needed.
|
||||
4. **UX Refinement:** Iterate on UI/UX based on testing feedback.
|
||||
1. **Full System Connection:** Ensure seamless frontend-backend communication.
|
||||
2. **End-to-End Testing:** Test various dialog scenarios, speaker configurations, and error conditions.
|
||||
3. **Performance & Memory:** Profile backend memory usage during generation; refine `TTSService` memory strategies if needed.
|
||||
4. **UX Refinement:** Iterate on UI/UX based on testing feedback.
|
||||
|
||||
### 4. Advanced Features & Deployment (Phase 4)
|
||||
## 4. Advanced Features & Deployment (Phase 4)
|
||||
|
||||
* (As per MEMORY[c20c2cce-46d4-453f-9bc3-c18e05dbc66f])
|
||||
* **Real-time Updates:** Consider WebSockets for live progress during generation.
|
||||
* **Deployment Strategy:** Plan for deploying the FastAPI application and serving the static frontend assets.
|
||||
* (As per MEMORY[c20c2cce-46d4-453f-9bc3-c18e05dbc66f])
|
||||
* **Real-time Updates:** Consider WebSockets for live progress during generation.
|
||||
* **Deployment Strategy:** Plan for deploying the FastAPI application and serving the static frontend assets.
|
||||
|
||||
### Key Considerations from `gradio_app.py` Analysis:
|
||||
## Key Considerations from `gradio_app.py` Analysis
|
||||
|
||||
* **Memory Management for TTS Model:** This is critical. The `reinit_each_line` option and explicit cleanup in `generate_audio` highlight this. The FastAPI backend must handle this robustly.
|
||||
* **Text Chunking:** The `split_text_at_sentence_boundaries` (max 300 chars) logic is essential and must be replicated.
|
||||
* **Dialog Parsing:** The `Speaker: "Text"` and `Silence: duration` format should be the basis for the frontend data structure sent to the backend.
|
||||
* **TTS Parameters:** Decide whether to expose advanced TTS parameters (`exaggeration`, `cfg_weight`, `temperature`) for dialog lines in the new API.
|
||||
* **File Output:** The backend needs to replicate the generation of individual segment files, a concatenated file, and a ZIP archive.
|
||||
* **Memory Management for TTS Model:** This is critical. The `reinit_each_line` option and explicit cleanup in `generate_audio` highlight this. The FastAPI backend must handle this robustly.
|
||||
* **Text Chunking:** The `split_text_at_sentence_boundaries` (max 300 chars) logic is essential and must be replicated.
|
||||
* **Dialog Parsing:** The `Speaker: "Text"` and `Silence: duration` format should be the basis for the frontend data structure sent to the backend.
|
||||
* **TTS Parameters:** Decide whether to expose advanced TTS parameters (`exaggeration`, `cfg_weight`, `temperature`) for dialog lines in the new API.
|
||||
* **File Output:** The backend needs to replicate the generation of individual segment files, a concatenated file, and a ZIP archive.
|
||||
|
|
Loading…
Reference in New Issue