chatterbox-ui/.note/detailed_migration_plan.md

6.1 KiB

Chatterbox TTS: Gradio to FastAPI & Vanilla JS Migration Plan

This plan outlines the steps to re-implement the dialog generation features of the Chatterbox TTS application, moving from the current Gradio-based implementation to a FastAPI backend and a vanilla JavaScript frontend. It incorporates findings from gradio_app.py and aligns with the existing high-level strategy (MEMORY[c20c2cce-46d4-453f-9bc3-c18e05dbc66f]).

1. Backend (FastAPI) Development

Objective: Create a robust API to handle TTS generation, speaker management, and file delivery.

Key Modules/Components:

  • API Endpoints:
    • POST /api/dialog/generate:
      • Input: Structured list: [{type: "speech", speaker_id: "str", text: "str"}, {type: "silence", duration: float}], output_base_name: str.
      • Output: JSON with log: str, concatenated_audio_url: str, zip_archive_url: str.
    • GET /api/speakers: Returns list of available speakers ([{id: "str", name: "str", sample_path: "str"}]).
    • POST /api/speakers: Adds a new speaker. Input: name: str, audio_sample_file: UploadFile. Output: {id: "str", name: "str", message: "str"}.
    • DELETE /api/speakers/{speaker_id}: Removes a speaker.
  • Core Logic & Services:
    • TTSService:
      • Manages ChatterboxTTS model instance(s) (loading, inference, memory cleanup).
      • Handles ChatterboxTTS.generate() calls, incorporating parameters like exaggeration, cfg_weight, temperature (decision needed on exposure vs. defaults).
      • Implements rigorous memory management (inspired by generate_audio and process_dialog's reinit_each_line concept).
    • DialogProcessorService:
      • Orchestrates dialog generation using TTSService.
      • Implements split_text_at_sentence_boundaries logic for long text inputs.
      • Manages generation of individual audio segments.
    • AudioManipulationService:
      • Concatenates audio segments using torch and torchaudio, inserting specified silences.
      • Creates ZIP archives of all generated audio files using zipfile.
    • SpeakerManagementService:
      • Manages speakers.yaml (or alternative storage) for speaker metadata.
      • Handles storage and retrieval of speaker audio samples (e.g., in speaker_samples/).
  • File Handling:
    • Strategy for storing and serving generated .wav and .zip files (e.g., FastAPI StaticFiles, temporary directories, or cloud storage).

Implementation Steps (Phase 1):

  1. Project Setup: Initialize FastAPI project, define dependencies (fastapi, uvicorn, python-multipart, pyyaml, torch, torchaudio, chatterbox-tts).
  2. Speaker Management: Implement SpeakerManagementService and the /api/speakers endpoints.
  3. TTS Core: Develop TTSService, focusing on model loading, inference, and critical memory management.
  4. Dialog Processing: Implement DialogProcessorService including text splitting.
  5. Audio Utilities: Create AudioManipulationService for concatenation and zipping.
  6. Main Endpoint: Implement POST /api/dialog/generate orchestrating the services.
  7. Configuration: Manage paths (speakers.yaml, sample storage, output directories) and TTS settings.
  8. Testing: Thoroughly test all API endpoints using tools like Postman or curl.

2. Frontend (Vanilla JavaScript) Development

Objective: Create an intuitive UI for dialog construction, speaker management, and interaction with the backend.

Key Modules/Components:

  • HTML (index.html): Structure for dialog editor, speaker controls, results display.
  • CSS (style.css): Styling for a clean and usable interface.
  • **JavaScript (app.js, api.js, ui.js):
    • api.js: Functions for all backend API communications (fetch).
    • ui.js: DOM manipulation for dynamic dialog lines, speaker lists, and results rendering.
    • app.js: Main application logic, event handling, state management (for dialog lines, speaker data).

Implementation Steps (Phase 2):

  1. Basic Layout: Create index.html and style.css.
  2. API Client: Develop api.js to interface with all backend endpoints.
  3. Speaker UI:
  • Fetch and display speakers using ui.js and api.js.
  • Implement forms and logic for adding (with file upload) and removing speakers.
  1. Dialog Editor UI:
  • Dynamically add/remove/reorder dialog lines (speech/silence).
  • Inputs for speaker selection (populated from API), text, and silence duration.
  • Input for output_base_name.
  1. Interaction & Results:
  • "Generate Dialog" button to submit data via api.js.
  • Display generation log, audio player for concatenated output, and download link for ZIP file.

3. Integration & Testing (Phase 3)

  1. Full System Connection: Ensure seamless frontend-backend communication.
  2. End-to-End Testing: Test various dialog scenarios, speaker configurations, and error conditions.
  3. Performance & Memory: Profile backend memory usage during generation; refine TTSService memory strategies if needed.
  4. UX Refinement: Iterate on UI/UX based on testing feedback.

4. Advanced Features & Deployment (Phase 4)

  • (As per MEMORY[c20c2cce-46d4-453f-9bc3-c18e05dbc66f])
  • Real-time Updates: Consider WebSockets for live progress during generation.
  • Deployment Strategy: Plan for deploying the FastAPI application and serving the static frontend assets.

Key Considerations from gradio_app.py Analysis:

  • Memory Management for TTS Model: This is critical. The reinit_each_line option and explicit cleanup in generate_audio highlight this. The FastAPI backend must handle this robustly.
  • Text Chunking: The split_text_at_sentence_boundaries (max 300 chars) logic is essential and must be replicated.
  • Dialog Parsing: The Speaker: "Text" and Silence: duration format should be the basis for the frontend data structure sent to the backend.
  • TTS Parameters: Decide whether to expose advanced TTS parameters (exaggeration, cfg_weight, temperature) for dialog lines in the new API.
  • File Output: The backend needs to replicate the generation of individual segment files, a concatenated file, and a ZIP archive.