chatterbox-ui/.note/detailed_migration_plan.md

6.1 KiB

Chatterbox TTS: Gradio to FastAPI & Vanilla JS Migration Plan

This plan outlines the steps to re-implement the dialog generation features of the Chatterbox TTS application, moving from the current Gradio-based implementation to a FastAPI backend and a vanilla JavaScript frontend. It incorporates findings from gradio_app.py and aligns with the existing high-level strategy (MEMORY[c20c2cce-46d4-453f-9bc3-c18e05dbc66f]).

1. Backend (FastAPI) Development

Objective

Create a robust API to handle TTS generation, speaker management, and file delivery.

Key Modules/Components

  • API Endpoints:
    • POST /api/dialog/generate:
      • Input: Structured list: [{type: "speech", speaker_id: "str", text: "str"}, {type: "silence", duration: float}], output_base_name: str.
      • Output: JSON with log: str, concatenated_audio_url: str, zip_archive_url: str.
    • GET /api/speakers: Returns list of available speakers ([{id: "str", name: "str", sample_path: "str"}]).
    • POST /api/speakers: Adds a new speaker. Input: name: str, audio_sample_file: UploadFile. Output: {id: "str", name: "str", message: "str"}.
    • DELETE /api/speakers/{speaker_id}: Removes a speaker.
  • Core Logic & Services:
    • TTSService:
      • Manages ChatterboxTTS model instance(s) (loading, inference, memory cleanup).
      • Handles ChatterboxTTS.generate() calls, incorporating parameters like exaggeration, cfg_weight, temperature (decision needed on exposure vs. defaults).
      • Implements rigorous memory management (inspired by generate_audio and process_dialog's reinit_each_line concept).
    • DialogProcessorService:
      • Orchestrates dialog generation using TTSService.
      • Implements split_text_at_sentence_boundaries logic for long text inputs.
      • Manages generation of individual audio segments.
    • AudioManipulationService:
      • Concatenates audio segments using torch and torchaudio, inserting specified silences.
      • Creates ZIP archives of all generated audio files using zipfile.
    • SpeakerManagementService:
      • Manages speakers.yaml (or alternative storage) for speaker metadata.
      • Handles storage and retrieval of speaker audio samples (e.g., in speaker_samples/).
  • File Handling:
    • Strategy for storing and serving generated .wav and .zip files (e.g., FastAPI StaticFiles, temporary directories, or cloud storage).

Implementation Steps (Phase 1)

  1. Project Setup: Initialize FastAPI project, define dependencies (fastapi, uvicorn, python-multipart, pyyaml, torch, torchaudio, chatterbox-tts).
  2. Speaker Management: Implement SpeakerManagementService and the /api/speakers endpoints.
  3. TTS Core: Develop TTSService, focusing on model loading, inference, and critical memory management.
  4. Dialog Processing: Implement DialogProcessorService including text splitting.
  5. Audio Utilities: Create AudioManipulationService for concatenation and zipping.
  6. Main Endpoint: Implement POST /api/dialog/generate orchestrating the services.
  7. Configuration: Manage paths (speakers.yaml, sample storage, output directories) and TTS settings.
  8. Testing: Thoroughly test all API endpoints using tools like Postman or curl.

2. Frontend (Vanilla JavaScript) Development

Objective

Create an intuitive UI for dialog construction, speaker management, and interaction with the backend.

Key Modules/Components

  • HTML (index.html): Structure for dialog editor, speaker controls, results display.
  • CSS (style.css): Styling for a clean and usable interface.
  • **JavaScript (app.js, api.js, ui.js):
    • api.js: Functions for all backend API communications (fetch).
    • ui.js: DOM manipulation for dynamic dialog lines, speaker lists, and results rendering.
    • app.js: Main application logic, event handling, state management (for dialog lines, speaker data).

Implementation Steps (Phase 2)

  1. Basic Layout: Create index.html and style.css.
  2. API Client: Develop api.js to interface with all backend endpoints.
  3. Speaker UI:
  • Fetch and display speakers using ui.js and api.js.
  • Implement forms and logic for adding (with file upload) and removing speakers.
  1. Dialog Editor UI:
  • Dynamically add/remove/reorder dialog lines (speech/silence).
  • Inputs for speaker selection (populated from API), text, and silence duration.
  • Input for output_base_name.
  1. Interaction & Results:
  • "Generate Dialog" button to submit data via api.js.
  • Display generation log, audio player for concatenated output, and download link for ZIP file.

3. Integration & Testing (Phase 3)

  1. Full System Connection: Ensure seamless frontend-backend communication.
  2. End-to-End Testing: Test various dialog scenarios, speaker configurations, and error conditions.
  3. Performance & Memory: Profile backend memory usage during generation; refine TTSService memory strategies if needed.
  4. UX Refinement: Iterate on UI/UX based on testing feedback.

4. Advanced Features & Deployment (Phase 4)

  • (As per MEMORY[c20c2cce-46d4-453f-9bc3-c18e05dbc66f])
  • Real-time Updates: Consider WebSockets for live progress during generation.
  • Deployment Strategy: Plan for deploying the FastAPI application and serving the static frontend assets.

Key Considerations from gradio_app.py Analysis

  • Memory Management for TTS Model: This is critical. The reinit_each_line option and explicit cleanup in generate_audio highlight this. The FastAPI backend must handle this robustly.
  • Text Chunking: The split_text_at_sentence_boundaries (max 300 chars) logic is essential and must be replicated.
  • Dialog Parsing: The Speaker: "Text" and Silence: duration format should be the basis for the frontend data structure sent to the backend.
  • TTS Parameters: Decide whether to expose advanced TTS parameters (exaggeration, cfg_weight, temperature) for dialog lines in the new API.
  • File Output: The backend needs to replicate the generation of individual segment files, a concatenated file, and a ZIP archive.